idnits 2.17.1 

draft-briscoe-conex-data-centre-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 668: '...  1.  The sender SHOULD send ConEx-ena...'
     RFC 2119 keyword, line 681: '...      capable it SHOULD be tunnelled t...'
     RFC 2119 keyword, line 688: '...ng ConEx signals MUST be copied to the...'
     RFC 2119 keyword, line 693: '...e tunnel ingress MUST use the normal m...'
     RFC 2119 keyword, line 700: '...-ECT i.e. 00) it MUST be made ECN-capa...'
     (2 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 268 has weird spacing: '...rvisors   host...'

  -- The document date (February 14, 2014) is 3724 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-13) exists of
     draft-ietf-conex-abstract-mech-08


     Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	ConEx                                                         B. Briscoe
3	Internet-Draft                                                        BT
4	Intended status: Informational                              M. Sridharan
5	Expires: August 18, 2014                                       Microsoft
6	                                                       February 14, 2014

8	Network Performance Isolation in Data Centres using Congestion Policing
9	                   draft-briscoe-conex-data-centre-02

11	Abstract

13	   This document describes how a multi-tenant (or multi-department) data
14	   centre operator can isolate tenants from network performance
15	   degradation due to each other's usage, but without losing the
16	   multiplexing benefits of a LAN-style network where anyone can use any
17	   amount of any resource.  Zero per-tenant configuration and no
18	   implementation change is required on network equipment.  Instead the
19	   solution is implemented with a simple change to the hypervisor (or
20	   container) beneath the tenant's virtual machines on every physical
21	   server connected to the network.  These collectively enforce a very
22	   simple distributed contract - a single network allowance that each
23	   tenant can allocate among their virtual machines, even if distributed
24	   around the network.  The solution uses layer-3 switches that support
25	   explicit congestion notification (ECN).  It is best if the sending
26	   operating system supports congestion exposure (ConEx).  Nonetheless,
27	   the operator can unilaterally deploy a complete solution while
28	   operating systems are being incrementally upgraded to support ConEx
29	   and ECN.

31	Status of This Memo

33	   This Internet-Draft is submitted in full conformance with the
34	   provisions of BCP 78 and BCP 79.

36	   Internet-Drafts are working documents of the Internet Engineering
37	   Task Force (IETF).  Note that other groups may also distribute
38	   working documents as Internet-Drafts.  The list of current Internet-
39	   Drafts is at http://datatracker.ietf.org/drafts/current/.

41	   Internet-Drafts are draft documents valid for a maximum of six months
42	   and may be updated, replaced, or obsoleted by other documents at any
43	   time.  It is inappropriate to use Internet-Drafts as reference
44	   material or to cite them other than as "work in progress."

46	   This Internet-Draft will expire on August 18, 2014.

48	Copyright Notice
49	   Copyright (c) 2014 IETF Trust and the persons identified as the
50	   document authors.  All rights reserved.

52	   This document is subject to BCP 78 and the IETF Trust's Legal
53	   Provisions Relating to IETF Documents
54	   (http://trustee.ietf.org/license-info) in effect on the date of
55	   publication of this document.  Please review these documents
56	   carefully, as they describe your rights and restrictions with respect
57	   to this document.  Code Components extracted from this document must
58	   include Simplified BSD License text as described in Section 4.e of
59	   the Trust Legal Provisions and are provided without warranty as
60	   described in the Simplified BSD License.

62	Table of Contents

64	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
65	   2.  Features of the Solution . . . . . . . . . . . . . . . . . . .  4
66	   3.  Outline Design . . . . . . . . . . . . . . . . . . . . . . . .  7
67	   4.  Performance Isolation: Intuition . . . . . . . . . . . . . . .  9
68	     4.1.  Performance Isolation: The Problem . . . . . . . . . . . .  9
69	     4.2.  Why Congestion Policing Works  . . . . . . . . . . . . . . 11
70	   5.  Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
71	     5.1.  Trustworthy Congestion Signals at Ingress  . . . . . . . . 13
72	       5.1.1.  Tunnel Feedback vs. ConEx  . . . . . . . . . . . . . . 14
73	       5.1.2.  ECN Recommended  . . . . . . . . . . . . . . . . . . . 14
74	       5.1.3.  Summary: Trustworthy Congestion Signals at Ingress . . 15
75	     5.2.  Switch/Router Support  . . . . . . . . . . . . . . . . . . 16
76	     5.3.  Congestion Policing  . . . . . . . . . . . . . . . . . . . 17
77	     5.4.  Distributed Token Buckets  . . . . . . . . . . . . . . . . 18
78	   6.  Incremental Deployment . . . . . . . . . . . . . . . . . . . . 19
79	     6.1.  Migration  . . . . . . . . . . . . . . . . . . . . . . . . 19
80	     6.2.  Evolution  . . . . . . . . . . . . . . . . . . . . . . . . 20
81	   7.  Related Approaches . . . . . . . . . . . . . . . . . . . . . . 20
82	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 21
83	   9.  IANA Considerations (to be removed by RFC Editor)  . . . . . . 21
84	   10. Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 21
85	   11. Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 21
86	   12. Informative References . . . . . . . . . . . . . . . . . . . . 21
87	   Appendix A.  Summary of Changes between Drafts (to be removed
88	                by RFC Editor)  . . . . . . . . . . . . . . . . . . . 23

90	1.  Introduction

92	   A number of companies offer hosting of virtual machines on their data
93	   centre infrastructure--so-called infrastructure as a service (IaaS)
94	   or 'cloud computing'.  A set amount of processing power, memory,
95	   storage and network are offered.  Although processing power, memory
96	   and storage are relatively simple to allocate on the 'pay as you go'
97	   basis that has become common, the network is less easy to allocate,
98	   given it is a naturally distributed system.

100	   This document describes how a data centre infrastructure provider can
101	   offer isolated network performance to each tenant by deploying
102	   congestion policing at every ingress to the data centre network, e.g.
103	   in all the hypervisors (or containers).  The data packets pick up
104	   congestion information as they traverse the network, which is brought
105	   to the ingress using one of two approaches: feedback tunnels or ConEx
106	   (or a mix of the two).  Then, these ingress congestion policers have
107	   sufficient information to limit the amount of congestion any tenant
108	   can cause anywhere in the whole meshed pool of data centre network
109	   resources.  This isolates the network performance experienced by each
110	   tenant from the behaviour of all the others, without any tenant-
111	   related configuration on any of the switches.

113	   _How _it works is very simple and quick to describe. _Why_ this
114	   approach provides performance isolation may be more difficult to
115	   grasp.  In particular, why it provides performance isolation across a
116	   network of links, even though there is no isolation mechanism in each
117	   link.  Essentially, rather than limiting how much traffic can go
118	   where, traffic is allowed anywhere and the policer finds out whenever
119	   and wherever any traffic causes a small amount of congestion so that
120	   it can prevent heavier congestion.

122	   This document explains how it works, while a companion document
123	   [conex-policing] builds up an intuition for why it works.
124	   Nonetheless to make this document self-contained, brief summaries of
125	   both the 'how' and the 'why' are given in sections 3 & 4.  Then
126	   Section 5 gives details of the design and Section 6 explains the
127	   aspects of the design that enable incremental deployment.  Finally
128	   Section 7 introduces other attempts to solve the network performance
129	   isolation problem and why they fall down in various ways.

131	   The solution would also be just as applicable to isolate the network
132	   performance of different departments within the private data centre
133	   of an enterprise, which could be implemented without virtualisation.
134	   However, it will be described as a multi-tenant scenario, which is
135	   the more difficult case from a security point of view.

137	2.  Features of the Solution

139	   The following goals are met by the design, each of which is explained
140	   subsequently:

142	   o  Performance isolation

144	   o  No loss of LAN-like openness and multiplexing benefits

146	   o  Zero tenant-related switch configuration

148	   o  No change to existing switch implementations

150	   o  Weighted performance differentiation

152	   o  Ultra-Simple contract--per-tenant network-wide allowance

154	   o  Sender constraint, but with transferrable allowance

156	   o  Transport-agnostic

158	   o  Extensible to wide-area and inter-data-centre interconnection

160	   o  Doesn't require traffic classes, or manages traffic within each
161	      class

163	   Performance Isolation with Openness of a LAN:  The primary goal is to
164	      ensure that each tenant of a data centre receives a minimum
165	      assured performance from the whole network resource pool, but
166	      without losing the efficiency savings from multiplexed use of
167	      shared infrastructure (work-conserving).  There is no need for
168	      partitioning or reservation of network resources.

170	   Zero Tenant-Related Switch Configuration:  Performance isolation is
171	      achieved with no per-tenant configuration of switches.  All switch
172	      resources are potentially available to all tenants.

174	      Separately, _forwarding_ isolation may (or may not) be configured
175	      to ensure one tenant cannot receive traffic from another's virtual
176	      network.  However, _performance_ isolation is kept completely
177	      orthogonal, and adds nothing to the configuration complexity of
178	      the network.

180	   No New Switch Implementation:  Straightforward commodity switches (or
181	      routers) are sufficient.  Bulk explicit congestion notification
182	      (ECN) is recommended, which is available in a growing range of
183	      layer-3 switches (a layer-3 switch does switching at layer-2, but
184	      it can use the Diffserv and ECN fields for traffic control if it
185	      can find an IP header).

187	   Weighted Performance Differentiation:  A tenant gets network
188	      performance in proportion to their allowance when constrained by
189	      others, with no constraint otherwise.  Importantly, this assurance
190	      is not just instantaneous, but over time.  And the assurance is
191	      not just localised to each link but network-wide.  This will be
192	      explained later with reference to the numerical examples in
193	      [conex-policing].

195	   Ultra-Simple Contract:  The tenant needs to decide only two things:
196	      The peak bit-rate connecting each virtual machine to the network
197	      (as today) and an overall 'usage' allowance.  This document
198	      focuses on the latter.  A tenant just decides one number for this
199	      contracted allowance that can be shared between all the tenant's
200	      virtual machines (VMs).  The 'usage' allowance is a measure of
201	      congestion-bit-rate, which will be explained later, but most
202	      tenants will just think of it as a number, where more is better.

204	   Multi-machine:  A tenant operating multiple VMs has no need to decide
205	      in advance which VMs will need more allowance and which less--an
206	      automated process can allocate the allowance across the VMs,
207	      shifting more to those that need it most, as they use it.
208	      Therefore, performance cannot be constrained by poor choice of
209	      allocations between VMs, removing a whole dimension from the
210	      problem that tenants face when choosing their traffic contract.
211	      The allocation process can be operated by the tenant, or provided
212	      by the data centre operator as part of an enhanced platform to
213	      complement the basic infrastructure (platform as a service or
214	      PaaS).

216	   Sender Constraint with transferrable allowance:  By default,
217	      constraints are always placed on data senders, determined by the
218	      sending party's traffic contract.  Nonetheless, if the receiving
219	      party (or any other party) wishes to enhance performance it can
220	      arrange this with the sender at the expense of its own sending
221	      allowance.

223	      For instance, when a VM sends data to a storage facility the
224	      tenant that owns the VM consumes as much of their allowance as
225	      necessary to achieve the desired sending performance.  But by
226	      default when that tenant later retrieves data from storage, the
227	      storage facility is the sender, so the storage facility consumes
228	      its allowance to determine performance in the reverse direction.
229	      Nonetheless, during the retrieval request, the storage facility
230	      can require that its sending 'costs' are covered by the receiving
231	      VM's allowance.  The design of this feature is beyond the scope of
232	      this document, but the system provides all the hooks to build it
233	      at the application (or transport) layer.

235	   Transport-Agnostic:  In a well-provisioned network, enforcement of
236	      performance isolation rarely introduces constraints on network
237	      behaviour.  However, it continually counts how much each tenant is
238	      limiting the performance of others, and it will intervene to
239	      enforce performance isolation against only those tenants who most
240	      persistently constrain others.  By default, this intervention is
241	      oblivious to flows and to the protocols and algorithms being used
242	      above the IP layer.  However, flow-aware or application-aware
243	      prioritisation can be built on top, either by the tenant or by the
244	      data centre operator as a complementary PaaS facility.

246	   Interconnection:  The solution is designed so that interconnected
247	      networks can ensure each is accountable for the performance
248	      degradation it contributes to in other networks.  If necessary,
249	      one network has the information to intervene at its ingress to
250	      limit traffic from another network that is degrading performance.
251	      Alternatively, with the proposed protocols, networks can see
252	      sufficient information in traffic arriving at their borders to
253	      give their neighbours financial incentives to limit the traffic
254	      themselves.

256	      The present document focuses on a single provider-scenario, but
257	      evolution to interconnection with other data centres over wide-
258	      area networks, and interconnection with access networks is briefly
259	      discussed in Section 6.2.

261	   Intra-class:  The solution does not need traffic to have been
262	      classified into classes.  Or if traffic is divided into classes,
263	      it manages contention for the resources of each class,
264	      independently of any scheduling between classes.

266	3.  Outline Design

268	              virtual     hypervisors   hosts        switches
269	              machines

271	          V11 V12     V1m +--------+                         __/
272	          *   * ...   *   |  ____  |  H1 ,-.__________+--+__/
273	           \___\__   __\__|__\T1/__|____/`-'       __-|S1|____,--
274	                          |  /__\  |        `. _ ,' ,'|  |_______
275	                  .       +--------|  H2 ,-._,`.  ,'  +--+
276	                  .                    . `-'._  `.
277	                  .       +--------+   .      `,' `.
278	                          |  ____  |   .     ,' `-. `.+--+_______
279	          Vn1 Vn2     Vnm | _\T1/_ |        /      `-_|S2|____
280	          *   * ...   *   |/ /__\ \|  Hn ,-.__________|  |__  `--
281	           \___\__   __\__/policers\____/`-'          +--+  \__
282	                          \  ____  /                           \
283	                          |\_\T2/_/|
284	                          |  /__\  |
285	                          +--------+

287	      The two (or more) policers associated with tenant T1 act as one
288	                             logical policer.

290	            Figure 1: Edge Policing and the Hose Traffic Model

292	   Edge policing:  Traffic policing is located at the policy enforcement
293	      point where each sending host connects to the network, typically
294	      beneath the tenant's operating system in the hypervisor controlled
295	      by the infrastructure operator (Figure 1).  In this respect, the
296	      approach has a similar arrangement to the Diffserv architecture
297	      with traffic policers forming a ring around the network [RFC2475].

299	   (Multi-)Hose model:  Each policer controls all traffic from the set
300	      of VMs associated with each tenant without regard to destination,
301	      similar to the Diffserv 'hose' model.  If the tenant has VMs
302	      spread across multiple physical hosts, they are all constrained by
303	      one logical policer that feeds tokens to individual sub-policers
304	      within each hypervisor on each physical host (e.g. the two
305	      policers associated with tenant T1 in Figure 1).  In other words,
306	      the network is treated as one resource pool.

308	   Congestion policing:  A congestion policer is very similar to a
309	      traditional bit-rate policer.  A classifier associates each packet
310	      with the relevant tenant's meter to drain tokens from the
311	      associated token bucket, while at the same time the bucket fills
312	      with tokens at the tenant's contracted rate (Figure 2).

314	      However, unlike a traditional policer, the tokens in a congestion
315	      policer represent congested bits (i.e. discarded or ECN-marked
316	      bits), not just any bits.  So, the bits in ECN-marked packets in
317	      Figure 2 count as congested bits, while all other bits don't drain
318	      anything from the token bucket--unmarked packets are invisible to
319	      the meter.  And a tenant's contracted fill rate (wi for tenant Ti
320	      in Figure 2) is only the rate of congested bits, not all bits.
321	      Then if, on average, any tenant tries to cause more congestion
322	      than their allowance, the policer will focus discards on that
323	      tenant's traffic to prevent any further increase in congestion for
324	      everyone else.

326	      The detail design section describes how congestion policers at the
327	      network ingress know the congestion that each packet will
328	      encounter in the network, as well as how the congestion policer
329	      limits both peak and average rates of congestion.

331	                                                 ______________________
332	                     |          |          |    | Legend               |
333	                     |w1        |w2        |wi  |                      |
334	                     |          |          |    | [_] [_]packet stream |
335	                     V          V          V    |                      |
336	       congestion    .          .          .    | [*]    marked packet |
337	       token bucket| . |      | . |    __|___|  | ___                  |
338	                 __|___|      | . |   |  |:::|  | \ /    policer       |
339	                |  |:::|    __|___|   |  |:::|  | /_\                  |
340	                |  +---+   |  +---+   |  +---+  |                      |
341	   bucket depth |    :     |    :     |    :    | /\     marking meter |
342	   controls the |    .     |    :     |    .    | \/                   |
343	      policer  _V_   .     |    :     |    .    |______________________|
344	           ____\ /__/\___________________________             downstream
345	          /[*] /_\  \/ [_] |    : [_] |    : [_] \            /->network
346	   class-/                 |    .     |    .      \          /    /--->
347	   ifier/T1               _V_   .     |    .       \        /    /
348	  __,--.__________________\ /__/\___________________\______/____/ loss
349	    `--' T2   [*]  [_] [_]/_\  \/ [_] |    .  [*]   /      \    \-X--->
350	        \                             |    .       /        \-->
351	         \Ti                         _V_   :      /          \  loss
352	          \__________________________\ /__/\_____/            \-X--->
353	           [_]   [*]  [_] [*] [_]    /_\  \/ [_]

355	                Figure 2: Bulk Congestion Policer Schematic

357	   Optional Per-Flow policing:  A congestion policer could be designed
358	      to focus policing on the particular data flow(s) contributing most
359	      to the excess congestion-bit-rate.  However bulk per-tenant
360	      congestion policing is sufficient to protect other tenants, then
361	      each tenant can choose per-flow policing if it wants.

363	   FIFO forwarding:  If scheduling by traffic class is used in network
364	      buffers (for whatever reason), congestion policing can be used to
365	      isolate tenants from each other within each class.  However,
366	      congestion policing will tend to keep queues short, therefore it
367	      is more likely that simple first-in first-out (FIFO) will be
368	      sufficient, with no need for any priority scheduling.

370	   ECN marking recommended:  All queues that might become congested
371	      should support bulk ECN marking.  For any non-ECN-capable flows or
372	      packets, the solution enables ECN universally in the outer IP
373	      header of an edge-to-edge tunnel.  It can use the edge-to-edge
374	      tunnel created by one of the network virtualisation overlay
375	      approaches, e.g. [nvgre, vxlan].

377	   In the proposed approach, the network operator deploys capacity as
378	   usual--using previous experience to determine a reasonable contention
379	   ratio at every tier of the network.  Then, the tenant contracts with
380	   the operator for the rate at which their congestion policer will
381	   allow them to contribute to congestion. [conex-policing] explains how
382	   the operator or tenant would determine an appropriate allowance.

384	4.  Performance Isolation: Intuition

386	4.1.  Performance Isolation: The Problem

388	   Network performance isolation traditionally meant that each user
389	   could be sure of a minimum guaranteed bit-rate.  Such assurances are
390	   useful if traffic from each tenant follows relatively predictable
391	   paths and is fairly constant.  If traffic demand is more dynamic and
392	   unpredictable (both over time and across paths), minimum bit-rate
393	   assurances can still be given, but they have to be very small
394	   relative to the available capacity, because a large number of users
395	   might all want to simulataneously share any one link, even though
396	   they rarely all use it at the same time.

398	   This either means the shared capacity has to be greatly overprovided
399	   so that the assured level is large enough, or the assured level has
400	   to be small.  The former is unnecessarily expensive; the latter
401	   doesn't really give a sufficiently useful assurance.

403	   Round robin or fair queuing are other forms of isolation that
404	   guarantee that each user will get 1/N of the capacity of each link,
405	   where N is the number of active users at each link.  This is fine if
406	   the number of active users (N) sharing a link is fairly predictable.
407	   However, if large numbers of tenants do not typically share any one
408	   link but at any time they all could (as in a data centre), a 1/N
409	   assurance is fairly worthless.  Again, given N is typically small but
410	   could be very large, either the shared capacity has to be expensively
411	   overprovided, or the assured bit-rate has to be worthlessly small.
412	   The argument is no different for the weighted forms of these
413	   algorithms: WRR & WFQ).

415	   Both these traditional forms of isolation try to give one tenant
416	   assured instantaneous bit-rate by constraining the instantaneous bit-
417	   rate of everyone else.  This approach is flawed except in the special
418	   case when the load from every tenant on every link is continuous and
419	   fairly constant.  The reality is usually very different: sources are
420	   on-off and the route taken varies, so that on any one link a source
421	   is more often off than on.

423	      In these more realistic (non-constant) scenarios, the capacity
424	      available for any one tenant depends much more on _how often_
425	      everyone else uses a link, not just _how much_ bit-rate everyone
426	      else would be entitled to if they did use it.

428	   For instance, if 100 tenants are using a 1Gb/s link for 1% of the
429	   time, there is a good chance each will get the full 1Gb/s link
430	   capacity.  But if just six of those tenants suddenly start using the
431	   link 50% of the time, whenever the other 94 tenants need the link,
432	   they will typically find 3 of these heavier tenants using it already.
433	   If a 1/N approach like round-robin were used, then the light tenants
434	   would suddently get 1/4 * 1Gb/s = 250Mb/s on average.  Round-robin
435	   cannot claim to isolate tenants from each other if they usually get
436	   1Gb/s but sometimes they get 250Mb/s (and only 10Mb/s guaranteed in
437	   the worst case when all 100 tenants are active).

439	   In contrast, congestion policing is the key to network performance
440	   isolation because it focuses policing only on those tenants that go
441	   fast over congested path(s) excessively and persistently over time.
442	   This keeps congestion below a design threshold everywhere so that
443	   everyone else can go fast.  In this way, congestion policing takes
444	   account of highly variable loads (varying in time and varying across
445	   routes).  And, if everyone's load happens to be constant, congestion
446	   policing converges on the same outcome as the traditional forms of
447	   isolation.

449	   The other flaw in the traditional approaches to isolation, like WRR &
450	   WFQ, is that they actually prevent long-running flows from yielding
451	   to brief bursts from lighter tenants.  A long-running flow can yield
452	   to brief flows and still complete nearly as soon as it would have
453	   otherwise (the brief flows complete sooner, freeing up the capacity
454	   for the longer flow sooner).  However, WRR & WFQ prevent flows from
455	   even seeing the congestion signals that would allow them to co-
456	   ordinate between themselves, because they isolate each tenant
457	   completely into separate queues.

459	   In summary, superficially, traditional approaches with separate
460	   queues sound good for isolation, but:

462	   1.  not when everyone's load is variable, so each tenant has no
463	       assurance about how many other queues there will be;

465	   2.  and not when each tenant can no longer even see the congestion
466	       signals from other tenants, so no-one's control algorithms can
467	       determine whether they would benefit most by pushing harder or
468	       yielding.

470	4.2.  Why Congestion Policing Works

472	   [conex-policing] explains why congestion policing works using
473	   numerical examples from a data centre and schematic traffic plots (in
474	   ASCII art).  The bullets below provide a summary of that explanation,
475	   which builds from the simple case of long-running flows through a
476	   single link up to a full meshed network with on-off flows of
477	   different sizes and different behaviours:

479	   o  Starting with the simple case of long-running flows focused on a
480	      single bottleneck link, tenants get weighted shares of the link,
481	      much like weighted round robin, but with no mechanism in any of
482	      the links.  This is because losses (or ECN marks) are random, so
483	      if one tenant sends twice as much bit-rate it will suffer twice as
484	      many lost bits (or ECN-marked bits).  So, at least for constant
485	      long-running flows, regulating congestion-bits gives the same
486	      outcome as regulating bits;

488	   o  In the more realistic case where flows are not all long-running
489	      but a mix of short to very long, it is explained that bit-rate is
490	      not a sufficient metric for isolating performance; how _often_ a
491	      tenant is sending (or not sending) is the significant factor for
492	      performance isolation, not whether bit-rate is shared equally
493	      whenever a source happens to be sending;

495	   o  Although it might seem that data volume would be a good measure of
496	      how often a tenant is sending, we then show why it is not.  For
497	      instance, a tenant can send a large volume of data but hardly
498	      affect the performance of others -- by being more responsive to
499	      congestion.  Using congestion-volume (congestion-bit-rate over
500	      time) in a policer encourages large data senders to be more
501	      responsive (to yield), giving other tenants much higher
502	      performance while hardly affecting their own performance.
503	      Whereas, using straight volume as an allocation metric provides no
504	      distinction between high volume sources that yield and high volume
505	      sources that do not yield (the widespread behaviour today);

507	   o  We then show that a policer based on the congestion-bit-rate
508	      metric works across a network of links treating it as a pool of
509	      capacity, whereas other approaches treat each link independently,
510	      which is why the proposed approach requires none of the
511	      configuration complexity on switches that is involved in other
512	      approaches.

514	   o  We also show that a congestion policer can be arranged to limit
515	      bursts of congestion from sources that focus traffic onto a single
516	      link, even where one source may consist of a large aggregate of
517	      sources.

519	   o  We show that a congestion policer rewards traffic that shifts to
520	      less congested paths (e.g. multipath TCP or virtual machine
521	      motion).  This means congestion policing encourages and ultimately
522	      forces end-systems to balance their load over the whole pool of
523	      bandwidth.  The network can attempt to balance the load, but bulk
524	      congestion policing is particularly designed to encourage end-
525	      systems to do the job, either at the transport layer with
526	      multipath TCP [RFC6356] or at the application layer by moving
527	      virtual machines or choosing peer virtual machines in a similar
528	      way to BitTorrent.

530	   o  We show that congestion policing works on the pool of links,
531	      irrespective of whether individual links have significantly
532	      different capacities.

534	   o  We show that a congestion policer allows a wide variety of
535	      responses to congestion (e.g.  New Reno TCP, Cubic TCP, Compound
536	      TCP, Data Centre TCP and even unresponsive UDP traffic), while
537	      still encouraging and enforcing a sufficient response to
538	      congestion from all sources taken together.

540	   o  Congestion policing can and will enforce a congestion response if
541	      a tenant persistently causes excessive congestion.  This ensures
542	      that each tenant's minimum performance is isolated from the
543	      combined effects of everyone else.  However, the purpose of
544	      congestion policing is not to intervene in everyone's rate control
545	      all the time.  Rather it is encourage each tenant to avoid being
546	      policed -- to keep the aggregate of all their flows' responses to
547	      congestion within an overall envelope and balanced across the
548	      network.

550	   [conex-policing] also includes a section that gives guidance on how
551	   to estimate appropriate fill rates and sizes for congestion token
552	   buckets.

554	5.  Design

556	   The design involves the following elements, each detailed in the
557	   following subsections:

559	   1.  Trustworthy Congestion Signals at Ingress

561	   2.  Switch/Router Support

563	   3.  Congestion Policing

565	   4.  Distributed Token Buckets

567	5.1.  Trustworthy Congestion Signals at Ingress
568	   ,---------.                                               ,---------.
569	   |Transport|                                               |Transport|
570	   | Sender  |   .                                           |Receiver |
571	   |         |  /|___________________________________________|         |
572	   |     ,-<---------------Congestion-Feedback-Signals--<--------.     |
573	   |     |   |/                                              |   |     |
574	   |     |   |\           Transport Layer Feedback Flow      |   |     |
575	   |     |   | \  ___________________________________________|   |     |
576	   |     |   |  \|                                           |   |     |
577	   |     |   |   '         ,-----------.               .     |   |     |
578	   |     |   |_____________|           |_______________|\    |   |     |
579	   |     |   |    IP Layer |           |  Data Flow      \   |   |     |
580	   |     |   |             |(Congested)|           ,-----.\  |   |     |
581	   |     |   |             |  Network  |--Congestion-Signals--->-'     |
582	   |     |   |             |  Device   |           |     |  \|         |
583	   |     |   |             |           |           |Audit|  /|         |
584	   |     `----------->--(new)-IP-Layer-ConEx-Signals-------->|         |
585	   |         |             |           |           `-----'/  |         |
586	   |         |_____________|           |_______________  /   |         |
587	   |         |             |           |               |/    |         |
588	   `---------'             `-----------'               '     `---------'

590	         Figure 3: The ConEx Protocol in the Internet Architecture

592	   The operator of the data centre infrastructure needs to trust this
593	   information, therefore it cannot just use the feedback in the end-to-
594	   end transport (e.g.  TCP SACK or ECN echo congestion experienced
595	   flags) that might anyway be encrypted.  Trusted congestion feedback
596	   may be implemented in either of the following two ways:

598	   a.  Either as a shim in both sending and receiving hypervisors using
599	       an edge-to-edge (host-host) tunnel controlled by the
600	       infrastructure operator, with feedback messages reporting
601	       congestion back to the sending host's hypervisor (in addition to
602	       the e2e feedback at the transport layer).

604	   b.  Or in the sending operating system using the congestion exposure
605	       protocol (ConEx [ConEx-Abstract-Mech]) with a ConEx audit
606	       function at the egress edge to check ConEx signals against actual
607	       congestion signals (Figure 3);

609	5.1.1.  Tunnel Feedback vs. ConEx

611	   The feedback tunnel approach (a) is inefficient because it duplicates
612	   end-to-end feedback and it introduces at least a round trip's delay,
613	   whereas the ConEx approach (b) is more efficient and not delayed,
614	   because ConEx packets signal a conservative estimate of congestion in
615	   the upcoming round trip.  Avoiding feedback delay is important for
616	   controlling congestion from aggregated short flows.  However, ConEx
617	   signals will not necessarily be supported by the sending operating
618	   system.

620	   Therefore, given ConEx IP packets are self-identifying, the best
621	   approach is to rely on ConEx signals when present and fill in with
622	   tunnelled feedback when not, on a packet-by-packet basis.

624	5.1.2.  ECN Recommended

626	   Both approaches are much easier if explicit congestion notification
627	   (ECN [RFC3168]) is enabled on network switches and if all packets are
628	   ECN-capable.  For non-ECN-capable packets, ECN support can be turned
629	   on in the outer of an edge-to-edge tunnel.  The reasons that ECN
630	   helps in each case are:

632	   a.  Tunnel Feeback: To feed back congestion signals, the tunnel
633	       egress needs to be able to detect forward congestion signals in
634	       the first place.  If the only symptom of congestion is dropped
635	       packets, the egress has to watch for gaps in the sequence space
636	       of the transport protocol, which cannot be guaranteed to be
637	       possible--the IP payload may be encrypted, or an unknown
638	       protocol, or parts of the flow may be sent over diverse paths.
639	       The tunnel ingress could add its own sequence numbers (as done by
640	       some pseudowire protocols), but it is easier to simply turn on
641	       ECN at the ingress so that the egress can detect ECN markings.

643	   b.  ConEx: The audit function needs to be able to compare ConEx
644	       signals with actual congestion.  So, as before, it needs to be
645	       able to detect congestion at the egress.  Therefore the same
646	       arguments for ECN apply.

648	5.1.3.  Summary: Trustworthy Congestion Signals at Ingress

650	   The above cases can be arranged in a 2x2 matrix, to show when edge-
651	   to-edge tunnelling is needed and what function the tunnel would need
652	   to serve:

654	   +----------------+----------------+---------------------------------+
655	   | ConEx-capable? | ECN-capable: Y | ECN-capable: N                  |
656	   +----------------+----------------+---------------------------------+
657	   | Y              | No tunnel      | ECN-enabled tunnel              |
658	   |                | needed         |                                 |
659	   | N              | Tunnel         | ECN-enabled tunnel + Tunnel     |
660	   |                | Feedback       | feedback                        |
661	   +----------------+----------------+---------------------------------+

663	   We can now summarise the steps necessary to ensure an ingress
664	   congestion policer obtains trustworthy congestion signals:

666	   1.  Sending operating system:

668	       1.  The sender SHOULD send ConEx-enabled and ECN-enabled packets
669	           whenever possible.

671	       2.  If the sender uses IPv6 it can signal ConEx in a destination
672	           option header [conex-destopt].

674	       3.  If the sender uses IPv4, it can signal ConEx markings by
675	           encoding them within the packet ID field as proposed in
676	           [ipv4-id-reuse].

678	   2.  Ingress edge:

680	       1.  If an arriving packet is either not ConEx-capable or not ECN-
681	           capable it SHOULD be tunnelled to the appropriate egress edge
682	           in an outer IP header.

684	       2.  A pre-existing edge-to-edge tunnel (e.g. [nvgre, vxlan]) can
685	           be used, irrespective of whether the packet is not ConEx-
686	           capable or not ECN-capable.

688	       3.  Incoming ConEx signals MUST be copied to the outer.  For an
689	           incoming IPv4 packet, this implies copying the ID field.  For
690	           an incoming IPv6 packet, this implies copying the Destination
691	           Option header.

693	       4.  In all cases, the tunnel ingress MUST use the normal mode of
694	           ECN tunnelling [RFC6040].

696	   3.  Directly after encapsulation (but not if the packet was not
697	       encapsulated):

699	       1.  If and only if the ECN field of the outer header is not ECN-
700	           capable (Not-ECT i.e. 00) it MUST be made ECN-capable by
701	           remarking it to ECT(0), i.e. 01.

703	       2.  If the outer ECN field carries any other value than 00, it
704	           should be left unchanged.

706	   4.  Directly before the edge egress (irrespective of whether the
707	       packet is encapsulated):

709	       1.  If the outer IP header is ConEx-capable, it MUST be passed
710	           through a ConEx audit function

712	       2.  If the packet is not ConEx-capable, it MUST be passed to a
713	           function that feeds back ECN marking statistics to the tunnel
714	           ingress.  Such a function is also a requirement of
715	           [tunnel-cong-exp], which may be re-usable for this purpose
716	           {ToDo: to be confirmed}.

718	   5.  Egress Edge Decapsulator:

720	       1.  Decapsulation must comply with [RFC6040].  This ensures that,
721	           a congestion experienced marking (CE or 11) on the outer will
722	           lead to the packet being dropped if the inner indicates that
723	           the endpoints will not understand ECN (i.e. the inner ECN
724	           field is Not-ECT or 00).  Effectively the egress edge drops
725	           such packets on behalf of the congested upstream buffer that
726	           marked it because the packet appeared to be ECN-capable on
727	           the outside, but it is not ECN-capable on this inside.
728	           [RFC6040] was deliberately arranged like this so that it
729	           would drop such packets to give an equivalent congestion
730	           signal to the end-to-end transport.

732	5.2.  Switch/Router Support

734	   Network switches/routers do not need any modification.  However, both
735	   congestion detection by the tunnel (approach a) and ConEx audit
736	   (approach b) are significantly easier if switches support ECN.

738	   Once switches support ECN, Data centre TCP [DCTCP] could optionally
739	   be used (DCTCP requires ECN).  It also requires modified sender and
740	   receiver TCP algorithms as well as a more aggressive configuration of
741	   the active queue management (AQM) in the L3 switches or routers.

743	5.3.  Congestion Policing

745	   Innovation in the design of congestion policers is expected and
746	   encouraged, but here we wlil describe one specific design to be
747	   concrete.

749	   A bulk congestion policing function would most likely be implemented
750	   as a shim in the hypervisor.  The hypervisor would create one
751	   instance of a bulk congestion policer per tenant on the physical
752	   machine, and it would ensure that all traffic sent by that tenant's
753	   VMs into the network would pass through the relevant congestion
754	   policer by associating every new virtual machine with the relevant
755	   policer.

757	   A bulk congestion policing function has already been outlined in
758	   Section 3.  To recap, it consists of a token bucket that is filled
759	   with congestion tokens at a constant rate.  The bucket is drained by
760	   the size of every packet that carries a congestion marking.  If the
761	   tunnel-feedback approach (a) were used, the bucket would be drained
762	   by congestion feedback from the tunnel egress, rather than markings
763	   on packets.  If the ConEx approach (b) were used, the bucket would be
764	   drained by ConEx markings on the actual data packets being forwarded.
765	   A congestion policer will need to drain in response to either form of
766	   signal, because it is recommended that both approaches are used in
767	   combination.

769	   Various more sophisticated congestion policer designs have been
770	   evaluated [CPolTrilogyExp].  In these experiments, it was found that
771	   it is better if the policer gradually increases discards as the
772	   bucket becomes empty.  Also isolation between tenants is better if
773	   each tenant is policed based on the combination of two buckets, not
774	   one (Figure 4):

776	   1.  A deep bucket (that would take minutes or even hours to fill at
777	       the contracted fill-rate) that constrains the tenant's long-term
778	       average rate of congestion (wi)

780	   2.  a very shallow bucket (e.g. only two or three MTU) that is filled
781	       considerably faster than the deep bucket (c * wi), where c = ~10,
782	       which prevents a tenant storing up a large backlog of tokens then
783	       causing congestion in one large burst.

785	   In this arrangement each marked packet drains tokens from both
786	   buckets, and the probability of policer discard is taken as the worse
787	   of the two buckets.

789	                                |         |
790	     Legend:                    |c*wi     |wi
791	     See previous figure        V         V
792	                                .         .
793	                                .       | . | deep bucket
794	                _ _ _ _ _ _ _ _ _ _ _ _ |___|
795	               |                .       |:::|
796	               |_ _ _ _ _ _ _ |___|     |:::|
797	               |      shallow +---+     +---+
798	   worse of the|       bucket
799	    two buckets|               \____   ____/
800	       triggers|                    \ / both buckets
801	      policing V                     :  drained by
802	              ___                    .  marked packets
803	   ___________\ /___________________/ \__________________
804	      [_] [_] /_\   [_]  [*]   [_]  \ /  [_]  [_]   [_]

806	      Figure 4: Dual Congestion Token Bucket (in place of each single
807	                      bucket in the previous figure)

809	   While the data centre network operator only needs to police
810	   congestion in bulk, tenants may wish to enforce their own limits on
811	   individual users or applications, as sub-limits of their overall
812	   allowance.  Given all the information used for policing is readily
813	   available within the transport layer of their own operating system.
814	   Tenants can readily apply any such per-flow, per-user or per-
815	   application limitations.  The tenant may operate their own fine-
816	   grained policing software, or such detailed control capabilities may
817	   be offered as part of the platform (platform as a service or PaaS).

819	5.4.  Distributed Token Buckets

821	   A customer may run virtual machines on multiple physical nodes, in
822	   which case at the time each VM is instantiated the data centre
823	   operator will deploy a congestion policer in the hypervisor on each
824	   node where the customer is running a VM.The DC operator can arrange
825	   for these congestion policers to collectively enforce the per-
826	   customer congestion allowance, as a distributed policer.

828	   A function to distribute a customer's tokens to the policer
829	   associated with each of the customer's VMs would be needed.  This
830	   could be similar to the distributed rate limiting of [DRL], which
831	   uses a gossip-like protocol to fill the sub-buckets.  Alternatively,
832	   a logically centralised bucket of congestion tokens could be used. it
833	   could be replicated for reliability then there could be simple 1-1
834	   communication between the central bucket and each local token bucket.

836	   Importantly, congestion tokens can be freely reassigned between
837	   different VMs, because a congestion token is equivalent at any place
838	   or time in a network.  In contrast, traditional bit-rate tokens
839	   cannot simply be reassigned from one VM to another without
840	   implications on the balance of network loading.  This is because the
841	   parameters used for bit-rate policing depend on the topology and its
842	   capacity planning (open loop), whereas congestion policing
843	   complements the closed loop congestion avoidance system that adapts
844	   to the prevailing traffic and topology.

846	   As well as distribution of tokens between the VMs of a tenant, it
847	   would similarly be feasible to allow transfer of tokens between
848	   tenants, also without breaking the performance isolation properties
849	   of the system.  Secure token transfer mechanisms could be built above
850	   the underlying policing design described here, but that is beyond the
851	   current scope and therefore deferred to future work.

853	6.  Incremental Deployment

855	6.1.  Migration

857	   A mechanism to bring trustworthy congestion signals to the ingress
858	   (Section 5.1) is critical to this performance isolation solution.
859	   Section 5.1.1 compares the two solutions: b) ConEx, which is
860	   efficient and it's timely enough to police short flows; and a)
861	   tunnel-feedback, which is neither.  However, ConEx requires
862	   deployment in host operating systems first, while tunnel feedback can
863	   be deployed unilaterally by the data centre operator in all
864	   hypervisors (or containers), without requiring support in guest
865	   operating systems.

867	   The section describes the steps necessary to support both approaches.
868	   This would provide an incremental deployment route with the best of
869	   both worlds: tunnel feedback could be deployed initially for
870	   unmodified guest OSs despite its weaknesses, and ConEx could
871	   gradually take over as it was deployed more widely in guest OSs.  It
872	   is important not to deploy the tunnel feedback approach without
873	   checking for ConEx-capable packets, otherwise it will never be
874	   possible to migrate to ConEx.  The advantages of being able to
875	   migrate to ConEx are:

877	   o  no duplicate feedback channel between hypervisors (sending and
878	      forwarding a large proportion of tiny packets), which would cause
879	      considerable packet processing overhead

881	   o  performance isolation includes the contribution to congestion from
882	      short (sub-round-trip-time) flows

884	6.2.  Evolution

886	   Initially, the approach would be confined to intra-data centre
887	   traffic.  With the addition of ECN support on network equipment (at
888	   least bottleneck access routers) in the WAN between data centres, it
889	   could straightforwardly be extended to inter-data centre scenarios,
890	   including across interconnected backbone networks.

892	   Once this approach becomes deployed within data centres and possibly
893	   across interconnects between data centres and enterprise LANs, the
894	   necessary support will be implemented in a wide range of equipment
895	   used in these scenarios.  Similar equipment is also used in other
896	   networks (e.g. broadband access and backhaul), so that it would start
897	   to be possible for these other networks to deploy a similar approach.

899	7.  Related Approaches

901	   The Related Work section of [CongPol] provides a useful comparison of
902	   the approach proposed here against other attempts to solve similar
903	   problems.

905	   When the hose model is used with Diffserv, capacity has to be
906	   considerably over-provisioned for all the unfortunate cases when
907	   multiple sources of traffic happen to coincide even though they are
908	   all in-contract at their respective ingress policers.  Even so, every
909	   node within a Diffserv network also has to be configured to limit
910	   higher traffic classes to a maximum rate in case of really unusual
911	   traffic distributions that would starve lower priority classes.
912	   Therefore, for really important performance assurances, Diffserv is
913	   used in the 'pipe' model where the policer constrains traffic
914	   separately for each destination, and sufficient capacity is provided
915	   at each network node for the sum of all the peak contracted rates for
916	   paths crossing that node.

918	   In contrast, the congestion policing approach is designed to give
919	   full performance assurances across a meshed network (the hose model),
920	   without having to divide a network up into pipes.  If an unexpected
921	   distribution of traffic from all sources focuses on a congestion
922	   hotspot, it will increase the congestion-bit-rate seen by the
923	   policers of all sources contributing to the hot-spot.  The congestion
924	   policers then focus on these sources, which in turn limits the
925	   severity of the hot-spot.

927	   The critical improvement over Diffserv is that the ingress edges
928	   receive information about any congestion occuring in the middle, so
929	   they can limit how much congestion occurs, wherever it happens to
930	   occur.  Previously Diffserv edge policers had to limit traffic
931	   generally in case it caused congestion, because they never knew
932	   whether it would (open loop control).

934	   Congestion policing mechanisms could be used to assure the
935	   performance of one data flow (the 'pipe' model), but this would
936	   involve unnecessary complexity, given the approach works well for the
937	   'hose' model.

939	   Therefore, congestion policing allows capacity to be provisioned for
940	   the average case, not for the near-worst case when many unlikely
941	   cases coincide.  It assures performance for all traffic using just
942	   one traffic class, whereas Diffserv only assures performance for a
943	   small proportion of traffic by partitioning it off into higher
944	   priority classes and over-provisioning relative to the traffic
945	   contracts sold for for this class.

947	   {ToDo: Refer to [conex-policing] for comparison with WRR & WFQ}

949	   Seawall {ToDo} [Seawall]

951	8.  Security Considerations

953	   {ToDo}

955	9.  IANA Considerations (to be removed by RFC Editor)

957	   This document does not require actions by IANA.

959	10.  Conclusions

961	   {ToDo}

963	11.  Acknowledgments

965	   Thanks to Yu-Shun Wang for comments on some of the practicalities.

967	   Bob Briscoe is part-funded by the European Community under its
968	   Seventh Framework Programme through the Trilogy 2 project (ICT-
969	   317756).  The views expressed here are solely those of the author.

971	12.  Informative References

973	   [CPolTrilogyExp]       Raiciu, C., Ed., "Progress on resource
974	                          control", Trilogy EU 7th Framework Project
975	                          ICT-216372 Deliverable 9, December 2009, <http
976	                          ://trilogy-project.org/publications/
977	                          deliverables.html>.

979	   [ConEx-Abstract-Mech]  Mathis, M. and B. Briscoe, "Congestion
980	                          Exposure (ConEx) Concepts and Abstract
981	                          Mechanism", draft-ietf-conex-abstract-mech-08
982	                          (work in progress), October 2013.

984	   [CongPol]              Jacquet, A., Briscoe, B., and T. Moncaster,
985	                          "Policing Freedom to Use the Internet Resource
986	                          Pool", Proc ACM Workshop on Re-Architecting
987	                          the Internet (ReArch'08) , December 2008,
988	                          <http://bobbriscoe.net/projects/
989	                          refb/#polfree>.

991	   [DCTCP]                Alizadeh, M., Greenberg, A., Maltz, D.,
992	                          Padhye, J., Patel, P., Prabhakar, B.,
993	                          Sengupta, S., and M. Sridharan, "Data Center
994	                          TCP (DCTCP)", ACM SIGCOMM CCR 40(4)63--74,
995	                          October 2010, <http://portal.acm.org/
996	                          citation.cfm?id=1851192>.

998	   [DRL]                  Raghavan, B., Vishwanath, K., Ramabhadran, S.,
999	                          Yocum, K., and A. Snoeren, "Cloud control with
1000	                          distributed rate limiting", ACM SIGCOMM
1001	                          CCR 37(4)337--348, 2007,
1002	                          <http://doi.acm.org/10.1145/1282427.1282419>.

1004	   [RFC2475]              Blake, S., Black, D., Carlson, M., Davies, E.,
1005	                          Wang, Z., and W. Weiss, "An Architecture for
1006	                          Differentiated Services", RFC 2475,
1007	                          December 1998.

1009	   [RFC3168]              Ramakrishnan, K., Floyd, S., and D. Black,
1010	                          "The Addition of Explicit Congestion
1011	                          Notification (ECN) to IP", RFC 3168,
1012	                          September 2001.

1014	   [RFC6040]              Briscoe, B., "Tunnelling of Explicit
1015	                          Congestion Notification", RFC 6040,
1016	                          November 2010.

1018	   [RFC6356]              Raiciu, C., Handley, M., and D. Wischik,
1019	                          "Coupled Congestion Control for Multipath
1020	                          Transport Protocols", RFC 6356, October 2011.

1022	   [Seawall]              Shieh, A., Kandula, S., Greenberg, A., and C.
1023	                          Kim, "Seawall: Performance Isolation in Cloud
1024	                          Datacenter Networks", Proc 2nd USENIX Workshop
1025	                          on Hot Topics in Cloud Computing , June 2010,
1026	                          <http://research.microsoft.com/en-us/projects/
1027	                          seawall/>.

1029	   [conex-destopt]        Krishnan, S., Kuehlewind, M., and C. Ucendo,
1030	                          "IPv6 Destination Option for ConEx",
1031	                          draft-ietf-conex-destopt-05 (work in
1032	                          progress), October 2013.

1034	   [conex-policing]       Briscoe, B., "Network Performance Isolation
1035	                          using Congestion Policing",
1036	                          draft-briscoe-conex-policing-01 (work in
1037	                          progress), February 2014.

1039	   [ipv4-id-reuse]        Briscoe, B., "Reusing the IPv4 Identification
1040	                          Field in Atomic Packets",
1041	                          draft-briscoe-intarea-ipv4-id-reuse-04 (work
1042	                          in progress), February 2014.

1044	   [nvgre]                Sridhavan, M., Greenberg, A., Wang, Y., Garg,
1045	                          P., Duda, K., Venkataramaiah, N., Ganga, I.,
1046	                          Lin, G., Pearson, M., Thaler, P., and C.
1047	                          Tumuluri, "NVGRE: Network Virtualization using
1048	                          Generic Routing Encapsulation",
1049	                          draft-sridharan-virtualization-nvgre-04 (work
1050	                          in progress), February 2014.

1052	   [tunnel-cong-exp]      Zhu, L., Zhang, H., and X. Gong, "Tunnel
1053	                          Congestion Exposure", draft-zhang-tsvwg-
1054	                          tunnel-congestion-exposure-00 (work in
1055	                          progress), October 2012.

1057	   [vxlan]                Mahalingam, M., Dutt, D., Duda, K., Agarwal,
1058	                          P., Kreeger, L., Sridhar, T., Bursell, M., and
1059	                          C. Wright, "VXLAN: A Framework for Overlaying
1060	                          Virtualized Layer 2 Networks over Layer 3
1061	                          Networks",
1062	                          draft-mahalingam-dutt-dcops-vxlan-08 (work in
1063	                          progress), February 2014.

1065	Appendix A.  Summary of Changes between Drafts (to be removed by RFC
1066	             Editor)

1068	   Detailed changes are available from
1069	   http://tools.ietf.org/html/draft-briscoe-conex-data-centre

1071	   From briscoe-01 to briscoe-02:  Added clarification about intra-class
1072	      applicability.  Updated references.

1074	   From briscoe-conex-data-centre-00 to briscoe-conex-data-centre-01:

1076	      *  Took out text Section 4 "Performance Isolation Intuition" and
1077	         Section 6.  "Parameter Setting" into a separate draft
1078	         [conex-policing] and instead included only a summary in these
1079	         sections, referring out for details.

1081	      *  Considerably updated Section 5 "Design"

1083	      *  Clarifications and updates throughout, including addition of
1084	         diagrams

1086	   From briscoe-conex-initial-deploy-02 to
1087	   briscoe-conex-data-centre-00:

1089	      *  Split off data-centre scenario as a separate document, by
1090	         popular request.

1092	Authors' Addresses

1094	   Bob Briscoe
1095	   BT
1096	   B54/77, Adastral Park
1097	   Martlesham Heath
1098	   Ipswich  IP5 3RE
1099	   UK

1101	   Phone: +44 1473 645196
1102	   EMail: bob.briscoe@bt.com
1103	   URI:   http://bobbriscoe.net/

1105	   Murari Sridharan
1106	   Microsoft
1107	   1 Microsoft Way
1108	   Redmond, WA  98052

1110	   Phone:
1111	   Fax:
1112	   EMail: muraris@microsoft.com
1113	   URI: