Network
Performance Isolation in Data Centres using Congestion PolicingBTB54/77, Adastral ParkMartlesham HeathIpswichIP5 3REUK+44 1473 645196bob.briscoe@bt.comhttp://bobbriscoe.net/Microsoft1 Microsoft WayRedmondWA98052muraris@microsoft.com
Transport Area
ConExInternet-DraftThis document describes how a multi-tenant (or multi-department) data
centre operator can isolate tenants from network performance degradation
due to each other's usage, but without losing the multiplexing benefits
of a LAN-style network where anyone can use any amount of any resource.
Zero per-tenant configuration and no implementation change is required
on network equipment. Instead the solution is implemented with a simple
change to the hypervisor (or container) beneath the tenant's virtual
machines on every physical server connected to the network. These
collectively enforce a very simple distributed contract - a single
network allowance that each tenant can allocate among their virtual
machines, even if distributed around the network. The solution uses
layer-3 switches that support explicit congestion notification (ECN). It
is best if the sending operating system supports congestion exposure
(ConEx). Nonetheless, the operator can unilaterally deploy a complete
solution while operating systems are being incrementally upgraded to
support ConEx and ECN.A number of companies offer hosting of virtual machines on their data
centre infrastructure—so-called infrastructure as a service (IaaS)
or 'cloud computing'. A set amount of processing power, memory, storage
and network are offered. Although processing power, memory and storage
are relatively simple to allocate on the 'pay as you go' basis that has
become common, the network is less easy to allocate, given it is a
naturally distributed system.This document describes how a data centre infrastructure provider can
offer isolated network performance to each tenant by deploying
congestion policing at every ingress to the data centre network, e.g. in
all the hypervisors (or containers). The data packets pick up congestion
information as they traverse the network, which is brought to the
ingress using one of two approaches: feedback tunnels or ConEx (or a mix
of the two). Then, these ingress congestion policers have sufficient
information to limit the amount of congestion any tenant can cause
anywhere in the whole meshed pool of data centre network resources. This
isolates the network performance experienced by each tenant from the
behaviour of all the others, without any tenant-related configuration on
any of the switches.How it works is very simple and quick to
describe. Why this approach provides
performance isolation may be more difficult to grasp. In particular, why
it provides performance isolation across a network of links, even though
there is no isolation mechanism in each link. Essentially, rather than
limiting how much traffic can go where, traffic is allowed anywhere and
the policer finds out whenever and wherever any traffic causes a small
amount of congestion so that it can prevent heavier congestion.This document explains how it works, while a companion document builds up an intuition for why it works.
Nonetheless to make this document self-contained, brief summaries of
both the 'how' and the 'why' are given in sections & . Then gives details of the design and explains the aspects of the design that
enable incremental deployment. Finally
introduces other attempts to solve the network performance isolation
problem and why they fall down in various ways. The solution would also be just as applicable to isolate the network
performance of different departments within the private data centre of
an enterprise, which could be implemented without virtualisation.
However, it will be described as a multi-tenant scenario, which is the
more difficult case from a security point of view.The following goals are met by the design, each of which is explained
subsequently: Performance isolationNo loss of LAN-like openness and multiplexing benefitsZero tenant-related switch configurationNo change to existing switch implementationsWeighted performance differentiationUltra-Simple contract—per-tenant network-wide allowanceSender constraint, but with transferrable allowanceTransport-agnosticExtensible to wide-area and inter-data-centre interconnectionThe
primary goal is to ensure that each tenant of a data centre receives
a minimum assured performance from the whole network resource pool,
but without losing the efficiency savings from multiplexed use of
shared infrastructure (work-conserving). There is no need for
partitioning or reservation of network resources.Performance
isolation is achieved with no per-tenant configuration of switches.
All switch resources are potentially available to all tenants.
Separately, forwarding
isolation may (or may not) be configured to ensure one tenant cannot
receive traffic from another's virtual network. However, performance isolation is kept completely
orthogonal, and adds nothing to the configuration complexity of the
network.Straightforward
commodity switches (or routers) are sufficient. Bulk explicit
congestion notification (ECN) is recommended, which is available in
a growing range of layer-3 switches (a layer-3 switch does switching
at layer-2, but it can use the Diffserv and ECN fields for traffic
control if it can find an IP header). A tenant gets
network performance in proportion to their allowance when
constrained by others, with no constraint otherwise. Importantly,
this assurance is not just instantaneous, but over time. And the
assurance is not just localised to each link but network-wide. This
will be explained later with reference to the numerical examples in
.The tenant needs to decide only
two things: The peak bit-rate connecting each virtual machine to the
network (as today) and an overall 'usage' allowance. This document
focuses on the latter. A tenant just decides one number for this
contracted allowance that can be shared between all the tenant's
virtual machines (VMs). The 'usage' allowance is a measure of
congestion-bit-rate, which will be explained later, but most tenants
will just think of it as a number, where more is better. A tenant operating multiple VMs has no
need to decide in advance which VMs will need more allowance and
which less—an automated process can allocate the allowance
across the VMs, shifting more to those that need it most, as they
use it. Therefore, performance cannot be constrained by poor choice
of allocations between VMs, removing a whole dimension from the
problem that tenants face when choosing their traffic contract. The
allocation process can be operated by the tenant, or provided by the
data centre operator as part of an enhanced platform to complement
the basic infrastructure (platform as a service or PaaS).By
default, constraints are always placed on data senders, determined
by the sending party's traffic contract. Nonetheless, if the
receiving party (or any other party) wishes to enhance performance
it can arrange this with the sender at the expense of its own
sending allowance. For instance, when a VM
sends data to a storage facility the tenant that owns the VM
consumes as much of their allowance as necessary to achieve the
desired sending performance. But by default when that tenant later
retrieves data from storage, the storage facility is the sender, so
the storage facility consumes its allowance to determine performance
in the reverse direction. Nonetheless, during the retrieval request,
the storage facility can require that its sending 'costs' are
covered by the receiving VM's allowance. The design of this feature
is beyond the scope of this document, but the system provides all
the hooks to build it at the application (or transport) layer.In a well-provisioned network,
enforcement of performance isolation rarely introduces constraints
on network behaviour. However, it continually counts how much each
tenant is limiting the performance of others, and it will intervene
to enforce performance isolation against only those tenants who most
persistently constrain others. By default, this intervention is
oblivious to flows and to the protocols and algorithms being used
above the IP layer. However, flow-aware or application-aware
prioritisation can be built on top, either by the tenant or by the
data centre operator as a complementary PaaS facility.The solution is designed so that
interconnected networks can ensure each is accountable for the
performance degradation it contributes to in other networks. If
necessary, one network has the information to intervene at its
ingress to limit traffic from another network that is degrading
performance. Alternatively, with the proposed protocols, networks
can see sufficient information in traffic arriving at their borders
to give their neighbours financial incentives to limit the traffic
themselves.The present document focuses on a
single provider-scenario, but evolution to interconnection with
other data centres over wide-area networks, and interconnection with
access networks is briefly discussed in .Traffic policing is located at the
policy enforcement point where each sending host connects to the
network, typically beneath the tenant's operating system in the
hypervisor controlled by the infrastructure operator (). In this respect, the approach has a
similar arrangement to the Diffserv architecture with traffic
policers forming a ring around the network .Each policer controls all traffic
from the set of VMs associated with each tenant without regard to
destination, similar to the Diffserv 'hose' model. If the tenant has
VMs spread across multiple physical hosts, they are all constrained
by one logical policer that feeds tokens to individual sub-policers
within each hypervisor on each physical host (e.g. the two policers
associated with tenant T1 in ).
In other words, the network is treated as one resource pool.A congestion policer is very
similar to a traditional bit-rate policer. A classifier associates
each packet with the relevant tenant's meter to drain tokens from
the associated token bucket, while at the same time the bucket fills
with tokens at the tenant's contracted rate ().However, unlike
a traditional policer, the tokens in a congestion policer represent
congested bits (i.e. discarded or ECN-marked bits), not just any
bits. So, the bits in ECN-marked packets in count as congested bits, while all other
bits don't drain anything from the token bucket—unmarked
packets are invisible to the meter. And a tenant's contracted fill
rate (wi for tenant Ti in ) is only
the rate of congested bits, not all bits. Then if, on average, any
tenant tries to cause more congestion than their allowance, the
policer will focus discards on that tenant's traffic to prevent any
further increase in congestion for everyone else. The detail design section describes how congestion
policers at the network ingress know the congestion that each packet
will encounter in the network, as well as how the congestion policer
limits both peak and average rates of congestion. A congestion policer could
be designed to focus policing on the particular data flow(s)
contributing most to the excess congestion-bit-rate. However bulk
per-tenant congestion policing is sufficient to protect other
tenants, then each tenant can choose per-flow policing if it wants.
If scheduling by traffic class is
used in network buffers (for whatever reason), congestion policing
can be used to isolate tenants from each other within each class.
However, congestion policing will tend to keep queues short,
therefore it is more likely that simple first-in first-out (FIFO)
will be sufficient, with no need for any priority scheduling.All queues that might become
congested should support bulk ECN marking. For any non-ECN-capable
flows or packets, the solution enables ECN universally in the outer
IP header of an edge-to-edge tunnel. It can use the edge-to-edge
tunnel created by one of the network virtualisation overlay
approaches, e.g. [, ].In the proposed approach, the network operator deploys capacity
as usual—using previous experience to determine a reasonable
contention ratio at every tier of the network. Then, the tenant
contracts with the operator for the rate at which their congestion
policer will allow them to contribute to congestion. explains how the operator or tenant would
determine an appropriate allowance. Network performance isolation traditionally meant that each user
could be sure of a minimum guaranteed bit-rate. Such assurances are
useful if traffic from each tenant follows relatively predictable
paths and is fairly constant. If traffic demand is more dynamic and
unpredictable (both over time and across paths), minimum bit-rate
assurances can still be given, but they have to be very small relative
to the available capacity, because a large number of users might all
want to simulataneously share any one link, even though they rarely
all use it at the same time.This either means the shared capacity has to be greatly
overprovided so that the assured level is large enough, or the assured
level has to be small. The former is unnecessarily expensive; the
latter doesn't really give a sufficiently useful assurance.Round robin or fair queuing are other forms of isolation that
guarantee that each user will get 1/N of the capacity of each link,
where N is the number of active users at each link. This is fine if
the number of active users (N) sharing a link is fairly predictable.
However, if large numbers of tenants do not typically share any one
link but at any time they all could (as in a data centre), a 1/N
assurance is fairly worthless. Again, given N is typically small but
could be very large, either the shared capacity has to be expensively
overprovided, or the assured bit-rate has to be worthlessly small. The
argument is no different for the weighted forms of these algorithms:
WRR & WFQ).Both these traditional forms of isolation try to give one tenant
assured instantaneous bit-rate by constraining the instantaneous
bit-rate of everyone else. This approach is flawed except in the
special case when the load from every tenant on every link is
continuous and fairly constant. The reality is usually very different:
sources are on-off and the route taken varies, so that on any one link
a source is more often off than on.In these more realistic (non-constant) scenarios, the capacity
available for any one tenant depends much more on how often everyone else uses a link, not just
how much bit-rate everyone else would
be entitled to if they did use it.For instance, if 100 tenants are using a 1Gb/s link for 1% of the
time, there is a good chance each will get the full 1Gb/s link
capacity. But if just six of those tenants suddenly start using the
link 50% of the time, whenever the other 94 tenants need the link,
they will typically find 3 of these heavier tenants using it already.
If a 1/N approach like round-robin were used, then the light tenants
would suddently get 1/4 * 1Gb/s = 250Mb/s on average. Round-robin
cannot claim to isolate tenants from each other if they usually get
1Gb/s but sometimes they get 250Mb/s (and only 10Mb/s guaranteed in
the worst case when all 100 tenants are active).In contrast, congestion policing is the key to network performance
isolation because it focuses policing only on those tenants that go
fast over congested path(s) excessively and persistently over time.
This keeps congestion below a design threshold everywhere so that
everyone else can go fast. In this way, congestion policing takes
account of highly variable loads (varying in time and varying across
routes). And, if everyone's load happens to be constant, congestion
policing converges on the same outcome as the traditional forms of
isolation.The other flaw in the traditional approaches to isolation, like WRR
& WFQ, is that they actually prevent long-running flows from
yielding to brief bursts from lighter tenants. A long-running flow can
yield to brief flows and still complete nearly as soon as it would
have otherwise (the brief flows complete sooner, freeing up the
capacity for the longer flow sooner). However, WRR & WFQ prevent
flows from even seeing the congestion signals that would allow them to
co-ordinate between themselves, because they isolate each tenant
completely into separate queues.In summary, superficially, traditional approaches with separate
queues sound good for isolation, but:not when everyone's load is variable, so each tenant has no
assurance about how many other queues there will be;and not when each tenant can no longer even see the congestion
signals from other tenants, so no-one's control algorithms can
determine whether they would benefit most by pushing harder or
yielding. explains why congestion policing
works using numerical examples from a data centre and schematic
traffic plots (in ASCII art). The bullets below provide a summary of
that explanation, which builds from the simple case of long-running
flows through a single link up to a full meshed network with on-off
flows of different sizes and different behaviours:Starting with the simple case of long-running flows focused on
a single bottleneck link, tenants get weighted shares of the link,
much like weighted round robin, but with no mechanism in any of
the links. This is because losses (or ECN marks) are random, so if
one tenant sends twice as much bit-rate it will suffer twice as
many lost bits (or ECN-marked bits). So, at least for constant
long-running flows, regulating congestion-bits gives the same
outcome as regulating bits;In the more realistic case where flows are not all long-running
but a mix of short to very long, it is explained that bit-rate is
not a sufficient metric for isolating performance; how often a tenant is sending (or not sending) is
the significant factor for performance isolation, not whether
bit-rate is shared equally whenever a source happens to be
sending;Although it might seem that data volume would be a good measure
of how often a tenant is sending, we then show why it is not. For
instance, a tenant can send a large volume of data but hardly
affect the performance of others — by being more responsive
to congestion. Using congestion-volume (congestion-bit-rate over
time) in a policer encourages large data senders to be more
responsive (to yield), giving other tenants much higher
performance while hardly affecting their own performance. Whereas,
using straight volume as an allocation metric provides no
distinction between high volume sources that yield and high volume
sources that do not yield (the widespread behaviour today);We then show that a policer based on the congestion-bit-rate
metric works across a network of links treating it as a pool of
capacity, whereas other approaches treat each link independently,
which is why the proposed approach requires none of the
configuration complexity on switches that is involved in other
approaches.We also show that a congestion policer can be arranged to limit
bursts of congestion from sources that focus traffic onto a single
link, even where one source may consist of a large aggregate of
sources.We show that a congestion policer rewards traffic that shifts
to less congested paths (e.g. multipath TCP or virtual machine
motion). This means congestion policing encourages and ultimately
forces end-systems to balance their load over the whole pool of
bandwidth. The network can attempt to balance the load, but bulk
congestion policing is particularly designed to encourage
end-systems to do the job, either at the transport layer with
multipath TCP or at the application layer
by moving virtual machines or choosing peer virtual machines in a
similar way to BitTorrent.We show that congestion policing works on the pool of links,
irrespective of whether individual links have significantly
different capacities.We show that a congestion policer allows a wide variety of
responses to congestion (e.g. New Reno TCP, Cubic TCP, Compound
TCP, Data Centre TCP and even unresponsive UDP traffic), while
still encouraging and enforcing a sufficient response to
congestion from all sources taken together.Congestion policing can and will enforce a congestion response
if a tenant persistently causes excessive congestion. This ensures
that each tenant's minimum performance is isolated from the
combined effects of everyone else. However, the purpose of
congestion policing is not to intervene in everyone's rate control
all the time. Rather it is encourage each tenant to avoid being
policed — to keep the aggregate of all their flows'
responses to congestion within an overall envelope and balanced
across the network. also includes a section that gives
guidance on how to estimate appropriate fill rates and sizes for
congestion token buckets.The design involves the following elements, each detailed in the
following subsections:The operator of the data centre infrastructure needs to trust this
information, therefore it cannot just use the feedback in the
end-to-end transport (e.g. TCP SACK or ECN echo congestion experienced
flags) that might anyway be encrypted. Trusted congestion feedback may
be implemented in either of the following two ways: Either as a shim in both sending and receiving hypervisors
using an edge-to-edge (host-host) tunnel controlled by the
infrastructure operator, with feedback messages reporting
congestion back to the sending host's hypervisor (in addition to
the e2e feedback at the transport layer).Or in the sending operating system using the congestion
exposure protocol (ConEx )
with a ConEx audit function at the egress edge to check ConEx
signals against actual congestion signals ();The feedback tunnel approach (a) is inefficient because it
duplicates end-to-end feedback and it introduces at least a round
trip's delay, whereas the ConEx approach (b) is more efficient and
not delayed, because ConEx packets signal a conservative estimate of
congestion in the upcoming round trip. Avoiding feedback delay is
important for controlling congestion from aggregated short flows.
However, ConEx signals will not necessarily be supported by the
sending operating system.Therefore, given ConEx IP packets are self-identifying, the best
approach is to rely on ConEx signals when present and fill in with
tunnelled feedback when not, on a packet-by-packet basis.Both approaches are much easier if explicit congestion
notification (ECN ) is enabled on network
switches and if all packets are ECN-capable. For non-ECN-capable
packets, ECN support can be turned on in the outer of an
edge-to-edge tunnel. The reasons that ECN helps in each case
are:Tunnel Feeback: To feed back congestion signals, the tunnel
egress needs to be able to detect forward congestion signals in
the first place. If the only symptom of congestion is dropped
packets, the egress has to watch for gaps in the sequence space
of the transport protocol, which cannot be guaranteed to be
possible—the IP payload may be encrypted, or an unknown
protocol, or parts of the flow may be sent over diverse paths.
The tunnel ingress could add its own sequence numbers (as done
by some pseudowire protocols), but it is easier to simply turn
on ECN at the ingress so that the egress can detect ECN
markings.ConEx: The audit function needs to be able to compare ConEx
signals with actual congestion. So, as before, it needs to be
able to detect congestion at the egress. Therefore the same
arguments for ECN apply.The above cases can be arranged in a 2x2 matrix, to show when
edge-to-edge tunnelling is needed and what function the tunnel would
need to serve:ConEx-capable?ECN-capable: YECN-capable: NYNo tunnel neededECN-enabled tunnelNTunnel FeedbackECN-enabled tunnel + Tunnel feedbackWe can now summarise the steps necessary to ensure an ingress
congestion policer obtains trustworthy congestion signals:Sending operating system: The sender SHOULD send ConEx-enabled and ECN-enabled
packets whenever possible.If the sender uses IPv6 it can signal ConEx in a
destination option header .
If the sender uses IPv4, it can signal ConEx markings by
encoding them within the packet ID field as proposed in
.Ingress edge:If an arriving packet is either not ConEx-capable or not
ECN-capable it SHOULD be tunnelled to the appropriate egress
edge in an outer IP header.A pre-existing edge-to-edge tunnel (e.g. [, ]) can be used, irrespective of whether the
packet is not ConEx-capable or not ECN-capable.Incoming ConEx signals MUST be copied to the outer. For
an incoming IPv4 packet, this implies copying the ID field.
For an incoming IPv6 packet, this implies copying the
Destination Option header.In all cases, the tunnel ingress MUST use the normal mode
of ECN tunnelling .Directly after encapsulation (but not if the packet was not
encapsulated):If and only if the ECN field of the outer header is not
ECN-capable (Not-ECT i.e. 00) it MUST be made ECN-capable by
remarking it to ECT(0), i.e. 01.If the outer ECN field carries any other value than 00,
it should be left unchanged.Directly before the edge egress (irrespective of whether the
packet is encapsulated):If the outer IP header is ConEx-capable, it MUST be
passed through a ConEx audit functionIf the packet is not ConEx-capable, it MUST be passed to
a function that feeds back ECN marking statistics to the
tunnel ingress. Such a function is also a requirement of
, which may be re-usable for
this purpose {ToDo: to be confirmed}.Egress Edge Decapsulator:Decapsulation must comply with .
This ensures that, a congestion experienced marking (CE or
11) on the outer will lead to the packet being dropped if
the inner indicates that the endpoints will not understand
ECN (i.e. the inner ECN field is Not-ECT or 00). Effectively
the egress edge drops such packets on behalf of the
congested upstream buffer that marked it because the packet
appeared to be ECN-capable on the outside, but it is not
ECN-capable on this inside. was
deliberately arranged like this so that it would drop such
packets to give an equivalent congestion signal to the
end-to-end transport.Network switches/routers do not need any modification. However,
both congestion detection by the tunnel (approach a) and ConEx audit
(approach b) are significantly easier if switches support ECN.Once switches support ECN, Data centre TCP
could optionally be used (DCTCP requires ECN). It also requires
modified sender and receiver TCP algorithms as well as a more
aggressive configuration of the active queue management (AQM) in the
L3 switches or routers.Innovation in the design of congestion policers is expected and
encouraged, but here we wlil describe one specific design to be
concrete.A bulk congestion policing function would most likely be
implemented as a shim in the hypervisor. The hypervisor would create
one instance of a bulk congestion policer per tenant on the physical
machine, and it would ensure that all traffic sent by that tenant's
VMs into the network would pass through the relevant congestion
policer by associating every new virtual machine with the relevant
policer.A bulk congestion policing function has already been outlined in
. To recap, it consists of a token
bucket that is filled with congestion tokens at a constant rate. The
bucket is drained by the size of every packet that carries a
congestion marking. If the tunnel-feedback approach (a) were used, the
bucket would be drained by congestion feedback from the tunnel egress,
rather than markings on packets. If the ConEx approach (b) were used,
the bucket would be drained by ConEx markings on the actual data
packets being forwarded. A congestion policer will need to drain in
response to either form of signal, because it is recommended that both
approaches are used in combination.Various more sophisticated congestion policer designs have been
evaluated . In these experiments, it
was found that it is better if the policer gradually increases
discards as the bucket becomes empty. Also isolation between tenants
is better if each tenant is policed based on the combination of two
buckets, not one ():A deep bucket (that would take minutes or even hours to fill at
the contracted fill-rate) that constrains the tenant's long-term
average rate of congestion (wi)a very shallow bucket (e.g. only two or three MTU) that is
filled considerably faster than the deep bucket (c * wi), where c
= ~10, which prevents a tenant storing up a large backlog of
tokens then causing congestion in one large burst.In this arrangement each marked packet drains tokens from
both buckets, and the probability of policer discard is taken as the
worse of the two buckets.While the data centre network operator only needs to police
congestion in bulk, tenants may wish to enforce their own limits on
individual users or applications, as sub-limits of their overall
allowance. Given all the information used for policing is readily
available within the transport layer of their own operating system.
Tenants can readily apply any such per-flow, per-user or
per-application limitations. The tenant may operate their own
fine-grained policing software, or such detailed control capabilities
may be offered as part of the platform (platform as a service or
PaaS).A customer may run virtual machines on multiple physical nodes, in
which case at the time each VM is instantiated the data centre
operator will deploy a congestion policer in the hypervisor on each
node where the customer is running a VM.The DC operator can arrange
for these congestion policers to collectively enforce the per-customer
congestion allowance, as a distributed policer.A function to distribute a customer's tokens to the policer
associated with each of the customer's VMs would be needed. This could
be similar to the distributed rate limiting of ,
which uses a gossip-like protocol to fill the sub-buckets.
Alternatively, a logically centralised bucket of congestion tokens
could be used. it could be replicated for reliability then there could
be simple 1-1 communication between the central bucket and each local
token bucket. Importantly, congestion tokens can be freely reassigned between
different VMs, because a congestion token is equivalent at any place
or time in a network. In contrast, traditional bit-rate tokens cannot
simply be reassigned from one VM to another without implications on
the balance of network loading. This is because the parameters used
for bit-rate policing depend on the topology and its capacity planning
(open loop), whereas congestion policing complements the closed loop
congestion avoidance system that adapts to the prevailing traffic and
topology.As well as distribution of tokens between the VMs of a tenant, it
would similarly be feasible to allow transfer of tokens between
tenants, also without breaking the performance isolation properties of
the system. Secure token transfer mechanisms could be built above the
underlying policing design described here, but that is beyond the
current scope and therefore deferred to future work.A mechanism to bring trustworthy congestion signals to the ingress
() is
critical to this performance isolation solution. compares the two solutions: b)
ConEx, which is efficient and it's timely enough to police short
flows; and a) tunnel-feedback, which is neither. However, ConEx
requires deployment in host operating systems first, while tunnel
feedback can be deployed unilaterally by the data centre operator in
all hypervisors (or containers), without requiring support in guest
operating systems.The section describes the steps necessary to support both
approaches. This would provide an incremental deployment route with
the best of both worlds: tunnel feedback could be deployed initially
for unmodified guest OSs despite its weaknesses, and ConEx could
gradually take over as it was deployed more widely in guest OSs. It is
important not to deploy the tunnel feedback approach without checking
for ConEx-capable packets, otherwise it will never be possible to
migrate to ConEx. The advantages of being able to migrate to ConEx
are:no duplicate feedback channel between hypervisors (sending and
forwarding a large proportion of tiny packets), which would cause
considerable packet processing overheadperformance isolation includes the contribution to congestion
from short (sub-round-trip-time) flowsInitially, the approach would be confined to intra-data centre
traffic. With the addition of ECN support on network equipment (at
least bottleneck access routers) in the WAN between data centres, it
could straightforwardly be extended to inter-data centre scenarios,
including across interconnected backbone networks.Once this approach becomes deployed within data centres and
possibly across interconnects between data centres and enterprise
LANs, the necessary support will be implemented in a wide range of
equipment used in these scenarios. Similar equipment is also used in
other networks (e.g. broadband access and backhaul), so that it would
start to be possible for these other networks to deploy a similar
approach.The Related Work section of provides a
useful comparison of the approach proposed here against other attempts
to solve similar problems.When the hose model is used with Diffserv, capacity has to be
considerably over-provisioned for all the unfortunate cases when
multiple sources of traffic happen to coincide even though they are all
in-contract at their respective ingress policers. Even so, every node
within a Diffserv network also has to be configured to limit higher
traffic classes to a maximum rate in case of really unusual traffic
distributions that would starve lower priority classes. Therefore, for
really important performance assurances, Diffserv is used in the 'pipe'
model where the policer constrains traffic separately for each
destination, and sufficient capacity is provided at each network node
for the sum of all the peak contracted rates for paths crossing that
node.In contrast, the congestion policing approach is designed to give
full performance assurances across a meshed network (the hose model),
without having to divide a network up into pipes. If an unexpected
distribution of traffic from all sources focuses on a congestion
hotspot, it will increase the congestion-bit-rate seen by the policers
of all sources contributing to the hot-spot. The congestion policers
then focus on these sources, which in turn limits the severity of the
hot-spot.The critical improvement over Diffserv is that the ingress edges
receive information about any congestion occuring in the middle, so they
can limit how much congestion occurs, wherever it happens to occur.
Previously Diffserv edge policers had to limit traffic generally in case
it caused congestion, because they never knew whether it would (open
loop control).Congestion policing mechanisms could be used to assure the
performance of one data flow (the 'pipe' model), but this would involve
unnecessary complexity, given the approach works well for the 'hose'
model.Therefore, congestion policing allows capacity to be provisioned for
the average case, not for the near-worst case when many unlikely cases
coincide. It assures performance for all traffic using just one traffic
class, whereas Diffserv only assures performance for a small proportion
of traffic by partitioning it off into higher priority classes and
over-provisioning relative to the traffic contracts sold for for this
class.{ToDo: Refer to for comparison with
WRR & WFQ}Seawall {ToDo} {ToDo}This document does not require actions by IANA.{ToDo}Thanks to Yu-Shun Wang for comments on some of the
practicalities.Bob Briscoe is part-funded by the European Community under its
Seventh Framework Programme through the Trilogy 2 project (ICT-317756).
The views expressed here are solely those of the author.Congestion Exposure (ConEx) Concepts and Abstract
MechanismGoogleBTIPv6 Destination Option for ConExConEx is a mechanism by which senders inform the network about
the congestion encountered by packets earlier in the same flow.
This document specifies an IPv6 destination option that is capable
of carrying ConEx markings in IPv6 datagrams.Network Performance Isolation using Congestion
PolicingThis document describes why policing using congestion
information can isolate users from network performance degradation
due to each other's usage, but without losing the multiplexing
benefits of a LAN- style network where anyone can use any amount
of any resource. Extensive numerical examples and diagrams are
given. The document is agnostic to how the congestion information
reaches the policer. The congestion exposure (ConEX) protocol is
recommended, but other tunnel feedback mechanisms have been
proposed.Reusing the IPv4 Identification Field in Atomic
PacketsThis specification takes a new approach to extensibility that
is both principled and a hack. It builds on recent moves to
formalise the increasingly common practice where fragmentation in
IPv4 more closely matches that of IPv6. The large majority of IPv4
packets are now 'atomic', meaning indivisible. In such packets,
the 16 bits of the IPv4 Identification (IPv4 ID) field are
redundant and could be freed up for the Internet community to put
to other uses, at least within the constraints imposed by their
original use for reassembly. This specification defines the
process for redefining the semantics of these bits. It uses the
previously reserved control flag in the IPv4 header to indicate
that these 16 bits have new semantics. Great care is taken
throughout to ease incremental deployment, even in the presence of
middleboxes that incorrectly discard or normalise packets that
have the reserved control flag set.NVGRE: Network Virtualization using Generic Routing
EncapsulationThis document describes the usage of Generic Routing
Encapsulation (GRE) header for Network Virtualization, called
NVGRE, in multi- tenant datacenters. Network Virtualization
decouples virtual networks and addresses from physical network
infrastructure, providing isolation and concurrency between
multiple virtual networks on the same physical network
infrastructure. This document also introduces a Network
Virtualization framework to illustrate the use cases, but the
focus is on specifying the data plane aspect of NVGRE.VXLAN: A Framework for Overlaying Virtualized Layer 2
Networks over Layer 3 NetworksThis document describes Virtual eXtensible Local Area Network
(VXLAN), which is used to address the need for overlay networks
within virtualized data centers accommodating multiple tenants.
The scheme and the related protocols can be used in cloud service
provider and enterprise data center networks.Tunnel Congestion ExposureAt present, tunneling technology has been widely applied in
VPN, mobile communication network, IPv6 over IPv4, Mobile IP,
multi-point delivery, and other fields. In the E2E link, there may
already have been an effective congestion control mechanism, but
we SHOULD also do traffic management in the tunnel to improve the
performance of the entire network. Because of the particularity of
the scenario of the tunnel, the existing E2E traffic management
mechanism cannot be directly be deployed (e.g. VPN, IPv6 over IPv4
etc). In these cases, this document focuses on how to expose the
congestion while the feedback mechanism is left for later study.
This document describes the problem of identifying congestion in a
tunnel segment of an end-to-end flow. A basic tunnel congestion
exposure model is then described, followed by three example
scenarios which use the basic model to derive tunnel congestion.
Finally, a general solution that can be applied to IP-in-IP
tunnels is described.Policing Freedom to Use the Internet Resource PoolBTBT & UCLBTProgress on resource controlUCLSeawall: Performance Isolation in Cloud Datacenter
NetworksMicrosoft and Cornell UniMicrosoftMicrosoftMicrosoftData Center TCP (DCTCP)Cloud control with distributed rate limitingDetailed changes are available from
http://tools.ietf.org/html/draft-briscoe-conex-data-centreTook out text Section 4 "Performance Isolation Intuition" and
Section 6. "Parameter Setting" into a separate draft and instead included only a summary in
these sections, referring out for details.Considerably updated Section 5 "Design"Clarifications and updates throughout, including addition of
diagramsSplit off data-centre scenario as a separate document, by
popular request.