idnits 2.17.1 

draft-briscoe-tsvwg-re-ecn-border-cheat-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 14.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 1607.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1584.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1591.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1597.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([RSVP-ECN], [Re-TCP], [PCN]),
     which it shouldn't.  Please replace those with straight textual mentions
     of the documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The exact meaning of the all-uppercase expression 'MAY NOT' is not
     defined in RFC 2119.  If it is intended as a requirements expression, it
     should be rewritten using one of the combinations defined in RFC 2119;
     otherwise it should not be all-uppercase.

  == The expression 'MAY NOT', while looking like RFC 2119 requirements text,
     is not defined in RFC 2119, and should not be used.  Consider using 'MUST
     NOT' instead (if that is what you mean).
     
     Found 'MAY NOT' in this paragraph:
     
     If the ingress gateway can guarantee that the network(s) that will
     carry the flow to its egress gateway all use a common identifier for the
     aggregate (e.g. a single MPLS network without ECMP routing), it MAY NOT
     set NF when it adds a new flow to an active aggregate and an NF packet
     need only be sent if a whole aggregate has been idle for more than 1
     second.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (February 27, 2006) is 6633 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-03) exists of
     draft-briscoe-tsvwg-cl-phb-01

  -- Possible downref: Normative reference to a draft: ref. 'PCN' 

  == Outdated reference: A later version (-01) exists of
     draft-lefaucheur-rsvp-ecn-00

  -- Possible downref: Normative reference to a draft: ref. 'RSVP-ECN' 

  == Outdated reference: A later version (-09) exists of
     draft-briscoe-tsvwg-re-ecn-tcp-01

  == Outdated reference: A later version (-04) exists of
     draft-briscoe-tsvwg-cl-architecture-02

  == Outdated reference: A later version (-20) exists of
     draft-ietf-nsis-rmd-06


     Summary: 4 errors (**), 0 flaws (~~), 8 warnings (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Transport Area Working Group                                  B. Briscoe
3	Internet-Draft                                                  BT & UCL
4	Expires: August 31, 2006                               February 27, 2006

6	        Emulating Border Flow Policing using Re-ECN on Bulk Data
7	               draft-briscoe-tsvwg-re-ecn-border-cheat-00

9	Status of this Memo

11	   By submitting this Internet-Draft, each author represents that any
12	   applicable patent or other IPR claims of which he or she is aware
13	   have been or will be disclosed, and any of which he or she becomes
14	   aware will be disclosed, in accordance with Section 6 of BCP 79.

16	   Internet-Drafts are working documents of the Internet Engineering
17	   Task Force (IETF), its areas, and its working groups.  Note that
18	   other groups may also distribute working documents as Internet-
19	   Drafts.

21	   Internet-Drafts are draft documents valid for a maximum of six months
22	   and may be updated, replaced, or obsoleted by other documents at any
23	   time.  It is inappropriate to use Internet-Drafts as reference
24	   material or to cite them other than as "work in progress."

26	   The list of current Internet-Drafts can be accessed at
27	   http://www.ietf.org/ietf/1id-abstracts.txt.

29	   The list of Internet-Draft Shadow Directories can be accessed at
30	   http://www.ietf.org/shadow.html.

32	   This Internet-Draft will expire on August 31, 2006.

34	Copyright Notice

36	   Copyright (C) The Internet Society (2006).

38	Abstract

40	   Scaling per flow admission control to the Internet is a hard problem.
41	   A recently proposed approach combines Diffserv and pre-congestion
42	   notification (PCN) to provide a service slightly better than Intserv
43	   controlled load.  It scales to networks of any size, but only if
44	   domains trust each other to comply with admission control and rate
45	   policing.  This memo claims to solve this trust problem without
46	   losing scalability.  It describes bulk border policing that emulates
47	   per-flow policing with the help of another recently proposed
48	   extension to ECN, involving re-echoing ECN feedback (re-ECN).  With
49	   only passive, bulk measurements at borders, sanctions can be applied
50	   against cheating networks.

52	Status (to be removed by the RFC Editor)

54	   This memo is posted as an Internet-Draft with the intent to
55	   eventually progress to informational status.  It is envisaged that
56	   the necessary standards actions to realise the system described would
57	   sit in three other documents currently being discussed (but not on
58	   the standards track) in the IETF Transport Area [Re-TCP], [RSVP-ECN]
59	   & [PCN].  The authors seek comments from the Internet community on
60	   whether combining PCN and re-ECN is a sufficient solution to the
61	   admission control problem.

63	Table of Contents

65	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
66	   2.  Requirements Notation  . . . . . . . . . . . . . . . . . . . .  5
67	   3.  The Problem  . . . . . . . . . . . . . . . . . . . . . . . . .  5
68	     3.1.  The Traditional Per-flow Policing Problem  . . . . . . . .  5
69	     3.2.  Generic Scenario . . . . . . . . . . . . . . . . . . . . .  7
70	   4.  Re-ECN Protocol for an RSVP Transport  . . . . . . . . . . . .  9
71	     4.1.  Protocol Overview  . . . . . . . . . . . . . . . . . . . .  9
72	     4.2.  Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or
73	           v6)  . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
74	     4.3.  Protocol Operation . . . . . . . . . . . . . . . . . . . . 13
75	     4.4.  Aggregate Bootstrap  . . . . . . . . . . . . . . . . . . . 15
76	     4.5.  Flow Bootstrap . . . . . . . . . . . . . . . . . . . . . . 16
77	   5.  Emulating Border Policing with Re-ECN  . . . . . . . . . . . . 17
78	     5.1.  Policing Overview  . . . . . . . . . . . . . . . . . . . . 18
79	     5.2.  Pre-requisite Contractual Arrangements . . . . . . . . . . 21
80	     5.3.  Emulation of Per-Flow Rate Policing: Rationale and
81	           Limits . . . . . . . . . . . . . . . . . . . . . . . . . . 23
82	     5.4.  Policing Dishonest Marking . . . . . . . . . . . . . . . . 24
83	     5.5.  Competitive Routing  . . . . . . . . . . . . . . . . . . . 25
84	     5.6.  Fail-safes . . . . . . . . . . . . . . . . . . . . . . . . 26
85	   6.  Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
86	   7.  Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 29
87	   8.  Design Choices and Rationale . . . . . . . . . . . . . . . . . 29
88	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 30
89	   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 30
90	   11. Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 31
91	   12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 31
92	   13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 31
93	   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 31
94	     14.1. Normative References . . . . . . . . . . . . . . . . . . . 31
95	     14.2. Informative References . . . . . . . . . . . . . . . . . . 32
96	   Appendix A.  Implementation  . . . . . . . . . . . . . . . . . . . 33
97	     A.1.  Ingress Gateway Algorithm for Blanking the RE bit  . . . . 33
98	     A.2.  Bulk Downstream Congestion Metering Algorithm  . . . . . . 34
99	     A.3.  Algorithm for Sanctioning Negative Traffic . . . . . . . . 35
100	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 36
101	   Intellectual Property and Copyright Statements . . . . . . . . . . 37

103	1.  Introduction

105	   The Internet community largely lost interest in the Intserv
106	   architecture after it was clarified that it would be unlikely to
107	   scale to the whole Internet [RFC2208].  Although Intserv mechanisms
108	   proved impractical, the services it aimed to offer are still very
109	   much required.

111	   A recently proposed approach [CL-arch] combines Diffserv and pre-
112	   congestion notification (PCN) to provide a service slightly better
113	   than Intserv controlled load [RFC2211].  It scales to any size
114	   network, but only if domains trust each other to comply with
115	   admission control and rate policing.  This memo describes border
116	   policing measures to sanction networks that cheat each other.  The
117	   approach provides a sufficient emulation of flow rate policing at
118	   trust boundaries but without per-flow processing.  The emulation is
119	   not perfect, but it is sufficient to ensure that the punishment is at
120	   least proportionate to the severity of the cheat.

122	   The aim is to be able to claim that controlled load service can scale
123	   to any number of endpoints, even though such scaling must take
124	   account of the increasing numbers of networks and users who may all
125	   have conflicting interests.  To achieve such scaling, this memo
126	   combines two recent proposals, both of which it briefly recaps:

128	   o  A framework for admission control over Diffserv using pre-
129	      congestion notification [CL-arch] describes how bulk pre-
130	      congestion notification on routers within an edge-to-edge Diffserv
131	      region can emulate the precision of per-flow admission control to
132	      provide controlled load service without unscalable per-flow
133	      processing;

135	   o  Re-ECN: Adding Accountability to TCP/IP [Re-TCP].  The trick that
136	      addresses cheating at borders is to recognise that border policing
137	      is mainly necessary because cheating upstream networks will admit
138	      traffic when they shouldn't only as long as they don't directly
139	      experience the downstream congestion their misbehaviour can cause.
140	      The re-ECN protocol ensures upstream nodes honestly declare
141	      expected downstream congestion in all forwarded packets, which we
142	      then use to emulate border policing.

144	   Rather than the end-to-end arrangement used when re-ECN was specified
145	   for the TCP transport [Re-TCP], this memo specifies re-ECN in an
146	   edge-to-edge arrangement, making it applicable to the Diffserv
147	   admission control scenario in the framework.  Also, rather than using
148	   a TCP transport for regular congestion feedback, this memo specifies
149	   re-ECN using RSVP as the transport.  We use the proposed minor
150	   extension of RSVP that allows it to carry congestion feedback [RSVP-
151	   ECN], which is much less frequent but more precise than TCP.

153	   Of course, network operators may choose to process per-flow
154	   signalling at their borders for their own reasons, such as per-flow
155	   accounting.  But the goal of this document is to show that per-flow
156	   processing at borders is no longer necessary in order to provide end-
157	   to-end QoS using flow admission control.  To be clear, we are
158	   absolutely opposed to standardisation of technology that embeds
159	   particular business models into the Internet.  Our aim here is to
160	   provide a new metric (downstream congestion) at trust boundaries.
161	   Given the well-known significance of congestion in economics,
162	   operators can then use this new metric in their interconnection
163	   contracts if they choose.  This will enable competitive evolution of
164	   new business models (for examples see&nbsp[IXQoS]), alongside more
165	   traditional models that depend on more costly per-flow processing at
166	   borders.

168	   We specify this protocol solution in detail in Section 4, after
169	   specifying the inter-domain policing problem more precisely and
170	   briefly recapping the framework for providing admission control using
171	   pre-congestion notification in Section 3.

173	   Having described the solution, this memo continues as follows: {ToDo:
174	   }

176	2.  Requirements Notation

178	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
179	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
180	   document are to be interpreted as described in [RFC2119].

182	3.  The Problem

184	3.1.  The Traditional Per-flow Policing Problem

186	   If we claim to be able to emulate per-flow policing with bulk
187	   policing at trust boundaries, we need to know exactly what we are
188	   emulating.  So, even though we expect it to become a historic
189	   practice, we will start from the traditional scenario with per-flow
190	   policing at trust boundaries to explain why it has always been
191	   considered necessary.

193	   To be able to take advantage of a reservation-based service such as
194	   controlled load, a source must reserve resources using a signalling
195	   protocol such as RSVP [RFC2205].  But, even if the source is
196	   authorised and admitted at the flow level, it cannot necessarily be
197	   trusted to send packets within the rate profile it requested.  For
198	   instance, without data rate policing, a source could reserve
199	   resources for an 8kbps audio flow but transmit a 6Mbps video (theft
200	   of service).  More subtly, the sender could generate bursts that were
201	   outside the profile it had requested.

203	   In traditional architectures, per-flow packet rate-policing is
204	   expensive and unscalable but, without it, a network is vulnerable to
205	   such theft of service (whether malicious or accidental).  Perhaps
206	   more importantly, if flows are allowed to send more data than they
207	   were permitted, the ability of admission control to give assurances
208	   to other flows will break.

210	   A signalled request refers to a flow of packets by its flow ID tuple
211	   (filter spec [RFC2205]) (or its security parameter index (SPI)&
212	   nbsp[RFC2207] if port numbers are hidden by IPsec encryption).  But
213	   merely opening a pin-hole for packets that match an admitted flow ID
214	   is an insufficient policing mechanism.  The packet rate must also be
215	   policed to keep the flow within the requested flow spec [RFC2205].

217	   Just as sources need not be trusted to keep within their requested
218	   flow spec, whole networks might also try to cheat.  We will now set
219	   up a concrete scenario to illustrate such cheats.  Imagine
220	   reservations for unidirectional flows from senders, through at least
221	   two networks, an edge network and its downstream transit provider.
222	   Imagine the edge network charges its retail customers per reservation
223	   but also has to pay its transit provider a charge per reservation.
224	   Typically, both its selling and buying charges might depend on the
225	   duration and rate of each reservation.  The level of the actual
226	   selling and buying prices are irrelevant to our discussion (most
227	   likely the network will sell at a higher price than it buys, of
228	   course).

230	   A cheating ingress network could systematically reduce the size of
231	   its retail customers' reservation signalling requests before
232	   forwarding them to its transit provider (and systematically reinstate
233	   the responses on the way back).  It would then receive an honest
234	   income from its upstream retail customer but only pay for
235	   fraudulently smaller reservations downstream.  Equivalently, a
236	   cheating ingress network may feed the traffic from a number of flows
237	   into an aggregate reservation over the transit that is smaller than
238	   the total of all the flows.  Because of these fraud possibilities, in
239	   traditional QoS reservation architectures the downstream network
240	   polices at each border.  The policer checks that the actual sent data
241	   rate of each flow is within the signalled reservation.

243	   Reservation signalling could be authenticated end to end, but this
244	   wouldn't prevent the aggregation cheat just described.  For this
245	   reason, and to avoid the need for a global PKI, signalling integrity
246	   is typically only protected on a hop-by-hop basis &nbsp[RFC2747].

248	   A variant of the above cheat is where a router in an honest
249	   downstream network denies admission to a new reservation, but a
250	   cheating upstream network still admits the flow.  For instance, the
251	   networks may be using Diffserv internally, but Intserv admission
252	   control at their borders [RFC2998].  The cheat would only work if
253	   they were using bulk Diffserv traffic policing at their borders,
254	   perhaps to avoid the cost/complexity of Intserv border policing.  As
255	   far as the cheating upstream network is concerned, it gets the
256	   revenue from the reservation, but it doesn't have to pay any
257	   downstream wholesale charges and the congestion is in someone else's
258	   network.  The cheating network may calculate that most of the flows
259	   affected by congestion in the downstream network aren't likely to be
260	   its own.  It may also calculate that the downstream router is
261	   probably not actually congested, but rather it is denying admission
262	   to new flows to protect bandwidth assigned to other lower priority
263	   services.

265	   To summarise, in traditional reservation signalling architectures, if
266	   a network cannot trust a neighbouring upstream network to rate-police
267	   each reservation, it has to check for itself that the data fits
268	   within each of the reservations it has admitted.

270	3.2.  Generic Scenario

272	   We will now describe a generic internetworking scenario that we will
273	   use to describe and to test our bulk policing proposal.  It consists
274	   of a number of networks and endpoints that do not fully trust each
275	   other to behave.  In Section 6 we will tie down exactly what we mean
276	   by partial trust, and we will consider the various combinations where
277	   some networks do not trust each other and others are colluding
278	   together.

280	    _    ___      _____________________________________       ___    _
281	   | |  |   |   _|__    ______    ______    ______    _|__   |   |  | |
282	   | |  |   |  |    |  |      |  |      |  |      |  |    |  |   |  | |
283	   | |  |   |  |    |  |Inter-|  |Inter-|  |Inter-|  |    |  |   |  | |
284	   | |  |   |  |    |  | ior  |  | ior  |  | ior  |  |    |  |   |  | |
285	   | |  |   |  |    |  |Domain|  |Domain|  |Domain|  |    |  |   |  | |
286	   | |  |   |  |    |  |  A   |  |  B   |  |  C   |  |    |  |   |  | |
287	   | |  |   |  |    |  |      |  |      |  |      |  |    |  |   |  | |
288	   | |  |   |  +----+  +-+  +-+  +-+  +-+  +-+  +-+  +----+  |   |  | |
289	   | |  |   |  |    |  |B|  |B|  |B|  |B|  |B|  |B|  |    |  |   |  | |
290	   | |==|   |==|Ingr|==|R|  |R|==|R|  |R|==|R|  |R|==|Egr |==|   |==| |
291	   | |  |   |  |G/W |  | |  | |  | |  | |  | |  | |  |G/W |  |   |  | |
292	   | |  |   |  +----+  +-+  +-+  +-+  +-+  +-+  +-+  +----+  |   |  | |
293	   | |  |   |  |    |  |      |  |      |  |      |  |    |  |   |  | |
294	   | |  |   |  |____|  |______|  |______|  |______|  |____|  |   |  | |
295	   |_|  |___|    |_____________________________________|     |___|  |_|

297	   Sx   Ingress               Diffserv region               Egress   Rx
298	   End  Access                                              Access  End
299	   Host Network                                            Network Host
300	                <-------- edge-to-edge signalling ------->
301	                          (for admission control)

303	   <-------------------end-to-end QoS signalling protocol------------->

305	   Figure 1: Generic Scenario (see text for explanation of terms)

307	   An ingress and egress gateway (Ingr G/W and Egr G/W in Figure 1)
308	   connect the interior Diffserv region to the edge access networks
309	   where routers (not shown) use per-flow reservation processing.
310	   Within the Diffserv region are three interior domains, A, B and C, as
311	   well as the inward facing interfaces of the ingress and egress
312	   gateways.  An ingress and egress border router (BR) is shown
313	   interconnecting each interior domain with the next.  There may be
314	   other interior routers (not shown) within each interior domain.

316	   In two paragraphs we now briefly recap how pre-congestion
317	   notification is intended to be used to control flow admission to a
318	   large Diffserv region.  The first paragraph describes data plane
319	   functions and the second describes signalling in the control plane.
320	   We omit many details from [CL-arch] including behaviour during
321	   routing changes.  For brevity here we assume other flows are already
322	   in progress across a path through the Diffserv region before a new
323	   one arrives, but how bootstrap works is described in Section 4.4.

325	   Figure 1 shows a single simplex reserved flow from the sending (Sx)
326	   end host to the receiving (Rx) end host.  The ingress gateway polices
327	   incoming traffic within its admitted reservation and remarks it to
328	   turn on an ECN-capable codepoint&nbsp[RFC3168] and the controlled
329	   load (CL) Diffserv codepoint.  Together, these codepoints define
330	   which traffic is entitled to the enhanced scheduling of the CL
331	   behaviour aggregate on routers within the Diffserv region.  The CL
332	   PHB of interior routers consists of a scheduling behaviour and a new
333	   ECN marking behaviour that we call 'pre-congestion
334	   notification' [PCN].  The CL PHB simply re-uses the definition of
335	   expedited forwarding (EF)&nbsp[RFC3246] for its scheduling behaviour.
336	   But it incorporates a new ECN marking behaviour, which sets the ECN
337	   field of an increasing number of CL packets to the admission marked
338	   (AM) codepoint as they approach a threshold rate that is lower than
339	   the line rate.  The use of virtual queues ensures real queues have
340	   hardly built up any congestion delay.

342	   The level of marking detected at the egress of the Diffserv region,
343	   is then used by the signalling system in order to determine admission
344	   control.  The end-to-end QoS signalling (e.g.  RSVP) for a new
345	   reservation takes one giant hop from ingress to egress gateway,
346	   because interior routers within the Diffserv region are configured to
347	   ignore RSVP.  The egress gateway holds flow state because it takes
348	   part in the end-to-end reservation.  So it can classify all packets
349	   by flow and it can identify all flows that have the same previous
350	   RSVP hop (a CL-region-aggregate).  For each CL-region-aggregate of
351	   flows in progress, the egress gateway maintains a per-packet moving
352	   average of the fraction of pre-congestion-marked traffic.  Once an
353	   RSVP PATH message for a new reservation has hopped across the
354	   Diffserv region and reached the destination, an RSVP RESV message is
355	   returned.  As the RESV message passes, the egress gateway piggy-backs
356	   the relevant pre-congestion level onto it [RSVP-ECN].  Again,
357	   interior routers ignore the RSVP message, but the ingress gateway
358	   strips off the pre-congestion level.  If the pre-congestion level is
359	   above a threshold, the ingress gateway denies admission to the new
360	   reservation, otherwise it returns the original RESV signal back
361	   towards the data sender.

363	   Once a reservation is admitted, its traffic will always receive low
364	   delay service for the duration of the reservation.  This is because
365	   ingress gateways ensure that traffic not under a reservation cannot
366	   pass into the Diffserv region with the CL DSCP set.  So non-reserved
367	   traffic will always be treated with a lower priority PHB at each
368	   interior router.

370	4.  Re-ECN Protocol for an RSVP Transport

372	4.1.  Protocol Overview

374	   First we need to recap the way routers accumulate congestion marking
375	   along a path.  Each ECN-capable router marks some packets with CE,
376	   the marking probability increasing with the length of the virtual
377	   queue at its egress link [PCN].  With multiple ECN-capable routers on
378	   a path, the ECN field accumulates the fraction of CE marking that
379	   each router adds.  The combined effect of the packet marking of all
380	   the routers along the path signals congestion of the whole path to
381	   the receiver.  So, for example, if one router early in a path is
382	   marking 1% of packets and another later in a path is marking 2%,
383	   flows that pass through both routers will experience approximately 3%
384	   marking.

386	   The packets crossing an inter-domain trust boundary within the
387	   Diffserv region will all have come from different ingress gateways
388	   and will all be destined for different egress gateways.  We will show
389	   that the key to policing against theft of service is to be able to
390	   measure expected downstream pre-congestion on the paths between a
391	   border router and the egress gateways that packets are headed for.

393	   With the original ECN protocol, if CE markings crossing the border
394	   had been counted over a period, they would have represented the
395	   accumulated upstream pre-congestion that had already been experienced
396	   by those packets.  The general idea of re-ECN is for the ingress
397	   gateway to continuously encode path congestion into the IP header,
398	   where path means from ingress to egress gateway.  Then at any point
399	   on that path (e.g. between domains A & B in Figure 2 below), IP
400	   headers can be monitored to subtract upstream congestion from
401	   expected path congestion in order to give the expected downstream
402	   congestion still to be experienced until the egress gateway.

404	                  _____________________________________
405	                _|__    ______    ______    ______    _|__
406	               |    |  |  A   |  |  B   |  |  C   |  |    |
407	               +----+  +-+  +-+  +-+  +-+  +-+  +-+  +----+
408	               |    |  |B|  |B|  |B|  |B|  |B|  |B|  |    |
409	               |Ingr|==|R|  |R|==|R|  |R|==|R|  |R|==|Egr |
410	               |G/W |  | |  | |: | |  | |  | |  | |  |G/W |
411	               +----+  +-+  +-+: +-+  +-+  +-+  +-+  +----+
412	               |    |  |      |: |      |  |      |  |    |
413	               |____|  |______|: |______|  |______|  |____|
414	                 |_____________:_______________________|
415	                               :
416	                 |             :                       |
417	                 |<-upstream-->:<-expected downstream->|
418	                 | congestion  :      congestion       |
419	                 |     u               v ~= p - u      |
420	                 |                                     |
421	                 |<--- expected path congestion, p --->|

423	   Figure 2: Re-ECN concept

425	4.2.  Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6)

427	   In this section we define the names of the various codepoints of the
428	   re-ECN protocol, deferring description of their semantics to the
429	   following sections.  First we recap the re-ECN wire protocol proposed
430	   in [Re-TCP].  It uses the two bit ECN field broadly as in
431	   RFC3168 [RFC3168].  It also uses a new re-ECN extension (RE) bit.
432	   The actual position of the RE bit is different between IPv4 & v6
433	   headers so we will use an abstraction of the IPv4 and v6 wire
434	   protocols by just calling it the RE bit.  [Re-TCP] proposes using bit
435	   48 (currently unused) in the IPv4 header for the RE bit, while it
436	   proposes an ECN extension header for IPv6.

438	   Unlike the ECN field, the RE bit is intended to be set by the sender
439	   and remain unchanged along the path, although it can be read by
440	   network elements that understand the re-ECN protocol.  In the
441	   scenario used in this memo, an ingress gateway changes the setting of
442	   the RE bit, acting as a proxy for the sender, as permitted in the
443	   specification of re-ECN.

445	   Although the RE bit is a separate, single bit field, it can be read
446	   as an extension to the two-bit ECN field; the three concatenated bits
447	   in what we will call the extended ECN field (EECN) make eight
448	   codepoints available.  When the RE bit setting is "don't care", we
449	   use the RFC3168 names of the ECN codepoints, but [Re-TCP] proposes
450	   the following six codepoint names for when there is a need to be more
451	   specific.

453	   +-------+------------+------+-------------+-------------------------+
454	   |  ECN  | RFC3168    |  RE  | re-ECN      |      re-ECN meaning     |
455	   | field | codepoint  |  bit | codepoint   |                         |
456	   +-------+------------+------+-------------+-------------------------+
457	   |   00  | Not-ECT    |   0  | NRECT       |    Not re-ECN-capable   |
458	   |       |            |      |             |        transport        |
459	   |   00  | Not-ECT    |   1  | NF          |       No feedback       |
460	   |       |            |      |             |                         |
461	   |   01  | ECT(1)     |   0  | Re-Echo     |   Re-echoed congestion  |
462	   |       |            |      |             |         and RECT        |
463	   |   01  | ECT(1)     |   1  | RECT        |      re-ECN capable     |
464	   |       |            |      |             |        transport        |
465	   |   10  | ECT(0)     |   0  | --CU--      |     Currently unused    |
466	   |       |            |      |             |                         |
467	   |   10  | ECT(0)     |   1  | --CU--      |     Currently unused    |
468	   |       |            |      |             |                         |
469	   |   11  | CE         |   0  | CE(0)       |  Congestion experienced |
470	   |       |            |      |             |       with Re-Echo      |
471	   |   11  | CE         |   1  | CE(-1)      |  Congestion experienced |
472	   +-------+------------+------+-------------+-------------------------+

474	    Table 1: Re-cap of Default Extended ECN Codepoints Proposed for Re-
475	                                    ECN

477	   As permitted by RFC3168, [PCN] proposes new semantics for the ECN
478	   codepoints when combined with a Diffserv codepoint (DSCP) that uses
479	   pre-congestion notification.  It also proposes various alternative
480	   encodings for these semantics, attempting to fit five states into the
481	   four available ECN codepoints by making various compromises.  The
482	   five states are Not-ECT, ECT (ECN-capable transport), the ECN Nonce,
483	   Admission Marking (AM) and Pre-emption Marking (PM).

485	   One of the five states was for the ECN Nonce [RFC3540], but the
486	   capability we describe in this memo supercedes any need for the
487	   Nonce.  The ECN Nonce is an elegant scheme, but it only allows a
488	   sending node (or its proxy) to detect suppression of congestion
489	   marking by a cheating receiver.  Thus the Nonce requires the sender
490	   or its proxy to be trusted to respond correctly to congestion.  But
491	   this is precisely the main cheat we want to protect against (as well
492	   as many others).

494	   One of the compromises that [PCN] explores ("Alternative 5") leaves
495	   out support for the ECN Nonce.  Therefore we use that one.  Then,
496	   with the addition of the RE bit, the 8 encodings of the extended ECN
497	   (EECN) field become those defined in the table below.  Note that
498	   these codepoints only take on the semantics in the table below when
499	   combined with a Diffserv codepoint that the operator has defined as
500	   supporting pre-congestion notification.

502	   +--------+-----------+------+-------------+-------------------------+
503	   |   ECN  | PCN       |  RE  | re-ECN      |      re-ECN meaning     |
504	   |  field | codepoint |  bit | codepoint   |                         |
505	   +--------+-----------+------+-------------+-------------------------+
506	   |   00   | Not-ECT   |   0  | NRECT       |    Not re-ECN-capable   |
507	   |        |           |      |             |        transport        |
508	   |   00   | Not-ECT   |   1  | NF          |       No feedback       |
509	   |        |           |      |             |                         |
510	   |   01   | ECT(1)    |   0  | Re-Echo     |   Re-echoed congestion  |
511	   |        |           |      |             |         and RECT        |
512	   |   01   | ECT(1)    |   1  | RECT        |      re-ECN capable     |
513	   |        |           |      |             |        transport        |
514	   |   10   | AM        |   0  | AM(0)       |  Admission Marking with |
515	   |        |           |      |             |         Re-Echo         |
516	   |   10   | AM        |   1  | AM(-1)      |    Admission Marking    |
517	   |   11   | PM        |   0  | PM(0)       |   Pre-emption Marking   |
518	   |        |           |      |             |       with Re-Echo      |
519	   |   11   | PM        |   1  | PM(-1)      |   Pre-emption Marking   |
520	   +--------+-----------+------+-------------+-------------------------+

522	   Table 2: Extended ECN Codepoints if the Diffserv codepoint uses Pre-
523	                       congestion Notification (PCN)

525	   For the rest of this memo, we will not distinguish between Admission
526	   Marking and Pre-emption Marking (unless stated otherwise).  We will
527	   call both "congestion marking".  With the above encoding, congestion
528	   marking can be read to mean any packet with the left-most bit of the
529	   ECN field set.

531	   All but the "not re-ECN-capable transport" (NRECT) field imply the
532	   presence of an ECN-capable transport.  Congested PCN-capable routers
533	   must drop rather than mark packets carrying the NRECT codepoint.
534	   Note that adding PCN-capability to a router will involve checking the
535	   RE bit as well as the ECN field and DSCP before deciding whether to
536	   drop or to mark a packet during congestion.  Router implementations
537	   might well append the RE bit to their internal representation of the
538	   ECN field, treating them internally as one 3-bit extended ECN value.

540	4.3.  Protocol Operation

542	   In this section we will give an overview of the operation of the re-
543	   ECN protocol for an RSVP transport, deferring a detailed
544	   specification to the following sections.

546	   The re-ECN protocol involves a simple tweak to the action of the
547	   gateway at the ingress edge of the CL region.  In the framework just
548	   described [CL-arch], for each active traffic aggregate across the CL
549	   region (CL-region-aggregate) the ingress gateway will hold a fairly
550	   recent Congestion-Level-Estimate that the egress gateway will have
551	   fed back to it, piggybacked on the signalling that sets up each flow.
552	   For instance, one aggregate might have been experiencing 3% pre-
553	   congestion (that is, congestion marked octets whether Admission
554	   Marked or Pre-emption Marked).  In this case, the ingress gateway
555	   MUST clear the RE bit to "0" for the same percentage of octets of CL-
556	   packets (3%) and set it to "1" in the rest (97%).  Appendix A.1 gives
557	   a simple pseudo-code algorithm that the ingress gateway may use to do
558	   this.

560	   The RE bit is set and cleared this way round for incremental
561	   deployent reasons (see [Re-TCP]).  To avoid confusion we will use the
562	   term `blanking' (rather than marking) when the RE bit is cleared to
563	   "0", so we will talk of the `RE blanking fraction' as the fraction of
564	   octets with the RE bit cleared to "0".

566	       ^
567	       |
568	       |         RE blanking fraction
569	    3% |    +----------------------------+====+
570	       |    |                            |    |
571	    2% |    |                            |    |
572	       |    | congestion marking fraction|    |
573	    1% |    |     +----------------------+    |
574	       |    |     |                           |
575	    0% +----+=====+---------------------------+------>
576	            ^   <--A---> <---B---> <---C--->  ^        domain
577	            |     ^                      ^    |
578	        ingress   |                      |    egress
579	                1.00%                 2.00%          marking fraction

581	   Figure 3: Example Re-ECN Codepoint Marking fractions (Imprecise)

583	   Figure 3 illustrates our example.  The horizontal axis represents the
584	   index of each congestible resource (typically queues) along a path
585	   through the Internet.  The two superimposed plots show the fraction
586	   of each ECN codepoint observed along this path, assuming two
587	   congested routers somewhere within domans A and C. And the table
588	   below shows the downstream pre-congestion measured at various border
589	   observation points along the path.  These figures are actually
590	   reasonable approximations derived from more precise formulae given in
591	   Appendix A of [Re-TCP].  The RE bit is not changed by interior
592	   routers, so it can be seen that it acts as a reference against which
593	   the congestion marking fraction can be compared along the path.

595	   +--------------------------+---------------------------------------+
596	   | Border observation point | Approximate Downstream pre-congestion |
597	   +--------------------------+---------------------------------------+
598	   |       ingress -- A       |              3% - 0% = 3%             |
599	   |          A -- B          |              3% - 1% = 2%             |
600	   |          B -- C          |              3% - 1% = 2%             |
601	   |        C -- egress       |              3% - 3% = 0%             |
602	   +--------------------------+---------------------------------------+

604	   Note that the ingress determines the RE blanking fraction for each
605	   aggregate using the most recent feedback from the relevant egress,
606	   arriving with each new reservation, or each refresh.  These arrive
607	   relatively infrequently compared to the speed with which congestion
608	   changes.  Although this feedback will always be out of date, on
609	   average positive errors will cancel out negative over a sufficiently
610	   long duration.

612	   In summary, the network adds pre-congestion marking in the forward
613	   data path, the egress feeds its level back to the ingress in RSVP,
614	   then the ingress gateway re-echoes it into the forward data path by
615	   blanking the RE bit.  Hence the name re-ECN.  Then at any border
616	   within the Diffserv region, the pre-congestion marking that every
617	   passing packet will be expected to experience downstream can be
618	   measured to be the RE blanking fraction minus the congestion marking
619	   fraction.

621	4.4.  Aggregate Bootstrap

623	   When a new reservation PATH message arrives at the egress, if there
624	   are currently no flows in progress from the same ingress, there will
625	   be no state maintaining the current level of pre-congestion marking
626	   for the aggregate.  While the reservation signalling continues onward
627	   towards the receiving host, the egress gateway returns an RSVP
628	   message to the ingress with a flag [RSVP-ECN] asking the ingress to
629	   send a specified number of data probes between them.  This bootstrap
630	   behaviour is all described in the framework [CL-arch].

632	   However, with our new re-ECN scheme, the ingress does not know what
633	   proportion of the data probes should have the RE bit blanked, because
634	   it has no estimate yet of pre-congestion for the path across the
635	   Diffserv region.

637	   To be conservative, following the guidance for specifying other re-
638	   ECN transports in [Re-TCP], the ingress SHOULD set the NF codepoint
639	   of the extended ECN header in all probe packets (Table 2).  As per
640	   the framework, the egress gateway measures the fraction of
641	   congestion-marked probe octets and feeds back the resulting pre-
642	   congestion level to the ingress, piggy-backed on the returning
643	   reservation response (RESV) for the new flow.  Probe packets are
644	   identifiable by the egress because they have the ingress as the
645	   source and the egress as the destination in the IP header.

647	   It may seem inadvisable to expect the NF codepoint to be set on
648	   probes, given legacy firewalls etc. might discard such packets
649	   (because this flag had no prevous legitimate use).  However, in the
650	   deployment scenarios envisaged for this admission control framework,
651	   each domain in the Diffserv region has to be explicitly configured to
652	   support the controlled load service.  So, before deploying the
653	   service, the operator MUST reconfigure such a misbehaving middlebox
654	   to allow through packets with the RE bit set.

656	   Note that we have said SHOULD rather than MUST for the NF setting
657	   behaviour of the ingress for probe packets.  This entertains the
658	   possibility of an ingress implementation having the benefit of other
659	   knowledge of the path, which it re-uses for a newly starting
660	   aggregate.  For instance, it may hold cached information from a
661	   recent use of the aggregate that is still sufficiently current to be
662	   useful.

664	   It might seem pedantic worrying about these few probe packets, but
665	   this behaviour ensures the system is safe, even if the proportion of
666	   probe packets becomes large.

668	4.5.  Flow Bootstrap

670	   It might be expected that a new flow within an active aggregate would
671	   need no special bootstrap behaviour.  If there was an aggregate
672	   already in progress between the gateways the new flow was about to
673	   use, it would inherit the prevailing RE blanking fraction.  And if
674	   there were no active aggregate, the aggregate bootstrap behaviour
675	   would be appropriate and sufficient for the new flow.

677	   However, for a number of reasons, at least the first packet of each
678	   new flow SHOULD be set to the NF codepoint, irrespective of whether
679	   it is joining an active aggregate or not.  If the first packet is
680	   unlikely to be reliably delivered, a number of NF packets MAY be sent
681	   to increase the probability that at least one is delivered to the
682	   egress gateway.

684	   If each flow does not start with an NF packet, it will be seen later
685	   that sanctions may be incorrectly applied at the interface before the
686	   egress gateway.  It will often be possible to apply sanctions at the
687	   granularity of aggregates rather than flows, but in an internetworked
688	   environment it cannot be guaranteed that aggregates will be
689	   identifiable in remote networks.  So setting NF at the start of each
690	   flow is a safe strategy.  For instance, a remote network may have
691	   equal cost multi-path (ECMP) routing enabled, causing flows between
692	   the same gateways to traverse different paths.

694	   After an idle period of more than 1 second, the ingress gateway
695	   SHOULD set the EECN field of the next packet it sends to NF.  This
696	   REQUIREMENT allows the design of network policers to be
697	   deterministic.

699	   If the ingress gateway can guarantee that the network(s) that will
700	   carry the flow to its egress gateway all use a common identifier for
701	   the aggregate (e.g. a single MPLS network without ECMP routing), it
702	   MAY NOT set NF when it adds a new flow to an active aggregate and an
703	   NF packet need only be sent if a whole aggregate has been idle for
704	   more than 1 second.

706	5.  Emulating Border Policing with Re-ECN

708	   Note: In the rest of this memo, where the context makes it clear, we
709	   will loosely use the term 'congestion' rather than using the stricter
710	   'downstream pre-congestion'.  Also we will loosely talk of positive
711	   or negative traffic, meaning traffic where the moving average of the
712	   downstream pre-congestion metric is persistently positive or negative
713	   respectively.

715	   The notion of positive and negative downstream pre-congestion is
716	   because downstream pre-congestion is calculated by subtracting the
717	   congestion marking fraction from the RE blanking fraction.  Therefore
718	   packets can be considered to have a 'value multiplier' of +1, 0 or
719	   -1.  Blanking the RE bit increments the 'value multiplier' of a
720	   packet.  Congestion marking a packet decrements 'the value
721	   multiplier' (whether admission marking or pre-emption marking).  Both
722	   together cancel each other out (a neutral or zero 'value-
723	   multiplier').  The NF codepoint is an exception.  It has the same
724	   positive 'value multiplier' as a re-echoed packet.  The table below
725	   specifies unambiguously the value multipliers of each extended ECN
726	   codepoint.

728	   +-------+------+-------------+--------------+-----------------------+
729	   |  ECN  |  RE  | re-ECN      | 'Value       |     re-ECN meaning    |
730	   | field |  bit | codepoint   | multiplier'  |                       |
731	   +-------+------+-------------+--------------+-----------------------+
732	   |   00  |   0  | NRECT       | n/a          |   Not re-ECN-capable  |
733	   |       |      |             |              |       transport       |
734	   |   00  |   1  | NF          | +1           |      No feedback      |
735	   |   01  |   0  | Re-Echo     | +1           |  Re-echoed congestion |
736	   |       |      |             |              |        and RECT       |
737	   |   01  |   1  | RECT        | 0            |     re-ECN capable    |
738	   |       |      |             |              |       transport       |
739	   |   10  |   0  | AM(0)       | 0            |   Admission Marking   |
740	   |       |      |             |              |      with Re-Echo     |
741	   |   10  |   1  | AM(-1)      | -1           |   Admission Marking   |
742	   |   11  |   0  | PM(0)       | 0            |  Pre-emption Marking  |
743	   |       |      |             |              |      with Re-Echo     |
744	   |   11  |   1  | PM(-1)      | -1           |  Pre-emption Marking  |
745	   +-------+------+-------------+--------------+-----------------------+

747	                Table 4: 'Sign' of Extended ECN Codepoints

749	   Just as we will loosely talk of positive and negative traffic when we
750	   mean the level of downstream pre-congestion in the stream of traffic,
751	   we will also talk of positive or negative packets, meaning whether a
752	   packet contributes positively or negatively to downstream pre-
753	   congestion.

755	5.1.  Policing Overview

757	   To emulate border policing, the general idea is for each domain to
758	   apply financial penalties to its upstream neighbour in proportion to
759	   the amount of downstream pre-congestion that the upstream network
760	   sends across the border.  This seems to encourage everyone to
761	   understate downstream pre-congestion to reduce the penalties they
762	   incur.  But it is in the last domain's interest to create a balancing
763	   upward pressure by applying sanctions to flows where the marking
764	   fraction goes negative before the egress gateway.

766	   Of course, some domains may trust other domains to comply without
767	   applying sanctions or penalties.  In these cases, no penalties need
768	   be applied.  The re-ECN protocol ensures downstream pre-congestion
769	   marking is passed on correctly whether or not penalties are applied
770	   to it, so the system works just as well with a mixture of some
771	   domains trusting each other and others not.

773	   Figure 4 uses the same example as in previous sections to show the
774	   downstream pre-congestion marking fraction, v, across a path through
775	   the Internet.  Downward arrows show the pressure for each domain to
776	   underdeclare downstream pre-congestion in traffic they pass to the
777	   next domain, because of the penalties.  Note that at the last egress
778	   of the Diffserv region, domain C should not agree to pay any
779	   penalties to the egress gateway for pre-congestion passed to the
780	   egress gateway.  Downstream pre-congestion to the egress gateway
781	   should have reached zero here, so if domain C agreed to pay for any
782	   downstream pre-congestion, it would give the egress gateway an
783	   incentive to overdeclare pre-congestion feedback and take the
784	   resulting profit from domain C.

786	   Providers should be free to agree the contractual terms they wish
787	   between themselves, so this memo does not propose to standardise how
788	   these penalties would be applied.  It is sufficient to standardise
789	   the re-ECN protocol so the downstream pre-congestion metric is
790	   available if providers choose to use it.  However, Section 5.2 gives
791	   some examples of how these penalties might be implemented.

793	               p e n a l t i e s
794	              /        |        \
795	       A     :         :         :
796	       |     |  <--A---> <---B---> <---C--->           domain
797	       |     V         :         :         :
798	    3% |    +-----+    |         |         :
799	       |    |     |    V         V         :
800	    2% |    |     +----------------------+ :
801	       |    |  downstream pre-congestion | :
802	    1% |    |     :                      | :
803	       |    |     :                      | :
804	    0% +----+----------------------------+====+------>
805	            :     :                      : A  :
806	            :     :                      : |  :
807	        ingress   :                      : :  egress
808	                1.00%                 2.00%:         pre-congestion
809	                                           |
810	                                       sanctions

812	   Figure 4: Policing Framework, showing creation of opposing pressures
813	   to underdeclare and overdeclare downstream pre-congestion, using
814	   penalties and sanctions

816	   Any traffic that persistently goes negative by the time it leaves a
817	   domain must not have been marked correctly in the first place.  A
818	   domain that discovers such traffic can adopt a range of strategies to
819	   protect itself.  Which strategy it uses will depend on policy,
820	   because it cannot immediately assume malice---there may be an
821	   innocent configuration error somewhere in the system.  So this memo
822	   also does not propose to standardise any particular mechanism, but
823	   Section 5.4 does give examples of how the underlying re-ECN protocol
824	   could be used to apply sanctions to persistently negative traffic.
825	   The ultimate sanction would be to drop such negative traffic
826	   indiscriminately, without regard to flows.  A less drastic sanction
827	   might be to focus drop on specific packets in specific flows to
828	   remove the negative bias while doing minimal harm.

830	   In all cases a management alarm SHOULD be raised on detecting
831	   persistently negative traffic and any automatic sanctions taken
832	   SHOULD be logged.  Even if the chosen policy is to take no automatic
833	   action, the cause can then be investigated manually.

835	   The incentive for domains not to tolerate negatively marked traffic
836	   depends on financial penalties never being negative.  That is, any
837	   level of negative marking only equates to zero penalty.  In other
838	   words, penalties are always paid in the same direction as the data,
839	   and never against the data flow.  This is consistent with the
840	   definition of physical congestion; when a resource is underutilised,
841	   it is not negatively congested, its congestion is just zero.  So,
842	   although short periods of negative marking can be tolerated to
843	   correct temporary overdeclarations due to lags in the feedback
844	   system, persistent downstream negative congestion can have no
845	   physical meaning and therefore must signify a problem.

847	   The upward arrow at the egress of domain C at its border with the
848	   egress gateway in Figure 4 represents this incentive not to allow
849	   negative traffic.  But the same upward pressure applies at every
850	   domain border (arrows not shown).

852	   With the above penalty system, each domain seems to have a perverse
853	   incentive to fake pre-congestion.  For instance domain B's profit
854	   depends on the difference between pre-congestion at its ingress (its
855	   revenue) and at its egress (its cost).  So if B overstates internal
856	   pre-congestion it seems to increase its profit.  However, we can
857	   assume that domain A could bypass B, routing through other domains to
858	   reach the egress.  So the competitive discipline of least-cost
859	   routing can ensure that any domain tempted to fake pre-congestion for
860	   profit risks losing all its usage revenue.  The least congested route
861	   would eventually be able to win this competitive game, only as long
862	   as it didn't declare more fake pre-congestion than the next most
863	   competitive route.

865	   Again, this memo does need to standardise any particular mechanism
866	   for routing based on re-ECN.  Section 5.5 explains why no new
867	   standards would be needed for congestion routing as long as re-ECN
868	   marking had been standardised.  That section also points to papers
869	   concerning optimising routing in the presence of usage charging.

871	5.2.  Pre-requisite Contractual Arrangements

873	   The re-ECN protocol has been chosen to solve the policing problem
874	   because it embeds a downstream pre-congestion metric in passing CL
875	   traffic that is difficult to lie about and can be measured in bulk.
876	   The ability to emulate border policing depends on network operators
877	   choosing to use this metric as one of the elements in their contracts
878	   with each other.

880	   Already many inter-domain agreements involve a capacity and a usage
881	   element.  The usage element may be based on volume or various
882	   measures of peak demand.  We expect that those network operators that
883	   choose to use pre-congestion notification for admission control would
884	   also be willing to consider using this downstream pre-congestion
885	   metric as a usage element in their interconnection contracts for
886	   admission controlled traffic.

888	   Appendix A.2 gives a suggested algorithm for metering downstream
889	   congestion at a border router.  It could hardly be simpler.  It
890	   involves accumulating the volume of packets with the RE bit blanked
891	   and the volume of those with congestion marking and subtracting the
892	   two.  In order to discard a persistent negative balance (see above),
893	   time is slotted into periods of say 10secs (or a time sufficient for
894	   a few rounds of feedback depending on the level of aggregation).
895	   Every timeslot, a positive balance between the two counters is
896	   accumulated into a long-term counter and reset.  Whereas, if the
897	   balance during any timeslot is negative, it is discarded and a
898	   management alarm SHOULD also be raised.  Over an accounting period
899	   (say a month) the single metric in the long term counter represents
900	   all the downstream congestion caused by traffic passing the border
901	   meter.

903	   Congestion has the dimension of [byte], being the product of volume
904	   transferred [byte] and percentage pre-congestion [dimensionless].
905	   The above algorithm effectively gives a measure of the volume
906	   transferred, but modulated by pre-congestion expected downstream.  So
907	   volume transferred during off-peak periods counts as nearly nothing,
908	   while volume transferred at peak times counts very highly.  The re-
909	   ECN protocol allows one network to measure how much pre-congestion
910	   has been 'dumped' into it by another network.  And then in turn how
911	   much of that pre-congestion it dumped into the next downstream
912	   network.

914	   Once this downstream pre-congestion metric is available, operators
915	   are free to choose how they incorporate it into their interconnection
916	   contracts&nbsp[IXQoS].  Some may include a threshold volume of pre-
917	   congestion as a quality measure in their service level agreement,
918	   perhaps with a penalty clause if the upstream network exceeds this
919	   threshold over, say, a month.  Others may agree a set of tiered
920	   monthly thresholds, with increasing penalties as each threshold is
921	   exceeded.  But, it would be just as easy and more precise to do away
922	   with discrete thresholds, and instead make the penalty rise smoothly
923	   with the volume of pre-congestion by applying a price to pre-
924	   congestion itself.  Then the usage element of the interconnection
925	   contract would directly relate to the volume of pre-congestion caused
926	   by the upstream network.

928	   Typically, where capacity charges are concerned, lower tier customer
929	   networks pay higher tier provider networks.  So money flows from the
930	   edges to the middle of the internetwork where there is greater
931	   connectivity.  But penalties or charges for usage normally follow the
932	   same direction as the data flow---the direction of control at the
933	   network layer.  So, where a tier 2 provider sends data into a tier 3
934	   customer network, we would expect the penalty clauses for sending too
935	   much pre-congestion to be against the tier 3 network, even though it
936	   is the provider.

938	   The relative direction of penalties and charges is a constant source
939	   of confusion.  It may help to remember that data will be flowing in
940	   the other direction too.  So the provider network has as much
941	   opportunity to levy usage penalties as its customer, and it can set
942	   the price or strength of its own penalties higher if it chooses.
943	   Usage charges in both directions tend to cancel each other out, which
944	   confirms that usage-charging is less to do with revenue raising and
945	   more to do with encouraging load control discipline in order to
946	   smooth peaks and troughs, improving utilisation and quality.

948	   To focus the discussion, from now on, unless otherwise stated, we
949	   will assume a downstream network charges its upstream neighbour in
950	   proportion to the pre-congestion it sends, B_v, using the notation of
951	   Appendix A.2.  If they previously agreed the (fixed) price per byte
952	   of pre-congestion would be L, then the bill at the end of the month
953	   will simply be the product L.B_v, plus any fixed charges they may
954	   also have agreed.

956	   We are well aware that the IETF tries to avoid standardising
957	   technology that depends on a particular business model.  But our aim
958	   is merely to show that border policing can at least work with this
959	   one model, then we can assume that operators might experiment with
960	   the metric in other models.  Effectively tiered thresholds are just
961	   more coarse-grained approximations of the fine-grained case we choose
962	   to examine.  Of course, operators are free to complement this pre-
963	   congestion-based usage element of their charges with traditional
964	   capacity charging, and we expect they will.

966	5.3.  Emulation of Per-Flow Rate Policing: Rationale and Limits

968	   The important feature of charging in proportion to congestion volume
969	   is that the penalty aggregates and deaggregates correctly along with
970	   packet flows.  This is because the penalty rises linearly with bit
971	   rate and linearly with congestion, because it is the product of them
972	   both.  So if the packets crossing a border consist of a thousand
973	   flows, and one of those flows doubles its rate, the ingress gateway
974	   forwarding that flow will have to put twice as much congestion
975	   marking into the packets of that flow.  And this extra congestion
976	   marking will add proportionately to the charges levied at every
977	   border the flow crosses in proportion to the amount of pre-congestion
978	   remaining on the path.

980	   As importantly, pre-congestion itself rises super-linearly with
981	   utilisation of a particular resource.  So if someone tries to push
982	   another flow into a path that is already signalling enough pre-
983	   congestion to warrant admission control, the penalty will be a lot
984	   greater than it would have been to add the same flow to a less
985	   congested path.  So, the system as a whole is fairly insensitive to
986	   the actual level of pre-congestion that each ingress chooses for
987	   triggering admission control.  The deterrent against exceeding
988	   whatever threshold is chosen rises very quickly with a small amount
989	   of cheating.

991	   These are the properties that allow re-ECN to emulate per-flow border
992	   policing of both rate and admission control.  When a whole inter-
993	   network is operating at normal (typically very low) congestion, the
994	   pre-congestion marking from virtual queues will be a little higher---
995	   still low, but more noticeable.  But this does not imply that usage
996	   /charges/ must also be low.  That depends on the /price/ L.

998	   For instance, combining capacity and volume charges is quite a common
999	   feature of interconnection agreements in today's Internet,
1000	   particularly since p2p file-sharing became popular.  Imagine that the
1001	   monthly payment between two networks is made up of a volume charge
1002	   and a capacity charge, and they usually turn out to be in a ratio of
1003	   about 1:2 (not atypical).  If charging for volume were replaced with
1004	   charging for congested volume, one would expect the price of
1005	   congestion to be arranged so that the total charge for usage remained
1006	   about the same---still about one third of the total settlement.
1007	   Because that is obviously the charge that the market has found is
1008	   necessary to push back against usage.  So, if an average pre-
1009	   congestion fraction turned out to be 0.1%, one would expect that the
1010	   price L per byte of pre-congestion would be about 1000 times the
1011	   previously used per byte price for volume (before congestion metrics
1012	   were available).

1014	   From the above example it can be seen why operators will become
1015	   acutely sensitive to the congestion they cause in other networks,
1016	   which is of course the desired effect to encourage networks to
1017	   /control/ the congestion they allow their users to cause to others.

1019	   Effectively, usage charges will continuously flow from ingress
1020	   gateways to the places where there is mild pre-congestion, in
1021	   proportion to the data rates from those gateways and to the path pre-
1022	   congestion.

1024	   If anyone sends even one flow at higher rate, they will immediately
1025	   have to pay proportionately more usage charges.  Because there is no
1026	   knowledge of reservations within the Diffserv region, no interior
1027	   router can police whether the rate of each flow is greater than each
1028	   reservation.  So the system doesn't truly emulate rate-policing of
1029	   each flow.  But there is no incentive to pack a higher rate into a
1030	   reservation, because the charges are directly proportional to rate,
1031	   irrespective of the reservation.

1033	   However, if virtual queues start to fill on any path, even though
1034	   real queues will still be able to provide low latency service, pre-
1035	   congestion marking will rise fairly quickly.  It may eventually reach
1036	   the threshold where the ingress gateway would deny admission to new
1037	   flows.  If the ingress gateway cheats and continues to admit new
1038	   flows, the affected virtual queues will rapidly fill, even though the
1039	   real queues will still be little worse than they were when admission
1040	   control should have been invoked.  The ingress gateway will have to
1041	   pay the penalty for such an extremely high pre-congestion level, so
1042	   the pressure to invoke admission control should become unbearable.

1044	   The above mechanisms protect against rational operators.  In
1045	   Section 5.6 we discuss how networks can protect themselves from
1046	   accidental or deliberate misconfiguration in neighbouring networks.

1048	5.4.  Policing Dishonest Marking

1050	   As CL traffic leaves the last network before the egress gateway
1051	   (domain C) the RE blanking fraction should match the congestion
1052	   marking fraction, when averaged over a sufficiently long duration
1053	   (perhaps ~10s to allow a few rounds of feedback through regular
1054	   signalling of new and refreshed reservations).

1056	   If domain C doesn't trust the networks around it to behave honestly,
1057	   it should install a monitor at its egress.  This monitor aims to
1058	   detect flows of CL packets that are persistently negative.  If flows
1059	   are positive, domain C need take no action---this simply means an
1060	   upstream network must be paying more penalties than it needs to.
1061	   Appendix A.3 gives a suggested algorithm for the monitor.

1063	   Note that the monitor operates on flows but we would like it not to
1064	   require per-flow state.  This is why we have been careful to ensure
1065	   that all flows MUST start with a packet marked with the NF codepoint.
1066	   If a flow does not start with the NF codepoint, a monitor is likely
1067	   to treat it unfavourably.  This incentivises setting of the NF
1068	   codepoint.

1070	   This also means that a monitor will be resistant to state exhaustion
1071	   attacks from other networks, as the monitor never creates state
1072	   unless an NF packet arrives.  And an NF packet counts positive, so it
1073	   will cost a lot for a network to send many of them.

1075	   Monitor algorithms will often maintain an average fraction of RE
1076	   blanked packets across flows.  When maintaining an average across
1077	   flows, a monitor MUST ignore packets with the NF codepoint set.  An
1078	   ingress gateway sets the NF codepoint when it does not have the
1079	   benefit of feedback from the ingress.  So counting packets with FE
1080	   cleared would be likely to make the average unnecessarily positive,
1081	   providing headroom (or should we say footroom?) for dishonest
1082	   (negative) traffic.

1084	   If the monitor detects a persistently negative flow, it could drop
1085	   sufficient negative and neutral packets to force the flow to not be
1086	   negative.  This is the approach taken for the 'egress dropper' in
1087	   [Re-TCP], but for the scenario in this memo, where everyone would
1088	   expect everyone else to keep to the protocol it is probably more
1089	   advisable to raise a management alarm.  So all ingresses cannot
1090	   understate downstream pre-congestion without getting logged.  Then
1091	   the network operator can deal with the offending network at the human
1092	   level, out of band.

1094	5.5.  Competitive Routing

1096	   Goldenberg et al [Smart_rtg] refers to various commercial product and
1097	   presents its own algorithms for moving traffic between multihomed
1098	   routes based on usage charges.  None of these systems require any
1099	   changes to standards protocols because the choice between the
1100	   available border gateway protocol (BGP) routes is based on a
1101	   combination of local knowledge of the charging regime and local
1102	   measurement of traffic levels.  If, as we propose, charges or
1103	   penalties were based on the level of re-ECN measured in passing
1104	   traffic, a similar optimisation could be achieved without requiring
1105	   any changes to standard routing protocols.

1107	   We must be clear that applying pre-congestion-based routing to this
1108	   admission control system remains an open research issue.  Traffic
1109	   engineering based on congestion requires careful damping to avoid
1110	   oscillations, and should not be attempted without adult supervision
1111	   :) Mortier & Pratt [ECN-BGP] have analysed traffic engineering based
1112	   on congestion.  Without the benefit of re-ECN, they they had to add a
1113	   path attribute to BGP to advertise a route's downstream congestion
1114	   (actually they proposed that BGP should advertise the charge for
1115	   congestion, which we believe wrongly embeds an assumption into BGP
1116	   that congestion will be charged for).

1118	5.6.  Fail-safes

1120	   The mechanisms described so far create incentives for rational
1121	   operators to behave.  That is, one operator aims to make another
1122	   behave responsibly by applying penalties and expecting a rational
1123	   response that trades off costs against benefits.  It is usually
1124	   reasonable to assume that other network operators behave rationally
1125	   (policy routing can avoid those that might not).  But this approach
1126	   does not protect against the misconfigurations and accidents of other
1127	   operators.

1129	   Therefore, we propose the following two similar mechanisms at a
1130	   network's borders to provide "defence in depth":

1132	   Highly positive flows RE blanked packets should be sampled and a
1133	      small regular sample picked randomly as they cross a border
1134	      interface.  Then subsequent packets matching the same source and
1135	      destination address and DSCP should be monitored.  If the RE
1136	      blanking rate is well above a threshold (to be determined by
1137	      operational practice), a management alarm SHOULD be raised, and
1138	      the flow MAY be automatically subject to focused drop.

1140	   Persistently negative flows congestion marked packets should be
1141	      sampled and a small regular sample picked randomly as they cross a
1142	      border interface.  Then subsequent packets matching the same
1143	      source and destination address and DSCP should be monitored.  If
1144	      the RE blanking rate minus the congestion marking rate is
1145	      persistently negative, a management alarm SHOULD be raised, and
1146	      the flow MAY be automatically subject to focused drop.

1148	   Both these mechanisms rely on the fact that highly postive (or
1149	   negative) flows will appear more quickly in the sample by selecting
1150	   randomly solely from positive (or negative) packets.

1152	   Note that there is no assumption that users behave rationally.  The
1153	   system is protected from the vagiaries of irrational user behaviour
1154	   by the ingress gateways, which transform internal penalties into a
1155	   deterministic, admission control mechanism that prevents users from
1156	   misbehaving, by directly engineered means.

1158	6.  Analysis

1160	   The domains in Figure 1 are not expected to be completely malicious
1161	   towards each other.  After all, we can assume that they are all co-
1162	   operating to provide an internetworking service to the benefit of
1163	   each of them and their customers.  Otherwise their routing polices
1164	   would not interconnect them in the first place.  However, we assume
1165	   that they are also competitors of each other.  So a network may try
1166	   to contravene our proposed protocol if it would gain or make a
1167	   competitor lose, or both, but only if it can do so without being
1168	   caught.  Therefore we do not have to consider every possible random
1169	   attack one network could launch on the traffic of another, given
1170	   anyway one network can always drop or corrupt packets that it
1171	   forwards on behalf of another.

1173	   Therefore, we only consider new opportunities for /gainful/ attack
1174	   that our proposal introduces.  But to a certain extent we can also
1175	   rely on the in depth defences we have described (Section 5.6 )
1176	   intended to mitigate the potential impact if one network accidentally
1177	   misconfiguring the workings of this protocol.

1179	   In the generic scenario we introduced in Figure 1 the ingress and
1180	   egress gateways are shown in the most generic arrangement, without
1181	   any surrounding network.  This allows us to consider more specific
1182	   cases where these gateways and a neighbouring network are operated by
1183	   the same player.  As well as cases where the same player operates
1184	   neighbouring networks, we will also consider cases where the two
1185	   gateways collude as one player and where the sender and receiver
1186	   collude as one.  Collusion of other sets of domains are less likely,
1187	   but we will consider such cases.  In the general case, we will assume
1188	   none of the nine trust domains across the figure fully trust any of
1189	   the others.

1191	   Taking the generic scenario in Figure 1, as we only propose to change
1192	   routers within the Diffserv region, we assume the operators of
1193	   networks outside the region will be doing per-flow policing.  That
1194	   is, we assume the networks outside the Diffserv region and the
1195	   gateways around its edges can protect themselves.  So our primary
1196	   concern is to be able to protect networks that don't do per-flow
1197	   policing from those that do.  The ingress and egress gateways are the
1198	   only way the outer 'enemy' can get at the middle victim, so we can
1199	   consider the gateways as the representatives of the 'enemy' as far as
1200	   domains A, B and C are concerned.  We will call this trust scenario
1201	   'edges against middles'.

1203	   Earlier in this memo, we outlined the classic border rate policing
1204	   problem (Section 3).  It will now be useful to spell out the
1205	   motivations that would create the lack of trust as the root cause of
1206	   the problem.  The more reservations a gateway can allow, the more
1207	   revenue it receives.  The middle networks want the edges to comply
1208	   with the admission control protocol when they become so congested
1209	   that their service to others might suffer.  The middle networks also
1210	   want to ensure the edges cannot steal more service from them than
1211	   they pay for.

1213	   In the context of this 'edges aginst middles' scenario, the re-ECN
1214	   protocol has two main effects:

1216	   o  The more pre-congestion there is on a path across the Diffserv
1217	      region, the higher the ingress gateway has to declare downstream
1218	      pre-congestion v_0.

1220	   o  because downstream pre-congestion should on average be zero at the
1221	      egress

1223	   An executive summary of our security analysis can be stated in two
1224	   parts, distinguished by the type of collusion considered.  In the
1225	   first case collusion is limited to neighbours in the feedback loop.
1226	   In other words, two neighbouring networks can be assumed to act as
1227	   one.  Or the egress gateway might collude with domain C. Or the
1228	   ingress gateway might collude with domain A. Or ingress and egress
1229	   gateways might collude with each other.

1231	   In these cases where only neighbours in the feedback loop collude,
1232	   all parties have a positive incentive to declare downstream pre-
1233	   congestion truthfully, and the ingress gateway has a positive
1234	   incentive to invoke admission control when congestion rises above the
1235	   admission threshold in any network in the region (including its own).
1236	   No party has an incentive to send more traffic than declared in
1237	   reservation signalling (even though only the gateways read this
1238	   signalling).  In short, no party can gain at the expense of another.

1240	   In the case of other forms of collusion (e.g. between domain A and C)
1241	   it would be possible for say A & B to create a tunnel between
1242	   theselves so that A would gain at the expense of B. But C would then
1243	   lose the gain that A had made.  Therefore the value to A & C of
1244	   colluding to mount this attack seems questionable.  It is made more
1245	   questionable, because the attack can be statistically detected by B
1246	   using the second defence in depth mechanism mentioned already.  Note
1247	   that C can effectively prevent A attacking it through a tunnel, by
1248	   treating the tunnel end point as a direct link to a neighbouring
1249	   network, which falls back to the regular scenario without collusion.

1251	   {ToDo: Due to lack of time, the full write up of the security
1252	   analysis is deferred to the next version of this memo.}
1253	   Finally, it is well known that the best person to analyse the
1254	   security of a system is not the designer.  Therefore, our confident
1255	   claims must be hedged with doubt until others with an incentive to
1256	   break it have mounted a full analysis.

1258	7.  Extensions

1260	   If a different signalling system, such as NSIS, were used, but
1261	   providing admission control in a similar way using pre-congestion
1262	   notification (e.g. with RMD [NSIS-RMD]) a similar approach to re-ECN
1263	   could be used.

1265	8.  Design Choices and Rationale

1267	   The case for using re-feedback (a generalisation of re-ECN) to police
1268	   congestion response and provide QoS is made in [Re-fb].  Essentially,
1269	   the insight is that congestion crosses layers from the physical
1270	   upwards.  Therefore re-feedback polices congestion response based on
1271	   physical interfaces not addresses.  That is, the congestion leaving a
1272	   physical interface can be policed at the interface, rather than the
1273	   congestion on packets that claim to come from an address, which may
1274	   be spoofed.  Also, re-feedback does not actually require feedback.  A
1275	   source must act conservatively before it gets feedback.

1277	   On the subject of lack of feedback, the no feedback (NF) codepoint is
1278	   motivated by arguments for a state set-up bit in IP to prevent state
1279	   exhaustion attacks.  This idea was first put forward by David Clark
1280	   and documented in [Handley_Steps_DoS].  The idea is that network
1281	   layer datagrams should signal explicitly when they require state to
1282	   be created in the layer above (e.g. at flow start).  Then the higher
1283	   layer can refuse to create any state unless a datagram declares this
1284	   intent.  We believe the NF codepoint can be used to serve the same
1285	   purpose as the proposed more generic state-set-up bit.

1287	   The re-feedback paper [Re-fb] also makes the case for using an
1288	   economic interpretation of congestion, which is the basis of the
1289	   incentives-based approach used in this memo.  That paper also makes
1290	   the case against the use of classic feedback if the economic
1291	   interpretation of congestion is to be realised.  The problem with
1292	   using classic feedback for policing congestion is that it opens up
1293	   receiving networks to `denial of funds' attacks.

1295	   {ToDo: Further Design Rationale will be included in future versions
1296	   of this memo}

1298	9.  IANA Considerations

1300	   {ToDo:}This memo includes no request to IANA (yet).

1302	10.  Security Considerations

1304	   This whole memo concerns the security of a scalable admission control
1305	   system.  In particular the analysis section.  Below some specific
1306	   security issues are mentioned that did not fit elsewhere in the memo
1307	   or which comment on the robustness of the security provided by the
1308	   design.

1310	   Firstly, we must repeat the statement of applicability in the
1311	   analysis: that we only consider new opportunities for /gainful/
1312	   attack that our proposal introduces.  Despite only involving a few
1313	   bits, there is sufficient complexity in the whole system that there
1314	   are numerous possibilities for attacks not catered for.  But as far
1315	   as we are aware, none reap any benefit to the attacker.  It will
1316	   always be possible for one network to cause damage to another
1317	   neighbouring network's traffic by dropping or corrupting it as it
1318	   forwards it.  Therefore we do not believe networks would set their
1319	   routing policies to interconnect in the first place if they didn't
1320	   trust the other networks not to damage their traffic without any
1321	   /direct/ gain to themselves.

1323	   Having said this, we do want to highlight some of the weaker parts of
1324	   our argument.  We have argued that networks will be dissuaded from
1325	   faking congestion marking by the possibility that upstream networks
1326	   will route round them.  As we have said, these arguments are
1327	   intuitive and will remain fairly tenuous until proved in practice,
1328	   particularly close to the egress where less competitive routing is
1329	   likely.

1331	   We should also point out that the approach in this memo was only
1332	   designed to be robust for admission control.  We do not claim the
1333	   incentives will always be strong enough to force correct flow pre-
1334	   emption behaviour.  This is because pre-emption of flows tends to be
1335	   associated with much higher damage to an operator's reputation for
1336	   robust quality than denying admission.  However, in general the
1337	   incentives for correct flow pre-emption are similar to those for
1338	   admission control.

1340	   Finally, it may seem that the 8 codepoints that have been made
1341	   available by extending the ECN field with the RE bit have been used
1342	   rather wastefully.  In effect the RE bit has been used as an
1343	   orthogonal single bit in nearly all cases.  The only exception being
1344	   when the ECN field is cleared to "00".  The mapping of the codepoints
1345	   in an earlier version of this proposal used the codepoint space more
1346	   efficiently, but the scheme became vulnerable to a network operator
1347	   focusing its congestion marking to mark more positive than neutral
1348	   packets in order to reduce its penalties.

1350	   {ToDo: More security considerations will undoubtedly be added in
1351	   future versions of this memo.}

1353	11.  Conclusions

1355	   Using pre-congestion is a promising technique to control flow
1356	   admissions that will scale to any size network.  However, it requires
1357	   a mechanism to ensure that networks can interconnect even if they do
1358	   not trust each to keep to the admission control protocols.  We claim
1359	   that the re-ECN protocol provides such a mechanism, so that one
1360	   network can detect and prevent another network in the system fro
1361	   cheating for its own gain.

1363	12.  Acknowledgements

1365	   All the following have given helpful comments and some may become co-
1366	   authors of later drafts: Arnaud Jacquet, Alessandro Salvatori, Steve
1367	   Rudkin, David Songhurst, John Davey, Ian Self, Anthony Sheppard (BT),
1368	   Stephen Hailes (UCL), Francois Le Faucheur, Anna Charny (Cisco),
1369	   Jozef Babiarz, Kwok-Ho Chan, Corey Alexander (Nortel), David Clark,
1370	   Bill Lehr, Sharon Gillett (MIT) and comments from participants in the
1371	   CFP/CRN inter-provider QoS and broadband working groups.

1373	13.  Comments Solicited

1375	   Comments and questions are encouraged and very welcome.  They can be
1376	   addressed to the IETF Transport Area working group's mailing list
1377	   <tsvwg@ietf.org>, and/or to the authors.

1379	14.  References

1381	14.1.  Normative References

1383	   [PCN]      Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F.,
1384	              Charny, A., Liatsos, V., Babiarz, J., Chan, K., and S.
1385	              Dudley, "Pre-Congestion Notification",
1386	              draft-briscoe-tsvwg-cl-phb-01 (work in progress),
1387	              March 2006.

1389	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1390	              Requirement Levels", BCP 14, RFC 2119, March 1997.

1392	   [RFC2211]  Wroclawski, J., "Specification of the Controlled-Load
1393	              Network Element Service", RFC 2211, September 1997.

1395	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
1396	              of Explicit Congestion Notification (ECN) to IP",
1397	              RFC 3168, September 2001.

1399	   [RFC3246]  Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec,
1400	              J., Courtney, W., Davari, S., Firoiu, V., and D.
1401	              Stiliadis, "An Expedited Forwarding PHB (Per-Hop
1402	              Behavior)", RFC 3246, March 2002.

1404	   [RSVP-ECN]
1405	              Le Faucheur, F., Charny, A., Briscoe, B., Eardley, P.,
1406	              Babiarz, J., and K. Chan, "RSVP Extensions for Admission
1407	              Control over Diffserv using Pre-congestion Notification",
1408	              draft-lefaucheur-rsvp-ecn-00 (work in progress),
1409	              October 2005.

1411	   [Re-TCP]   Briscoe, B., Jacquet, A., and A. Salvatori, "Re-ECN:
1412	              Adding Accountability for Causing Congestion to TCP/IP",
1413	              draft-briscoe-tsvwg-re-ecn-tcp-01 (work in progress),
1414	              March 2006.

1416	14.2.  Informative References

1418	   [CL-arch]  Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F.,
1419	              Charny, A., Babiarz, J., and K. Chan, "A Framework for
1420	              Admission Control over DiffServ using Pre-Congestion
1421	              Notification", draft-briscoe-tsvwg-cl-architecture-02
1422	              (work in progress), March 2006.

1424	   [ECN-BGP]  Mortier, R. and I. Pratt, "Incentive Based Inter-Domain
1425	              Routeing", Proc Internet Charging and QoS Technology
1426	              Workshop (ICQT'03) pp308--317, September 2003, <http://
1427	              research.microsoft.com/users/mort/publications.aspx>.

1429	   [IXQoS]    Briscoe, B. and S. Rudkin, "Commercial Models for IP
1430	              Quality of Service Interconnect", BT Technology Journal
1431	              (BTTJ) 23(2)171--195, April 2005,
1432	              <http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#ixqos>.

1434	   [NSIS-RMD]
1435	              Bader, A., Westberg, L., Karagiannis, G., Kappler, C., and
1436	              T. Phelan, "RMD-QOSM - The Resource Management in Diffserv
1437	              QOS Model", draft-ietf-nsis-rmd-06 (work in progress),
1438	              February 2006.

1440	   [RFC2205]  Braden, B., Zhang, L., Berson, S., Herzog, S., and S.
1441	              Jamin, "Resource ReSerVation Protocol (RSVP) -- Version 1
1442	              Functional Specification", RFC 2205, September 1997.

1444	   [RFC2207]  Berger, L. and T. O'Malley, "RSVP Extensions for IPSEC
1445	              Data Flows", RFC 2207, September 1997.

1447	   [RFC2208]  Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell,
1448	              M., Romanow, A., Weinrib, A., and L. Zhang, "Resource
1449	              ReSerVation Protocol (RSVP) Version 1 Applicability
1450	              Statement Some Guidelines on Deployment", RFC 2208,
1451	              September 1997.

1453	   [RFC2747]  Baker, F., Lindell, B., and M. Talwar, "RSVP Cryptographic
1454	              Authentication", RFC 2747, January 2000.

1456	   [RFC2998]  Bernet, Y., Ford, P., Yavatkar, R., Baker, F., Zhang, L.,
1457	              Speer, M., Braden, R., Davie, B., Wroclawski, J., and E.
1458	              Felstaine, "A Framework for Integrated Services Operation
1459	              over Diffserv Networks", RFC 2998, November 2000.

1461	   [RFC3540]  Spring, N., Wetherall, D., and D. Ely, "Robust Explicit
1462	              Congestion Notification (ECN) Signaling with Nonces",
1463	              RFC 3540, June 2003.

1465	   [Re-fb]    Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C.,
1466	              Salvatori, A., Soppera, A., and M. Koyabe, "Policing
1467	              Congestion Response in an Internetwork Using Re-Feedback",
1468	              ACM SIGCOMM CCR 35(4)277--288, August 2005, <http://
1469	              www.acm.org/sigs/sigcomm/sigcomm2005/
1470	              techprog.html#session8>.

1472	   [Smart_rtg]
1473	              Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang,
1474	              "Optimizing Cost and Performance for Multihoming", ACM
1475	              SIGCOMM CCR 34(4)79--92, October 2004,
1476	              <http://citeseer.ist.psu.edu/698472.html>.

1478	Appendix A.  Implementation

1480	A.1.  Ingress Gateway Algorithm for Blanking the RE bit

1482	   The ingress gateway receives regular feedback reporting the fraction
1483	   of congestion marked octets for each aggregate arriving at the
1484	   egress.  So for each aggregate it should blank the RE bit on the same
1485	   fraction of octets.  It is more efficient to calculate the reciprocal
1486	   of this fraction when the signalling arrives, Z_0 = 1 / Congestion-
1487	   Level-Estimate, which will be the number of bytes of packets the
1488	   ingress should send with the RE bit set between those it sends with
1489	   the RE bit blanked.  Z_0 will also take account of the sustainable
1490	   rate reported during the flow pre-emption process, if necessary.

1492	   A suitable pseudo-code algorithm for the ingress gateway is as
1493	   follows:

1495	   ====================================================================
1496	   B_i = 0                 /* interblank volume                     */
1497	   for each packet {
1498	       b = readLength()    /* set b to packet size                  */
1499	       B_i += b            /* accumulate interblank volume          */
1500	       if B_i < b * Z_0 {  /* test whether interblank volume...     */
1501	           writeRE(1)
1502	       } else {            /* ...exceeds blank RE spacing * pkt size*/
1503	           writeRE(0)      /* ...and if so, clear RE                */
1504	           B_i = 0         /* ...and re-set interblank volume       */
1505	       }
1506	   }
1507	   ====================================================================

1509	A.2.  Bulk Downstream Congestion Metering Algorithm

1511	   To meter the bulk amount of downstream pre-congestion in passing
1512	   traffic an algorithm is needed that accumulates the size of packets
1513	   with RE blanked (or NF set) and subtracts the size of congestion
1514	   marked packets, but ignores a persistently negative balance over a
1515	   duration of T ~ 10secs, say.  Three counters need to be maintained:

1517	      B_v: accumulated pre-congestion volume

1519	      B_s: pre-congestion volume in timeslot

1521	      B_t: total data volume

1523	   A suitable pseudo-code algorithm for a border router is as follows:

1525	   ====================================================================
1526	   B_v = 0
1527	   B_s = 0
1528	   B_t = 0
1529	   t = timeNow() + T           /* divide into timeslots of few secs */
1530	   for each packet {
1531	       b = readLength()            /* set b to packet size          */
1532	       B_t += b                    /* accumulate total volume       */
1533	       if readRE() == 0 || readEECN() == NF {
1534	           B_s += b                /* increment...                  */
1535	       } elseif readECN() == 1X {
1536	           B_s -= b                /* ...or decrement B_s...        */
1537	       }                           /*...depending on EECN field     */
1538	       if timeNow() > t {      /* every timeslot...                 */
1539	           if B_v > 0 {        /* count a negative balance as zero  */
1540	               B_v += B_s      /* otherwise accumulate the balance  */
1541	           }
1542	           B_s = 0                 /* re-set the temp counter...    */
1543	           t += T                  /* ...for the next timeslot      */
1544	       }
1545	   }
1546	   ====================================================================

1548	   At the end of an accounting period this counter B_v represents the
1549	   pre-congestion volume that penalties could be applied to, as
1550	   described in Section 5.2.

1552	   For instance, accumulated volume of pre-congestion through a border
1553	   interface over a month might be B_v = 5PB (petabyte = 10^15 byte).
1554	   This might have resulted from an average downstream pre-congestion
1555	   level of 1% on an accumulated total data volume of B_t = 500PB.

1557	A.3.  Algorithm for Sanctioning Negative Traffic

1559	   {ToDo: Write up dropper with flow management algorithm and variant
1560	   with bounded flow state.}

1562	Author's Address

1564	   Bob Briscoe
1565	   BT & UCL
1566	   B54/77, Adastral Park
1567	   Martlesham Heath
1568	   Ipswich  IP5 3RE
1569	   UK

1571	   Phone: +44 1473 645196
1572	   Email: bob.briscoe@bt.com
1573	   URI:   http://www.cs.ucl.ac.uk/staff/B.Briscoe/

1575	Intellectual Property Statement

1577	   The IETF takes no position regarding the validity or scope of any
1578	   Intellectual Property Rights or other rights that might be claimed to
1579	   pertain to the implementation or use of the technology described in
1580	   this document or the extent to which any license under such rights
1581	   might or might not be available; nor does it represent that it has
1582	   made any independent effort to identify any such rights.  Information
1583	   on the procedures with respect to rights in RFC documents can be
1584	   found in BCP 78 and BCP 79.

1586	   Copies of IPR disclosures made to the IETF Secretariat and any
1587	   assurances of licenses to be made available, or the result of an
1588	   attempt made to obtain a general license or permission for the use of
1589	   such proprietary rights by implementers or users of this
1590	   specification can be obtained from the IETF on-line IPR repository at
1591	   http://www.ietf.org/ipr.

1593	   The IETF invites any interested party to bring to its attention any
1594	   copyrights, patents or patent applications, or other proprietary
1595	   rights that may cover technology that may be required to implement
1596	   this standard.  Please address the information to the IETF at
1597	   ietf-ipr@ietf.org.

1599	Disclaimer of Validity

1601	   This document and the information contained herein are provided on an
1602	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
1603	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
1604	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
1605	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
1606	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
1607	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

1609	Copyright Statement

1611	   Copyright (C) The Internet Society (2006).  This document is subject
1612	   to the rights, licenses and restrictions contained in BCP 78, and
1613	   except as set forth therein, the authors retain all their rights.

1615	Acknowledgment

1617	   Funding for the RFC Editor function is currently provided by the
1618	   Internet Society.