idnits 2.17.1 

draft-briscoe-tsvwg-re-ecn-border-cheat-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 14.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 2279.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2256.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2263.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2269.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([RSVP-ECN], [Re-TCP], [PCN]),
     which it shouldn't.  Please replace those with straight textual mentions
     of the documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The exact meaning of the all-uppercase expression 'MAY NOT' is not
     defined in RFC 2119.  If it is intended as a requirements expression, it
     should be rewritten using one of the combinations defined in RFC 2119;
     otherwise it should not be all-uppercase.

  == The expression 'MAY NOT', while looking like RFC 2119 requirements text,
     is not defined in RFC 2119, and should not be used.  Consider using 'MUST
     NOT' instead (if that is what you mean).
     
     Found 'MAY NOT' in this paragraph:
     
     However, if the ingress gateway can guarantee that the network(s)
     that will carry the flow to its egress gateway all use a common
     identifier for the aggregate (e.g. a single MPLS network without ECMP
     routing), it MAY NOT set FNE when it adds a new flow to an active
     aggregate.  And an FNE packet need only be sent if a whole aggregate has
     been idle for more than 1 second.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 26, 2006) is 6513 days in the past.  Is this
     intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-03) exists of
     draft-briscoe-tsvwg-cl-phb-02

  -- Possible downref: Normative reference to a draft: ref. 'PCN' 

  -- Possible downref: Normative reference to a draft: ref. 'RSVP-ECN' 

  == Outdated reference: A later version (-09) exists of
     draft-briscoe-tsvwg-re-ecn-tcp-02

  == Outdated reference: A later version (-04) exists of
     draft-briscoe-tsvwg-cl-architecture-03

  == Outdated reference: A later version (-01) exists of
     draft-davie-ecn-mpls-00

  == Outdated reference: A later version (-20) exists of
     draft-ietf-nsis-rmd-06


     Summary: 4 errors (**), 0 flaws (~~), 8 warnings (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Transport Area Working Group                                  B. Briscoe
3	Internet-Draft                                                  BT & UCL
4	Expires: December 28, 2006                                 June 26, 2006

6	        Emulating Border Flow Policing using Re-ECN on Bulk Data
7	               draft-briscoe-tsvwg-re-ecn-border-cheat-01

9	Status of this Memo

11	   By submitting this Internet-Draft, each author represents that any
12	   applicable patent or other IPR claims of which he or she is aware
13	   have been or will be disclosed, and any of which he or she becomes
14	   aware will be disclosed, in accordance with Section 6 of BCP 79.

16	   Internet-Drafts are working documents of the Internet Engineering
17	   Task Force (IETF), its areas, and its working groups.  Note that
18	   other groups may also distribute working documents as Internet-
19	   Drafts.

21	   Internet-Drafts are draft documents valid for a maximum of six months
22	   and may be updated, replaced, or obsoleted by other documents at any
23	   time.  It is inappropriate to use Internet-Drafts as reference
24	   material or to cite them other than as "work in progress."

26	   The list of current Internet-Drafts can be accessed at
27	   http://www.ietf.org/ietf/1id-abstracts.txt.

29	   The list of Internet-Draft Shadow Directories can be accessed at
30	   http://www.ietf.org/shadow.html.

32	   This Internet-Draft will expire on December 28, 2006.

34	Copyright Notice

36	   Copyright (C) The Internet Society (2006).

38	Abstract

40	   Scaling per flow admission control to the Internet is a hard problem.
41	   A recently proposed approach combines Diffserv and pre-congestion
42	   notification (PCN) to provide a service slightly better than Intserv
43	   controlled load.  It scales to networks of any size, but only if
44	   domains trust each other to comply with admission control and rate
45	   policing.  This memo claims to solve this trust problem without
46	   losing scalability.  It describes bulk border policing that provides
47	   a sufficient emulation of per-flow policing with the help of another
48	   recently proposed extension to ECN, involving re-echoing ECN feedback
49	   (re-ECN).  With only passive bulk measurements at borders, sanctions
50	   can be applied against cheating networks.

52	Status (to be removed by the RFC Editor)

54	   This memo is posted as an Internet-Draft with the intent to
55	   eventually progress to informational status.  It is envisaged that
56	   the necessary standards actions to realise the system described would
57	   sit in three other documents currently being discussed (but not on
58	   the standards track) in the IETF Transport Area [Re-TCP], [RSVP-ECN]
59	   & [PCN].  The authors seek comments from the Internet community on
60	   whether combining PCN and re-ECN is a sufficient solution to the
61	   admission control problem.

63	Changes from previous drafts (to be removed by the RFC Editor)

65	   From -00 to -01:

67	      Added subsection on Border Accounting Mechanisms (Section 5.6.1)

69	      Section 4.2 on the re-ECN wire protocol clarified and re-organised
70	      to separately discuss re-ECN for default ECN marking and for pre-
71	      congestion marking (PCN).

73	      Router Forwarding Behaviour subsection added to re-organised
74	      section on Protocol Operation (Section 4.3).  Extensions section
75	      moved within Protocol Operations.

77	      Emulating Border Policing (Section 5) reorganised, starting with a
78	      new Terminology subsection heading, and a simplified overview
79	      section.  Added a large new subsection on Border Accounting
80	      Mechanisms within a new section bringing together other
81	      subsections on Border Mechanisms generally (Section 5.6).  Some
82	      text moved from old subsections into these new ones.

84	      Added section on Incremental Deployment (Section 7), drawing
85	      together relevant points about deployment made throughout.

87	      Sections on Design Rationale (Section 8) and Security
88	      Considerations (Section 9) expanded with some new material,
89	      including new attacks and their defences.

91	      Suggested Border Metering Algorithms improved (Appendix A.2) for
92	      resilience to newly identified attacks.

94	Table of Contents

96	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
97	   2.  Requirements Notation  . . . . . . . . . . . . . . . . . . . .  7
98	   3.  The Problem  . . . . . . . . . . . . . . . . . . . . . . . . .  7
99	     3.1.  The Traditional Per-flow Policing Problem  . . . . . . . .  7
100	     3.2.  Generic Scenario . . . . . . . . . . . . . . . . . . . . .  9
101	   4.  Re-ECN Protocol for an RSVP (or similar) Transport . . . . . . 11
102	     4.1.  Protocol Overview  . . . . . . . . . . . . . . . . . . . . 11
103	     4.2.  Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or
104	           v6)  . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
105	       4.2.1.  Re-ECN Recap . . . . . . . . . . . . . . . . . . . . . 13
106	       4.2.2.  Re-ECN Combined with Pre-Congestion Notification
107	               (re-PCN) . . . . . . . . . . . . . . . . . . . . . . . 14
108	     4.3.  Protocol Operation . . . . . . . . . . . . . . . . . . . . 17
109	       4.3.1.  Protocol Operation for an Established Flow . . . . . . 17
110	       4.3.2.  Aggregate Bootstrap  . . . . . . . . . . . . . . . . . 18
111	       4.3.3.  Flow Bootstrap . . . . . . . . . . . . . . . . . . . . 19
112	       4.3.4.  Router Forwarding Behaviour  . . . . . . . . . . . . . 20
113	       4.3.5.  Extensions . . . . . . . . . . . . . . . . . . . . . . 22
114	   5.  Emulating Border Policing with Re-ECN  . . . . . . . . . . . . 22
115	     5.1.  Informal Terminology . . . . . . . . . . . . . . . . . . . 22
116	     5.2.  Policing Overview  . . . . . . . . . . . . . . . . . . . . 23
117	     5.3.  Pre-requisite Contractual Arrangements . . . . . . . . . . 25
118	     5.4.  Emulation of Per-Flow Rate Policing: Rationale and
119	           Limits . . . . . . . . . . . . . . . . . . . . . . . . . . 28
120	     5.5.  Sanctioning Dishonest Marking  . . . . . . . . . . . . . . 29
121	     5.6.  Border Mechanisms  . . . . . . . . . . . . . . . . . . . . 31
122	       5.6.1.  Border Accounting Mechanisms . . . . . . . . . . . . . 31
123	       5.6.2.  Competitive Routing  . . . . . . . . . . . . . . . . . 35
124	       5.6.3.  Fail-safes . . . . . . . . . . . . . . . . . . . . . . 35
125	   6.  Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
126	   7.  Incremental Deployment . . . . . . . . . . . . . . . . . . . . 39
127	   8.  Design Choices and Rationale . . . . . . . . . . . . . . . . . 40
128	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 41
129	   10. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 43
130	   11. Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 43
131	   12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 44
132	   13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 44
133	   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 44
134	     14.1. Normative References . . . . . . . . . . . . . . . . . . . 44
135	     14.2. Informative References . . . . . . . . . . . . . . . . . . 45
136	   Appendix A.  Implementation  . . . . . . . . . . . . . . . . . . . 46
137	     A.1.  Ingress Gateway Algorithm for Blanking the RE flag . . . . 47
138	     A.2.  Downstream Congestion Metering Algorithms  . . . . . . . . 47
139	       A.2.1.  Bulk Downstream Congestion Metering Algorithm  . . . . 47
140	       A.2.2.  Inflation Factor for Persistently Negative Flows . . . 48
141	     A.3.  Algorithm for Sanctioning Negative Traffic . . . . . . . . 49

143	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 50
144	   Intellectual Property and Copyright Statements . . . . . . . . . . 51

146	1.  Introduction

148	   The Internet community largely lost interest in the Intserv
149	   architecture after it was clarified that it would be unlikely to
150	   scale to the whole Internet [RFC2208].  Although Intserv mechanisms
151	   proved impractical, the bandwidth reservation service it aimed to
152	   offer is still very much required.

154	   A recently proposed approach [CL-deploy] combines Diffserv and pre-
155	   congestion notification (PCN) to provide a service slightly better
156	   than Intserv controlled load [RFC2211].  It scales to any size
157	   network, but only if domains trust their neighbours to have checked
158	   that upstream customers aren't taking more bandwidth than they
159	   reserved, either accidentally or deliberately.  This memo describes
160	   border policing measures so that one network can protect its
161	   interests, even if networks around it are deliberately trying to
162	   cheat.  The approach provides a sufficient emulation of flow rate
163	   policing at trust boundaries but without per-flow processing.  The
164	   emulation is not perfect, but it is sufficient to ensure that the
165	   punishment is at least proportionate to the severity of the cheat.

167	   The aim is to be able to scale controlled load service to any number
168	   of endpoints, even though such scaling must take account of the
169	   increasing numbers of networks and users who may all have conflicting
170	   interests.  To achieve such scaling, this memo combines two recent
171	   proposals, both of which it briefly recaps:

173	   o  A deployment model for admission control over Diffserv using pre-
174	      congestion notification [CL-deploy] describes how bulk pre-
175	      congestion notification on routers within an edge-to-edge Diffserv
176	      region can emulate the precision of per-flow admission control to
177	      provide controlled load service without unscalable per-flow
178	      processing;

180	   o  Re-ECN: Adding Accountability to TCP/IP [Re-TCP].  The trick that
181	      addresses cheating at borders is to recognise that border policing
182	      is mainly necessary because cheating upstream networks will admit
183	      traffic when they shouldn't only as long as they don't directly
184	      experience the downstream congestion their misbehaviour can cause.
185	      The re-ECN protocol requires upstream nodes to declare expected
186	      downstream congestion in all forwarded packets and it makes it in
187	      their interests to declare it honestly.  Operators can then
188	      monitor downstream congestion in bulk at borders to emulate
189	      policing.

191	   Rather than the end-to-end arrangement used when re-ECN was specified
192	   for the TCP transport [Re-TCP], this memo specifies re-ECN in an
193	   edge-to-edge arrangement, making it applicable to the above
194	   deployment model for admission control over Diffserv.  Also, rather
195	   than using a TCP transport for regular congestion feedback, this memo
196	   specifies re-ECN using RSVP as the transport for feedback [RSVP-ECN].
197	   A similar deployment model, but with a different transport for
198	   signalling congestion feedback could be used (e.g.  RMD [NSIS-RMD]
199	   uses NSIS).

201	   This memo aims to do two things: i) define how to apply the re-ECN
202	   protocol to the admission control over Diffserv scenario; and ii)
203	   explain why re-ECN sufficiently emulates border policing in that
204	   scenario.  Most of the memo is taken up with the second aim;
205	   explaining why it works.  Applying re-ECN to the scenario actually
206	   involves quite a trivial modification to the ingress gateway.  Our
207	   immediate goal is to convince everyone to build that modification in
208	   to ingress gateways from the start, whether first deployments require
209	   policing or not.  Otherwise, when we want to add policing, we will
210	   have built ourselves a legacy problem.  In other words, we aim to
211	   convince people to "Build in security from the start."

213	   The body of this memo is structured as follows:

215	      Section 3 describes the border policing problem.  We recap the
216	      traditional, unscalable view of how to solve the problem, and we
217	      recap the admission control solution which has the scalability we
218	      do not want to lose when we add border policing;

220	      Section 4 specifies the re-ECN protocol solution in detail;

222	      Section 5 explains how to use the protocol to emulate border
223	      policing, and why it works;

225	      Section 6 analyses the security of the proposed solution;

227	      Section 8 explains the sometimes subtle rationale behind our
228	      design decisions;

230	      Section 9 comments on the overall robustness of the security
231	      assumptions and lists specific security issues.

233	   It must be emphasised that we are not evangelical about removing per-
234	   flow processing from borders.  Network operators may choose to do
235	   per-flow processing at their borders for their own reasons, such as
236	   to support business models that require per-flow accounting.  Our aim
237	   is to show that per-flow processing at borders is no longer
238	   /necessary/ in order to provide end-to-end QoS using flow admission
239	   control.  Indeed, we are absolutely opposed to standardisation of
240	   technology that embeds particular business models into the Internet.
241	   Our aim is merely to provide a new useful metric (downstream
242	   congestion) at trust boundaries.  Given the well-known significance
243	   of congestion in economics, operators can then use this new metric in
244	   their interconnection contracts if they choose.  This will enable
245	   competitive evolution of new business models (for examples
246	   see [IXQoS]), alongside more traditional models that depend on more
247	   costly per-flow processing at borders.

249	2.  Requirements Notation

251	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
252	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
253	   document are to be interpreted as described in [RFC2119].

255	3.  The Problem

257	3.1.  The Traditional Per-flow Policing Problem

259	   If we claim to be able to emulate per-flow policing with bulk
260	   policing at trust boundaries, we need to know exactly what we are
261	   emulating.  So, even though we expect it to become a historic
262	   practice, we will start from the traditional scenario with per-flow
263	   policing at trust boundaries to explain why it has always been
264	   considered necessary.

266	   To be able to take advantage of a reservation-based service such as
267	   controlled load, a source must reserve resources using a signalling
268	   protocol such as RSVP [RFC2205].  An RSVP signalling request refers
269	   to a flow of packets by its flow ID tuple (filter spec [RFC2205]) (or
270	   its security parameter index (SPI) [RFC2207] if port numbers are
271	   hidden by IPSec encryption).  Other signalling protocols use similar
272	   flow identifiers.  But, it is insufficient to merely authorise and
273	   admit a flow based on its identifiers, for instance merely opening a
274	   pin-hole for packets with identifiers that match an admitted flow ID.
275	   Once a flow is admitted, it cannot necessarily be trusted to send
276	   packets within the rate profile it requested.

278	   The packet rate must also be policed to keep the flow within the
279	   requested flow spec [RFC2205].  For instance, without data rate
280	   policing, a source could reserve resources for an 8kbps audio flow
281	   but transmit a 6Mbps video (theft of service).  More subtly, the
282	   sender could generate bursts that were outside the profile it had
283	   requested.

285	   In traditional architectures, per-flow packet rate-policing is
286	   expensive and unscalable but, without it, a network is vulnerable to
287	   such theft of service (whether malicious or accidental).  Perhaps
288	   more importantly, if flows are allowed to send more data than they
289	   were permitted, the ability of admission control to give assurances
290	   to other flows will break.

292	   Just as sources need not be trusted to keep within their requested
293	   flow spec, whole networks might also try to cheat.  We will now set
294	   up a concrete scenario to illustrate such cheats.  Imagine
295	   reservations for unidirectional flows from senders, through at least
296	   two networks, an edge network and its downstream transit provider.
297	   Imagine the edge network charges its retail customers per reservation
298	   but also has to pay its transit provider a charge per reservation.
299	   Typically, both its selling and buying charges might depend on the
300	   duration and rate of each reservation.  The level of the actual
301	   selling and buying prices are irrelevant to our discussion (most
302	   likely the network will sell at a higher price than it buys, of
303	   course).

305	   A cheating ingress network could systematically reduce the size of
306	   its retail customers' reservation signalling requests before
307	   forwarding them to its transit provider (and systematically reinstate
308	   the responses on the way back).  It would then receive an honest
309	   income from its upstream retail customer but only pay for
310	   fraudulently smaller reservations downstream.  Equivalently, a
311	   cheating ingress network may feed the traffic from a number of flows
312	   into an aggregate reservation over the transit that is smaller than
313	   the total of all the flows.  Because of these fraud possibilities, in
314	   traditional QoS reservation architectures the downstream network
315	   polices at each border.  The policer checks that the actual sent data
316	   rate of each flow is within the signalled reservation.

318	   Reservation signalling could be authenticated end to end, but this
319	   wouldn't prevent the aggregation cheat just described.  For this
320	   reason, and to avoid the need for a global PKI, signalling integrity
321	   is typically only protected on a hop-by-hop basis [RFC2747].

323	   A variant of the above cheat is where a router in an honest
324	   downstream network denies admission to a new reservation, but a
325	   cheating upstream network still admits the flow.  For instance, the
326	   networks may be using Diffserv internally, but Intserv admission
327	   control at their borders [RFC2998].  The cheat would only work if
328	   they were using bulk Diffserv traffic policing at their borders,
329	   perhaps to avoid the cost/complexity of Intserv border policing.  As
330	   far as the cheating upstream network is concerned, it gets the
331	   revenue from the reservation, but it doesn't have to pay any
332	   downstream wholesale charges and the congestion is in someone else's
333	   network.  The cheating network may calculate that most of the flows
334	   affected by congestion in the downstream network aren't likely to be
335	   its own.  It may also calculate that the downstream router has been
336	   configured to deny admission to new flows in order to protect
337	   bandwidth assigned to other network services (e.g. enterprise VPNs).
338	   So the cheating network can steal capacity from the downstream
339	   operator's VPNs that are probably not actually congested.

341	   To summarise, in traditional reservation signalling architectures, if
342	   a network cannot trust a neighbouring upstream network to rate-police
343	   each reservation, it has to check for itself that the data rate fits
344	   within each of the reservations it has admitted.

346	3.2.  Generic Scenario

348	   We will now describe a generic internetworking scenario that we will
349	   use to describe and to test our bulk policing proposal.  It consists
350	   of a number of networks and endpoints that do not fully trust each
351	   other to behave.  In Section 6 we will tie down exactly what we mean
352	   by partial trust, and we will consider the various combinations where
353	   some networks do not trust each other and others are colluding
354	   together.

356	    _    ___      _____________________________________       ___    _
357	   | |  |   |   _|__    ______    ______    ______    _|__   |   |  | |
358	   | |  |   |  |    |  |      |  |      |  |      |  |    |  |   |  | |
359	   | |  |   |  |    |  |Inter-|  |Inter-|  |Inter-|  |    |  |   |  | |
360	   | |  |   |  |    |  | ior  |  | ior  |  | ior  |  |    |  |   |  | |
361	   | |  |   |  |    |  |Domain|  |Domain|  |Domain|  |    |  |   |  | |
362	   | |  |   |  |    |  |  A   |  |  B   |  |  C   |  |    |  |   |  | |
363	   | |  |   |  |    |  |      |  |      |  |      |  |    |  |   |  | |
364	   | |  |   |  +----+  +-+  +-+  +-+  +-+  +-+  +-+  +----+  |   |  | |
365	   | |  |   |  |    |  |B|  |B|  |B|  |B|  |B|  |B|  |    |  |   |\ | |
366	   | |==|   |==|Ingr|==|R|  |R|==|R|  |R|==|R|  |R|==|Egr |==|   |=>| |
367	   | |  |   |  |G/W |  | |  | |  | |  | |  | |  | |  |G/W |  |   |/ | |
368	   | |  |   |  +----+  +-+  +-+  +-+  +-+  +-+  +-+  +----+  |   |  | |
369	   | |  |   |  |    |  |      |  |      |  |      |  |    |  |   |  | |
370	   | |  |   |  |____|  |______|  |______|  |______|  |____|  |   |  | |
371	   |_|  |___|    |_____________________________________|     |___|  |_|

373	   Sx   Ingress               Diffserv region               Egress   Rx
374	   End  Access                                              Access  End
375	   Host Network                                            Network Host
376	                <-------- edge-to-edge signalling ------->
377	                          (for admission control)

379	   <-------------------end-to-end QoS signalling protocol------------->

381	   Figure 1: Generic Scenario (see text for explanation of terms)

383	   An ingress and egress gateway (Ingr G/W and Egr G/W in Figure 1)
384	   connect the interior Diffserv region to the edge access networks
385	   where routers (not shown) use per-flow reservation processing.
386	   Within the Diffserv region are three interior domains, A, B and C, as
387	   well as the inward facing interfaces of the ingress and egress
388	   gateways.  An ingress and egress border router (BR) is shown
389	   interconnecting each interior domain with the next.  There may be
390	   other interior routers (not shown) within each interior domain.

392	   In two paragraphs we now briefly recap how pre-congestion
393	   notification is intended to be used to control flow admission to a
394	   large Diffserv region.  The first paragraph describes data plane
395	   functions and the second describes signalling in the control plane.
396	   We omit many details from [CL-deploy] including behaviour during
397	   routing changes.  For brevity here we assume other flows are already
398	   in progress across a path through the Diffserv region before a new
399	   one arrives, but how bootstrap works is described in Section 4.3.2.

401	   Figure 1 shows a single simplex reserved flow from the sending (Sx)
402	   end host to the receiving (Rx) end host.  The ingress gateway polices
403	   incoming traffic within its admitted reservation and remarks it to
404	   turn on an ECN-capable codepoint [RFC3168] and the controlled load
405	   (CL) Diffserv codepoint.  Together, these codepoints define which
406	   traffic is entitled to the enhanced scheduling of the CL behaviour
407	   aggregate on routers within the Diffserv region.  The CL PHB of
408	   interior routers consists of a scheduling behaviour and a new ECN
409	   marking behaviour that we call `pre-congestion notification' [PCN].
410	   The CL PHB simply re-uses the definition of expedited forwarding
411	   (EF) [RFC3246] for its scheduling behaviour.  But it incorporates a
412	   new ECN marking behaviour, which sets the ECN field of an increasing
413	   number of CL packets to the admission marked (AM) codepoint as they
414	   approach a threshold rate that is lower than the line rate.  The use
415	   of virtual queues ensures real queues have hardly built up any
416	   congestion delay.  The level of marking detected at the egress of the
417	   Diffserv region is then used by the signalling system in order to
418	   determine admission control as follows.

420	   The end-to-end QoS signalling (e.g.  RSVP) for a new reservation
421	   takes one giant hop from ingress to egress gateway, because interior
422	   routers within the Diffserv region are configured to ignore RSVP.
423	   The egress gateway holds flow state because it takes part in the end-
424	   to-end reservation.  So it can classify all packets by flow and it
425	   can identify all flows that have the same previous RSVP hop (a CL-
426	   region-aggregate).  For each CL-region-aggregate of flows in
427	   progress, the egress gateway maintains a per-packet moving average of
428	   the fraction of pre-congestion-marked traffic.  Once an RSVP PATH
429	   message for a new reservation has hopped across the Diffserv region
430	   and reached the destination, an RSVP RESV message is returned.  As
431	   the RESV message passes, the egress gateway piggy-backs the relevant
432	   pre-congestion level onto it [RSVP-ECN].  Again, interior routers
433	   ignore the RSVP message, but the ingress gateway strips off the pre-
434	   congestion level.  If the pre-congestion level is above a threshold,
435	   the ingress gateway denies admission to the new reservation,
436	   otherwise it returns the original RESV signal back towards the data
437	   sender.

439	   Once a reservation is admitted, its traffic will always receive low
440	   delay service for the duration of the reservation.  This is because
441	   ingress gateways ensure that traffic not under a reservation cannot
442	   pass into the Diffserv region with the CL DSCP set.  So non-reserved
443	   traffic will always be treated with a lower priority PHB at each
444	   interior router.  And even if some disaster re-routes traffic after
445	   it has been admitted, if the traffic through any resource tips over a
446	   fail-safe threshold, pre-congestion notification will trigger flow-
447	   pre-emption to very quickly bring every router within the whole
448	   Diffserv region back below its operating point.

450	   The whole admission control system just described deliberately
451	   confines per-flow processing to the access edges of the network,
452	   where it will not limit the system's scalability.  But ideally we
453	   want to extend this approach to multiple networks, to take even more
454	   advantage of its scaling potential.  We would still need per-flow
455	   processing at the access edges of each network, but not at the high
456	   speed interfaces where they interconnect.  Even though such an
457	   admission control system would work technically, it would gain us no
458	   scaling advantage if each network also wanted to police the rate of
459	   each admitted flow for itself---border routers would still have to do
460	   complex packet operations per-flow anyway, given they don't trust
461	   upstream networks to do their policing for them.

463	   This memo describes how to emulate per-flow rate policing using bulk
464	   mechanisms at border routers, so the full scalability potential of
465	   pre-congestion notification is not limited by the need for per-flow
466	   policing mechanisms at borders, which would make borders the most
467	   cost-critical pinch-points.  Then we can achieve the long sought-for
468	   vision of secure Internet-wide bandwidth reservations without needing
469	   per-flow processing at all in core and border routers---where
470	   scalability is most critical.

472	4.  Re-ECN Protocol for an RSVP (or similar) Transport

474	4.1.  Protocol Overview

476	   First we need to recap the way routers accumulate congestion marking
477	   along a path.  Each ECN-capable router marks some packets with CE,
478	   the marking probability increasing with the length of the queue at
479	   its egress link.  The only difference with pre-congestion
480	   marking [PCN] is that marking is based on the length of a virtual
481	   queue, so that the real queue occupancy can remain very low.  We will
482	   use the terms congestion and pre-congestion interchangeably in the
483	   following unless it is important to distinguish between them.

485	   With multiple ECN-capable routers on a path, the ECN field
486	   accumulates the fraction of CE marking that each router adds.  The
487	   combined effect of the packet marking of all the routers along the
488	   path signals congestion of the whole path to the receiver.  So, for
489	   example, if one router early in a path is marking 1% of packets and
490	   another later in a path is marking 2%, flows that pass through both
491	   routers will experience approximately 3% marking.

493	   The packets crossing an inter-domain trust boundary within the
494	   Diffserv region will all have come from different ingress gateways
495	   and will all be destined for different egress gateways.  We will show
496	   that the key to policing against theft of service is for a border
497	   router to be able to directly measure the congestion that is about to
498	   be caused by the traffic it forwards.  That is, it can measure
499	   locally the congestion on each of the downstream paths between itself
500	   and the egress gateways that its traffic is destined for.

502	   With the original ECN protocol, if CE markings crossing the border
503	   had been counted over a period, they would have represented the
504	   accumulated upstream congestion that had already been experienced by
505	   those packets.  The general idea of re-ECN is for the ingress gateway
506	   to continuously encode path congestion into the IP header where, in
507	   this case, `path' means from ingress to egress gateway.  Then at any
508	   point on that path (e.g. between domains A & B in Figure 2 below), IP
509	   headers can be monitored to subtract upstream congestion from
510	   expected path congestion in order to give the expected downstream
511	   congestion still to be experienced until the egress gateway.

513	   Importantly, it turns out that there is no need to monitor downstream
514	   congestion on a per-flow basis.  We will show that accounting for it
515	   in bulk across all flows will be sufficient.

517	                  _____________________________________
518	                _|__    ______    ______    ______    _|__
519	               |    |  |  A   |  |  B   |  |  C   |  |    |
520	               +----+  +-+  +-+  +-+  +-+  +-+  +-+  +----+
521	               |    |  |B|  |B|  |B|  |B|  |B|  |B|  |    |
522	               |Ingr|==|R|  |R|==|R|  |R|==|R|  |R|==|Egr |
523	               |G/W |  | |  | |: | |  | |  | |  | |  |G/W |
524	               +----+  +-+  +-+: +-+  +-+  +-+  +-+  +----+
525	               |    |  |      |: |      |  |      |  |    |
526	               |____|  |______|: |______|  |______|  |____|
527	                 |_____________:_______________________|
528	                               :
529	                 |             :                       |
530	                 |<-upstream-->:<-expected downstream->|
531	                 | congestion  :      congestion       |
532	                 |     u               v ~= p - u      |
533	                 |                                     |
534	                 |<--- expected path congestion, p --->|

536	   Figure 2: Re-ECN concept

538	4.2.  Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6)

540	   In this section we define the names of the various codepoints of the
541	   re-ECN protocol when used with pre-congestion notification, deferring
542	   description of their semantics to the following sections.  But first
543	   we recap the re-ECN wire protocol proposed in [Re-TCP].

545	4.2.1.  Re-ECN Recap

547	   Re-ECN uses the two bit ECN field broadly as in RFC3168 [RFC3168].
548	   It also uses a new re-ECN extension (RE) flag.  The actual position
549	   of the RE flag is different between IPv4 & v6 headers so we will use
550	   an abstraction of the IPv4 and v6 wire protocols by just calling it
551	   the RE flag.  [Re-TCP] proposes using bit 48 (currently unused) in
552	   the IPv4 header for the RE flag, while for IPv6 it proposes an ECN
553	   extension header.

555	   Unlike the ECN field, the RE flag is intended to be set by the sender
556	   and remain unchanged along the path, although it can be read by
557	   network elements that understand the re-ECN protocol.  In the
558	   scenario used in this memo, the ingress gateway acts as a proxy for
559	   the sender, setting the RE flag as permitted in the specification of
560	   re-ECN.

562	   Note that general-purpose routers do not have to read the RE flag,
563	   only special policing elements at borders do.  And no general-purpose
564	   routers have to change the RE flag, although the ingress and egress
565	   gateways do because in the edge-to-edge deployment model we are
566	   using, they act as proxies for the endpoints.  Therefore the RE flag
567	   does not even have to be visible to interior routers.  So the RE flag
568	   has no implications on protocols like MPLS.  Congested label
569	   switching routers (LSRs) would have to be able to notify their
570	   congestion with an ECN/PCN codepoint in the MPLS shim [ECN-MPLS], but
571	   like any interior IP router, they can be oblivious to the RE flag,
572	   which need only be read by border policing functions.

574	   Although the RE flag is a separate, single bit field, it can be read
575	   as an extension to the two-bit ECN field; the three concatenated bits
576	   in what we will call the extended ECN field (EECN) make eight
577	   codepoints available.  When the RE flag setting is "don't care", we
578	   use the RFC3168 names of the ECN codepoints, but [Re-TCP] proposes
579	   the following six codepoint names for when there is a need to be more
580	   specific.

582	   +-------+------------+------+---------------+-----------------------+
583	   |  ECN  | RFC3168    |  RE  | Extended ECN  |     Re-ECN meaning    |
584	   | field | codepoint  | flag | codepoint     |                       |
585	   +-------+------------+------+---------------+-----------------------+
586	   |   00  | Not-ECT    |   0  | Not-RECT      |   Not re-ECN-capable  |
587	   |       |            |      |               |       transport       |
588	   |   00  | Not-ECT    |   1  | FNE           |      Feedback not     |
589	   |       |            |      |               |      established      |
590	   |   01  | ECT(1)     |   0  | Re-Echo       |  Re-echoed congestion |
591	   |       |            |      |               |        and RECT       |
592	   |   01  | ECT(1)     |   1  | RECT          |     Re-ECN capable    |
593	   |       |            |      |               |       transport       |
594	   |   10  | ECT(0)     |   0  | ---           |     Legacy ECN use    |
595	   |       |            |      |               |        only           |
596	   |   10  | ECT(0)     |   1  | --CU--        |    Currently unused   |
597	   |       |            |      |               |                       |
598	   |   11  | CE         |   0  | CE(0)         |       Congestion      |
599	   |       |            |      |               |    experienced with   |
600	   |       |            |      |               |        Re-Echo        |
601	   |   11  | CE         |   1  | CE(-1)        |       Congestion      |
602	   |       |            |      |               |      experienced      |
603	   +-------+------------+------+---------------+-----------------------+

605	    Table 1: Re-cap of Default Extended ECN Codepoints Proposed for Re-
606	                                    ECN

608	4.2.2.  Re-ECN Combined with Pre-Congestion Notification (re-PCN)

610	   As permitted by the ECN specification [RFC3168], a proposal is
611	   currently being advanced in the IETF to define different semantics
612	   for how routers might mark the ECN field of certain packets.  The
613	   idea is to be able to notify congestion when the router's load
614	   approaches a logical limit, rather than the physical limit of the
615	   line.  This new marking is called pre-congestion notification [PCN]
616	   and we will use the term PCN-enabled router for a router that can
617	   apply pre-congestion notification marking to the ECN fields of
618	   packets.

620	   [RFC3168] recommends that a packet's Diffserv codepoint should
621	   determine which type of ECN marking it receives.  A Diffserv per-hop
622	   behaviour (PHB) can specify that routers should apply pre-congestion
623	   notification marking to PCN-capable packets.  We will call this a
624	   PCN-enhanced PHB.  A PCN-capable packet must meet two conditions, it
625	   must carry a DSCP that maps to a PCN-enhanced PHB and it must carry
626	   an ECN field that turns on PCN marking.

628	   As an example, the controlled load (CL) PHB might specify expedited
629	   forwarding as its scheduling behaviour and PCN marking as its
630	   congestion marking behaviour.  Then we would say the CL PHB is a PCN-
631	   enhanced PHB, and that packets with a DSCP that maps to the CL PHB
632	   and with ECN turned on are PCN-capable packets.

634	   [PCN] actually proposes that two logical limits should be used for
635	   pre-congestion notification, with the higher limit as a back-stop for
636	   dealing with anomalous events.  It envisages PCN will be used to
637	   admission control inelastic real-time traffic, so marking at the
638	   lower limit will trigger admission control, while at the higher limit
639	   it will trigger flow pre-emption.

641	   Because it needs two types of congestion marking, PCN seems to need
642	   five states: Not-ECT, ECT (ECN-capable transport), the ECN Nonce,
643	   Admission Marking (AM) and Flow Pre-emption Marking (PM).  [PCN]
644	   proposes various alternative encodings of the ECN field, attempting
645	   various compromises to fit these five states into the four available
646	   ECN codepoints.

648	   One of the five states to make room for is the ECN Nonce [RFC3540],
649	   but the capability we describe in this memo supersedes any need for
650	   the Nonce.  The ECN Nonce is an elegant scheme, but it only allows a
651	   sending node (or its proxy) to detect suppression of congestion
652	   marking in the feedback loop.  Thus the Nonce requires the sender or
653	   its proxy to be trusted to respond correctly to congestion.  But this
654	   is precisely the main cheat we want to protect against (as well as
655	   many others).

657	   One of the compromise protocol encodings that [PCN] explores
658	   ("Alternative 5") leaves out support for the ECN Nonce.  Therefore we
659	   use that one.  This encoding of PCN markings is shown on the left of
660	   Table 2.  Note that these codepoints of the ECN field only take on
661	   the semantics of pre-congestion noticiation if they are combined with
662	   a Diffserv codepoint that the operator has configured to cause PCN
663	   marking, by mapping it to a PCN-enhanced PHB.

665	   For the rest of this memo, we will not distinguish between Admission
666	   Marking and Pre-emption Marking unless we need to be specific.  We
667	   will call both "congestion marking".  With the above encoding,
668	   congestion marking can be read to mean any packet with the left-most
669	   bit of the ECN field set.

671	   The re-ECN protocol can be used to control misbehaving sources
672	   whether congestion is with respect to a logical threshold (PCN) or
673	   the physical line rate (ECN).  In either case the RE flag can be used
674	   to create an extended ECN field.  For PCN-capable packets, the 8
675	   possible encodings of this 3-bit extended ECN (EECN) field are
676	   defined on the right of Table 2 below.  The purposes of these
677	   different codepoints will be introduced in subsequent sections.

679	   +-------+-----------------+------+-------------+--------------------+
680	   |  ECN  | PCN codepoint   |  RE  | Extended    |   Re-ECN meaning   |
681	   | field | (Alternative 5) | flag | ECN         |                    |
682	   |       |                 |      | codepoint   |                    |
683	   +-------+-----------------+------+-------------+--------------------+
684	   |   00  | Not-ECT         |   0  | Not-RECT    | Not re-ECN-capable |
685	   |       |                 |      |             |      transport     |
686	   |   00  | Not-ECT         |   1  | FNE         |    Feedback not    |
687	   |       |                 |      |             |     established    |
688	   |   01  | ECT(1)          |   0  | Re-Echo     |      Re-echoed     |
689	   |       |                 |      |             |   congestion and   |
690	   |       |                 |      |             |        RECT        |
691	   |   01  | ECT(1)          |   1  | RECT        |   Re-ECN capable   |
692	   |       |                 |      |             |      transport     |
693	   |   10  | AM              |   0  | AM(0)       |  Admission Marking |
694	   |       |                 |      |             |    with Re-Echo    |
695	   |   10  | AM              |   1  | AM(-1)      |  Admission Marking |
696	   |       |                 |      |             |                    |
697	   |   11  | PM              |   0  | PM(0)       |     Pre-emption    |
698	   |       |                 |      |             |    Marking with    |
699	   |       |                 |      |             |       Re-Echo      |
700	   |   11  | PM              |   1  | PM(-1)      |     Pre-emption    |
701	   |       |                 |      |             |       Marking      |
702	   +-------+-----------------+------+-------------+--------------------+

704	   Table 2: Extended ECN Codepoints if the Diffserv codepoint uses Pre-
705	                       congestion Notification (PCN)

707	4.3.  Protocol Operation

709	4.3.1.  Protocol Operation for an Established Flow

711	   The re-ECN protocol involves a simple tweak to the action of the
712	   gateway at the ingress edge of the CL region.  In the deployment
713	   model just described [CL-deploy], for each active traffic aggregate
714	   across the CL region (CL-region-aggregate) the ingress gateway will
715	   hold a fairly recent Congestion-Level-Estimate that the egress
716	   gateway will have fed back to it, piggybacked on the signalling that
717	   sets up each flow.  For instance, one aggregate might have been
718	   experiencing 3% pre-congestion (that is, congestion marked octets
719	   whether Admission Marked or Pre-emption Marked).  In this case, the
720	   ingress gateway MUST clear the RE flag to "0" for the same percentage
721	   of octets of CL-packets (3%) and set it to "1" in the rest (97%).
722	   Appendix A.1 gives a simple pseudo-code algorithm that the ingress
723	   gateway may use to do this.

725	   The RE flag is set and cleared this way round for incremental
726	   deployment reasons (see [Re-TCP]).  To avoid confusion we will use
727	   the term `blanking' (rather than marking) when the RE flag is cleared
728	   to "0", so we will talk of the `RE blanking fraction' as the fraction
729	   of octets with the RE flag cleared to "0".

731	       ^
732	       |
733	       |         RE blanking fraction
734	    3% |    +----------------------------+====+
735	       |    |                            |    |
736	    2% |    |                            |    |
737	       |    | congestion marking fraction|    |
738	    1% |    |     +----------------------+    |
739	       |    |     |                           |
740	    0% +----+=====+---------------------------+------>
741	            ^   <--A---> <---B---> <---C--->  ^        domain
742	            |     ^                      ^    |
743	        ingress   |                      |    egress
744	                1.00%                 2.00%          marking fraction

746	   Figure 3: Example Extended ECN codepoint Marking fractions
747	   (Imprecise)

749	   Figure 3 illustrates our example.  The horizontal axis represents the
750	   index of each congestible resource (typically queues) along a path
751	   through the Internet.  The two superimposed plots show the fraction
752	   of each ECN codepoint observed along this path, assuming there are
753	   two congested routers somewhere within domains A and C. And Table 3
754	   below shows the downstream pre-congestion measured at various border
755	   observation points along the path.  Figure 4 (later) shows the same
756	   results of these subtractions, but in graphical form like the above
757	   figure.  The tabulated figures are actually reasonable approximations
758	   derived from more precise formulae given in Appendix A of [Re-TCP].
759	   The RE flag is not changed by interior routers, so it can be seen
760	   that it acts as a reference against which the congestion marking
761	   fraction can be compared along the path.

763	   +--------------------------+---------------------------------------+
764	   | Border observation point | Approximate Downstream pre-congestion |
765	   +--------------------------+---------------------------------------+
766	   |       ingress -- A       |              3% - 0% = 3%             |
767	   |          A -- B          |              3% - 1% = 2%             |
768	   |          B -- C          |              3% - 1% = 2%             |
769	   |        C -- egress       |              3% - 3% = 0%             |
770	   +--------------------------+---------------------------------------+

772	   Table 3: Downstream Congestion Measured at Example Observation Points

774	   Note that the ingress determines the RE blanking fraction for each
775	   aggregate using the most recent feedback from the relevant egress,
776	   arriving with each new reservation, or each refresh.  These updates
777	   arrive relatively infrequently compared to the speed with which
778	   congestion changes.  Although this feedback will always be out of
779	   date, on average positive errors should cancel out negative over a
780	   sufficiently long duration.

782	   In summary, the network adds pre-congestion marking in the forward
783	   data path, the egress feeds its level back to the ingress in RSVP (or
784	   similar signalling), then the ingress gateway re-echoes it into the
785	   forward data path by blanking the RE flag.  Hence the name re-ECN.
786	   Then at any border within the Diffserv region, the pre-congestion
787	   marking that every passing packet will be expected to experience
788	   downstream can be measured to be the RE blanking fraction minus the
789	   congestion marking fraction.

791	4.3.2.  Aggregate Bootstrap

793	   When a new reservation PATH message arrives at the egress, if there
794	   are currently no flows in progress from the same ingress, there will
795	   be no state maintaining the current level of pre-congestion marking
796	   for the aggregate.  While the reservation signalling continues onward
797	   towards the receiving host, the egress gateway returns an RSVP
798	   message to the ingress with a flag [RSVP-ECN] asking the ingress to
799	   send a specified number of data probes between them.  This bootstrap
800	   behaviour is all described in the deployment model [CL-deploy].

802	   However, with our new re-ECN scheme, the ingress does not know what
803	   proportion of the data probes should have the RE flag blanked,
804	   because it has no estimate yet of pre-congestion for the path across
805	   the Diffserv region.

807	   To be conservative, following the guidance for specifying other re-
808	   ECN transports in [Re-TCP], the ingress SHOULD set the FNE codepoint
809	   of the extended ECN header in all probe packets (Table 2).  As per
810	   the deployment model, the egress gateway measures the fraction of
811	   congestion-marked probe octets and feeds back the resulting pre-
812	   congestion level to the ingress, piggy-backed on the returning
813	   reservation response (RESV) for the new flow.  Probe packets are
814	   identifiable by the egress because they have the ingress as the
815	   source and the egress as the destination in the IP header.

817	   It may seem inadvisable to expect the FNE codepoint to be set on
818	   probes, given legacy firewalls etc. might discard such packets
819	   (because this flag had no previous legitimate use).  However, in the
820	   deployment scenarios envisaged, each domain in the Diffserv region
821	   has to be explicitly configured to support the controlled load
822	   service.  So, before deploying the service, the operator MUST
823	   reconfigure such a misbehaving middlebox to allow through packets
824	   with the RE flag set.

826	   Note that we have said SHOULD rather than MUST for the FNE setting
827	   behaviour of the ingress for probe packets.  This entertains the
828	   possibility of an ingress implementation having the benefit of other
829	   knowledge of the path, which it re-uses for a newly starting
830	   aggregate.  For instance, it may hold cached information from a
831	   recent use of the aggregate that is still sufficiently current to be
832	   useful.

834	   It might seem pedantic worrying about these few probe packets, but
835	   this behaviour ensures the system is safe, even if the proportion of
836	   probe packets becomes large.

838	4.3.3.  Flow Bootstrap

840	   It might be expected that a new flow within an active aggregate would
841	   need no special bootstrap behaviour.  If there was an aggregate
842	   already in progress between the gateways the new flow was about to
843	   use, it would inherit the prevailing RE blanking fraction.  And if
844	   there were no active aggregate, the bootstrap behaviour for an
845	   aggregate would be appropriate and sufficient for the new flow.

847	   However, for a number of reasons, at least the first packet of each
848	   new flow SHOULD be set to the FNE codepoint, irrespective of whether
849	   it is joining an active aggregate or not.  If the first packet is
850	   unlikely to be reliably delivered, a number of FNE packets MAY be
851	   sent to increase the probability that at least one is delivered to
852	   the egress gateway.

854	   If each flow does not start with an FNE packet, it will be seen later
855	   that sanctions may be too strict at the interface before the egress
856	   gateway.  It will often be possible to apply sanctions at the
857	   granularity of aggregates rather than flows, but in an internetworked
858	   environment it cannot be guaranteed that aggregates will be
859	   identifiable in remote networks.  So setting FNE at the start of each
860	   flow is a safe strategy.  For instance, a remote network may have
861	   equal cost multi-path (ECMP) routing enabled, causing different flows
862	   between the same gateways to traverse different paths.

864	   After an idle period of more than 1 second, the ingress gateway
865	   SHOULD set the EECN field of the next packet it sends to FNE.  This
866	   allows the design of network policers to be deterministic (see [Re-
867	   TCP]).

869	   However, if the ingress gateway can guarantee that the network(s)
870	   that will carry the flow to its egress gateway all use a common
871	   identifier for the aggregate (e.g. a single MPLS network without ECMP
872	   routing), it MAY NOT set FNE when it adds a new flow to an active
873	   aggregate.  And an FNE packet need only be sent if a whole aggregate
874	   has been idle for more than 1 second.

876	4.3.4.  Router Forwarding Behaviour

878	   Adding re-ECN works well without modifying the forwarding behaviour
879	   of any routers.  However, below, two changes are proposed when
880	   forwarding packets with a per-hop-behaviour that requires pre-
881	   congestion notification:

883	   Preferential drop: When a router cannot avoid dropping ECN-capable
884	      packets, preferential dropping of packets with different extended
885	      ECN codepoints SHOULD be implemented between packets within a PHB
886	      that uses PCN marking.  The drop preference order to use is
887	      defined in Table 4.  Note that to reduce configuration complexity,
888	      Re-Echo and FNE MAY be given the same drop preference, but if
889	      feasible, FNE should be dropped in preference to Re-Echo.

891	   +--------+------+----------------+---------+------------------------+
892	   |   ECN  |  RE  | Extended ECN   | Drop    |     Re-ECN meaning     |
893	   |  field | flag | codepoint      | Pref    |                        |
894	   +--------+------+----------------+---------+------------------------+
895	   |   01   |   0  | Re-Echo        | 5/4     |  Re-echoed congestion  |
896	   |        |      |                |         |        and RECT        |
897	   |   00   |   1  | FNE            | 4       |      Feedback not      |
898	   |        |      |                |         |       established      |
899	   |   01   |   1  | RECT           | 3       |     Re-ECN capable     |
900	   |        |      |                |         |        transport       |
901	   |   10   |   0  | AM(0)          | 3       | Admission Marking with |
902	   |        |      |                |         |         Re-Echo        |
903	   |   10   |   1  | AM(-1)         | 3       |    Admission Marking   |
904	   |        |      |                |         |                        |
905	   |   11   |   0  | PM(0)          | 2       |   Pre-emption Marking  |
906	   |        |      |                |         |      with Re-Echo      |
907	   |   11   |   1  | PM(-1)         | 2       |   Pre-emption Marking  |
908	   |        |      |                |         |                        |
909	   |   00   |   0  | Not-RECT       | 1       |   Not re-ECN-capable   |
910	   |        |      |                |         |        transport       |
911	   +--------+------+----------------+---------+------------------------+

913	      Table 4: Drop Preference of Extended ECN Codepoints (1 = drop 1st)

915	      Given this proposal is being advanced at the same time as PCN
916	      itself, we strongly RECOMMEND that preferential drop based on
917	      extended ECN codepoint is added to router forwarding at the same
918	      time as PCN marking.  Preferential dropping can be difficult to
919	      implement, but we strongly RECOMMEND this security-related re-ECN
920	      improvement where feasible as it is an effective defence against
921	      flooding attacks.

923	   Marking vs. Drop: We propose that PCN-routers SHOULD inspect the RE
924	      flag as well as the ECN field to decide whether to drop or mark
925	      PCN DSCPs.  They MUST choose drop if the codepoint of this
926	      extended ECN field is Not-RECT.  Otherwise they SHOULD mark
927	      (unless, of course, buffer space is exhausted).

929	      A PCN-capable router MUST NOT ever congestion mark a packet
930	      carrying the Not-RECT codepoint because the transport will only
931	      understand drop, not congestion marking.  But a PCN-capable router
932	      can mark rather than drop an FNE packet, even though its ECN field
933	      when looked at in isolation is '00' which appears to be a legacy
934	      Not-ECT packet.  Therefore, if a packet's RE flag is '1', even if
935	      its ECN field is '00', a PCN-enabled router SHOULD use congestion
936	      marking.  This allows the `feedback not established' (FNE)
937	      codepoint to be used for probe packets, in order to pick up PCN
938	      marking when bootstrapping an aggregate.

940	      ECN marking rather than dropping of FNE packets MUST only be
941	      deployed in controlled environments, such as that in [CL-deploy],
942	      where the presence of an egress node that understands ECN marking
943	      is assured.  Congestion events might otherwise be ignored if the
944	      receiver only understands drop, rather than ECN marking.  This is
945	      because there is no guarantee that ECN capability has been
946	      negotiated if feedback is not established (FNE).  Also, [Re-TCP]
947	      places the strong condition that a router MUST apply drop rather
948	      than marking to FNE packets unless it can guarantee that FNE
949	      packets are rate limited either locally or upstream.

951	4.3.5.  Extensions

953	   If a different signalling system, such as NSIS, were used, but it
954	   provided admission control in a similar way, using pre-congestion
955	   notification (e.g. with RMD [NSIS-RMD]) we believe re-ECN could be
956	   used to protect against misbehaving networks in the same way as
957	   proposed above.

959	5.  Emulating Border Policing with Re-ECN

961	5.1.  Informal Terminology

963	   In the rest of this memo, where the context makes it clear, we will
964	   sometimes loosely use the term `congestion' rather than using the
965	   stricter `downstream pre-congestion'.  Also we will loosely talk of
966	   positive or negative flows, meaning flows where the moving average of
967	   the downstream pre-congestion metric is persistently positive or
968	   negative.  The notion of a negative metric arises because it is
969	   derived by subtracting one metric from another.  Of course actual
970	   downstream congestion cannot be negative, only the metric can
971	   (whether due to time lags or deliberate malice).

973	   Just as we will loosely talk of positive and negative flows, we will
974	   also talk of positive or negative packets, meaning packets that
975	   contribute positively or negatively to downstream pre-congestion.

977	   Therefore packets can be considered to have a `worth' of +1, 0 or -1,
978	   which, when multiplied by their size, indicates their contribution to
979	   downstream congestion.  Packets will usually be sent with a worth of
980	   0.  Blanking the RE flag increments the worth of a packet to +1.
981	   Congestion marking a packet decrements its worth (whether admission
982	   marking or pre-emption marking).  Congestion marking a previously
983	   blanked packet cancel out the positive and negative worth of each
984	   marking (a worth of 0).  The FNE codepoint is an exception.  It has
985	   the same positive worth as a packet with the Re-Echo codepoint.  The
986	   table below specifies unambiguously the worth of each extended ECN
987	   codepoint.  Note the order is different from the previous table to
988	   emphasise how congestion marking processes decrement the worth.

990	   +--------+------+------------------+-------+------------------------+
991	   |   ECN  |  RE  | Extended ECN     | Worth |     Re-ECN meaning     |
992	   |  field | flag | codepoint        |       |                        |
993	   +--------+------+------------------+-------+------------------------+
994	   |   00   |   0  | Not-RECT         | n/a   |   Not re-ECN-capable   |
995	   |        |      |                  |       |        transport       |
996	   |   01   |   0  | Re-Echo          | +1    |  Re-echoed congestion  |
997	   |        |      |                  |       |        and RECT        |
998	   |   10   |   0  | AM(0)            | 0     | Admission Marking with |
999	   |        |      |                  |       |         Re-Echo        |
1000	   |   11   |   0  | PM(0)            | 0     |   Pre-emption Marking  |
1001	   |        |      |                  |       |      with Re-Echo      |
1002	   |   00   |   1  | FNE              | +1    |      Feedback not      |
1003	   |        |      |                  |       |       established      |
1004	   |   01   |   1  | RECT             | 0     |     Re-ECN capable     |
1005	   |        |      |                  |       |        transport       |
1006	   |   10   |   1  | AM(-1)           | -1    |    Admission Marking   |
1007	   |        |      |                  |       |                        |
1008	   |   11   |   1  | PM(-1)           | -1    |   Pre-emption Marking  |
1009	   +--------+------+------------------+-------+------------------------+

1011	                Table 5: 'Worth' of Extended ECN Codepoints

1013	5.2.  Policing Overview

1015	   It will be recalled that downstream congestion can be found by
1016	   subtracting upstream congestion from path congestion.  Figure 4
1017	   displays the difference between the two plots in Figure 3 to show
1018	   downstream pre-congestion across the same path through the Internet.

1020	   To emulate border policing, the general idea is for each domain to
1021	   apply penalties to its upstream neighbour in proportion to the amount
1022	   of downstream pre-congestion that the upstream network sends across
1023	   the border.  That is, the penalties should be in proportion to the
1024	   height of the plot.  Downward arrows in the figure show the resulting
1025	   pressure for each domain to under-declare downstream pre-congestion
1026	   in traffic they pass to the next domain, because of the penalties.

1028	               p e n a l t i e s
1029	              /        |        \
1030	       A     :         :         :
1031	       |     |  <--A---> <---B---> <---C--->           domain
1032	       |     V         :         :         :
1033	    3% |    +-----+    |         |         :
1034	       |    |     |    V         V         :
1035	    2% |    |     +----------------------+ :
1036	       |    |  downstream pre-congestion | :
1037	    1% |    |     :                      | :
1038	       |    |     :                      | :
1039	    0% +----+----------------------------+====+------>
1040	            :     :                      : A  :
1041	            :     :                      : |  :
1042	        ingress   :                      : :  egress
1043	                1.00%                 2.00%:         pre-congestion
1044	                                           |
1045	                                       sanctions

1047	   Figure 4: Policing Framework, showing creation of opposing pressures
1048	   to under-declare and over-declare downstream pre-congestion, using
1049	   penalties and sanctions

1051	   These penalties seem to encourage everyone to understate downstream
1052	   congestion in order to reduce the penalties they incur.  But a
1053	   balancing pressure is introduced by the last domain, which applies
1054	   sanctions to flows if downstream congestion goes negative before the
1055	   egress gateway.  The upward arrow at Domain C's border with the
1056	   egress gateway represents the incentive the sanctions would create to
1057	   prevent negative traffic.  The same upward pressure can be applied at
1058	   any domain border (arrows not shown).

1060	   Any flow that persistently goes negative by the time it leaves a
1061	   domain must not have been marked correctly in the first place.  A
1062	   domain that discovers such a flow can adopt a range of strategies to
1063	   protect itself.  Which strategy it uses will depend on policy,
1064	   because it cannot immediately assume malice---there may be an
1065	   innocent configuration error somewhere in the system.

1067	   This memo does not propose to standardise any particular mechanism to
1068	   detect persistently negative flows, but Section 5.5 does give
1069	   examples.  Note that we have used the term flow, but there will be no
1070	   need to bury into the transport layer for port numbers; identifiers
1071	   visible in the network layer will be sufficient (IP address pair,
1072	   DSCP, protocol ID).  The appendix also gives a mechanism to bound the
1073	   required flow state, preventing state exhaustion attacks.

1075	   Of course, some domains may trust other domains to comply with
1076	   admission control without applying sanctions or penalties.  In these
1077	   cases, the protocol should still be used but no penalties need be
1078	   applied.  The re-ECN protocol ensures downstream pre-congestion
1079	   marking is passed on correctly whether or not penalties are applied
1080	   to it, so the system works just as well with a mixture of some
1081	   domains trusting each other and others not.

1083	   Providers should be free to agree the contractual terms they wish
1084	   between themselves, so this memo does not propose to standardise how
1085	   these penalties would be applied.  It is sufficient to standardise
1086	   the re-ECN protocol so the downstream pre-congestion metric is
1087	   available if providers choose to use it.  However, the next section
1088	   (Section 5.3) gives some examples of how these penalties might be
1089	   implemented.

1091	5.3.  Pre-requisite Contractual Arrangements

1093	   The re-ECN protocol has been chosen to solve the policing problem
1094	   because it embeds a downstream pre-congestion metric in passing CL
1095	   traffic that is difficult to lie about and can be measured in bulk.
1096	   The ability to emulate border policing depends on network operators
1097	   choosing to use this metric as one of the elements in their contracts
1098	   with each other.

1100	   Already many inter-domain agreements involve a capacity and a usage
1101	   element.  The usage element may be based on volume or various
1102	   measures of peak demand.  We expect that those network operators who
1103	   choose to use pre-congestion notification for admission control would
1104	   also be willing to consider using this downstream pre-congestion
1105	   metric as a usage element in their interconnection contracts for
1106	   admission controlled (CL) traffic.

1108	   Congestion (or pre-congestion) has the dimension of [octet], being
1109	   the product of volume transferred [octet] and the congestion fraction
1110	   [dimensionless], which is the fraction of the offered load that the
1111	   network isn't able to serve (or would rather not serve in the case of
1112	   pre-congestion).  Measuring downstream congestion gives a measure of
1113	   the volume transferred but modulated by congestion expected
1114	   downstream.  So volume transferred during off-peak periods counts as
1115	   nearly nothing, while volume transferred at peak times counts very
1116	   highly.  The re-ECN protocol allows one network to measure how much
1117	   pre-congestion has been `dumped' into it by another network.  And
1118	   then in turn how much of that pre-congestion it dumped into the next
1119	   downstream network.

1121	   Section 5.6 describes mechanisms for calculating border penalties
1122	   referring to Appendix A.2 for suggested metering algorithms for
1123	   downstream congestion at a border router.  Conceptually, it could
1124	   hardly be simpler.  It broadly involves accumulating the volume of
1125	   packets with the RE flag blanked and the volume of those with
1126	   congestion marking then subtracting the two.

1128	   Once this downstream pre-congestion metric is available, operators
1129	   are free to choose how they incorporate it into their interconnection
1130	   contracts [IXQoS].  Some may include a threshold volume of pre-
1131	   congestion as a quality measure in their service level agreement,
1132	   perhaps with a penalty clause if the upstream network exceeds this
1133	   threshold over, say, a month.  Others may agree a set of tiered
1134	   monthly thresholds, with increasing penalties as each threshold is
1135	   exceeded.  But, it would be just as easy, and more resistant to
1136	   gaming, to do away with discrete thresholds, and instead make the
1137	   penalty rise smoothly with the volume of pre-congestion by applying a
1138	   price to pre-congestion itself.  Then the usage element of the
1139	   interconnection contract would directly relate to the volume of pre-
1140	   congestion caused by the upstream network.

1142	   The direction of penalties and charges relative to the direction of
1143	   traffic flow is a constant source of confusion.  Typically, where
1144	   capacity charges are concerned, lower tier customer networks pay
1145	   higher tier provider networks.  So money flows from the edges to the
1146	   middle of the internetwork, towards greater connectivity,
1147	   irrespective of the flow of data.  But we advise that penalties or
1148	   charges for usage should follow the same direction as the data
1149	   flow---the direction of control at the network layer.  Otherwise a
1150	   network lays itself open to `denial of funds' attacks.  So, where a
1151	   tier 2 provider sends data into a tier 3 customer network, we would
1152	   expect the penalty clauses for sending too much pre-congestion to be
1153	   against the tier 2 network, even though it is the provider.

1155	   It may help to remember that data will be flowing in the other
1156	   direction too.  So the provider network has as much opportunity to
1157	   levy usage penalties as its customer, and it can set the price or
1158	   strength of its own penalties higher if it chooses.  Usage charges in
1159	   both directions tend to cancel each other out, which confirms that
1160	   usage-charging is less to do with revenue raising and more to do with
1161	   encouraging load control discipline in order to smooth peaks and
1162	   troughs, improving utilisation and quality.

1164	   Further, when operators agree penalties in their interconnection
1165	   contracts for sending downstream congestion, they should make sure
1166	   that any level of negative marking only equates to zero penalty.  In
1167	   other words, penalties are always paid in the same direction as the
1168	   data, and never against the data flow, even if downstream congestion
1169	   seems to be negative.  This is consistent with the definition of
1170	   physical congestion; when a resource is underutilised, it is not
1171	   negatively congested.  Its congestion is just zero.  So, although
1172	   short periods of negative marking can be tolerated to correct
1173	   temporary over-declarations due to lags in the feedback system,
1174	   persistent downstream negative congestion can have no physical
1175	   meaning and therefore must signify a problem.  The incentive for
1176	   domains not to tolerate persistently negative traffic depends on this
1177	   principle that penalties must never be paid against the data flow.

1179	   Also note that at the last egress of the Diffserv region, domain C
1180	   should not agree to pay any penalties to the egress gateway for pre-
1181	   congestion passed to the egress gateway.  Downstream pre-congestion
1182	   to the egress gateway should have reached zero here.  If domain C
1183	   were to agree to pay for any remaining downstream pre-congestion, it
1184	   would give the egress gateway an incentive to over-declare pre-
1185	   congestion feedback and take the resulting profit from domain C.

1187	   To focus the discussion, from now on, unless otherwise stated, we
1188	   will assume a downstream network charges its upstream neighbour in
1189	   proportion to the pre-congestion it sends (V_b in the notation of
1190	   Appendix A.2).  Effectively tiered thresholds would be just more
1191	   coarse-grained approximations of the fine-grained case we choose to
1192	   examine.  If these neighbours had previously agreed that the (fixed)
1193	   price per octet of pre-congestion would be L, then the bill at the
1194	   end of the month would simply be the product L*V_b, plus any fixed
1195	   charges they may also have agreed.

1197	   We are well aware that the IETF tries to avoid standardising
1198	   technology that depends on a particular business model.  Indeed, this
1199	   principle is at the heart of all our own work.  Our aim here is to
1200	   make a new metric available that we believe is superior to all
1201	   existing metrics.  Then, our aim is to show that border policing can
1202	   at least work with the one model we have just outlined.  We assume
1203	   that operators might then experiment with the metric in other models.
1204	   Of course, operators are free to complement this pre-congestion-based
1205	   usage element of their charges with traditional capacity charging,
1206	   and we expect they will.

1208	   Also note well that everything we discuss in this memo only concerns
1209	   interconnection within the Diffserv region.  ISPs are free to sell or
1210	   give away reservations however they want on the retail market.  But
1211	   of course, interconnection charges will have a bearing on that.
1212	   Indeed, in the present scenario, the ingress gateway effectively
1213	   sells reservations on one side and buys congestion penalties on the
1214	   other.  As congestion rises, one can imagine the gateway discovering
1215	   that congestion penalties have risen higher than the (probably fixed)
1216	   revenue it will earn from selling the next flow reservation.  This
1217	   encourages the gateway to cut its losses by blocking new calls, which
1218	   is why we believe downstream congestion penalties can emulate per-
1219	   flow rate policing at borders, as the next section explains.

1221	5.4.  Emulation of Per-Flow Rate Policing: Rationale and Limits

1223	   The important feature of charging in proportion to congestion volume
1224	   is that the penalty aggregates and disaggregates correctly along with
1225	   packet flows.  This is because the penalty rises linearly with bit
1226	   rate (unless congestion is absolutely zero) and linearly with
1227	   congestion, because it is the product of them both.  So if the
1228	   packets crossing a border belong to a thousand flows, and one of
1229	   those flows doubles its rate, the ingress gateway forwarding that
1230	   flow will have to put twice as much congestion marking into the
1231	   packets of that flow.  And this extra congestion marking will add
1232	   proportionately to the penalties levied at every border the flow
1233	   crosses in proportion to the amount of pre-congestion remaining on
1234	   the path.

1236	   Effectively, usage charges will continuously flow from ingress
1237	   gateways to the places generating pre-congestion marking, in
1238	   proportion to the pre-congestion marking introduced and to the data
1239	   rates from those gateways.

1241	   As importantly, pre-congestion itself rises super-linearly with
1242	   utilisation of a particular resource.  So if someone tries to push
1243	   another flow into a path that is already signalling enough pre-
1244	   congestion to warrant admission control, the penalty will be a lot
1245	   greater than it would have been to add the same flow to a less
1246	   congested path.  This makes the incentive system fairly insensitive
1247	   to the actual level of pre-congestion for triggering admission
1248	   control that each ingress chooses.  The deterrent against exceeding
1249	   whatever threshold is chosen rises very quickly with a small amount
1250	   of cheating.

1252	   These are the properties that allow re-ECN to emulate per-flow border
1253	   policing of both rate and admission control.  It is not a perfect
1254	   emulation of per-flow border policing, but we claim it is sufficient
1255	   to at least ensure the cost to others of a cheat is borne by the
1256	   cheater, because the penalties are at least proportionate to the
1257	   level of the cheat.  If an edge network operator is selling
1258	   reservations at a large profit over the congestion cost, these pre-
1259	   congestion penalties will not be sufficient to ensure networks in the
1260	   middle get a share of those profits, but at least they can cover
1261	   their costs.

1263	   We will now explain with an example.  When a whole inter-network is
1264	   operating at normal (typically very low) congestion, the pre-
1265	   congestion marking from virtual queues will be a little higher than
1266	   if the real queues had been used---still low, but more noticeable.
1267	   But low congestion levels do not imply that usage /charges/ must also
1268	   be low.  Usage charges will depend on the /price/ L as well.

1270	   If the metric of the usage element of an interconnection agreement
1271	   was changed from pure volume to pre-congested volume, one would
1272	   expect the price of pre-congestion to be arranged so that the total
1273	   usage charge remained about the same.  So, if an average pre-
1274	   congestion fraction turned out to be 1/1000, one would expect that
1275	   the price L (per octet) of pre-congestion would be about 1000 times
1276	   the previously used (per octet) price for volume.  We should add that
1277	   a switch to pre-congestion is unlikely to exactly maintain the same
1278	   overall level of usage charges, but this argument will be
1279	   approximately true, because usage charge will rise to at least the
1280	   level the market finds necessary to push back against usage.

1282	   From the above example it can be seen why a 1000x higher price will
1283	   make operators become acutely sensitive to the congestion they cause
1284	   in other networks, which is of course the desired effect; to
1285	   encourage networks to /control/ the congestion they allow their users
1286	   to cause to others.

1288	   If any network sends even one flow at higher rate, they will
1289	   immediately have to pay proportionately more usage charges.  Because
1290	   there is no knowledge of reservations within the Diffserv region, no
1291	   interior router can police whether the rate of each flow is greater
1292	   than each reservation.  So the system doesn't truly emulate rate-
1293	   policing of each flow.  But there is no incentive to pack a higher
1294	   rate into a reservation, because the charges are directly
1295	   proportional to rate, irrespective of the reservations.

1297	   However, if virtual queues start to fill on any path, even though
1298	   real queues will still be able to provide low latency service, pre-
1299	   congestion marking will rise fairly quickly.  It may eventually reach
1300	   the threshold where the ingress gateway would deny admission to new
1301	   flows.  If the ingress gateway cheats and continues to admit new
1302	   flows, the affected virtual queues will rapidly fill, even though the
1303	   real queues will still be little worse than they were when admission
1304	   control should have been invoked.  The ingress gateway will have to
1305	   pay the penalty for such an extremely high pre-congestion level, so
1306	   the pressure to invoke admission control should become unbearable.

1308	   The above mechanisms protect against rational operators.  In
1309	   Section 5.6.3 we discuss how networks can protect themselves from
1310	   accidental or deliberate misconfiguration in neighbouring networks.

1312	5.5.  Sanctioning Dishonest Marking

1314	   As CL traffic leaves the last network before the egress gateway
1315	   (domain C) the RE blanking fraction should match the congestion
1316	   marking fraction, when averaged over a sufficiently long duration
1317	   (perhaps ~10s to allow a few rounds of feedback through regular
1318	   signalling of new and refreshed reservations).

1320	   To protect itself, domain C should install a monitor at its egress.
1321	   It aims to detect flows of CL packets that are persistently negative.
1322	   If flows are positive, domain C need take no action---this simply
1323	   means an upstream network must be paying more penalties than it needs
1324	   to.  Appendix A.3 gives a suggested algorithm for the monitor,
1325	   meeting the criteria below.

1327	   o  It SHOULD introduce minimal false positives for honest flows;

1329	   o  It SHOULD quickly detect and sanction dishonest flows (minimal
1330	      false negatives);

1332	   o  It MUST be invulnerable to state exhaustion attacks from malicious
1333	      sources.  For instance, if the dropper uses flow-state, it should
1334	      not be possible for a source to send numerous packets, each with a
1335	      different flow ID, to force the dropper to exhaust its memory
1336	      capacity;

1338	   o  It MUST introduce sufficient loss in goodput so that malicious
1339	      sources cannot play off losses in the egress dropper against
1340	      higher allowed throughput.  Salvatori [CLoop_pol] describes this
1341	      attack, which involves the source understating path congestion
1342	      then inserting forward error correction (FEC) packets to
1343	      compensate expected losses.

1345	   Note that the monitor operates on flows but with careful design we
1346	   can avoid per-flow state.  This is why we have been careful to ensure
1347	   that all flows MUST start with a packet marked with the FNE
1348	   codepoint.  If a flow does not start with the FNE codepoint, a
1349	   monitor is likely to treat it unfavourably.  This risk makes it worth
1350	   setting the FNE codepoint at the start of a flow, even though there
1351	   is a cost to setting FNE (positive `worth').

1353	   Starting flows with an FNE packet also means that a monitor will be
1354	   resistant to state exhaustion attacks from other networks, as the
1355	   monitor can then be designed to never create state unless an FNE
1356	   packet arrives.  And an FNE packet counts positive, so it will cost a
1357	   lot for a network to send many of them.

1359	   Monitor algorithms will often maintain a moving average across flows
1360	   of the fraction of RE blanked packets.  When maintaining an average
1361	   across flows, a monitor MUST ignore packets with the FNE codepoint
1362	   set.  An ingress gateway sets the FNE codepoint when it does not have
1363	   the benefit of feedback from the egress.  So counting packets with
1364	   FNE cleared would be likely to make the average unnecessarily
1365	   positive, providing headroom (or should we say footroom?) for
1366	   dishonest (negative) traffic.

1368	   If the monitor detects a persistently negative flow, it could drop
1369	   sufficient negative and neutral packets to force the flow to not be
1370	   negative.  This is the approach taken for the `egress dropper' in
1371	   [Re-TCP], but for the scenario in this memo, where everyone would
1372	   expect everyone else to keep to the protocol, a management alarm
1373	   SHOULD be raised on detecting persistently negative traffic and any
1374	   automatic sanctions taken SHOULD be logged.  Even if the chosen
1375	   policy is to take no automatic action, the cause can then be
1376	   investigated manually.

1378	   Then all ingresses cannot understate downstream pre-congestion
1379	   without their action being logged.  So network operators can deal
1380	   with offending networks at the human level, out of band.  As a last
1381	   resort, perhaps where the ingress gateway address seems to have been
1382	   spoofed in the signalling, packets can be dropped.  Drops could be
1383	   focused on just sufficient packets in misbehaving flows to remove the
1384	   negative bias while doing minimal harm.

1386	   A future version of this memo may define a control message that could
1387	   be used to notify an offending ingress gateway (possibly via the
1388	   egress gateway) that it is sending persistently negative flows.
1389	   However, we are aware that such messages could be used to test the
1390	   sensitivity of the detection system, so currently we prefer silent
1391	   sanctions.

1393	   An extreme scenario would be where an ingress gateway (or set of
1394	   gateways) mounted a DoS attack against another network.  If their
1395	   traffic caused sufficient congestion to lead to drop but they
1396	   understated path congestion to avoid penalties for causing high
1397	   congestion, the preferential drop recommendations in Section 4.3.4
1398	   would at least ensure that these flows would always be dropped before
1399	   honest flows..

1401	5.6.  Border Mechanisms

1403	5.6.1.  Border Accounting Mechanisms

1405	   One of the main design goals of re-ECN was for border security
1406	   mechanisms to be as simple as possible, otherwise they would become
1407	   the pinch-points that limit scalability of the whole internetwork.
1408	   As the title of this memo suggests, we want to avoid per-flow
1409	   processing at borders.  We also want to keep to passive mechanisms
1410	   that can monitor traffic in parallel to forwarding, rather than
1411	   having to filter traffic inline---in series with forwarding.  As data
1412	   rates continue to rise, we suspect that all-optical interconnection
1413	   between networks will soon be a requirement.  So we want to avoid any
1414	   new need for buffering (even though border filtering is current
1415	   practice for other reasons, we don't want to make it even less likely
1416	   that we will ever get rid of it).

1418	   So far, we have been able to keep the border mechanisms simple,
1419	   despite having had to harden them against some subtle attacks on the
1420	   re-ECN design.  The mechanisms are still passive and avoid per-flow
1421	   processing, although we do use filtering as a fail-safe to
1422	   temporarily shield against extreme events in other networks, such as
1423	   accidental misconfigurations (Section 5.6.3).

1425	   The basic accounting mechanism at each border interface simply
1426	   involves accumulating the volume of packets with positive worth (Re-
1427	   Echo and FNE), and subtracting the volume of those with negative
1428	   worth: AM(-1) and PM(-1).  Even though this mechanism takes no regard
1429	   of flows, over an accounting period (say a month) this subtraction
1430	   will account for the downstream congestion caused by all the flows
1431	   traversing the interface, wherever they come from, and wherever they
1432	   go to.  The two networks can agree to use this metric however they
1433	   wish to determine some congestion-related penalty against the
1434	   upstream network (see Section 5.3 for examples).  Although the
1435	   algorithm could hardly be simpler, it is spelled out using pseudo-
1436	   code in Appendix A.2.1.

1438	   Various attempts to subvert the re-ECN design have been made.  In all
1439	   cases their root cause is persistently negative flows.  But, after
1440	   describing these attacks we will show that we don't actually have to
1441	   get rid of all persistently negative flows in order to thwart the
1442	   attacks.

1444	   In honest flows, downstream congestion is measured as positive minus
1445	   negative volume.  So if all flows are honest (i.e. not persistently
1446	   negative), adding all positive volume and all negative volume without
1447	   regard to flows will give an aggregate measure of downstream
1448	   congestion.  But such simple aggregation is only possible if no flows
1449	   are persistently negative.  Unless persistently negative flows are
1450	   completely removed, they will reduce the aggregate measure of
1451	   congestion.  The aggregate may still be positive overall, but not as
1452	   positive as it would have been had the negative flows been removed.

1454	   In Section 5.5 we discussed how to sanction traffic to remove, or at
1455	   least to identify, persistently negative flows.  But, even if the
1456	   sanction for negative traffic is to discard it, unless it is
1457	   discarded at the exact point it goes negative, it will wrongly
1458	   subtract from aggregate downstream congestion, at least at any
1459	   borders it crosses after it has gone negative but before it is
1460	   discarded.

1462	   We rely on sanctions to deter dishonest understatement of congestion.
1463	   But even the ultimate sanction of discard can only be effective if
1464	   the sender is bothered about the data getting through to its
1465	   destination.  A number of attacks have been identified where a sender
1466	   gains from sending dummy traffic or it can attack someone or
1467	   something using dummy traffic even though it isn't communicating any
1468	   information to anyone:

1470	   o  A network can simply create its own dummy traffic to congest
1471	      another network, perhaps causing it to lose business at no cost to
1472	      the attacking network.  This is a form of denial of service
1473	      perpetrated by one network on another.  The preferential drop
1474	      measures in Section 4.3.4 provide crude protection against such
1475	      attacks, but we are not overly worried about more accurate
1476	      prevention measures, because it is already possible for networks
1477	      to DoS other networks on the general Internet, but they generally
1478	      don't because of the grave consequences of being found out.  We
1479	      are only concerned if re-ECN increases the motivation for such an
1480	      attack, as in the next example.

1482	   o  A network can just generate negative traffic and send it over its
1483	      border with a neighbour to reduce the overall penalties that it
1484	      should pay to that neighbour.  It could even initialise the TTL so
1485	      it expired shortly after entering the neighbouring network,
1486	      reducing the chance of detection further downstream.  This attack
1487	      need not be motivated by a desire to deny service and indeed need
1488	      not cause denial of service.  A network's main motivator would
1489	      most likely be to reduce the penalties it pays to a neighbour.
1490	      But, the prospect of financial gain might tempt the network into
1491	      mounting a DoS attack on the other network as well, given the gain
1492	      would offset some of the risk of being detected.

1494	   Note that we have not included DoS by Internet hosts in the above
1495	   list of attacks, because we have restricted ourselves to a scenario
1496	   with edge-to-edge admission control across a Diffserv region.  In
1497	   this case, the edge ingress gateways insulate the Diffserv region
1498	   from DoS by Internet hosts.  Re-ECN resists more general DoS attacks,
1499	   but this is discussed in [Re-TCP].

1501	   The first step towards a solution to all these problems with negative
1502	   flows is to be able to estimate the contribution they make to
1503	   downstream congestion at a border and to correct the measure
1504	   accordingly.  Although ideally we want to remove negative flows
1505	   themselves, perhaps surprisingly, the most effective first step is to
1506	   cancel out the polluting effect negative flows have on the measure of
1507	   downstream congestion at a border.  It is more important to get an
1508	   unbiased estimate of their effect, than to try to remove them all.  A
1509	   suggested algorithm to give an unbiased estimate of the contribution
1510	   from negative flows to the downstream congestion measure is given in
1511	   Appendix A.2.2.

1513	   Although making an accurate assessment of the contribution from
1514	   negative flows may not be easy, just the single step of neutralising
1515	   their polluting effect on congestion metrics removes all the gains
1516	   networks could otherwise make from mounting dummy traffic attacks on
1517	   each other.  This puts all networks on the same side (only with
1518	   respect to negative flows of course), rather than being pitched
1519	   against each other.  The network where this flow goes negative as
1520	   well as all the networks downstream lose out from not being
1521	   reimbursed for any congestion this flow causes.  So they all have an
1522	   interest in getting rid of these negative flows.  Networks forwarding
1523	   a flow before it goes negative aren't strictly on the same side, but
1524	   they are disinterested bystanders---they don't care that the flow
1525	   goes negative downstream, but at least they can't actively gain from
1526	   making it go negative.  The problem becomes localised so that once a
1527	   flow goes negative, all the networks from where it happens and beyond
1528	   downstream each have a small problem, each can detect it has a
1529	   problem and each can get rid of the problem if it chooses to.  But
1530	   negative flows can no longer be used for any new attacks.

1532	   Once an unbiased estimate of the effect of negative flows can be
1533	   made, the problem reduces to detecting and preferably removing flows
1534	   that have gone negative as soon as possible.  But importantly,
1535	   complete eradication of negative flows is no longer critical---best
1536	   endeavours will be sufficient.

1538	   Note that the guiding principle behind all the above discussion is
1539	   that any gain from subverting the protocol should be precisely
1540	   neutralised, rather than punished.  If a gain is punished to a
1541	   greater extent than is sufficient to neutralise it, it will most
1542	   likely open up a new vulnerability, where the amplifying effect of
1543	   the punishment mechanism can be turned on others.

1545	   For instance, if possible, flows should be removed as soon as they go
1546	   negative, but we do NOT RECOMMEND any attempts to discard such flows
1547	   further upstream while they are still positive.  Such over-zealous
1548	   push-back is unnecessary and potentially dangerous.  These flows have
1549	   paid their `fare' up to the point they go negative, so there is no
1550	   harm in delivering them that far.  If someone downstream asks for a
1551	   flow to be dropped as near to the source as possible, because they
1552	   say it is going to become negative later, an upstream node cannot
1553	   test the truth of this assertion.  Rather than have to authenticate
1554	   such messages, re-ECN has been designed so that flows can be dropped
1555	   solely based on locally measurable evidence.  A message hinting that
1556	   a flow should be watched closely to test for negativity is fine.  But
1557	   not a message that claims that a positive flow will go negative
1558	   later, so it should be dropped. .

1560	5.6.2.  Competitive Routing

1562	   With the above penalty system, each domain seems to have a perverse
1563	   incentive to fake pre-congestion.  For instance domain B profits from
1564	   the difference between penalties it receives at its ingress (its
1565	   revenue) and those it pays at its egress (its cost).  So if B
1566	   overstates internal pre-congestion it seems to increase its profit.
1567	   However, we can assume that domain A could bypass B, routing through
1568	   other domains to reach the egress.  So the competitive discipline of
1569	   least-cost routing can ensure that any domain tempted to fake pre-
1570	   congestion for profit risks losing /all/ its incoming traffic.  The
1571	   least congested route would eventually be able to win this
1572	   competitive game, only as long as it didn't declare more fake pre-
1573	   congestion than the next most competitive route.

1575	   This memo does not need to standardise any particular mechanism for
1576	   routing based on re-ECN.  Goldenberg et al [Smart_rtg] refers to
1577	   various commercial products and presents its own algorithms for
1578	   moving traffic between multi-homed routes based on usage charges.
1579	   None of these systems require any changes to standards protocols
1580	   because the choice between the available border gateway protocol
1581	   (BGP) routes is based on a combination of local knowledge of the
1582	   charging regime and local measurement of traffic levels.  If, as we
1583	   propose, charges or penalties were based on the level of re-ECN
1584	   measured in passing traffic, a similar optimisation could be achieved
1585	   without requiring any changes to standard routing protocols.

1587	   We must be clear that applying pre-congestion-based routing to this
1588	   admission control system remains an open research issue.  Traffic
1589	   engineering based on congestion requires careful damping to avoid
1590	   oscillations, and should not be attempted without adult supervision
1591	   :) Mortier & Pratt [ECN-BGP] have analysed traffic engineering based
1592	   on congestion.  But without the benefit of re-ECN, they had to add a
1593	   path attribute to BGP to advertise a route's downstream congestion
1594	   (actually they proposed that BGP should advertise the charge for
1595	   congestion, which we believe wrongly embeds an assumption into BGP
1596	   that the only thing to do with congestion is charge for it).

1598	5.6.3.  Fail-safes

1600	   The mechanisms described so far create incentives for rational
1601	   operators to behave.  That is, one operator aims to make another
1602	   behave responsibly by applying penalties and expects a rational
1603	   response (i.e. one that trades off costs against benefits).  It is
1604	   usually reasonable to assume that other network operators will behave
1605	   rationally (policy routing can avoid those that might not).  But this
1606	   approach does not protect against the misconfigurations and accidents
1607	   of other operators.

1609	   Therefore, we propose the following two mechanisms at a network's
1610	   borders to provide "defence in depth".  Both are similar:

1612	   Highly positive flows: A small sample of positive packets should be
1613	      picked randomly as they cross a border interface.  Then subsequent
1614	      packets matching the same source and destination address and DSCP
1615	      should be monitored.  If the fraction of positive marking is well
1616	      above a threshold (to be determined by operational practice), a
1617	      management alarm SHOULD be raised, and the flow MAY be
1618	      automatically subject to focused drop.

1620	   Persistently negative flows: A small sample of congestion marked
1621	      packets should be picked randomly as they cross a border
1622	      interface.  Then subsequent packets matching the same source and
1623	      destination address and DSCP should be monitored.  If the RE
1624	      blanking fraction minus the congestion marking fraction is
1625	      persistently negative, a management alarm SHOULD be raised, and
1626	      the flow MAY be automatically subject to focused drop.

1628	   Both these mechanisms rely on the fact that highly positive (or
1629	   negative) flows will appear more quickly in the sample by selecting
1630	   randomly solely from positive (or negative) packets.

1632	   Note that there is no assumption that /users/ behave rationally.  The
1633	   system is protected from the vagaries of irrational user behaviour by
1634	   the ingress gateways, which transform internal penalties into a
1635	   deterministic, admission control mechanism that prevents users from
1636	   misbehaving, by directly engineered means.

1638	6.  Analysis

1640	   The domains in Figure 1 are not expected to be completely malicious
1641	   towards each other.  After all, we can assume that they are all co-
1642	   operating to provide an internetworking service to the benefit of
1643	   each of them and their customers.  Otherwise their routing polices
1644	   would not interconnect them in the first place.  However, we assume
1645	   that they are also competitors of each other.  So a network may try
1646	   to contravene our proposed protocol if it would gain or make a
1647	   competitor lose, or both, but only if it can do so without being
1648	   caught.  Therefore we do not have to consider every possible random
1649	   attack one network could launch on the traffic of another, given
1650	   anyway one network can always drop or corrupt packets that it
1651	   forwards on behalf of another.

1653	   Therefore, we only consider new opportunities for /gainful/ attack
1654	   that our proposal introduces.  But to a certain extent we can also
1655	   rely on the in depth defences we have described (Section 5.6.3 )
1656	   intended to mitigate the potential impact if one network accidentally
1657	   misconfiguring the workings of this protocol.

1659	   The ingress and egress gateways are shown in the most generic
1660	   arrangement possible in Figure 1, without any surrounding network.
1661	   This allows us to consider more specific cases where these gateways
1662	   and a neighbouring network are operated by the same player.  As well
1663	   as cases where the same player operates neighbouring networks, we
1664	   will also consider cases where the two gateways collude as one player
1665	   and where the sender and receiver collude as one.  Collusion of other
1666	   sets of domains is less likely, but we will consider such cases.  In
1667	   the general case, we will assume none of the nine trust domains
1668	   across the figure fully trust any of the others.

1670	   As we only propose to change routers within the Diffserv region, we
1671	   assume the operators of networks outside the region will be doing
1672	   per-flow policing.  That is, we assume the networks outside the
1673	   Diffserv region and the gateways around its edges can protect
1674	   themselves.  So given we are proposing to remove flow policing from
1675	   some networks, our primary concern must be to protect networks that
1676	   don't do per-flow policing (the potential `victims') from those that
1677	   do (the `enemy').  The ingress and egress gateways are the only way
1678	   the outer enemy can get at the middle victim, so we can consider the
1679	   gateways as the representatives of the enemy as far as domains A, B
1680	   and C are concerned.  We will call this trust scenario `edges against
1681	   middles'.

1683	   Earlier in this memo, we outlined the classic border rate policing
1684	   problem (Section 3).  It will now be useful to reiterate the
1685	   motivations that are the root cause of the problem.  The more
1686	   reservations a gateway can allow, the more revenue it receives.  The
1687	   middle networks want the edges to comply with the admission control
1688	   protocol when they become so congested that their service to others
1689	   might suffer.  The middle networks also want to ensure the edges
1690	   cannot steal more service from them than they are entitled to.

1692	   In the context of this `edges against middles' scenario, the re-ECN
1693	   protocol has two main effects:

1695	   o  The more pre-congestion there is on a path across the Diffserv
1696	      region, the higher the ingress gateway must declare downstream
1697	      pre-congestion.

1699	   o  If the ingress gateway does not declare downstream pre-congestion
1700	      high enough on average, it will `hit the ground before the
1701	      runway', going negative and triggering sanctions, either directly
1702	      against the traffic or against the ingress gateway at a management
1703	      level

1705	   An executive summary of our security analysis can be stated in three
1706	   parts, distinguished by the type of collusion considered.

1708	   Neighbour-only Middle-Middle Collusion: Here there is no collusion or
1709	      collusion is limited to neighbours in the feedback loop.  In other
1710	      words, two neighbouring networks can be assumed to act as one.  Or
1711	      the egress gateway might collude with domain C. Or the ingress
1712	      gateway might collude with domain A. Or ingress and egress
1713	      gateways might collude with each other.

1715	      In these cases where only neighbours in the feedback loop collude,
1716	      we concludes that all parties have a positive incentive to declare
1717	      downstream pre-congestion truthfully, and the ingress gateway has
1718	      a positive incentive to invoke admission control when congestion
1719	      rises above the admission threshold in any network in the region
1720	      (including its own).  No party has an incentive to send more
1721	      traffic than declared in reservation signalling (even though only
1722	      the gateways read this signalling).  In short, no party can gain
1723	      at the expense of another.

1725	   Non-neighbour Middle-Middle Collusion: In the case of other forms of
1726	      collusion between middle networks (e.g. between domain A and C) it
1727	      would be possible for say A & C to create a tunnel between
1728	      themselves so that A would gain at the expense of B. But C would
1729	      then lose the gain that A had made.  Therefore the value to A & C
1730	      of colluding to mount this attack seems questionable.  It is made
1731	      more questionable, because the attack can be statistically
1732	      detected by B using the second `defence in depth' mechanism
1733	      mentioned already.  Note that C can defend itself from being
1734	      attacked through a tunnel by treating the tunnel end point as a
1735	      direct link to a neighbouring network (e.g. as if A were a
1736	      neighbour of C, via the tunnel), which falls back to the safety of
1737	      the neighbour-only scenario.

1739	   Middle-Edge Collusion: Collusion between networks or gateways within
1740	      the Diffserv region and networks or users outside the region has
1741	      not yet been fully analysed.  The presence of full per-flow
1742	      policing at the ingress gateway seems to make this a less likely
1743	      source of a successful attack.

1745	   {ToDo: Due to lack of time, the full write up of the security
1746	   analysis is deferred to the next version of this memo.}

1748	   Finally, it is well known that the best person to analyse the
1749	   security of a system is not the designer.  Therefore, our confident
1750	   claims must be hedged with doubt until others with perhaps a greater
1751	   incentive to break it have mounted a full analysis.

1753	7.  Incremental Deployment

1755	   We believe ECN has so far not been widely deployed because it
1756	   requires widespread end system and network deployment just to achieve
1757	   a marginal improvement in performance.  The ability to offer a new
1758	   service (admission control) would be a much stronger driver for ECN
1759	   deployment.

1761	   As stated in the introduction, the aim of this memo is to "build in
1762	   security from the start" when admission control is based on pre-
1763	   congestion notification.  However, the proposal has been designed so
1764	   that security can be added some time after first deployment.  Given
1765	   admission control based on pre-congestion notification requires few
1766	   changes to standards, it should be deployable fairly soon.  However,
1767	   re-ECN requires a change to IP, which may take a little longer.

1769	   We expect that initial deployments of PCN-based admission control
1770	   will be confined to single networks, or to clubs of networks that
1771	   trust each other.  The proposal in this memo will only become
1772	   relevant once networks with conflicting interests wish to
1773	   interconnect their admission controlled services, but without the
1774	   scalability constraints of per-flow border policing.  It will not be
1775	   possible to use re-ECN, even in a controlled environment between
1776	   consenting operators, unless it is standardised into IP.  Given the
1777	   IPv4 header has limited space for further changes, current IESG
1778	   policy [{ToDo: ref?}] is not to allow experimental use of codepoints
1779	   in the IPv4 header, as whenever an experiment isn't taken up, the
1780	   space it used tends to be impossible to reclaim.

1782	   If PCN-based admission control is deployed before re-ECN is
1783	   standardised into IP, wherever a networks (or club of networks)
1784	   connects to another network (or club of networks) with conflicting
1785	   interests, they will place a gateway between the two regions that
1786	   does per-flow rate policing and admission control.  If re-ECN is
1787	   eventually standardised into IP, it will be possible for these
1788	   separate regions to upgrade all their gateways to use re-ECN before
1789	   removing the per-flow policing gateways between them.  Given the
1790	   edge-to-edge deployment model of PCN-based admission control, it is
1791	   reasonable to imagine this incremental deployment model without
1792	   needing to cater for partial deployment of re-ECN in just some of the
1793	   gateways around one Diffserv region.

1795	   Only the edge gateways around a Diffserv region have to be upgraded
1796	   to add re-ECN support, not interior routers.  It is also necessary to
1797	   add the mechanisms that use re-ECN to secure a network against
1798	   misbehaving gateways and networks.  Specifically, these are the
1799	   border mechanisms (Section 5.6) and the mechanisms to sanction
1800	   dishonest marking (Section 5.5).

1802	   We also RECOMMEND adding improvements to forwarding on interior
1803	   routers (Section 4.3.4).  But the system works whether all, some or
1804	   none are upgraded, so interior routers may be upgraded in a piecemeal
1805	   fashion at any time.

1807	8.  Design Choices and Rationale

1809	   The primary insight of this work is that downstream congestion is the
1810	   metric that would be most useful to control an internetwork, and
1811	   particularly to police how one network responds to the congestion it
1812	   causes in a remote network.  This is the problem that has previously
1813	   made it so hard to provide scalable admission control.

1815	   The case for using re-feedback (a generalisation of re-ECN) to police
1816	   congestion response and provide QoS is made in [Re-fb].  Essentially,
1817	   the insight is that congestion is a factor that crosses layers from
1818	   the physical upwards.  Therefore re-feedback polices congestion where
1819	   it emerges from a physical interface between networks.  This is
1820	   achieved by bringing the congestion information to the interface,
1821	   rather than examining packet addressing where there is congestion.
1822	   Then congestion crossing the physical interface at a border can be
1823	   policed at the interface, rather than policing the congestion on
1824	   packets that claim to come from an address (which may be spoofed).
1825	   Also, re-feedback works in the network layer independently of other
1826	   layers---despite its name re-feedback does not actually require
1827	   feedback.  It requires a source to act conservatively before it gets
1828	   feedback.

1830	   On the subject of lack of feedback, the feedback not established
1831	   (FNE) codepoint is motivated by arguments for a state set-up bit in
1832	   IP to prevent state exhaustion attacks.  This idea was first put
1833	   forward informally by David Clark and documented by Handley and
1834	   Greenhalgh in [Steps_DoS].  The idea is that network layer datagrams
1835	   should signal explicitly when they require state to be created in the
1836	   network layer or the layer above (e.g. at flow start).  Then a node
1837	   can refuse to create any state unless a datagram declares this
1838	   intent.  We believe the proposed FNE codepoint serves the same
1839	   purpose as the proposed state-set-up bit, but it has been overloaded
1840	   with a more specific purpose, using it on more packets than just the
1841	   first in a flow, but never less (i.e. it is idempotent).  In effect
1842	   the FNE codepoint serves the purpose of a `soft-state set-up
1843	   codepoint'.

1845	   The re-feedback paper [Re-fb] also makes the case for converting the
1846	   economic interpretation of congestion into hard engineering
1847	   mechanism, which is the basis of the approach used in this memo.  The
1848	   admission control gateways around the Diffserv region use hard
1849	   engineering, not incentives, to prevent end users from sending more
1850	   traffic than they have reserved.  Incentive-based mechanisms are only
1851	   used between networks, because they are expected to respond to
1852	   incentives more rationally than end-users can be expected to.
1853	   However, even then, a network can use fail-safes to protect itself
1854	   from excessively unusual behaviour by neighbouring networks, whether
1855	   due to an accidental misconfiguration or malicious intent.

1857	   The guiding principle behind the incentive-based approach used
1858	   between networks is that any gain from subverting the protocol should
1859	   be precisely neutralised, rather than punished.  If a gain is
1860	   punished to a greater extent than is sufficient to neutralise it, it
1861	   will most likely open up a new vulnerability, where the amplifying
1862	   effect of the punishment mechanism can be turned on others.

1864	   The re-feedback paper also makes the case against the use of
1865	   congestion charging to police congestion if it is based on classic
1866	   feedback (where only upstream congestion is visible to network
1867	   elements).  It argues this would open up receiving networks to
1868	   `denial of funds' attacks and would require end users to accept
1869	   dynamic pricing (which few would).

1871	   Re-ECN has been deliberately designed to simplify policing at the
1872	   borders between networks.  These trust boundaries are the critical
1873	   pinch-points that will limit the scalability of the whole
1874	   internetwork unless the overall design minimises the complexity of
1875	   security functions at these borders.  The border mechanisms described
1876	   in this memo run passively in parallel to data forwarding and they do
1877	   not require per-flow processing.

1879	9.  Security Considerations

1881	   This whole memo concerns the security of a scalable admission control
1882	   system.  In particular the analysis section.  Below some specific
1883	   security issues are mentioned that did not belong elsewhere or which
1884	   comment on the overall robustness of the security provided by the
1885	   design.

1887	   Firstly, we must repeat the statement of applicability in the
1888	   analysis: that we only consider new opportunities for /gainful/
1889	   attack that our proposal introduces, particularly if the attacker can
1890	   avoid being identified.  Despite only involving a few bits, there is
1891	   sufficient complexity in the whole system that there are probably
1892	   numerous possibilities for other attacks.  However, as far as we are
1893	   aware, none reap any benefit to the attacker.  For instance, it would
1894	   be possible for a downstream network to remove the congestion
1895	   markings introduced by an upstream network, but it would only lose
1896	   out on the penalties it could apply to a downstream network.

1898	   When one network forwards a neighbouring network's traffic it will
1899	   always be possible to cause damage by dropping or corrupting it.
1900	   Therefore we do not believe networks would set their routing policies
1901	   to interconnect in the first place if they didn't trust the other
1902	   networks not to arbitrarily damage their traffic.

1904	   Having said this, we do want to highlight some of the weaker parts of
1905	   our argument.  We have argued that networks will be dissuaded from
1906	   faking congestion marking by the possibility that upstream networks
1907	   will route round them.  As we have said, these arguments are based on
1908	   fairly delicate assumptions and will remain fairly tenuous until
1909	   proved in practice, particularly close to the egress where less
1910	   competitive routing is likely.

1912	   We should also point out that the approach in this memo was only
1913	   designed to be robust for admission control.  We do not claim the
1914	   incentives will always be strong enough to force correct flow pre-
1915	   emption behaviour.  This is because a user will tend to perceive much
1916	   greater loss in value if a flow is pre-empted than if admission is
1917	   denied at the start.  However, in general the incentives for correct
1918	   flow pre-emption are similar to those for admission control.

1920	   Finally, it may seem that the 8 codepoints that have been made
1921	   available by extending the ECN field with the RE flag have been used
1922	   rather wastefully.  In effect the RE flag has been used as an
1923	   orthogonal single bit in nearly all cases.  The only exception being
1924	   when the ECN field is cleared to "00".  The mapping of the codepoints
1925	   in an earlier version of this proposal used the codepoint space more
1926	   efficiently, but the scheme became vulnerable to a network operator
1927	   focusing its congestion marking to mark more positive than neutral
1928	   packets in order to reduce its penalties.

1930	   With the scheme as now proposed, once the RE flag is set or cleared
1931	   by the sender or its proxy, it should not be written by the network,
1932	   only read.  So the gateways can detect if any network maliciously
1933	   alters the RE flag.  IPSec AH integrity checking does not cover the
1934	   IPv4 option flags (they were considered mutable---even the one we
1935	   propose using for the RE flag that was `currently unused' when IPSec
1936	   was defined).  But it would be sufficient for a pair of gateways to
1937	   make random checks on whether the RE flag was the same when it
1938	   reached the egress gateway as when it left the ingress.  Indeed, if
1939	   IPSec AH had covered the RE flag, any network intending to alter
1940	   sufficient RE flags to make a gain would have focused its alterations
1941	   on packets without authenticating headers (AHs).

1943	   No cryptographic algorithms have been harmed in the making of this
1944	   proposal.

1946	10.  IANA Considerations

1948	   This memo includes no request to IANA.

1950	11.  Conclusions

1952	   This memo builds on a promising technique to solve the classic
1953	   problem of making flow admission control scale to any size network.
1954	   It involves the use of Diffserv in a deployment model that uses pre-
1955	   congestion notification feedback to control admission into a network
1956	   path [CL-deploy].  However as it stands, that deployment model
1957	   depends on all network domains trusting each other to comply with the
1958	   protocols, invoking admission control and flow pre-emption when
1959	   requested.

1961	   We propose that the congestion feedback used in that deployment model
1962	   should be re-echoed into the forward data path, by making a trivial
1963	   modification to the ingress gateway.  We then explain how the
1964	   resulting downstream pre-congestion metric in packets can be
1965	   monitored in bulk at borders to sufficiently emulate flow rate
1966	   policing.

1968	   We claim the result of combining these two approaches is an admission
1969	   control system that scales to any size network /and/ any number of
1970	   interconnected networks, even if they all act in their own interests.

1972	   This proposal aims to convince its readers to "Design in Security
1973	   from the start," by building modified ingress gateways from day one,
1974	   even if border policing is not needed at first.  This way, we will
1975	   not build ourselves tomorrow's legacy problem.

1977	   Re-echoing congestion feedback is based on a principled technique
1978	   called Re-ECN [Re-TCP], designed to add accountability for causing
1979	   congestion to the general-purpose IP datagram service.  Re-ECN
1980	   proposes to consume the last completely unused bit in the basic IPv4
1981	   header.

1983	12.  Acknowledgements

1985	   All the following have given helpful comments and some may become co-
1986	   authors of later drafts: Arnaud Jacquet, Alessandro Salvatori, Steve
1987	   Rudkin, David Songhurst, John Davey, Ian Self, Anthony Sheppard,
1988	   Carla Di Cairano-Gilfedder (BT), Mark Handley (who identified the
1989	   excess canceled packets attack), Stephen Hailes, Adam Greenhalgh
1990	   (UCL), Francois Le Faucheur, Anna Charny (Cisco), Jozef Babiarz,
1991	   Kwok-Ho Chan, Corey Alexander (Nortel), David Clark, Bill Lehr,
1992	   Sharon Gillett, Steve Bauer (MIT) (who publicised various dummy
1993	   traffic attacks), Sally Floyd (ICIR) and comments from participants
1994	   in the CFP/CRN inter-provider QoS and broadband working groups.

1996	13.  Comments Solicited

1998	   Comments and questions are encouraged and very welcome.  They can be
1999	   addressed to the IETF Transport Area working group's mailing list
2000	   <tsvwg@ietf.org>, and/or to the authors.

2002	14.  References

2004	14.1.  Normative References

2006	   [PCN]      Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F.,
2007	              Charny, A., Liatsos, V., Babiarz, J., Chan, K., Dudley,
2008	              S., Westberg, L., Bader, A., and G. Karagiannis, "Pre-
2009	              Congestion Notification Marking",
2010	              draft-briscoe-tsvwg-cl-phb-02 (work in progress),
2011	              June 2006.

2013	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
2014	              Requirement Levels", BCP 14, RFC 2119, March 1997.

2016	   [RFC2211]  Wroclawski, J., "Specification of the Controlled-Load
2017	              Network Element Service", RFC 2211, September 1997.

2019	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
2020	              of Explicit Congestion Notification (ECN) to IP",
2021	              RFC 3168, September 2001.

2023	   [RFC3246]  Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec,
2024	              J., Courtney, W., Davari, S., Firoiu, V., and D.
2025	              Stiliadis, "An Expedited Forwarding PHB (Per-Hop
2026	              Behavior)", RFC 3246, March 2002.

2028	   [RSVP-ECN]
2029	              Le Faucheur, F., Charny, A., Briscoe, B., Eardley, P.,
2030	              Babiarz, J., and K. Chan, "RSVP Extensions for Admission
2031	              Control over Diffserv using Pre-congestion Notification",
2032	              draft-lefaucheur-rsvp-ecn-01 (work in progress),
2033	              June 2006.

2035	   [Re-TCP]   Briscoe, B., Jacquet, A., and A. Salvatori, "Re-ECN:
2036	              Adding Accountability for Causing Congestion to TCP/IP",
2037	              draft-briscoe-tsvwg-re-ecn-tcp-02 (work in progress),
2038	              June 2006.

2040	14.2.  Informative References

2042	   [CL-deploy]
2043	              Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F.,
2044	              Charny, A., Babiarz, J., Chan, K., Westberg, L., Bader,
2045	              A., and G. Karagiannis, "A Deployment Model for Admission
2046	              Control over DiffServ using Pre-Congestion Notification",
2047	              draft-briscoe-tsvwg-cl-architecture-03 (work in progress),
2048	              June 2006.

2050	   [CLoop_pol]
2051	              Salvatori, A., "Closed Loop Traffic Policing", Politecnico
2052	              Torino and Institut Eurecom Masters Thesis ,
2053	              September 2005.

2055	   [ECN-BGP]  Mortier, R. and I. Pratt, "Incentive Based Inter-Domain
2056	              Routeing", Proc Internet Charging and QoS Technology
2057	              Workshop (ICQT'03) pp308--317, September 2003, <http://
2058	              research.microsoft.com/users/mort/publications.aspx>.

2060	   [ECN-MPLS]
2061	              Bruce, B., Briscoe, B., and J. Tay, "Explicit Congestion
2062	              Marking in MPLS", draft-davie-ecn-mpls-00 (work in
2063	              progress), June 2006.

2065	   [IXQoS]    Briscoe, B. and S. Rudkin, "Commercial Models for IP
2066	              Quality of Service Interconnect", BT Technology Journal
2067	              (BTTJ) 23(2)171--195, April 2005,
2068	              <http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#ixqos>.

2070	   [NSIS-RMD]
2071	              Bader, A., Westberg, L., Karagiannis, G., Kappler, C., and
2072	              T. Phelan, "RMD-QOSM - The Resource Management in Diffserv
2073	              QOS Model", draft-ietf-nsis-rmd-06 (work in progress),
2074	              February 2006.

2076	   [RFC2205]  Braden, B., Zhang, L., Berson, S., Herzog, S., and S.

2078	              Jamin, "Resource ReSerVation Protocol (RSVP) -- Version 1
2079	              Functional Specification", RFC 2205, September 1997.

2081	   [RFC2207]  Berger, L. and T. O'Malley, "RSVP Extensions for IPSEC
2082	              Data Flows", RFC 2207, September 1997.

2084	   [RFC2208]  Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell,
2085	              M., Romanow, A., Weinrib, A., and L. Zhang, "Resource
2086	              ReSerVation Protocol (RSVP) Version 1 Applicability
2087	              Statement Some Guidelines on Deployment", RFC 2208,
2088	              September 1997.

2090	   [RFC2747]  Baker, F., Lindell, B., and M. Talwar, "RSVP Cryptographic
2091	              Authentication", RFC 2747, January 2000.

2093	   [RFC2998]  Bernet, Y., Ford, P., Yavatkar, R., Baker, F., Zhang, L.,
2094	              Speer, M., Braden, R., Davie, B., Wroclawski, J., and E.
2095	              Felstaine, "A Framework for Integrated Services Operation
2096	              over Diffserv Networks", RFC 2998, November 2000.

2098	   [RFC3540]  Spring, N., Wetherall, D., and D. Ely, "Robust Explicit
2099	              Congestion Notification (ECN) Signaling with Nonces",
2100	              RFC 3540, June 2003.

2102	   [Re-fb]    Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C.,
2103	              Salvatori, A., Soppera, A., and M. Koyabe, "Policing
2104	              Congestion Response in an Internetwork Using Re-Feedback",
2105	              ACM SIGCOMM CCR 35(4)277--288, August 2005, <http://
2106	              www.acm.org/sigs/sigcomm/sigcomm2005/
2107	              techprog.html#session8>.

2109	   [Smart_rtg]
2110	              Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang,
2111	              "Optimizing Cost and Performance for Multihoming", ACM
2112	              SIGCOMM CCR 34(4)79--92, October 2004,
2113	              <http://citeseer.ist.psu.edu/698472.html>.

2115	   [Steps_DoS]
2116	              Handley, M. and A. Greenhalgh, "Steps towards a DoS-
2117	              resistant Internet Architecture", Proc. ACM SIGCOMM
2118	              workshop on Future directions in network architecture
2119	              (FDNA'04) pp 49--56, August 2004.

2121	Appendix A.  Implementation
2122	A.1.  Ingress Gateway Algorithm for Blanking the RE flag

2124	   The ingress gateway receives regular feedback reporting the fraction
2125	   of congestion marked octets for each aggregate arriving at the
2126	   egress.  So for each aggregate it should blank the RE flag on the
2127	   same fraction of octets.  It is more efficient to calculate the
2128	   reciprocal of this fraction when the signalling arrives, Z_0 = (1 /
2129	   Congestion-Level-Estimate).  Z_0 will be the number of octets of
2130	   packets the ingress should send with the RE flag set between those it
2131	   sends with the RE flag blanked.  Z_0 will also take account of the
2132	   sustainable rate reported during the flow pre-emption process, if
2133	   necessary.

2135	   A suitable pseudo-code algorithm for the ingress gateway is as
2136	   follows:

2138	   ====================================================================
2139	   B_i = 0                 /* interblank volume                     */
2140	   for each PCN-capable packet {
2141	       b = readLength()    /* set b to packet size                  */
2142	       B_i += b            /* accumulate interblank volume          */
2143	       if B_i < b * Z_0 {  /* test whether interblank volume...     */
2144	           writeRE(1)
2145	       } else {            /* ...exceeds blank RE spacing * pkt size*/
2146	           writeRE(0)      /* ...and if so, clear RE                */
2147	           B_i = 0         /* ...and re-set interblank volume       */
2148	       }
2149	   }
2150	   ====================================================================

2152	A.2.  Downstream Congestion Metering Algorithms

2154	A.2.1.  Bulk Downstream Congestion Metering Algorithm

2156	   To meter the bulk amount of downstream pre-congestion in traffic
2157	   crossing an inter-domain border, an algorithm is needed that
2158	   accumulates the size of positive packets and subtracts the size of
2159	   negative packets.  We maintain two counters:

2161	      V_b: accumulated pre-congestion volume

2163	      B: total data volume (in case it is needed)

2165	   A suitable pseudo-code algorithm for a border router is as follows:

2167	   ====================================================================
2168	   V_b = 0
2169	   B   = 0
2170	   for each PCN-capable packet {
2171	       b = readLength(packet)      /* set b to packet size          */
2172	       B += b                      /* accumulate total volume       */
2173	       if readEECN(packet) == (Re-Echo || FNE) {
2174	           V_b += b                /* increment...                  */
2175	       } elseif readEECN(packet) == ( AM(-1) || PM(-1) ) {
2176	           V_b -= b                /* ...or decrement V_b...        */
2177	       }                           /*...depending on EECN field     */
2178	   }
2179	   ====================================================================

2181	   At the end of an accounting period this counter V_b represents the
2182	   pre-congestion volume that penalties could be applied to, as
2183	   described in Section 5.3.

2185	   For instance, accumulated volume of pre-congestion through a border
2186	   interface over a month might be V_b = 5PB (petabyte = 10^15 byte).
2187	   This might have resulted from an average downstream pre-congestion
2188	   level of 1% on an accumulated total data volume of B = 500PB.

2190	A.2.2.  Inflation Factor for Persistently Negative Flows

2192	   The following process is suggested to complement the simple algorithm
2193	   above in order to protect against the various attacks from
2194	   persistently negative flows described in Section 5.6.1.  As explained
2195	   in that section, the most important and first step is to estimate the
2196	   contribution of persistently negative flows to the bulk volume of
2197	   downstream pre-congestion and to inflate this bulk volume as if these
2198	   flows weren't there.  The process below has been designed to give an
2199	   unboased estimate, but it may be possible to define other processes
2200	   that achieve similar ends.

2202	   While the above simple metering algorithm is counting the bulk of
2203	   traffic over an accounting period, the meter should also select a
2204	   subset of the whole flow ID space that is small enough to be able to
2205	   realistically measure but large enough to give a realistic sample.
2206	   Many different samples of different subsets of the ID space should be
2207	   taken at different times during the accounting period, preferably
2208	   covering the whole ID space.  During each sample, the meter should
2209	   count the volume of positive packets and subtract the volume of
2210	   negative, maintaining a separate account for each flow in the sample.
2211	   It should run a lot longer than the large majority of flows, to avoid
2212	   a bias from missing the starts and ends of flows, which tend to be
2213	   positive and negative respectively.

2215	   Once the accounting period finishes, the meter should calculate the
2216	   total of the accounts V_{bI} for the subset of flows I in the sample,
2217	   and the total of the accounts V_{fI} excluding flows with a negative
2218	   account from the subset I. Then the weighted mean of all these
2219	   samples should be taken a_S = sum_{forall I} V_{fI} / sum_{forall I}
2220	   V_{bI}.

2222	   If V_b is the result of the bulk accounting algorithm over the
2223	   accounting period (Appendix A.2.1) it can be inflated by this factor
2224	   a_S to get a good unbiased estimate of the volume of downstream
2225	   congestion over the accounting period a_S.V_b, without being polluted
2226	   by the effect of persistently negative flows.

2228	A.3.  Algorithm for Sanctioning Negative Traffic

2230	   {ToDo: Write up algorithms similar to Appendix D of [Re-TCP] for the
2231	   negative flow monitor with flow management algorithm and the variant
2232	   with bounded flow state.}

2234	Author's Address

2236	   Bob Briscoe
2237	   BT & UCL
2238	   B54/77, Adastral Park
2239	   Martlesham Heath
2240	   Ipswich  IP5 3RE
2241	   UK

2243	   Phone: +44 1473 645196
2244	   Email: bob.briscoe@bt.com
2245	   URI:   http://www.cs.ucl.ac.uk/staff/B.Briscoe/

2247	Intellectual Property Statement

2249	   The IETF takes no position regarding the validity or scope of any
2250	   Intellectual Property Rights or other rights that might be claimed to
2251	   pertain to the implementation or use of the technology described in
2252	   this document or the extent to which any license under such rights
2253	   might or might not be available; nor does it represent that it has
2254	   made any independent effort to identify any such rights.  Information
2255	   on the procedures with respect to rights in RFC documents can be
2256	   found in BCP 78 and BCP 79.

2258	   Copies of IPR disclosures made to the IETF Secretariat and any
2259	   assurances of licenses to be made available, or the result of an
2260	   attempt made to obtain a general license or permission for the use of
2261	   such proprietary rights by implementers or users of this
2262	   specification can be obtained from the IETF on-line IPR repository at
2263	   http://www.ietf.org/ipr.

2265	   The IETF invites any interested party to bring to its attention any
2266	   copyrights, patents or patent applications, or other proprietary
2267	   rights that may cover technology that may be required to implement
2268	   this standard.  Please address the information to the IETF at
2269	   ietf-ipr@ietf.org.

2271	Disclaimer of Validity

2273	   This document and the information contained herein are provided on an
2274	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
2275	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
2276	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
2277	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
2278	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2279	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

2281	Copyright Statement

2283	   Copyright (C) The Internet Society (2006).  This document is subject
2284	   to the rights, licenses and restrictions contained in BCP 78, and
2285	   except as set forth therein, the authors retain all their rights.

2287	Acknowledgment

2289	   Funding for the RFC Editor function is currently provided by the
2290	   Internet Society.