idnits 2.17.1 

draft-briscoe-re-pcn-border-cheat-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 15.

  -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on
     line 2643.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2654.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2661.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2667.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust Copyright Line does not match the
     current year

  -- The exact meaning of the all-uppercase expression 'MAY NOT' is not
     defined in RFC 2119.  If it is intended as a requirements expression, it
     should be rewritten using one of the combinations defined in RFC 2119;
     otherwise it should not be all-uppercase.

  == The expression 'MAY NOT', while looking like RFC 2119 requirements text,
     is not defined in RFC 2119, and should not be used.  Consider using 'MUST
     NOT' instead (if that is what you mean).
     
     Found 'MAY NOT' in this paragraph:
     
     However, if the ingress gateway can guarantee that the network(s)
     that will carry the flow to its egress gateway all use a common
     identifier for the aggregate (e.g. a single MPLS network without ECMP
     routing), it MAY NOT set FNE when it adds a new flow to an active
     aggregate.  And an FNE packet need only be sent if a whole aggregate has
     been idle for more than 1 second.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (September 13, 2008) is 5698 days in the past.  Is
     this intentional?

  -- Found something which looks like a code comment -- if you have code
     sections in the document, please surround them with '<CODE BEGINS>' and
     '<CODE ENDS>' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-09) exists of
     draft-briscoe-tsvwg-re-ecn-tcp-06

  == Outdated reference: A later version (-20) exists of
     draft-ietf-nsis-rmd-12

  == Outdated reference: A later version (-11) exists of
     draft-ietf-pcn-architecture-06

  == Outdated reference: A later version (-07) exists of
     draft-ietf-tsvwg-admitted-realtime-dscp-04


     Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 9 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	PCN Working Group                                             B. Briscoe
3	Internet-Draft                                                  BT & UCL
4	Intended status: Standards Track                      September 13, 2008
5	Expires: March 17, 2009

7	        Emulating Border Flow Policing using Re-PCN on Bulk Data
8	                  draft-briscoe-re-pcn-border-cheat-02

10	Status of this Memo

12	   By submitting this Internet-Draft, each author represents that any
13	   applicable patent or other IPR claims of which he or she is aware
14	   have been or will be disclosed, and any of which he or she becomes
15	   aware will be disclosed, in accordance with Section 6 of BCP 79.

17	   Internet-Drafts are working documents of the Internet Engineering
18	   Task Force (IETF), its areas, and its working groups.  Note that
19	   other groups may also distribute working documents as Internet-
20	   Drafts.

22	   Internet-Drafts are draft documents valid for a maximum of six months
23	   and may be updated, replaced, or obsoleted by other documents at any
24	   time.  It is inappropriate to use Internet-Drafts as reference
25	   material or to cite them other than as "work in progress."

27	   The list of current Internet-Drafts can be accessed at
28	   http://www.ietf.org/ietf/1id-abstracts.txt.

30	   The list of Internet-Draft Shadow Directories can be accessed at
31	   http://www.ietf.org/shadow.html.

33	   This Internet-Draft will expire on March 17, 2009.

35	Abstract

37	   Scaling per flow admission control to the Internet is a hard problem.
38	   The approach of combining Diffserv and pre-congestion notification
39	   (PCN) provides a service slightly better than Intserv controlled load
40	   that scales to networks of any size without needing Diffserv's usual
41	   overprovisioning, but only if domains trust each other to comply with
42	   admission control and rate policing.  This memo claims to solve this
43	   trust problem without losing scalability.  It provides a sufficient
44	   emulation of per-flow policing at borders but with only passive bulk
45	   metering rather than per-flow processing.  Measurements are
46	   sufficient to apply penalties against cheating neighbour networks.

48	Table of Contents

50	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  8
51	   2.  Requirements Notation  . . . . . . . . . . . . . . . . . . . . 11
52	   3.  The Problem  . . . . . . . . . . . . . . . . . . . . . . . . . 11
53	     3.1.  The Traditional Per-flow Policing Problem  . . . . . . . . 11
54	     3.2.  Generic Scenario . . . . . . . . . . . . . . . . . . . . . 14
55	   4.  Re-ECN Protocol in IP with Two Congestion Marking Levels . . . 17
56	     4.1.  Protocol Overview  . . . . . . . . . . . . . . . . . . . . 17
57	     4.2.  Re-PCN Abstracted Network Layer Wire Protocol (IPv4 or
58	           v6)  . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
59	       4.2.1.  Re-ECN Recap . . . . . . . . . . . . . . . . . . . . . 18
60	       4.2.2.  Re-ECN Combined with Pre-Congestion Notification
61	               (re-PCN) . . . . . . . . . . . . . . . . . . . . . . . 20
62	     4.3.  Protocol Operation . . . . . . . . . . . . . . . . . . . . 22
63	       4.3.1.  Protocol Operation for an Established Flow . . . . . . 23
64	       4.3.2.  Aggregate Bootstrap  . . . . . . . . . . . . . . . . . 24
65	       4.3.3.  Flow Bootstrap . . . . . . . . . . . . . . . . . . . . 26
66	       4.3.4.  Router Forwarding Behaviour  . . . . . . . . . . . . . 26
67	       4.3.5.  Extensions . . . . . . . . . . . . . . . . . . . . . . 28
68	   5.  Emulating Border Policing with Re-ECN  . . . . . . . . . . . . 28
69	     5.1.  Informal Terminology . . . . . . . . . . . . . . . . . . . 28
70	     5.2.  Policing Overview  . . . . . . . . . . . . . . . . . . . . 30
71	     5.3.  Pre-requisite Contractual Arrangements . . . . . . . . . . 31
72	     5.4.  Emulation of Per-Flow Rate Policing: Rationale and
73	           Limits . . . . . . . . . . . . . . . . . . . . . . . . . . 34
74	     5.5.  Sanctioning Dishonest Marking  . . . . . . . . . . . . . . 36
75	     5.6.  Border Mechanisms  . . . . . . . . . . . . . . . . . . . . 38
76	       5.6.1.  Border Accounting Mechanisms . . . . . . . . . . . . . 38
77	       5.6.2.  Competitive Routing  . . . . . . . . . . . . . . . . . 41
78	       5.6.3.  Fail-safes . . . . . . . . . . . . . . . . . . . . . . 42
79	   6.  Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
80	   7.  Incremental Deployment . . . . . . . . . . . . . . . . . . . . 46
81	   8.  Design Choices and Rationale . . . . . . . . . . . . . . . . . 47
82	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 49
83	   10. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 50
84	   11. Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 50
85	   12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 51
86	   13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 52
87	   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 52
88	     14.1. Normative References . . . . . . . . . . . . . . . . . . . 52
89	     14.2. Informative References . . . . . . . . . . . . . . . . . . 53
90	   Appendix A.  Implementation  . . . . . . . . . . . . . . . . . . . 55
91	     A.1.  Ingress Gateway Algorithm for Blanking the RE flag . . . . 55
92	     A.2.  Downstream Congestion Metering Algorithms  . . . . . . . . 56
93	       A.2.1.  Bulk Downstream Congestion Metering Algorithm  . . . . 56
94	       A.2.2.  Inflation Factor for Persistently Negative Flows . . . 56
95	     A.3.  Algorithm for Sanctioning Negative Traffic . . . . . . . . 57

97	   Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 57
98	   Intellectual Property and Copyright Statements . . . . . . . . . . 59

100	Status (to be removed by the RFC Editor)

102	   The IETF PCN working group is initially chartered to consider PCN
103	   domains only under a single trust authority.  However, after its
104	   initial work is complete the charter says the working group may re-
105	   charter to consider concatenated Diffserv domains, amongst other new
106	   work items.  The charter ends by stating "The details of these work
107	   items are outside the scope of the initial phase; but the WG may
108	   consider their requirements to design components that are
109	   sufficiently general to support such extensions in the future."

111	   This memo is therefore contributed to describe how PCN could be
112	   extended to inter-domain.  We wanted to document the solution to
113	   reduce the chances that something else eats up the codepoint space
114	   needed before PCN re-charters to consider inter-domain.  Losing the
115	   chance to standardise this simple, scalable solution to the problem
116	   of inter-domain flow admission control would be unfortunate
117	   (understatement), given it took years to find, and even then it was
118	   very difficult to find codepoint space for it.

120	   The scheme described here (Section 4) requires the PCN ingress
121	   gateway to re-echo any PCN feedback it receives back into the forward
122	   stream of IP packets (hence we call this scheme re-PCN).  Re-PCN
123	   works in a very similar way to the re-ECN proposal on which it is
124	   based [I-D.briscoe-tsvwg-re-ecn-tcp], the only difference being that
125	   PCN might encode three states of congestion, whereas ECN encodes two.
126	   This document is written to stand alone from re-ECN, so that readers
127	   do not have to read [I-D.briscoe-tsvwg-re-ecn-tcp].

129	   The authors seek comments from the Internet community on whether
130	   combining PCN and re-ECN to create re-PCN in this way is a sufficient
131	   solution to the problem of scaling microflow admission control to the
132	   Internet as a whole.  Here we emphasise that scaling is not just an
133	   issue of numbers of flows, but also the number of security entities--
134	   networks and users--who may all have conflicting interests.

136	   This memo is posted as an Internet-Draft with the intent to
137	   eventually be broken down in two documents; one for the standards
138	   track and one for informational status.  But until it becomes an item
139	   of IETF working group business the whole proposal has been kept
140	   together to aid understanding.  Only the text of Section 4 of this
141	   document is intended to be normative (requiring standardisation).
142	   The rest of the sections are merely informative, describing how a
143	   system might be built from these protocols by the operators of an
144	   internetwork.  Note in particular that the policing and monitoring
145	   functions proposed for the trust boundaries between operators would
146	   not need standardisation by the IETF.  They simply represent one
147	   possible way that the proposed protocols could be used to extend the
148	   PCN architecture [I-D.ietf-pcn-architecture] to span multiple domains
149	   without mutual trust between the operators.

151	Dependencies (to be removed by the RFC Editor)

153	   To realise the system described, this document also depends on other
154	   documents chartered in the IETF Transport Area progressing along the
155	   standards track:

157	   o  Pre-congestion notification (PCN) marking on interior nodes
158	      [I-D.eardley-pcn-marking-behaviour], chartered for standardisation
159	      in the PCN w-g;

161	   o  The baseline encoding of pre-congestion notification in the IP
162	      header [I-D.moncaster-pcn-baseline-encoding], also chartered for
163	      standardisation in the PCN w-g;

165	   o  Feedback of aggregate PCN measurements by suitably extending the
166	      admission control signalling protocol (e.g.  RSVP extension
167	      [RSVP-ECN] or NSIS extension [I-D.arumaithurai-nsis-pcn]).

169	   The baseline encoding makes no new demands on codepoint space in the
170	   IP header but provides just two PCN encoding states (not marked and
171	   marked).  The PCN architecture recognises that operators might want
172	   PCN marking to trigger two functions (admission control and flow
173	   termination) at different levels of pre-congestion, which seems to
174	   require three encoding states.  A scheme has been proposed
175	   [I-D.charny-pcn-single-marking] that can do both functions with just
176	   two encoding states, but simulations have shown it performs poorly
177	   under certain conditions that might be typical.  As it seems likely
178	   that PCN might need three encoding states to be fully operational, we
179	   want to be sure that three encoding states can be extended to work
180	   inter-domain.  Therefore, we have defined a three-state extension
181	   encoding scheme in this document, then we have added the re-PCN
182	   scheme to it.  The three-state encoding we have chosen depends on
183	   standardisation of yet another document in the IETF Transport Area:

185	   o  Propagation beyond the tunnel decapsulator of any changes in the
186	      ECN field to ECT(0) or ECT(1) made within a tunnel (the ideal
187	      decapsulation rules of [I-D.briscoe-tsvwg-ecn-tunnel]);

189	Changes from previous drafts (to be removed by the RFC Editor)

191	   Full diffs of incremental changes between drafts are available at
192	   URL: <http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#repcn>
193	   Changes from <draft-briscoe-re-pcn-border-cheat-01>            to
194	   <draft-briscoe-re-pcn-border-cheat-02> (current version):

196	         Considerably updated the 'Status' note to explain the
197	         relationship of this draft to other documents in the IETF
198	         process (or not) and to chartered PCN w-g activity.

200	         Split out the dependencies into a separate note and added
201	         dependencies on new PCN documents in progress.

203	         Made scalability motivation in the introduction clearer,
204	         explaining why Diffserv over-provisioning doesn't scale unless
205	         PCN is used.

207	         Clarified that the standards action in Section 4 is to define
208	         the meanings of the combination of fields in the IP header: the
209	         RE flag and 2-level congestion marking in the ECN field.  And
210	         that it is not characterised by a particular feedback style in
211	         the transport.

213	         Switched round the two ECT codepoints to be compatible with the
214	         new PCN baseline encoding and used less confusing naming for
215	         re-PCN codepoints (Section 4).

217	         Generalised rules for encoding probes when bootstrapping or re-
218	         starting aggregates & flows (Section 4.3.2).

220	         Downgraded drop sanction behaviour from MUST to conditional
221	         SHOULD (Section 5.5).

223	         Added incremental deployment safety justification for choice of
224	         which way round the RE flag works (Section 7).

226	         Added possible vulnerability to brief attacks and possible
227	         solution to security considerations (Section 9).

229	         Updated references and terminology, particularly taking account
230	         of recent new PCN w-g documents;

232	         Replaced suggested Ingress Gateway Algorithm for Blanking the
233	         RE flag (Appendix A.1)

235	         Clarifications throughout;

237	   Changes from <draft-briscoe-re-pcn-border-cheat-00>            to
238	   <draft-briscoe-re-pcn-border-cheat-01>:

240	         Updated references.

242	   Changes from <draft-briscoe-tsvwg-re-ecn-border-cheat-01>
243	   to <draft-briscoe-re-pcn-border-cheat-00>:

245	         Changed filename to associate it with the new IETF PCN w-g,
246	         rather than the TSVWG w-g.

248	         Introduction: Clarified that bulk policing only replaces per-
249	         flow policing at interior inter-domain borders, while per-flow
250	         policing is still needed at the access interface to the
251	         internetwork.  Also clarified that the aim is to neutralise any
252	         gains from cheating using local bilateral contracts between
253	         neighbouring networks, rather than merely identifying remote
254	         cheaters.

256	         Section 3.1: Described the traditional per-flow policing
257	         problem with inter-domain reservations more precisely,
258	         particularly with respect to direction of reservations and of
259	         traffic flows.

261	         Clarified status of Section 5 onwards, in particular that
262	         policers and monitors would not need standardisation, but that
263	         the protocol in Section 4 would require standardisation.

265	         Section 5.6.2 on competitive routing: Added discussion of
266	         direct incentives for a receiver to switch to a different
267	         provider even if the provider has a termination monopoly.

269	         Clarified that "Designing in security from the start" merely
270	         means allowing codepoint space in the PCN protocol encoding.
271	         There is no need to actually implement inter-domain security
272	         mechanisms for solutions confined to a single domain.

274	         Updated some references and added a ref to the Security
275	         Considerations, as well as other minor corrections and
276	         improvements.

278	   Changes from <draft-briscoe-tsvwg-re-ecn-border-cheat-00> to
279	   <draft-briscoe-tsvwg-re-ecn-border-cheat-01>:

281	         Added subsection on Border Accounting Mechanisms
282	         (Section 5.6.1)
283	         Section 4.2 on the re-ECN wire protocol clarified and re-
284	         organised to separately discuss re-ECN for default ECN marking
285	         and for pre-congestion marking (PCN).

287	         Router Forwarding Behaviour subsection added to re-organised
288	         section on Protocol Operation (Section 4.3).  Extensions
289	         section moved within Protocol Operations.

291	         Emulating Border Policing (Section 5) reorganised, starting
292	         with a new Terminology subsection heading, and a simplified
293	         overview section.  Added a large new subsection on Border
294	         Accounting Mechanisms within a new section bringing together
295	         other subsections on Border Mechanisms generally (Section 5.6).
296	         Some text moved from old subsections into these new ones.

298	         Added section on Incremental Deployment (Section 7), drawing
299	         together relevant points about deployment made throughout.

301	         Sections on Design Rationale (Section 8) and Security
302	         Considerations (Section 9) expanded with some new material,
303	         including new attacks and their defences.

305	         Suggested Border Metering Algorithms improved (Appendix A.2)
306	         for resilience to newly identified attacks.

308	1.  Introduction

310	   The Internet community largely lost interest in the Intserv
311	   architecture after it was clarified that it would be unlikely to
312	   scale to the whole Internet [RFC2208].  Although Intserv mechanisms
313	   proved impractical, the bandwidth reservation service it aimed to
314	   offer is still very much required.

316	   A recently proposed approach [I-D.ietf-pcn-architecture] combines
317	   Diffserv and pre-congestion notification (PCN) to provide a service
318	   slightly better than Intserv controlled load [RFC2211].  PCN does not
319	   require the considerable over-provisioning that is normally required
320	   for admission control over Diffserv [RFC2998] to be robust against
321	   re-routes or variation in the traffic matrix.  It has been proved
322	   that Diffserv's over-provisioning requirement grows linearly with the
323	   network diameter in hops [QoS_scale].

325	   A number of PCN domains can be concatenated into a larger PCN region
326	   without any per-flow processing between them, but only if each domain
327	   trusts the ingress network to have checked that upstream customers
328	   aren't taking more bandwidth than they reserved, either accidentally
329	   or deliberately.  Unfortunately, networks can gain considerably by
330	   breaking this trust.  One way for a network to protect itself against
331	   others is to handle flow signalling at its own border and police
332	   traffic against reservations itself.  However, this reintroduces the
333	   per-flow unscalability at borders that Intserv over Diffserv suffers
334	   from.

336	   This memo describes a protocol called re-PCN that enables bulk border
337	   measurements so that one network can protect its interests, even if
338	   networks around it are deliberately trying to cheat.  The approach
339	   provides a sufficient emulation of flow rate policing at trust
340	   boundaries but without per-flow processing.  Per-flow rate policing
341	   for each reservation is still expected to be used at the access edge
342	   of the internetwork, but at the borders between networks bulk
343	   policing can be used to emulate per-flow policing.  The emulation is
344	   not perfect, but it is sufficient to ensure that the punishment is at
345	   least proportionate to the severity of the cheat.  Re-PCN neither
346	   requires the unscalable over-provisioning of Diffserv nor the per-
347	   flow processing at borders of Intserv over Diffserv.

349	   It should therefore scale controlled load service to the whole
350	   internetwork without the cost of Diffserv's linearly increasing over-
351	   provisioning, or the cost of per-flow policing at each border.  To
352	   achieve such scaling, this memo combines two recent proposals, both
353	   of which it briefly recaps:

355	   o  The pre-congestion notification (PCN)
356	      architecture[I-D.ietf-pcn-architecture] describes how bulk pre-
357	      congestion notification on routers within an edge-to-edge Diffserv
358	      region can emulate the precision of per-flow admission control to
359	      provide controlled load service without unscalable per-flow
360	      processing;

362	   o  Re-ECN: Adding Accountability to TCP/
363	      IP [I-D.briscoe-tsvwg-re-ecn-tcp].

365	   We coin the term re-PCN for the combination of PCN and re-ECN.

367	   The trick that addresses cheating at borders is to recognise that
368	   border policing is mainly necessary because cheating upstream
369	   networks will admit traffic when they shouldn't only as long as they
370	   don't directly experience the downstream congestion their
371	   misbehaviour can cause.  The re-ECN protocol ensures a network can be
372	   made to experience the congestion it causes in other networks.  Re-
373	   ECN requires the sending node to declare expected downstream
374	   congestion in all packets and it makes it in its interest to declare
375	   this honestly.  At the border between upstream network 'A' and
376	   downstream network 'B' (say), both networks can monitor packets
377	   crossing the border to measure how much congestion 'A' is causing in
378	   'B' and beyond.  'B' can then include a limit or penalty based on
379	   this metric in its contract with 'A'.  This is how 'A' experiences
380	   the effect of congestion it causes in other networks.  'A' no longer
381	   gains by admitting traffic when it shouldn't, which is why we can say
382	   re-PCN emulates flow policing, even though it doesn't measure flows.

384	   The aim is not to enable a network to _identify_ some remote cheating
385	   party, which would rarely be useful given the victim network would be
386	   unlikely to be able to seek redress from a cheater in some remote
387	   part of the world with whom no direct contractual relationship
388	   exists.  Rather the aim is to ensure that any gain from cheating will
389	   be cancelled out by penalties applied to the cheating party by its
390	   local network.  Further, the solution ensures each of the chain of
391	   networks between the cheater and the victim will lose out if it
392	   doesn't apply penalties to its neighbour.  Thus the solution builds
393	   on the local bilateral contractual relationships that already exist
394	   between neighbouring networks.

396	   Rather than the end-to-end arrangement used when re-ECN was specified
397	   for the TCP transport [I-D.briscoe-tsvwg-re-ecn-tcp], this memo
398	   specifies re-ECN in an edge-to-edge arrangement, making it applicable
399	   to deployment models where admission control over Diffserv is based
400	   on pre-congestion notification.  Also, rather than using a TCP
401	   transport for regular congestion feedback, this memo specifies re-ECN
402	   using RSVP as the transport for feedback [RSVP-ECN].  RSVP is used to
403	   be concrete, but a similar deployment model, but with a different
404	   transport for signalling congestion feedback could be used (e.g.
405	   Arumaithurai [I-D.arumaithurai-nsis-pcn] and RMD [I-D.ietf-nsis-rmd]
406	   both use NSIS).

408	   This memo aims to do two things: i) define how to apply the re-PCN
409	   protocol to the admission control over Diffserv scenario; and ii)
410	   explain why re-PCN sufficiently emulates border policing in that
411	   scenario.  Most of the memo is taken up with the second aim;
412	   explaining why it works.  Applying re-PCN to the scenario actually
413	   involves quite a trivial modification to the ingress gateway.  That
414	   modification can be added to gateways later, so our immediate goal is
415	   to convince everyone to have the foresight to define the PCN wire
416	   protocol encoding to accommodate the extended codepoints defined in
417	   this document, whether first deployments require border policing or
418	   not.  Otherwise, when we want to add policing, we will have built
419	   ourselves a legacy problem.  In other words, we aim to convince
420	   people to "Design in security from the start."

422	   The body of this memo is structured as follows:

424	      Section 3 describes the border policing problem.  We recap the
425	      traditional, unscalable view of how to solve the problem, and we
426	      recap the admission control solution which has the scalability we
427	      do not want to lose when we add border policing;

429	      Section 4 specifies the re-PCN protocol solution in detail;

431	      Section 5 explains how to use the protocol to emulate border
432	      policing, and why it works;

434	      Section 6 analyses the security of the proposed solution;

436	      Section 8 explains the sometimes subtle rationale behind our
437	      design decisions;

439	      Section 9 comments on the overall robustness of the security
440	      assumptions and lists specific security issues.

442	   It must be emphasised that we are not evangelical about removing per-
443	   flow processing from borders.  Network operators may choose to do
444	   per-flow processing at their borders for their own reasons, such as
445	   to support business models that require per-flow accounting.  Our aim
446	   is to show that per-flow processing at borders is no longer
447	   _necessary_ in order to provide end-to-end QoS using flow admission
448	   control.  Indeed, we are absolutely opposed to standardisation of
449	   technology that embeds particular business models into the Internet.
450	   Our aim is merely to provide a new useful metric (downstream
451	   congestion) at trust boundaries.  Given the well-known significance
452	   of congestion in economics, operators can then use this new metric in
453	   their interconnection contracts if they choose.  This will enable
454	   competitive evolution of new business models (for examples
455	   see [IXQoS]), even for sets of flows running alongside another set
456	   across the same border but using the more traditional model that
457	   depends on more costly per-flow processing at each border.

459	2.  Requirements Notation

461	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
462	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
463	   document are to be interpreted as described in [RFC2119].

465	3.  The Problem

467	3.1.  The Traditional Per-flow Policing Problem

469	   If we claim to be able to emulate per-flow policing with bulk
470	   policing at trust boundaries, we need to know exactly what we are
471	   emulating.  So, we will start from the traditional scenario with per-
472	   flow policing at trust boundaries to explain why it has always been
473	   considered necessary.

475	   To be able to take advantage of a reservation-based service such as
476	   controlled load, a source-destination pair must reserve resources
477	   using a signalling protocol such as RSVP [RFC2205].  An RSVP
478	   signalling request refers to a flow of packets by its flow ID tuple
479	   (filter spec [RFC2205]) (or its security parameter index
480	   (SPI) [RFC2207] if port numbers are hidden by IPSec encryption).
481	   Other signalling protocols use similar flow identifiers.  But, it is
482	   insufficient to merely authorise and admit a flow based on its
483	   identifiers, for instance merely opening a pin-hole for packets with
484	   identifiers that match an admitted flow ID.  Because, once a flow is
485	   admitted, it cannot necessarily be trusted to send packets within the
486	   rate profile it requested.

488	   The packet rate must also be policed to keep the flow within the
489	   requested flow spec [RFC2205].  For instance, without data rate
490	   policing, a source-destination pair could reserve resources for an
491	   8kbps audio flow but the source could transmit a 6Mbps video (theft
492	   of service).  More subtly, the sender could generate bursts that were
493	   outside the profile requested.

495	   In traditional architectures, per-flow packet rate-policing is
496	   expensive and unscalable but, without it, a network is vulnerable to
497	   such theft of service (whether malicious or accidental).  Perhaps
498	   more importantly, if flows are allowed to send more data than they
499	   were permitted, the ability of admission control to give assurances
500	   to other flows will break.

502	   Just as sources need not be trusted to keep within the requested flow
503	   spec, whole networks might also try to cheat.  We will now set up a
504	   concrete scenario to illustrate such cheats.  Imagine reservations
505	   for unidirectional flows, through at least two networks, an edge
506	   network and its downstream transit provider.  Imagine the edge
507	   network charges its retail customers per reservation but also has to
508	   pay its transit provider a charge per reservation.  Typically, both
509	   the charges for buying from the transit and selling to the retail
510	   customer might depend on the duration and rate of each reservation.
511	   The level of the actual selling and buying prices are irrelevant to
512	   our discussion (most likely the network will sell at a higher price
513	   than it buys, of course).

515	   A cheating ingress network could systematically reduce the size of
516	   its retail customers' reservation signalling requests (e.g. the
517	   SENDER_TSPEC object in RSVP's PATH message) before forwarding them to
518	   its transit provider and systematically reinstate the responses on
519	   the way back (e.g. the FLOWSPEC object in RSVP's RESV message).  It
520	   would then receive an honest income from its upstream retail customer
521	   but only pay for fraudulently smaller reservations downstream.  A
522	   similar but opposite trick (increasing the TSPEC and decreasing the
523	   FLOWSPEC) could be perpetrated by the receiver's access network if
524	   the reservation was paid for by the receiver.

526	   Equivalently, a cheating ingress network may feed the traffic from a
527	   number of flows into an aggregate reservation over the transit that
528	   is smaller than the total of all the flows.  Because of these fraud
529	   possibilities, in traditional QoS reservation architectures the
530	   downstream network polices traffic at each border.  The policer
531	   checks that the actual sent data rate of each flow is within the
532	   signalled reservation.

534	   Reservation signalling could be authenticated end to end, but this
535	   wouldn't prevent the aggregation cheat just described.  For this
536	   reason, and to avoid the need for a global PKI, signalling integrity
537	   is typically only protected on a hop-by-hop basis [RFC2747].

539	   A variant of the above cheat is where a router in an honest
540	   downstream network denies admission to a new reservation, but a
541	   cheating upstream network still admits the flow.  For instance, the
542	   networks may be using Diffserv internally, but Intserv admission
543	   control at their borders [RFC2998].  The cheat would only work if
544	   they were using bulk Diffserv traffic policing at their borders,
545	   perhaps to avoid the cost/complexity of Intserv border policing.  As
546	   far as the cheating upstream network is concerned, it gets the
547	   revenue from the reservation, but it doesn't have to pay any
548	   downstream wholesale charges and the congestion is in someone else's
549	   network.  The cheating network may calculate that most of the flows
550	   affected by congestion in the downstream network aren't likely to be
551	   its own.  It may also calculate that the downstream router has been
552	   configured to deny admission to new flows in order to protect
553	   bandwidth assigned to other network services (e.g. enterprise VPNs).
554	   So the cheating network can steal capacity from the downstream
555	   operator's VPNs that are probably not actually congested.

557	   All the above cheats are framed in the context of RSVP's receiver
558	   confirmed reservation model, but similar cheats are possible with
559	   sender-initiated and other models.

561	   To summarise, in traditional reservation signalling architectures, if
562	   a network cannot trust a neighbouring upstream network to rate-police
563	   each reservation, it has to check for itself that the data rate fits
564	   within each of the reservations it has admitted.

566	3.2.  Generic Scenario

568	   We will now describe a generic internetworking scenario that we will
569	   use to describe and to test our bulk policing proposal.  It consists
570	   of a number of networks and endpoints that do not fully trust each
571	   other to behave.  In Section 6 we will tie down exactly what we mean
572	   by partial trust, and we will consider the various combinations where
573	   some networks do not trust each other and others are colluding
574	   together.

576	    _    ___      _____________________________________       ___    _
577	   | |  |   |   _|__    ______    ______    ______    _|__   |   |  | |
578	   | |  |   |  |    |  |      |  |      |  |      |  |    |  |   |  | |
579	   | |  |   |  |    |  |Inter-|  |Inter-|  |Inter-|  |    |  |   |  | |
580	   | |  |   |  |    |  | ior  |  | ior  |  | ior  |  |    |  |   |  | |
581	   | |  |   |  |    |  |Domain|  |Domain|  |Domain|  |    |  |   |  | |
582	   | |  |   |  |    |  |  A   |  |  B   |  |  C   |  |    |  |   |  | |
583	   | |  |   |  |    |  |      |  |      |  |      |  |    |  |   |  | |
584	   | |  |   |  +----+  +-+  +-+  +-+  +-+  +-+  +-+  +----+  |   |  | |
585	   | |  |   |  |    |  |B|  |B|  |B|  |B|  |B|  |B|  |    |  |   |\ | |
586	   | |==|   |==|Ingr|==|R|  |R|==|R|  |R|==|R|  |R|==|Egr |==|   |=>| |
587	   | |  |   |  |G/W |  | |  | |  | |  | |  | |  | |  |G/W |  |   |/ | |
588	   | |  |   |  +----+  +-+  +-+  +-+  +-+  +-+  +-+  +----+  |   |  | |
589	   | |  |   |  |    |  |      |  |      |  |      |  |    |  |   |  | |
590	   | |  |   |  |____|  |______|  |______|  |______|  |____|  |   |  | |
591	   |_|  |___|    |_____________________________________|     |___|  |_|

593	   Sx   Ingress               Diffserv region               Egress   Rx
594	   End  Access                                              Access  End
595	   Host Network                                            Network Host
596	                <-------- edge-to-edge signalling ------->
597	                          (for admission control)

599	   <-------------------end-to-end QoS signalling protocol------------->

601	      Figure 1: Generic Scenario (see text for explanation of terms)

603	   An ingress and egress gateway (Ingr G/W and Egr G/W in Figure 1)
604	   connect the interior Diffserv region to the edge access networks
605	   where routers (not shown) use per-flow reservation processing.
606	   Within the Diffserv region are three interior domains, 'A', 'B' and
607	   'C', as well as the inward facing interfaces of the ingress and
608	   egress gateways.  An ingress and egress border router (BR) is shown
609	   interconnecting each interior domain with the next.  There will
610	   typically be other interior routers (not shown) within each interior
611	   domain.

613	   In two paragraphs we now briefly recap how pre-congestion
614	   notification is intended to be used to control flow admission to a
615	   large Diffserv region.  The first paragraph describes data plane
616	   functions and the second describes signalling in the control plane.
617	   We omit many details from [I-D.ietf-pcn-architecture] including
618	   behaviour during routing changes.  For brevity here we assume other
619	   flows are already in progress across a path through the Diffserv
620	   region before a new one arrives, but how bootstrap works is described
621	   in Section 4.3.2.

623	   Figure 1 shows a single simplex reserved flow from the sending (Sx)
624	   end host to the receiving (Rx) end host.  The ingress gateway polices
625	   incoming traffic and colours conforming traffic within an admitted
626	   reservation to a combination of Diffserv codepoint and ECN field that
627	   defines the traffic as 'PCN-enabled'.  This redefines the meaning of
628	   the ECN field as a PCN field, which is largely the same as ECN
629	   [RFC3168], but with slightly different semantics defined in
630	   [I-D.moncaster-pcn-baseline-encoding] (or various extensions that are
631	   currently experimental).  The Diffserv region is called a PCN-region
632	   because all the queues within it are PCN-enabled.  This means the
633	   per-hop behaviour they apply to PCN-enabled traffic consists of both
634	   a scheduling behaviour and a new ECN marking behaviour that we call
635	   `pre-congestion notification' [I-D.eardley-pcn-marking-behaviour].  A
636	   PCN-enabled queue typically re-uses the definition of expedited
637	   forwarding (EF) [RFC3246] for its scheduling behaviour.  The new
638	   congestion marking behaviour sets the PCN field of an increasing
639	   proportion of PCN packets to the PCN-marked (PM) codepoint
640	   [I-D.moncaster-pcn-baseline-encoding] as their load approaches a
641	   threshold rate that is lower than the line rate
642	   [I-D.eardley-pcn-marking-behaviour].  This can be achieved with an
643	   algorithm similar to a token-bucket called a virtual queue.  The aim
644	   is for a queue to start marking PCN traffic to trigger admission
645	   control before the real queue builds up any congestion delay.  The
646	   level of a queue's pre-congestion marking is detected at the egress
647	   of the Diffserv region and used by the signalling system to control
648	   admission of further traffic that would otherwise overload that
649	   queue, as follows.

651	   The end-to-end QoS signalling for a new reservation (to be concrete
652	   we will use RSVP) takes one giant hop from ingress to egress gateway,
653	   because interior routers within the Diffserv region are configured to
654	   ignore RSVP.  The egress gateway holds flow state because it takes
655	   part in the end-to-end reservation.  So it can classify all packets
656	   by flow and it can identify all flows that have the same previous
657	   RSVP hop (an ingress-egress-aggregate).  For each ingress-egress-
658	   aggregate of flows in progress, the egress gateway maintains a per-
659	   packet moving average of the fraction of pre-congestion-marked
660	   traffic.  Once an RSVP PATH message for a new reservation has hopped
661	   across the Diffserv region and reached the destination, an RSVP RESV
662	   message is returned.  As the RESV message passes, the egress gateway
663	   piggy-backs the relevant pre-congestion level onto it [RSVP-ECN].
664	   Again, interior routers ignore the RSVP message, but the ingress
665	   gateway strips off the pre-congestion level.  If the pre-congestion
666	   level is above a threshold, the ingress gateway denies admission to
667	   the new reservation, otherwise it returns the original RESV signal
668	   back towards the data sender.

670	   Once a reservation is admitted, its traffic will always receive low
671	   delay service for the duration of the reservation.  This is because
672	   ingress gateways ensure that traffic not under a reservation cannot
673	   pass into the PCN-region with a Diffserv codepoint that gives it
674	   priority over the capacity used for PCN traffic.

676	   Even if some disaster re-routes traffic after it has been admitted,
677	   if the PCN traffic through any PCN resource tips over a higher, fail-
678	   safe threshold, pre-congestion notification can trigger flow
679	   termination to very quickly bring every router within the whole PCN-
680	   region back below its operating point.  The same marking process and
681	   ECN codepoint can be used for both admission control and flow
682	   termination, by simply triggering them at different fractions of
683	   marking [I-D.charny-pcn-single-marking].  However simulations have
684	   confirmed that this approach is not robust in all circumstances that
685	   might typically be encountered, so approaches with two thresholds and
686	   two congestion encodings are expected to be required in production
687	   networks.

689	   The whole admission control system just described deliberately
690	   confines per-flow processing to the access edges of the network,
691	   where it will not limit the system's scalability.  But ideally we
692	   want to extend this approach to multiple networks, to take even more
693	   advantage of its scaling potential.  We would still need per-flow
694	   processing at the access edges of each network, but not at the high
695	   speed interfaces where they interconnect.  Even though such an
696	   admission control system would work technically, it would gain us no
697	   scaling advantage if each network also wanted to police the rate of
698	   each admitted flow for itself--border routers would still have to do
699	   complex packet operations per-flow anyway, given they don't trust
700	   upstream networks to do their policing for them.

702	   This memo describes how to emulate per-flow rate policing using bulk
703	   mechanisms at border routers.  Otherwise the full scalability
704	   potential of pre-congestion notification would be limited by the need
705	   for per-flow policing mechanisms at borders, which would make borders
706	   the most cost-critical pinch-points.  Instead we can achieve the long
707	   sought-for vision of secure Internet-wide bandwidth reservations
708	   without over-generous provisioning or per-flow processing.  We still
709	   use per-flow processing at the edge routers closest to the end-user,
710	   but we need no per-flow processing at all in core _or border
711	   routers_--where scalability is most critical.

713	4.  Re-ECN Protocol in IP with Two Congestion Marking Levels

715	4.1.  Protocol Overview

717	   First we need to recap the way routers accumulate PCN congestion
718	   marking along a path (it accumulates the same way as ECN).  Each PCN-
719	   capable queue into a link might mark some packets with a PCN-marked
720	   (PM) codepoint, the marking probability increasing with the length of
721	   the queue [I-D.eardley-pcn-marking-behaviour].  With a series of PCN-
722	   capable routers on a path, a stream of packets accumulates the
723	   fraction of PCN markings that each queue adds.  The combined effect
724	   of the packet marking of all the queues along the path signals
725	   congestion of the whole path to the receiver.  So, for example, if
726	   one queue early in a path is marking 1% of packets and another later
727	   in a path is marking 2%, flows that pass through both queues will
728	   experience approximately 3% marking over a sequence of packets.

730	   (Note: Whenever the word 'congestion' is used in this document it
731	   should be taken to mean congestion of the virtual resource assigned
732	   for use by PCN-traffic.  This avoids cumbersome repetition of the
733	   strictly correct term 'pre-congestion'.)

735	   The packets crossing an inter-domain trust boundary within the PCN-
736	   region will all have come from different ingress gateways and will
737	   all be destined for different egress gateways.  We will show that the
738	   key to policing against theft of service is for a border router to be
739	   able to directly measure the congestion that is about to be caused by
740	   the packets it forwards into any of the downstream paths between
741	   itself and the egress gateways that each packet is destined for.  The
742	   purpose of the re-PCN protocol is to make packets automatically carry
743	   this information, which then merely needs to be counted locally at
744	   the border.

746	   With the original PCN protocol, if a border router, e.g. that between
747	   domains 'A' & 'B' Figure 2), counts PCN markings crossing the border
748	   over a period, they represent the accumulated congestion that has
749	   already been experienced by those packets (congestion upstream of the
750	   border, u).  The idea of re-PCN is to make the ingress gateway
751	   continuously encode the path congestion it knows into a new field in
752	   the IP header (in this case, `path' means the path from the ingress
753	   to the egress gateway).  This new field is _not_ altered by queues
754	   along the path.  Then at any point on that path (e.g. between domains
755	   'A' & 'B'), IP headers can be monitored to measure both expected path
756	   congestion, p and upstream congestion, u.  Then congestion expected
757	   downstream of the border, v, can be derived simply by subtracting
758	   upstream congestion from expected path congestion.  That is v ~= p -
759	   u.

761	   Importantly, it turns out that there is no need to monitor downstream
762	   congestion on a per-flow, per-path or per-aggregate basis.  We will
763	   show that accounting for it in bulk by counting the volume of all
764	   marked packet will be sufficient.

766	                  _____________________________________
767	                _|__    ______    ______    ______    _|__
768	               |    |  |  A   |  |  B   |  |  C   |  |    |
769	               +----+  +-+  +-+  +-+  +-+  +-+  +-+  +----+
770	               |    |  |B|  |B|  |B|  |B|  |B|  |B|  |    |
771	               |Ingr|==|R|  |R|==|R|  |R|==|R|  |R|==|Egr |
772	               |G/W |  | |  | |: | |  | |  | |  | |  |G/W |
773	               +----+  +-+  +-+: +-+  +-+  +-+  +-+  +----+
774	               |    |  |      |: |      |  |      |  |    |
775	               |____|  |______|: |______|  |______|  |____|
776	                 |_____________:_______________________|
777	                               :
778	                 |             :                       |
779	                 |<-upstream-->:<-expected downstream->|
780	                 | congestion  :      congestion       |
781	                 |     u               v ~= p - u      |
782	                 |                                     |
783	                 |<--- expected path congestion, p --->|

785	                         Figure 2: Re-ECN concept

787	4.2.  Re-PCN Abstracted Network Layer Wire Protocol (IPv4 or v6)

789	   In this section we define the names of the various codepoints of the
790	   extended ECN field when used with pre-congestion notification,
791	   deferring description of their semantics to the following sections.
792	   But first we recap the re-ECN wire protocol proposed in
793	   [I-D.briscoe-tsvwg-re-ecn-tcp].

795	4.2.1.  Re-ECN Recap

797	   Re-ECN uses the two bit ECN field broadly as in RFC3168 [RFC3168].
798	   It also uses a new re-ECN extension (RE) flag.  The actual position
799	   of the RE flag is different between IPv4 & v6 headers so we will use
800	   an abstraction of the IPv4 and v6 wire protocols by just calling it
801	   the RE flag.  [I-D.briscoe-tsvwg-re-ecn-tcp] proposes using bit 48
802	   (currently unused) in the IPv4 header for the RE flag, while for IPv6
803	   it proposes an congestion extension header.

805	   Unlike the ECN field, the RE flag is intended to be set by the sender
806	   and remain unchanged along the path, although it can be read by
807	   network elements that understand the re-ECN protocol.  In the
808	   scenario used in this memo, the ingress gateway is the 'sender' as
809	   far as the scope of the PCN region is concerned, so it sets the RE
810	   flag (as permitted for sender proxies in the specification of re-
811	   ECN).

813	   Note that general-purpose routers do not have to read the RE flag,
814	   only special policing elements at borders do.  And no general-purpose
815	   routers have to change the RE flag, although the ingress and egress
816	   gateways do because in the edge-to-edge deployment model we are
817	   using, they act as the endpoints of the PCN region.  Therefore the RE
818	   flag does not even have to be visible to interior routers.  So the RE
819	   flag has no implications on protocols like MPLS.  Congested label
820	   switching routers (LSRs) would have to be able to notify their
821	   congestion with an ECN/PCN codepoint in the MPLS shim [RFC5129], but
822	   like any interior IP router, they can be oblivious to the RE flag,
823	   which need only be read by border policing functions.

825	   Although the RE flag is a separate single bit field, it can be read
826	   as an extension to the two-bit ECN field; the three concatenated bits
827	   in what we will call the extended ECN field (EECN) make eight
828	   codepoints available.  When the RE flag setting is "don't care", we
829	   use the RFC3168 names of the ECN codepoints, but
830	   [I-D.briscoe-tsvwg-re-ecn-tcp] proposes the following six codepoint
831	   names for when there is a need to be more specific.

833	   +--------+-------------+-------+-------------+----------------------+
834	   |   ECN  | RFC3168     |   RE  | Extended    |    Re-ECN meaning    |
835	   |  field | codepoint   |  flag | ECN         |                      |
836	   |        |             |       | codepoint   |                      |
837	   +--------+-------------+-------+-------------+----------------------+
838	   |   00   | Not-ECT     |   0   | Not-RECT    |  Not re-ECN-capable  |
839	   |        |             |       |             |       transport      |
840	   |   00   | Not-ECT     |   1   | FNE         |     Feedback not     |
841	   |        |             |       |             |      established     |
842	   |   10   | ECT(0)      |   0   | ---         |    Legacy ECN use    |
843	   |        |             |       |             |        only          |
844	   |   10   | ECT(0)      |   1   | --CU--      |   Currently unused   |
845	   |        |             |       |             |                      |
846	   |   01   | ECT(1)      |   0   | Re-Echo     | Re-echoed congestion |
847	   |        |             |       |             |       and RECT       |
848	   |   01   | ECT(1)      |   1   | RECT        |    Re-ECN capable    |
849	   |        |             |       |             |       transport      |
850	   |   11   | CE          |   0   | CE(0)       |      Congestion      |
851	   |        |             |       |             |   experienced with   |
852	   |        |             |       |             |        Re-Echo       |
853	   |   11   | CE          |   1   | CE(-1)      |      Congestion      |
854	   |        |             |       |             |      experienced     |
855	   +--------+-------------+-------+-------------+----------------------+

857	    Table 1: Re-cap of Default Extended ECN Codepoints Proposed for Re-
858	                                    ECN

860	4.2.2.  Re-ECN Combined with Pre-Congestion Notification (re-PCN)

862	   As permitted by the ECN specification [RFC3168] and by the guidelines
863	   for specifying alternative semantics for the ECN field [RFC4774], a
864	   proposal is currently being advanced in the IETF to define different
865	   semantics for how queues might mark the ECN field of certain packets.
866	   The idea is to be able to notify congestion when the queue's load
867	   approaches a logical limit, rather than the physical limit of the
868	   line.  This new marking is called pre-congestion
869	   notification [I-D.eardley-pcn-marking-behaviour] and we will use the
870	   term PCN-enabled queue for a queue that can apply pre-congestion
871	   notification marking to the ECN fields of packets.

873	   [RFC3168] recommends that a packet's Diffserv codepoint should
874	   determine which type of ECN marking it receives.  A PCN-capable
875	   packet must meet two conditions; it must carry a DSCP that has been
876	   associated with PCN marking and it must carry an ECN field that turns
877	   on PCN marking.

879	   As an example, a packet carrying the VOICE-ADMIT
880	   [I-D.ietf-tsvwg-admitted-realtime-dscp] DSCP would be associated with
881	   expedited forwarding [RFC3246] as its scheduling behaviour and pre-
882	   congestion notification as its congestion marking behaviour.  PCN
883	   would only be turned on within a PCN-region by an ECN codepoint other
884	   than Not-ECT (00).  Then we would describe packets with the VOICE-
885	   ADMIT DSCP and with ECN turned on as PCN-capable packets.

887	   [I-D.eardley-pcn-marking-behaviour] actually proposes that two
888	   logical limits can be used for pre-congestion notification, with the
889	   higher limit as a back-stop for dealing with anomalous events.  It
890	   envisages PCN will be used to admission control inelastic real-time
891	   traffic, so marking at the lower limit will trigger admission
892	   control, while at the higher limit it will trigger flow termination.

894	   Because it needs two types of congestion marking, PCN needs four
895	   states: Not PCN-capable (Not-PCN), PCN-capable but not PCN-marked
896	   (NM), Admission Marked (AM) and Flow Termination Marked (TM).  A
897	   proposed encoding of the four required PCN states is shown on the
898	   left of Table 2.  Note that these codepoints of the ECN field only
899	   take on the semantics of pre-congestion notification if they are
900	   combined with a Diffserv codepoint that the operator has configured
901	   to be associated with PCN marking.

903	   This encoding only correctly traverses an IP in IP tunnel if the
904	   ideal decapsulation rules in [I-D.briscoe-tsvwg-ecn-tunnel] are
905	   followed when combining the ECN fields of the outer and inner
906	   headers.  If instead the decapsulation rules in [RFC3168] or
907	   [RFC4301] are followed, any admission marking applied to an outer
908	   header will be incorrectly removed on decapsulation at the tunnel
909	   egress.

911	   The RFC3168 ECN field includes space for the experimental ECN
912	   Nonce [RFC3540], which seems to require a fifth state if it is also
913	   needed with re-PCN.  But re-PCN supersedes any need for the Nonce
914	   within the PCN-region.  The ECN Nonce is an elegant scheme, but it
915	   only allows a sending node (or its proxy) to detect suppression of
916	   congestion marking in the feedback loop.  Thus the Nonce requires the
917	   sender (or in our case the PCN ingress) to be trusted to respond
918	   correctly to congestion.  But this is precisely the main cheat we
919	   want to protect against (as well as many others).  Also, the ECN
920	   nonce only works once the receiver has placed packets in the same
921	   order as they left the ingress, which cannot be done by an edge node
922	   without adding unnecessary edge-edge packet ordering.  Nonetheless,
923	   if the ECN nonce were in use outside the PCN region (end-to-end), the
924	   ingress would have to tunnel the arriving IP header across the PCN
925	   region ([I-D.ietf-pcn-architecture]).

927	   For the rest of this memo, to mean either Admission Marking or
928	   Termination Marking we will call both "congestion marking" or "PCN
929	   marking" unless we need to be specific.  With the above encoding,
930	   congestion marking can be read to mean any packet with the right-most
931	   bit of the ECN field set.

933	   The re-ECN protocol can be used to control misbehaving sources
934	   whether congestion is with respect to a logical threshold (PCN) or
935	   the physical line rate (ECN).  In either case the RE flag can be used
936	   to create an extended ECN field.  For PCN-capable packets, the 8
937	   possible encodings of this 3-bit extended PCN (EPCN) field are
938	   defined on the right of Table 2 below.  The purposes of these
939	   different codepoints will be introduced in subsequent sections.

941	   +--------+-----------+-------+-----------------+--------------------+
942	   |   ECN  | PCN       |   RE  | Extended PCN    |   Re-PCN meaning   |
943	   |  field | codepoint |  flag | codepoint       |                    |
944	   +--------+-----------+-------+-----------------+--------------------+
945	   |   00   | Not-PCN   |   0   | Not-PCN         |   Not PCN-capable  |
946	   |        |           |       |                 |      transport     |
947	   |   00   | Not-PCN   |   1   | FNE             |    Feedback not    |
948	   |        |           |       |                 |     established    |
949	   |   10   | NM        |   0   | Re-PCT-Echo     |      Re-echoed     |
950	   |        |           |       |                 |   congestion and   |
951	   |        |           |       |                 |       Re-PCT       |
952	   |   10   | NM        |   1   | Re-PCT          |   Re-PCN capable   |
953	   |        |           |       |                 |      transport     |
954	   |   01   | AM        |   0   | AM(0)           |  Admission Marking |
955	   |        |           |       |                 |    with Re-Echo    |
956	   |   01   | AM        |   1   | AM(-1)          |  Admission Marking |
957	   |        |           |       |                 |                    |
958	   |   11   | TM        |   0   | TM(0)           |     Termination    |
959	   |        |           |       |                 |    Marking with    |
960	   |        |           |       |                 |       Re-Echo      |
961	   |   11   | TM        |   1   | TM(-1)          |     Termination    |
962	   |        |           |       |                 |       Marking      |
963	   +--------+-----------+-------+-----------------+--------------------+

965	   Table 2: Extended ECN Codepoints if the Diffserv codepoint uses Pre-
966	                       congestion Notification (PCN)

968	   Note that Table 2 shows re-PCN uses ECT(0) but Table 1 shows re-ECN
969	   uses ECT(1) for the unmarked state.  The difference is intended--
970	   although it makes it harder to remember the two schemes, it makes
971	   them both safer during incremental deployment.

973	4.3.  Protocol Operation
974	4.3.1.  Protocol Operation for an Established Flow

976	   The re-PCN protocol involves a simple addition to the action of the
977	   gateway at the ingress edge of the PCN region (the PCN-ingress-node).
978	   But first we will recap how PCN works without the addition.  For each
979	   active traffic aggregate across a PCN region (ingress-egress-
980	   aggregate) the egress gateway measures the level of PCN marking and
981	   feeds it back to the ingress piggy-backed as 'PCN-feedback-
982	   information' on any control signal passing between the nodes (e.g.
983	   every flow set-up, refresh or tear-down).  Therefore the ingress
984	   gateway will always hold a fairly recent (typically at most 30sec)
985	   estimate of the ingress-egress-aggregate congestion level.  For
986	   instance, one aggregate might have been experiencing 3% pre-
987	   congestion (that is, congestion marked octets whether Admission
988	   Marked or Termination Marked).

990	   To comply with the re-PCN protocol, for all PCN packets in each
991	   ingress-egress-aggregate the ingress gateway MUST clear the RE flag
992	   to "0" for the same percentage of octets as its current estimate of
993	   congestion on the aggregate (e.g. 3%) and set it to "1" in the rest
994	   (97%).  Appendix A.1 gives a simple pseudo-code algorithm that the
995	   ingress gateway may use to do this.

997	   The RE flag is set and cleared this way round for incremental
998	   deployment reasons (see Section 7).  To avoid confusion we will use
999	   the term `blanking' (rather than marking) when the RE flag is cleared
1000	   to "0", so we will talk of the `RE blanking fraction' as the fraction
1001	   of octets with the RE flag cleared to "0".

1003	       ^
1004	       |
1005	       |         RE blanking fraction
1006	    3% |    +----------------------------+====+
1007	       |    |                            |    |
1008	    2% |    |                            |    |
1009	       |    | congestion marking fraction|    |
1010	    1% |    |     +----------------------+    |
1011	       |    |     |                           |
1012	    0% +----+=====+---------------------------+------>
1013	            ^   <--A---> <---B---> <---C--->  ^        domain
1014	            |     ^                      ^    |
1015	        ingress   |                      |    egress
1016	                1.00%                 2.00%          marking fraction

1018	        Figure 3: Example Extended ECN codepoint Marking fractions
1019	                                (Imprecise)

1021	   Figure 3 illustrates our example.  The horizontal axis represents the
1022	   index of each congestible resource (typically queues) along a path
1023	   through the Internet.  The two superimposed plots show the fraction
1024	   of each extended PCN codepoint observed along this path, assuming
1025	   there are two congested routers somewhere within domains A and C. And
1026	   Table 3 below shows the downstream pre-congestion measured at various
1027	   border observation points along the path.  Figure 4 (later) shows the
1028	   same results of these subtractions, but in graphical form like the
1029	   above figure.  The tabulated figures are actually reasonable
1030	   approximations derived from more precise formulae given in Appendix A
1031	   of [I-D.briscoe-tsvwg-re-ecn-tcp].  The RE flag is not changed by
1032	   interior routers, so it can be seen that it acts as a reference
1033	   against which the congestion marking fraction can be compared along
1034	   the path.

1036	   +--------------------------+---------------------------------------+
1037	   | Border observation point | Approximate Downstream pre-congestion |
1038	   +--------------------------+---------------------------------------+
1039	   |       ingress -- A       |              3% - 0% = 3%             |
1040	   |          A -- B          |              3% - 1% = 2%             |
1041	   |          B -- C          |              3% - 1% = 2%             |
1042	   |        C -- egress       |              3% - 3% = 0%             |
1043	   +--------------------------+---------------------------------------+

1045	   Table 3: Downstream Congestion Measured at Example Observation Points

1047	   Note that the ingress determines the RE blanking fraction for each
1048	   aggregate using the most recent feedback from the relevant egress,
1049	   arriving with each new reservation, or each refresh.  These updates
1050	   arrive relatively infrequently compared to the speed with which
1051	   congestion changes.  Although this feedback will always be out of
1052	   date, on average positive errors should cancel out negative over a
1053	   sufficiently long duration.

1055	   In summary, the network adds pre-congestion marking in the forward
1056	   data path, the egress feeds its level back to the ingress in RSVP (or
1057	   similar signalling), then the ingress gateway re-echoes it into the
1058	   forward data path by blanking the RE flag.  Then at any border within
1059	   the PCN-region, the pre-congestion marking that every passing packet
1060	   will be expected to experience downstream can be measured to be the
1061	   RE blanking fraction minus the congestion marking fraction.

1063	4.3.2.  Aggregate Bootstrap

1065	   When a new reservation PATH message arrives at the egress, if there
1066	   are currently no flows in progress from the same ingress, there will
1067	   be no state maintaining the current level of pre-congestion marking
1068	   for the aggregate.  In the case of RSVP reservation signalling, while
1069	   the signal continues onward towards the receiving host, the egress
1070	   gateway can return an RSVP message to the ingress with a
1071	   flag [RSVP-ECN] asking the ingress to send a specified number of data
1072	   probes between them.  The more general possibilities for bootstrap
1073	   behaviour are described in the PCN
1074	   architecture [I-D.ietf-pcn-architecture], including using the
1075	   reservation signal itself as a probe.

1077	   However, with our new re-PCN scheme, the ingress does not know what
1078	   proportion of the data probes should have the RE flag blanked,
1079	   because it has no estimate yet of pre-congestion for the path across
1080	   the PCN-region.

1082	   To be conservative, following the guidance for specifying other re-
1083	   ECN transports in [I-D.briscoe-tsvwg-re-ecn-tcp], the ingress SHOULD
1084	   set the FNE codepoint of the extended PCN header in all probe packets
1085	   (Table 2).  As per the PCN deployment model, the egress gateway
1086	   measures the fraction of congestion-marked probe octets and feeds
1087	   back the resulting pre-congestion level to the ingress, piggy-backed
1088	   on the returning reservation response (RESV) for the new flow.  Probe
1089	   packets are identifiable by the egress because they carry the FNE
1090	   codepoint.

1092	   It may seem inadvisable to expect the FNE codepoint to be set on
1093	   probes, given legacy firewalls etc. might discard such packets
1094	   (because this flag had no previous legitimate use).  However, in the
1095	   deployment scenarios envisaged, each domain in the PCN-region has to
1096	   be explicitly configured to support the admission controlled service.
1097	   So, before deploying the service, the operator MUST reconfigure such
1098	   a badly implemented middlebox to allow through packets with the RE
1099	   flag set.

1101	   Note that we have said SHOULD rather than MUST for the FNE setting
1102	   behaviour of the ingress for probe packets.  This entertains the
1103	   possibility of an ingress implementation having the benefit of other
1104	   knowledge of the path, which it re-uses for a newly starting
1105	   aggregate.  For instance, it may hold cached information from a
1106	   recent use of the aggregate that is still sufficiently current to be
1107	   useful.  If not all probe packets are set to FNE, the ingress will
1108	   have to ensure probe packets are identifiable by some other means,
1109	   perhaps by using the egress as the destination address.

1111	   It might seem pedantic worrying about these few probe packets, but
1112	   this behaviour ensures the system is safe, even if the proportion of
1113	   probe packets becomes large.

1115	4.3.3.  Flow Bootstrap

1117	   It might be expected that a new flow within an active aggregate would
1118	   need no special bootstrap behaviour.  If there was an aggregate
1119	   already in progress between the gateways the new flow was about to
1120	   use, it would inherit the prevailing RE blanking fraction.  And if
1121	   there were no active aggregate, the bootstrap behaviour for an
1122	   aggregate would be appropriate and sufficient for the new flow.

1124	   However, for a number of reasons, at least the first packet of each
1125	   new flow SHOULD be set to the FNE codepoint, irrespective of whether
1126	   it is joining an active aggregate or not.  If the first packet is
1127	   unlikely to be reliably delivered, a number of FNE packets MAY be
1128	   sent to increase the probability that at least one is delivered to
1129	   the egress gateway.

1131	   If each flow does not start with an FNE packet, it will be seen later
1132	   that sanctions may be too strict at the interface before the egress
1133	   gateway.  It will often be possible to apply sanctions at the
1134	   granularity of aggregates rather than flows, but in an internetworked
1135	   environment it cannot be guaranteed that aggregates will be
1136	   identifiable in remote networks.  So setting FNE at the start of each
1137	   flow is a safe strategy.  For instance, a remote network may have
1138	   equal cost multi-path (ECMP) routing enabled, causing different flows
1139	   between the same gateways to traverse different paths.

1141	   After an idle period of more than 1 second, the ingress gateway
1142	   SHOULD set the EPCN field of the next packet it sends to FNE.  This
1143	   allows the design of network policers to be deterministic (see
1144	   [I-D.briscoe-tsvwg-re-ecn-tcp]).

1146	   However, if the ingress gateway can guarantee that the network(s)
1147	   that will carry the flow to its egress gateway all use a common
1148	   identifier for the aggregate (e.g. a single MPLS network without ECMP
1149	   routing), it MAY NOT set FNE when it adds a new flow to an active
1150	   aggregate.  And an FNE packet need only be sent if a whole aggregate
1151	   has been idle for more than 1 second.

1153	4.3.4.  Router Forwarding Behaviour

1155	   Adding re-PCN works well with the regular PCN forwarding behaviour of
1156	   interior queues.  However, below, two optional changes are proposed
1157	   when forwarding packets with a per-hop-behaviour that requires pre-
1158	   congestion notification:

1160	   Preferential drop:  When a router cannot avoid dropping PCN-capable
1161	      packets, preferential dropping of packets with different extended
1162	      PCN codepoints SHOULD be implemented between packets within a PHB
1163	      that uses PCN marking.  The drop preference order to use is
1164	      defined in Table 4.  Note that to reduce configuration complexity,
1165	      Re-PCT-Echo and FNE MAY be given the same drop preference, but if
1166	      feasible, FNE SHOULD be dropped in preference to Re-PCT-Echo.

1168	      If this proposal were advanced at the same time as PCN itself, we
1169	      would recommend that preferential drop based on extended PCN
1170	      codepoint SHOULD be added to router forwarding at the same time as
1171	      PCN marking.  Preferential dropping can be difficult to implement,
1172	      but we RECOMMEND this security-related re-PCN improvement where
1173	      feasible as it is an effective defence against flooding attacks.

1175	   Marking vs. Drop:  We propose that PCN-routers SHOULD inspect the RE
1176	      flag as well as the ECN field to decide whether to drop or mark
1177	      PCN DSCPs.  They MUST choose drop if the codepoint of this
1178	      extended ECN field is Not-PCN.  Otherwise they SHOULD mark
1179	      (unless, of course, buffer space is exhausted).

1181	      A PCN-capable router MUST NOT ever congestion mark a packet
1182	      carrying the Not-PCN codepoint because the transport will only
1183	      understand drop, not congestion marking.  But a PCN-capable router
1184	      can mark rather than drop an FNE packet, even though its ECN field
1185	      when looked at in isolation is '00' which appears to be a legacy
1186	      Not-ECT packet.  Therefore, if a packet's RE flag is '1', even if
1187	      its ECN field is '00', a PCN-enabled router SHOULD use congestion
1188	      marking.  This allows the `feedback not established' (FNE)
1189	      codepoint to be used for probe packets, in order to pick up PCN
1190	      marking when bootstrapping an aggregate.

1192	      PCN marking rather than dropping of FNE packets MUST only be
1193	      deployed in controlled environments, such as that in
1194	      [I-D.ietf-pcn-architecture], where the presence of an egress node
1195	      that understands PCN marking is assured.  Congestion events might
1196	      otherwise be ignored if the receiver only understands drop, rather
1197	      than PCN marking.  This is because there is no guarantee that PCN
1198	      capability has been negotiated if feedback is not established
1199	      (FNE).  Also, [I-D.briscoe-tsvwg-re-ecn-tcp] places the strong
1200	      condition that a router MUST apply drop rather than marking to FNE
1201	      packets unless it can guarantee that FNE packets are rate limited
1202	      either locally or upstream.

1204	   +---------+-------+-----------------+---------+---------------------+
1205	   |   PCN   |   RE  | Extended PCN    | Drop    |    Re-PCN meaning   |
1206	   |  field  |  flag | codepoint       | Pref    |                     |
1207	   +---------+-------+-----------------+---------+---------------------+
1208	   |    10   |   0   | Re-PCT-Echo     | 5/4     |      Re-echoed      |
1209	   |         |       |                 |         |    congestion and   |
1210	   |         |       |                 |         |        Re-PCT       |
1211	   |    00   |   1   | FNE             | 4       |     Feedback not    |
1212	   |         |       |                 |         |     established     |
1213	   |    10   |   1   | Re-PCT          | 3       |    Re-PCN capable   |
1214	   |         |       |                 |         |      transport      |
1215	   |    01   |   0   | AM(0)           | 3       |  Admission Marking  |
1216	   |         |       |                 |         |     with Re-Echo    |
1217	   |    01   |   1   | AM(-1)          | 3       |  Admission Marking  |
1218	   |         |       |                 |         |                     |
1219	   |    11   |   0   | TM(0)           | 2       | Termination Marking |
1220	   |         |       |                 |         |     with Re-Echo    |
1221	   |    11   |   1   | TM(-1)          | 2       | Termination Marking |
1222	   |         |       |                 |         |                     |
1223	   |    00   |   0   | Not-PCN         | 1       |   Not PCN-capable   |
1224	   |         |       |                 |         |      transport      |
1225	   +---------+-------+-----------------+---------+---------------------+

1227	    Table 4: Drop Preference of Extended ECN Codepoints (1 = drop 1st)

1229	4.3.5.  Extensions

1231	   If a different signalling system, such as NSIS, were used but it
1232	   provided admission control in a similar way using pre-congestion
1233	   notification (e.g.  Arumaithurai [I-D.arumaithurai-nsis-pcn] or
1234	   RMD [I-D.ietf-nsis-rmd]), we believe re-PCN could be used to protect
1235	   against misbehaving networks in the same way as proposed above.

1237	5.  Emulating Border Policing with Re-ECN

1239	   The following sections are informative, not normative.  The re-PCN
1240	   protocol described in Section 4 above would require standardisation,
1241	   whereas operators acting in their own interests would be expected to
1242	   deploy policing and monitoring functions similar to those proposed in
1243	   the sections below without any further need for standardisation by
1244	   the IETF.  Flexibility is expected in exactly how policing and
1245	   monitoring is done.

1247	5.1.  Informal Terminology

1249	   In the rest of this memo, where the context makes it clear, we will
1250	   sometimes loosely use the term `congestion' rather than using the
1251	   stricter `downstream pre-congestion'.  Also we will loosely talk of
1252	   positive or negative flows, meaning flows where the moving average of
1253	   the downstream pre-congestion metric is persistently positive or
1254	   negative.  The notion of a negative metric arises because it is
1255	   derived by subtracting one metric from another.  Of course actual
1256	   downstream congestion cannot be negative, only the metric can
1257	   (whether due to time lags or deliberate malice).

1259	   Just as we will loosely talk of positive and negative flows, we will
1260	   also talk of positive or negative packets, meaning packets that
1261	   contribute positively or negatively to downstream pre-congestion.

1263	   Therefore packets can be considered to have a `worth' of +1, 0 or -1,
1264	   which, when multiplied by their size, indicates their contribution to
1265	   downstream congestion.  Packets will usually be initialised by the
1266	   PCN ingress with a worth of 0.  Blanking the RE flag increments the
1267	   worth of a packet to +1.  Congestion marking a packet decrements its
1268	   worth (whether admission marking or termination marking).  Congestion
1269	   marking a previously blanked packet cancels out the positive worth
1270	   with the negative worth of the congestion marking (resulting in a
1271	   packet worth 0).  The FNE codepoint is an exception.  It has the same
1272	   positive worth as a packet with the Re-PCT-Echo codepoint.  The table
1273	   below specifies unambiguously the worth of each extended PCN
1274	   codepoint.  Note the order is different from the previous table to
1275	   emphasise how congestion marking processes decrement the worth (with
1276	   the exception of FNE).

1278	   +---------+-------+------------------+-------+----------------------+
1279	   |   ECN   |   RE  | Extended PCN     | Worth |    Re-PCN meaning    |
1280	   |  field  |  flag | codepoint        |       |                      |
1281	   +---------+-------+------------------+-------+----------------------+
1282	   |    00   |   0   | Not-PCN          | n/a   |    Not PCN-capable   |
1283	   |         |       |                  |       |       transport      |
1284	   |    10   |   0   | Re-PCT-Echo      | +1    | Re-echoed congestion |
1285	   |         |       |                  |       |      and Re-PCT      |
1286	   |    01   |   0   | AM(0)            | 0     |   Admission Marking  |
1287	   |         |       |                  |       |     with Re-Echo     |
1288	   |    11   |   0   | TM(0)            | 0     |  Termination Marking |
1289	   |         |       |                  |       |     with Re-Echo     |
1290	   |    00   |   1   | FNE              | +1    |     Feedback not     |
1291	   |         |       |                  |       |      established     |
1292	   |    10   |   1   | Re-PCT           | 0     |    Re-PCN capable    |
1293	   |         |       |                  |       |       transport      |
1294	   |    01   |   1   | AM(-1)           | -1    |   Admission Marking  |
1295	   |         |       |                  |       |                      |
1296	   |    11   |   1   | TM(-1)           | -1    |  Termination Marking |
1297	   +---------+-------+------------------+-------+----------------------+
1298	                Table 5: 'Worth' of Extended ECN Codepoints

1300	5.2.  Policing Overview

1302	   It will be recalled that downstream congestion can be found by
1303	   subtracting upstream congestion from path congestion.  Figure 4
1304	   displays the difference between the two plots in Figure 3 to show
1305	   downstream pre-congestion across the same path through the Internet.

1307	   To emulate border policing, the general idea is for each domain to
1308	   apply penalties to its upstream neighbour in proportion to the amount
1309	   of downstream pre-congestion that the upstream network sends across
1310	   the border.  That is, the penalties should be in proportion to the
1311	   height of the plot.  Downward arrows in the figure show the resulting
1312	   pressure for each domain to under-declare downstream pre-congestion
1313	   in traffic they pass to the next domain, because of the penalties.

1315	               p e n a l t i e s
1316	              /        |        \
1317	       A     :         :         :
1318	       |     |  <--A---> <---B---> <---C--->           domain
1319	       |     V         :         :         :
1320	    3% |    +-----+    |         |         :
1321	       |    |     |    V         V         :
1322	    2% |    |     +----------------------+ :
1323	       |    |  downstream pre-congestion | :
1324	    1% |    |     :                      | :
1325	       |    |     :                      | :
1326	    0% +----+----------------------------+====+------>
1327	            :     :                      : A  :
1328	            :     :                      : |  :
1329	        ingress   :                      : :  egress
1330	                1.00%                 2.00%:         pre-congestion
1331	                                           |
1332	                                       sanctions

1334	   Figure 4: Policing Framework, showing creation of opposing pressures
1335	    to under-declare and over-declare downstream pre-congestion, using
1336	                          penalties and sanctions

1338	   These penalties seem to encourage everyone to understate downstream
1339	   congestion in order to reduce the penalties they incur.  But a
1340	   balancing pressure is introduced by the last domain (strictly by any
1341	   domain), which applies sanctions to flows if downstream congestion
1342	   goes negative before the egress gateway.  The upward arrow at Domain
1343	   C's border with the egress gateway represents the incentive the
1344	   sanctions would create to prevent negative traffic.  The same upward
1345	   pressure can be applied at any domain border (arrows not shown).

1347	   Any flow that persistently goes negative by the time it leaves a
1348	   domain must not have been marked correctly in the first place.  A
1349	   domain that discovers such a flow can adopt a range of strategies to
1350	   protect itself.  Which strategy it uses will depend on policy,
1351	   because it cannot immediately assume malice--there may be an innocent
1352	   configuration error somewhere in the system.

1354	   This memo does not propose to standardise any particular mechanism to
1355	   detect persistently negative flows, but Section 5.5 does give
1356	   examples.  Note that we have used the term flow, but there will be no
1357	   need to bury into the transport layer for port numbers; identifiers
1358	   visible in the network layer will be sufficient (IP address pair,
1359	   DSCP, protocol ID).  The appendix also gives a mechanism to limit the
1360	   required flow state, preventing state exhaustion attacks.

1362	   Of course, some domains may trust other domains to comply with
1363	   admission control without applying sanctions or penalties.  In these
1364	   cases, the protocol should still be used but no penalties need be
1365	   applied.  The re-PCN protocol ensures downstream pre-congestion
1366	   marking is passed on correctly whether or not penalties are applied
1367	   to it, so the system works just as well with a mixture of some
1368	   domains trusting each other and others not.

1370	   Providers should be free to agree the contractual terms they wish
1371	   between themselves, so this memo does not propose to standardise how
1372	   these penalties would be applied.  It is sufficient to standardise
1373	   the re-PCN protocol so the downstream pre-congestion metric is
1374	   available if providers choose to use it.  However, the next section
1375	   (Section 5.3) gives some examples of how these penalties might be
1376	   implemented.

1378	5.3.  Pre-requisite Contractual Arrangements

1380	   The re-PCN protocol has been chosen to solve the policing problem
1381	   because it embeds a downstream pre-congestion metric in passing PCN
1382	   traffic that is difficult to lie about and can be measured in bulk.
1383	   The ability to emulate border policing depends on network operators
1384	   choosing to use this metric as one of the elements in their contracts
1385	   with each other.

1387	   Already many inter-domain agreements involve a capacity and a usage
1388	   element.  The usage element may be based on volume or various
1389	   measures of peak demand.  We expect that those network operators who
1390	   choose to use pre-congestion notification for admission control would
1391	   also be willing to consider using this downstream pre-congestion
1392	   metric as a usage element in their interconnection contracts for
1393	   admission controlled (PCN) traffic.

1395	   Congestion (or pre-congestion) has the dimension of [octet], being
1396	   the product of volume transferred [octet] and the congestion fraction
1397	   [dimensionless], which is the fraction of the offered load that the
1398	   network isn't able to serve (or would rather not serve in the case of
1399	   pre-congestion).  Measuring downstream congestion gives a measure of
1400	   the volume transferred but modulated by congestion expected
1401	   downstream.  So volume transferred during off-peak periods counts as
1402	   nearly nothing, while volume transferred at peak times or over
1403	   temporarily congested links counts very highly.  The re-PCN protocol
1404	   allows one network to measure how much pre-congestion has been
1405	   `dumped' into it by another network.  And then in turn how much of
1406	   that pre-congestion it dumped into the next downstream network.

1408	   Section 5.6 describes mechanisms for calculating border penalties
1409	   referring to Appendix A.2 for suggested metering algorithms for
1410	   downstream congestion at a border router.  Conceptually, it could
1411	   hardly be simpler.  It broadly involves accumulating the volume of
1412	   packets with the RE flag blanked and the volume of those with
1413	   congestion marking then subtracting the two.

1415	   Once this downstream pre-congestion metric is available, operators
1416	   are free to choose how they incorporate it into their interconnection
1417	   contracts [IXQoS].  Some may include a threshold volume of pre-
1418	   congestion as a quality measure in their service level agreement,
1419	   perhaps with a penalty clause if the upstream network exceeds this
1420	   threshold over, say, a month.  Others may agree a set of tiered
1421	   monthly thresholds, with increasing penalties as each threshold is
1422	   exceeded.  But, it would be just as easy, and more resistant to
1423	   gaming, to do away with discrete thresholds, and instead make the
1424	   penalty rise smoothly with the volume of pre-congestion by applying a
1425	   price to pre-congestion itself.  Then the usage element of the
1426	   interconnection contract would directly relate to the volume of pre-
1427	   congestion caused by the upstream network.

1429	   The direction of penalties and charges relative to the direction of
1430	   traffic flow is a constant source of confusion.  Typically, where
1431	   capacity charges are concerned, lower tier customer networks pay
1432	   higher tier provider networks.  So money flows from the edges to the
1433	   middle of the internetwork, towards greater connectivity,
1434	   irrespective of the flow of data.  But we advise that penalties or
1435	   charges for usage should follow the same direction as the data flow--
1436	   the direction of control at the network layer.  Otherwise a network
1437	   lays itself open to `denial of funds' attacks.  So, where a tier 2
1438	   provider sends data into a tier 3 customer network, we would expect
1439	   the penalty clauses for sending too much pre-congestion to be against
1440	   the tier 2 network, even though it is the provider.

1442	   It may help to remember that data will be flowing in the other
1443	   direction too.  So the provider network has as much opportunity to
1444	   levy usage penalties as its customer, and it can set the price or
1445	   strength of its own penalties higher if it chooses.  Usage charges in
1446	   both directions tend to cancel each other out, which confirms that
1447	   usage-charging is less to do with revenue raising and more to do with
1448	   encouraging load control discipline in order to smooth peaks and
1449	   troughs, improving utilisation and quality.

1451	   Further, when operators agree penalties in their interconnection
1452	   contracts for sending downstream congestion, they should make sure
1453	   that any level of negative marking only equates to zero penalty.  In
1454	   other words, penalties are always paid in the same direction as the
1455	   data, and never against the data flow, even if downstream congestion
1456	   seems to be negative.  This is consistent with the definition of
1457	   physical congestion; when a resource is underutilised, it is not
1458	   negatively congested.  Its congestion is just zero.  So, although
1459	   short periods of negative marking can be tolerated to correct
1460	   temporary over-declarations due to lags in the feedback system,
1461	   persistent downstream negative congestion can have no physical
1462	   meaning and therefore must signify a problem.  The incentive for
1463	   domains not to tolerate persistently negative traffic depends on this
1464	   principle that negative penalties must never be paid for negative
1465	   congestion.

1467	   Also note that at the last egress of the PCN-region, domain C should
1468	   not agree to pay any penalties to the egress gateway for pre-
1469	   congestion passed to the egress gateway.  Downstream pre-congestion
1470	   to the egress gateway should have reached zero here.  If domain C
1471	   were to agree to pay for any remaining downstream pre-congestion, it
1472	   would give the egress gateway an incentive to over-declare pre-
1473	   congestion feedback and take the resulting profit from domain C.

1475	   To focus the discussion, from now on, unless otherwise stated, we
1476	   will assume a downstream network charges its upstream neighbour in
1477	   proportion to the pre-congestion it sends (V_b in the notation of
1478	   Appendix A.2).  Effectively tiered thresholds would be just more
1479	   coarse-grained approximations of the fine-grained case we choose to
1480	   examine.  If these neighbours had previously agreed that the (fixed)
1481	   price per octet of pre-congestion would be L, then the bill at the
1482	   end of the month would simply be the product L*V_b, plus any fixed
1483	   charges they may also have agreed.

1485	   We are well aware that the IETF tries to avoid standardising
1486	   technology that depends on a particular business model.  Indeed, this
1487	   principle is at the heart of all our own work.  Our aim here is to
1488	   make a new metric available that we believe is superior to all
1489	   existing metrics.  Then, our aim is to show that bulk border policing
1490	   can at least work with the one model we have just outlined.  Of
1491	   course, operators are free to complement this pre-congestion-based
1492	   usage element of their charges with traditional capacity charging,
1493	   and we expect they will.  But if operators don't want to use this
1494	   business model at all, they don't have to do bulk border policing.
1495	   We also assume that operators might experiment with the metric in
1496	   other models.

1498	   Also note well that everything we discuss in this memo only concerns
1499	   interconnection within the PCN-region.  ISPs are free to sell or give
1500	   away reservations however they want on the retail market.  But of
1501	   course, interconnection charges will have a bearing on that.  Indeed,
1502	   in the present scenario, the ingress gateway effectively sells
1503	   reservations on one side and buys congestion penalties on the other.
1504	   As congestion rises, one can imagine the gateway discovering that
1505	   congestion penalties have risen higher than the (probably fixed)
1506	   revenue it will earn from selling the next flow reservation.  This
1507	   encourages the gateway to cut its losses by blocking new calls, which
1508	   is why we believe downstream congestion penalties can emulate per-
1509	   flow rate policing at borders, as the next section explains.

1511	5.4.  Emulation of Per-Flow Rate Policing: Rationale and Limits

1513	   The important feature of charging in proportion to congestion volume
1514	   is that the penalty aggregates and disaggregates correctly along with
1515	   packet flows.  This is because the penalty rises linearly with bit
1516	   rate (unless congestion is absolutely zero) and linearly with
1517	   congestion, because it is the product of them both.  So if the
1518	   packets crossing a border belong to a thousand flows, and one of
1519	   those flows doubles its rate, the ingress gateway forwarding that
1520	   flow will have to put twice as much congestion marking into the
1521	   packets of that flow.  And this extra congestion marking will add
1522	   proportionately to the penalties levied at every border the flow
1523	   crosses in proportion to the amount of pre-congestion remaining on
1524	   the path.

1526	   Effectively, usage charges will continuously flow from ingress
1527	   gateways to the places generating pre-congestion marking, in
1528	   proportion to the pre-congestion marking introduced and to the data
1529	   rates from those gateways.

1531	   As importantly, pre-congestion itself rises super-linearly with
1532	   utilisation of a particular resource.  So if someone tries to push
1533	   another flow into a path that is already signalling enough pre-
1534	   congestion to warrant admission control, the penalty will be a lot
1535	   greater than it would have been to add the same flow to a less
1536	   congested path.  This makes the incentive system fairly insensitive
1537	   to the actual level of pre-congestion for triggering admission
1538	   control that each ingress chooses.  The deterrent against exceeding
1539	   whatever threshold is chosen rises very quickly with a small amount
1540	   of cheating.

1542	   These are the properties that allow re-PCN to emulate per-flow border
1543	   policing of both rate and admission control.  It is not a perfect
1544	   emulation of per-flow border policing, but we claim it is sufficient
1545	   to at least ensure the cost to others of a cheat is borne by the
1546	   cheater, because the penalties are at least proportionate to the
1547	   level of the cheat.  If an edge network operator is selling
1548	   reservations at a large profit over the congestion cost, these pre-
1549	   congestion penalties will not be sufficient to ensure networks in the
1550	   middle get a share of those profits, but at least they can cover
1551	   their costs.

1553	   We will now explain with an example.  When a whole inter-network is
1554	   operating at normal (typically very low) congestion, the pre-
1555	   congestion marking from virtual queues will be a little higher than
1556	   if the real queues had been used--still low, but more noticeable.
1557	   But low congestion levels do not imply that usage _charges_ must also
1558	   be low.  Usage charges will depend on the _price_ L as well.

1560	   If the metric of the usage element of an interconnection agreement
1561	   was changed from pure volume to pre-congested volume, one would
1562	   expect the price of pre-congestion to be arranged so that the total
1563	   usage charge remained about the same.  So, if an average pre-
1564	   congestion fraction turned out to be 1/1000, one would expect that
1565	   the price L (per octet) of pre-congestion would be about 1000 times
1566	   the previously used (per octet) price for volume.  We should add that
1567	   a switch to pre-congestion is unlikely to exactly maintain the same
1568	   overall level of usage charges, but this argument will be
1569	   approximately true, because usage charge will rise to at least the
1570	   level the market finds necessary to push back against usage.

1572	   From the above example it can be seen why a 1000x higher price will
1573	   make operators become acutely sensitive to the congestion they cause
1574	   in other networks, which is of course the desired effect; to
1575	   encourage networks to _avoid_ the congestion they allow their users
1576	   to cause to others.

1578	   If any network sends even one flow at higher rate, they will
1579	   immediately have to pay proportionately more usage charges.  Because
1580	   there is no knowledge of reservations within the PCN-region, no
1581	   interior router can police whether the rate of each flow is greater
1582	   than each reservation.  So the system doesn't truly emulate rate-
1583	   policing of each flow.  But there is no incentive to pack a higher
1584	   rate into a reservation, because the charges are directly
1585	   proportional to rate, irrespective of the reservations.

1587	   However, if virtual queues start to fill on any path, even though
1588	   real queues will still be able to provide low latency service, pre-
1589	   congestion marking will rise fairly quickly.  It may eventually reach
1590	   the threshold where the ingress gateway would deny admission to new
1591	   flows.  If the ingress gateway cheats and continues to admit new
1592	   flows, the affected virtual queues will rapidly fill, even though the
1593	   real queues will still be little worse than they were when admission
1594	   control should have been invoked.  The ingress gateway will have to
1595	   pay the penalty for such an extremely high pre-congestion level, so
1596	   the pressure to invoke admission control should become unbearable.

1598	   The above mechanisms protect against rational operators.  In
1599	   Section 5.6.3 we discuss how networks can protect themselves from
1600	   accidental or deliberate misconfiguration in neighbouring networks.

1602	5.5.  Sanctioning Dishonest Marking

1604	   As PCN traffic leaves the last network before the egress gateway
1605	   (domain 'C' in Figure 4) the RE blanking fraction should match the
1606	   congestion marking fraction, when averaged over a sufficiently long
1607	   duration (perhaps ~10s to allow a few rounds of feedback through
1608	   regular signalling of new and refreshed reservations).

1610	   To protect itself, domain 'C' should install a monitor at its egress.
1611	   It aims to detect flows of PCN packets that are persistently
1612	   negative.  If flows are positive, domain 'C' need take no action--
1613	   this simply means an upstream network must be paying more penalties
1614	   than it needs to.  Appendix A.3 gives a suggested algorithm for the
1615	   monitor, meeting the criteria below.

1617	   o  It SHOULD introduce minimal false positives for honest flows;

1619	   o  It SHOULD quickly detect and sanction dishonest flows (minimal
1620	      false negatives);

1622	   o  It MUST be invulnerable to state exhaustion attacks from malicious
1623	      sources.  For instance, if the dropper uses flow-state, it should
1624	      not be possible for a source to send numerous packets, each with a
1625	      different flow ID, to force the dropper to exhaust its memory
1626	      capacity;

1628	   o  If drop is used as a sanction, it SHOULD introduce sufficient loss
1629	      in goodput so that malicious sources cannot play off losses in the
1630	      egress dropper against higher allowed throughput.
1631	      Salvatori [CLoop_pol] describes this attack, which involves the
1632	      source understating path congestion then inserting forward error
1633	      correction (FEC) packets to compensate expected losses.

1635	   Note that the monitor operates on flows but with careful design we
1636	   can avoid per-flow state.  This is why we have been careful to ensure
1637	   that all flows MUST start with a packet marked with the FNE
1638	   codepoint.  If a flow does not start with the FNE codepoint, a
1639	   monitor is likely to treat it unfavourably.  This risk makes it worth
1640	   setting the FNE codepoint at the start of a flow, even though there
1641	   is a cost to setting FNE (positive `worth').

1643	   Starting flows with an FNE packet also means that a monitor will be
1644	   resistant to state exhaustion attacks from other networks, as the
1645	   monitor can then be designed to never create state unless an FNE
1646	   packet arrives.  And an FNE packet counts positive, so it will cost a
1647	   lot for a network to send many of them.

1649	   Monitor algorithms will often maintain a moving average across flows
1650	   of the fraction of RE blanked packets.  When maintaining an average
1651	   across flows, a monitor MUST ignore packets with the FNE codepoint
1652	   set.  An ingress gateway sets the FNE codepoint when it does not have
1653	   the benefit of feedback from the egress.  So counting packets with
1654	   FNE cleared would be likely to make the average unnecessarily
1655	   positive, providing headroom (or should we say footroom?) for
1656	   dishonest (negative) traffic.

1658	   If the monitor detects a persistently negative flow, it could drop
1659	   sufficient negative and neutral packets to force the flow to not be
1660	   negative.  This is the approach taken for the `egress dropper' in
1661	   [I-D.briscoe-tsvwg-re-ecn-tcp], but for the scenario in this memo,
1662	   where everyone would expect everyone else to keep to the protocol, a
1663	   management alarm SHOULD be raised on detecting persistently negative
1664	   traffic and any automatic sanctions taken SHOULD be logged.  Even if
1665	   the chosen policy is to take no automatic action, the cause can then
1666	   be investigated manually.

1668	   Then all ingresses cannot understate downstream pre-congestion
1669	   without their action being logged.  So network operators can deal
1670	   with offending networks at the human level, out of band.  As a last
1671	   resort, perhaps where the ingress gateway address seems to have been
1672	   spoofed in the signalling, packets can be dropped.  Drops could be
1673	   focused on just sufficient packets in misbehaving flows to remove the
1674	   negative bias while doing minimal harm.

1676	   A future version of this memo may define a control message that could
1677	   be used to notify an offending ingress gateway (possibly via the
1678	   egress gateway) that it is sending persistently negative flows.
1679	   However, we are aware that such messages could be used to test the
1680	   sensitivity of the detection system, so currently we prefer silent
1681	   sanctions.

1683	   An extreme scenario would be where an ingress gateway (or set of
1684	   gateways) mounted a DoS attack against another network.  If their
1685	   traffic caused sufficient congestion to lead to drop but they
1686	   understated path congestion to avoid penalties for causing high
1687	   congestion, the preferential drop recommendations in Section 4.3.4
1688	   would at least ensure that these flows would always be dropped before
1689	   honest flows..

1691	5.6.  Border Mechanisms

1693	5.6.1.  Border Accounting Mechanisms

1695	   One of the main design goals of re-PCN was for border security
1696	   mechanisms to be as simple as possible, otherwise they would become
1697	   the pinch-points that limit scalability of the whole internetwork.
1698	   As the title of this memo suggests, we want to avoid per-flow
1699	   processing at borders.  We also want to keep to passive mechanisms
1700	   that can monitor traffic in parallel to forwarding, rather than
1701	   having to filter traffic inline--in series with forwarding.  As data
1702	   rates continue to rise, we suspect that all-optical interconnection
1703	   between networks will soon be a requirement.  So we want to avoid any
1704	   new need for buffering (even though border filtering is current
1705	   practice for other reasons, we don't want to make it even less likely
1706	   that we will ever get rid of it).

1708	   So far, we have been able to keep the border mechanisms simple,
1709	   despite having had to harden them against some subtle attacks on the
1710	   re-PCN design.  The mechanisms are still passive and avoid per-flow
1711	   processing, although we do use filtering as a fail-safe to
1712	   temporarily shield against extreme events in other networks, such as
1713	   accidental misconfigurations (Section 5.6.3).

1715	   The basic accounting mechanism at each border interface simply
1716	   involves accumulating the volume of packets with positive worth (Re-
1717	   PCT-Echo and FNE), and subtracting the volume of those with negative
1718	   worth: AM(-1) and TM(-1).  Even though this mechanism takes no regard
1719	   of flows, over an accounting period (say a month) this subtraction
1720	   will account for the downstream congestion caused by all the flows
1721	   traversing the interface, wherever they come from, and wherever they
1722	   go to.  The two networks can agree to use this metric however they
1723	   wish to determine some congestion-related penalty against the
1724	   upstream network (see Section 5.3 for examples).  Although the
1725	   algorithm could hardly be simpler, it is spelled out using pseudo-
1726	   code in Appendix A.2.1.

1728	   Various attempts to subvert the re-ECN design have been made.  In all
1729	   cases their root cause is persistently negative flows.  But, after
1730	   describing these attacks we will show that we don't actually have to
1731	   get rid of all persistently negative flows in order to thwart the
1732	   attacks.

1734	   In honest flows, downstream congestion is measured as positive minus
1735	   negative volume.  So if all flows are honest (i.e. not persistently
1736	   negative), adding all positive volume and all negative volume without
1737	   regard to flows will give an aggregate measure of downstream
1738	   congestion.  But such simple aggregation is only possible if no flows
1739	   are persistently negative.  Unless persistently negative flows are
1740	   completely removed, they will reduce the aggregate measure of
1741	   congestion.  The aggregate may still be positive overall, but not as
1742	   positive as it would have been had the negative flows been removed.

1744	   In Section 5.5 we discussed how to sanction traffic to remove, or at
1745	   least to identify, persistently negative flows.  But, even if the
1746	   sanction for negative traffic is to discard it, unless it is
1747	   discarded at the exact point it goes negative, it will wrongly
1748	   subtract from aggregate downstream congestion, at least at any
1749	   borders it crosses after it has gone negative but before it is
1750	   discarded.

1752	   We rely on sanctions to deter dishonest understatement of congestion.
1753	   But even the ultimate sanction of discard can only be effective if
1754	   the sender is bothered about the data getting through to its
1755	   destination.  A number of attacks have been identified where a sender
1756	   gains from sending dummy traffic or it can attack someone or
1757	   something using dummy traffic even though it isn't communicating any
1758	   information to anyone:

1760	   o  A network can simply create its own dummy traffic to congest
1761	      another network, perhaps causing it to lose business at no cost to
1762	      the attacking network.  This is a form of denial of service
1763	      perpetrated by one network on another.  The preferential drop
1764	      measures in Section 4.3.4 provide crude protection against such
1765	      attacks, but we are not overly worried about more accurate
1766	      prevention measures, because it is already possible for networks
1767	      to DoS other networks on the general Internet, but they generally
1768	      don't because of the grave consequences of being found out.  We
1769	      are only concerned if re-PCN increases the motivation for such an
1770	      attack, as in the next example.

1772	   o  A network can just generate negative traffic and send it over its
1773	      border with a neighbour to reduce the overall penalties that it
1774	      should pay to that neighbour.  It could even initialise the TTL so
1775	      it expired shortly after entering the neighbouring network,
1776	      reducing the chance of detection further downstream.  This attack
1777	      need not be motivated by a desire to deny service and indeed need
1778	      not cause denial of service.  A network's main motivator would
1779	      most likely be to reduce the penalties it pays to a neighbour.
1780	      But, the prospect of financial gain might tempt the network into
1781	      mounting a DoS attack on the other network as well, given the gain
1782	      would offset some of the risk of being detected.

1784	   Note that we have not included DoS by Internet hosts in the above
1785	   list of attacks, because we have restricted ourselves to a scenario
1786	   with edge-to-edge admission control across a PCN-region.  In this
1787	   case, the edge ingress gateways insulate the PCN-region from DoS by
1788	   Internet hosts.  Re-ECN resists more general DoS attacks, but this is
1789	   discussed in [I-D.briscoe-tsvwg-re-ecn-tcp].

1791	   The first step towards a solution to all these problems with negative
1792	   flows is to be able to estimate the contribution they make to
1793	   downstream congestion at a border and to correct the measure
1794	   accordingly.  Although ideally we want to remove negative flows
1795	   themselves, perhaps surprisingly, the most effective first step is to
1796	   cancel out the polluting effect negative flows have on the measure of
1797	   downstream congestion at a border.  It is more important to get an
1798	   unbiased estimate of their effect, than to try to remove them all.  A
1799	   suggested algorithm to give an unbiased estimate of the contribution
1800	   from negative flows to the downstream congestion measure is given in
1801	   Appendix A.2.2.

1803	   Although making an accurate assessment of the contribution from
1804	   negative flows may not be easy, just the single step of neutralising
1805	   their polluting effect on congestion metrics removes all the gains
1806	   networks could otherwise make from mounting dummy traffic attacks on
1807	   each other.  This puts all networks on the same side (only with
1808	   respect to negative flows of course), rather than being pitched
1809	   against each other.  The network where a flow goes negative as well
1810	   as all the networks downstream lose out from not being reimbursed for
1811	   any congestion this flow causes.  So they all have an interest in
1812	   getting rid of these negative flows.  Networks forwarding a flow
1813	   before it goes negative aren't strictly on the same side, but they
1814	   are disinterested bystanders--they don't care that the flow goes
1815	   negative downstream, but at least they can't actively gain from
1816	   making it go negative.  The problem becomes localised so that once a
1817	   flow goes negative, all the networks from where it happens and beyond
1818	   downstream each have a small problem, each can detect it has a
1819	   problem and each can get rid of the problem if it chooses to.  But
1820	   negative flows can no longer be used for any new attacks.

1822	   Once an unbiased estimate of the effect of negative flows can be
1823	   made, the problem reduces to detecting and preferably removing flows
1824	   that have gone negative as soon as possible.  But importantly,
1825	   complete eradication of negative flows is no longer critical--best
1826	   endeavours will be sufficient.

1828	   Note that the guiding principle behind all the above discussion is
1829	   that any gain from subverting the protocol should be precisely
1830	   neutralised, rather than punished.  If a gain is punished to a
1831	   greater extent than is sufficient to neutralise it, it will most
1832	   likely open up a new vulnerability, where the amplifying effect of
1833	   the punishment mechanism can be turned on others.

1835	   For instance, if possible, flows should be removed as soon as they go
1836	   negative, but we do NOT RECOMMEND any attempts to discard such flows
1837	   further upstream while they are still positive.  Such over-zealous
1838	   push-back is unnecessary and potentially dangerous.  These flows have
1839	   paid their `fare' up to the point they go negative, so there is no
1840	   harm in delivering them that far.  If someone downstream asks for a
1841	   flow to be dropped as near to the source as possible, because they
1842	   say it is going to become negative later, an upstream node cannot
1843	   test the truth of this assertion.  Rather than have to authenticate
1844	   such messages, re-PCN has been designed so that flows can be dropped
1845	   solely based on locally measurable evidence.  A message hinting that
1846	   a flow should be watched closely to test for negativity is fine.  But
1847	   not a message that claims that a positive flow will go negative
1848	   later, so it should be dropped.

1850	5.6.2.  Competitive Routing

1852	   With the above penalty system, each domain seems to have a perverse
1853	   incentive to fake pre-congestion.  For instance domain 'B' profits
1854	   from the difference between penalties it receives at its ingress (its
1855	   revenue) and those it pays at its egress (its cost).  So if 'B'
1856	   overstates internal pre-congestion it seems to increase its profit.
1857	   However, we can assume that domain 'A' could bypass 'B', routing
1858	   through other domains to reach the egress.  So the competitive
1859	   discipline of least-cost routing can ensure that any domain tempted
1860	   to fake pre-congestion for profit risks losing _all_ its incoming
1861	   traffic.  The least congested route would eventually be able to win
1862	   this competitive game, only as long as it didn't declare more fake
1863	   pre-congestion than the next most competitive route.

1865	   The competitive effect of interdomain routing might be weaker nearer
1866	   to the egress.  For instance, 'C' may be the only route 'B' can take
1867	   to reach the ultimate receiver.  And if 'C' over-penalises 'B', the
1868	   egress gateway and the ultimate receiver seem to have no incentive to
1869	   move their terminating attachment to another network, because only
1870	   'B' and those upstream of 'B' suffer the higher penalties.  However,
1871	   we must remember that we are only looking at the money flows at the
1872	   unidirectional network layer.  There are likely to be all sorts of
1873	   higher level business models constructed over the top of these low
1874	   level 'sender-pays' penalties.  For instance, we might expect a
1875	   session layer charging model where the session originator pays for a
1876	   pair of duplex flows, one as receiver and one as sender.
1877	   Traditionally this has been a common model for telephony and we might
1878	   expect it to be used, at least sometimes, for other media such as
1879	   video.  Wherever such a model is used, the data receiver will be
1880	   directly affected if its sessions terminate through a network like
1881	   'C' that fakes congestion to over-penalise 'B'.  So end-customers
1882	   will experience a direct competitive pressure to switch to cheaper
1883	   networks, away from networks like 'C' that try to over-penalise 'B'.

1885	   This memo does not need to standardise any particular mechanism for
1886	   routing based on re-PCN.  Goldenberg et al [Smart_rtg] refers to
1887	   various commercial products and presents its own algorithms for
1888	   moving traffic between multi-homed routes based on usage charges.
1889	   None of these systems require any changes to standards protocols
1890	   because the choice between the available border gateway protocol
1891	   (BGP) routes is based on a combination of local knowledge of the
1892	   charging regime and local measurement of traffic levels.  If, as we
1893	   propose, charges or penalties were based on the level of re-PCN
1894	   measured locally in passing traffic, a similar optimisation could be
1895	   achieved without requiring any changes to standard routing protocols.

1897	   We must be clear that applying pre-congestion-based routing to this
1898	   admission control system remains an open research issue.  Traffic
1899	   engineering based on congestion requires careful damping to avoid
1900	   oscillations, and should not be attempted without adult supervision
1901	   :) Mortier & Pratt [ECN-BGP] have analysed traffic engineering based
1902	   on congestion.  But without the benefit of re-ECN or re-PCN, they had
1903	   to add a path attribute to BGP to advertise a route's downstream
1904	   congestion (actually they proposed that BGP should advertise the
1905	   charge for congestion, which we believe wrongly embeds an assumption
1906	   into BGP that the only thing to do with congestion is charge for it).

1908	5.6.3.  Fail-safes

1910	   The mechanisms described so far create incentives for rational
1911	   operators to behave.  That is, one operator aims to make another
1912	   behave responsibly by applying penalties and expects a rational
1913	   response (i.e. one that trades off costs against benefits).  It is
1914	   usually reasonable to assume that other network operators will behave
1915	   rationally (policy routing can avoid those that might not).  But this
1916	   approach does not protect against the misconfigurations and accidents
1917	   of other operators.

1919	   Therefore, we propose the following two mechanisms at a network's
1920	   borders to provide "defence in depth".  Both are similar:

1922	   Highly positive flows:  A small sample of positive packets should be
1923	      picked randomly as they cross a border interface.  Then subsequent
1924	      packets matching the same source and destination address and DSCP
1925	      should be monitored.  If the fraction of positive marking is well
1926	      above a threshold (to be determined by operational practice), a
1927	      management alarm SHOULD be raised, and the flow MAY be
1928	      automatically subject to focused drop.

1930	   Persistently negative flows:  A small sample of congestion marked
1931	      packets should be picked randomly as they cross a border
1932	      interface.  Then subsequent packets matching the same source and
1933	      destination address and DSCP should be monitored.  If the RE
1934	      blanking fraction minus the congestion marking fraction is
1935	      persistently negative, a management alarm SHOULD be raised, and
1936	      the flow MAY be automatically subject to focused drop.

1938	   Both these mechanisms rely on the fact that highly positive (or
1939	   negative) flows will appear more quickly in the sample by selecting
1940	   randomly solely from positive (or negative) packets.

1942	   Note that there is no assumption that _users_ behave rationally.  The
1943	   system is protected from the vagaries of irrational user behaviour by
1944	   the ingress gateways, which transform internal penalties into a
1945	   deterministic, admission control mechanism that prevents users from
1946	   misbehaving, by directly engineered means.

1948	6.  Analysis

1950	   The domains in Figure 1 are not expected to be completely malicious
1951	   towards each other.  After all, we can assume that they are all co-
1952	   operating to provide an internetworking service to the benefit of
1953	   each of them and their customers.  Otherwise their routing polices
1954	   would not interconnect them in the first place.  However, we assume
1955	   that they are also competitors of each other.  So a network may try
1956	   to contravene our proposed protocol if it would gain or make a
1957	   competitor lose, or both.  But only if it can do so without being
1958	   caught.  Therefore we do not have to consider every possible random
1959	   attack one network could launch on the traffic of another, given
1960	   anyway one network can always drop or corrupt packets that it
1961	   forwards on behalf of another.

1963	   Therefore, we only consider new opportunities for _gainful_ attack
1964	   that our proposal introduces.  But to a certain extent we can also
1965	   rely on the in depth defences we have described (Section 5.6.3 )
1966	   intended to mitigate the potential impact if one network accidentally
1967	   misconfiguring the workings of this protocol.

1969	   The ingress and egress gateways are shown in the most generic
1970	   arrangement possible in Figure 1, without any surrounding network.
1971	   This allows us to consider more specific cases where these gateways
1972	   and a neighbouring network are operated by the same player.  As well
1973	   as cases where the same player operates neighbouring networks, we
1974	   will also consider cases where the two gateways collude as one player
1975	   and where the sender and receiver collude as one.  Collusion of other
1976	   sets of domains is less likely, but we will consider such cases.  In
1977	   the general case, we will assume none of the nine trust domains
1978	   across the figure fully trust any of the others.

1980	   As we only propose to change routers within the PCN-region, we assume
1981	   the operators of networks outside the region will be doing per-flow
1982	   policing.  That is, we assume the networks outside the PCN-region and
1983	   the gateways around its edges can protect themselves.  So given we
1984	   are proposing to remove flow policing from some networks, our primary
1985	   concern must be to protect networks that don't do per-flow policing
1986	   (the potential `victims') from those that do (the `enemy').  The
1987	   ingress and egress gateways are the only way the outer enemy can get
1988	   at the middle victim, so we can consider the gateways as the
1989	   representatives of the enemy as far as domains 'A', 'B' and 'C' are
1990	   concerned.  We will call this trust scenario `edges against middles'.

1992	   Earlier in this memo, we outlined the classic border rate policing
1993	   problem (Section 3).  It will now be useful to reiterate the
1994	   motivations that are the root cause of the problem.  The more
1995	   reservations a gateway can allow, the more revenue it receives.  The
1996	   middle networks want the edges to comply with the admission control
1997	   protocol when they become so congested that their service to others
1998	   might suffer.  The middle networks also want to ensure the edges
1999	   cannot steal more service from them than they are entitled to.

2001	   In the context of this `edges against middles' scenario, the re-PCN
2002	   protocol has two main effects:

2004	   o  The more pre-congestion there is on a path across the PCN-region,
2005	      the higher the ingress gateway must declare downstream pre-
2006	      congestion.

2008	   o  If the ingress gateway does not declare downstream pre-congestion
2009	      high enough on average, it will `hit the ground before the
2010	      runway', going negative and triggering sanctions, either directly
2011	      against the traffic or against the ingress gateway at a management
2012	      level

2014	   An executive summary of our security analysis can be stated in three
2015	   parts, distinguished by the type of collusion considered.

2017	   Neighbour-only Middle-Middle Collusion:  Here there is no collusion
2018	      or collusion is limited to neighbours in the feedback loop.  In
2019	      other words, two neighbouring networks can be assumed to act as
2020	      one.  Or the egress gateway might collude with domain 'C'.  Or the
2021	      ingress gateway might collude with domain 'A'.  Or ingress and
2022	      egress gateways might collude with each other.

2024	      In these cases where only neighbours in the feedback loop collude,
2025	      we concludes that all parties have a positive incentive to declare
2026	      downstream pre-congestion truthfully, and the ingress gateway has
2027	      a positive incentive to invoke admission control when congestion
2028	      rises above the admission threshold in any network in the region
2029	      (including its own).  No party has an incentive to send more
2030	      traffic than declared in reservation signalling (even though only
2031	      the gateways read this signalling).  In short, no party can gain
2032	      at the expense of another.

2034	   Non-neighbour Middle-Middle Collusion:  In the case of other forms of
2035	      collusion between middle networks (e.g. between domain 'A' and
2036	      'C') it would be possible for say 'A' & 'C' to create a tunnel
2037	      between themselves so that 'A' would gain at the expense of 'B'.
2038	      But 'C' would then lose the gain that 'A' had made.  Therefore the
2039	      value to 'A' & 'C' of colluding to mount this attack seems
2040	      questionable.  It is made more questionable, because the attack
2041	      can be statistically detected by 'B' using the second `defence in
2042	      depth' mechanism mentioned already.  Note that 'C' can defend
2043	      itself from being attacked through a tunnel by treating the tunnel
2044	      end point as a direct link to a neighbouring network (e.g. as if
2045	      'A' were a neighbour of 'C', via the tunnel), which falls back to
2046	      the safety of the neighbour-only scenario.

2048	   Middle-Edge Collusion:  Collusion between networks or gateways within
2049	      the PCN-region and networks or users outside the region has not
2050	      yet been fully analysed.  The presence of full per-flow policing
2051	      at the ingress gateway seems to make this a less likely source of
2052	      a successful attack.

2054	   {ToDo: Due to lack of time, the full write up of the security
2055	   analysis is deferred to the next version of this memo.}

2057	   Finally, it is well known that the best person to analyse the
2058	   security of a system is not the designer.  Therefore, our confident
2059	   claims must be hedged with doubt until others with perhaps a greater
2060	   incentive to break it have mounted a full analysis.

2062	7.  Incremental Deployment

2064	   We believe ECN has so far not been widely deployed because it
2065	   requires end system and widespread network deployment just to achieve
2066	   a marginal improvement in performance.  The ability to offer a new
2067	   service (admission control) would be a much stronger driver for ECN
2068	   deployment.

2070	   As stated in the introduction, the aim of this memo is to "Design in
2071	   security from the start" when admission control is based on pre-
2072	   congestion notification.  The proposal has been designed so that
2073	   security can be added some time after first deployment, but only if
2074	   the PCN wire protocol encoding is defined with the foresight to
2075	   accommodate the extended set of codepoints defined in this document.
2076	   Given admission control based on pre-congestion notification requires
2077	   few changes to standards, it should be deployable fairly soon.
2078	   However, re-PCN requires a change to IP, which may take a little
2079	   longer :)

2081	   We expect that initial deployments of PCN-based admission control
2082	   will be confined to single networks, or to clubs of networks that
2083	   trust each other.  The proposal in this memo will only become
2084	   relevant once networks with conflicting interests wish to
2085	   interconnect their admission controlled services, but without the
2086	   scalability constraints of per-flow border policing.  It will not be
2087	   possible to use re-PCN, even in a controlled environment between
2088	   consenting operators, unless it is standardised into IP.  Given the
2089	   IPv4 header has limited space for further changes, current IESG
2090	   policy [RFC4727] is not to allow experimental use of codepoints in
2091	   the IPv4 header, as whenever an experiment isn't taken up, the space
2092	   it used tends to be impossible to reclaim.  Therefore, for IPv4 at
2093	   least, we will need to find a way to run an experiment so that the
2094	   header fields it uses can be reclaimed if the experiment is not a
2095	   success.

2097	   If PCN-based admission control is deployed before re-PCN is
2098	   standardised into IP, wherever a network (or club of networks)
2099	   connects to another network (or club of networks) with conflicting
2100	   interests, they will place a gateway between the two regions that
2101	   does per-flow rate policing and admission control.  If re-PCN is
2102	   eventually standardised into IP, it will be possible for these
2103	   separate regions to upgrade all their ingress gateways to support re-
2104	   PCN before removing the per-flow policing gateways between them.
2105	   Given the edge-to-edge deployment model of PCN-based admission
2106	   control, it is reasonable to expect incremental deployment of re-PCN
2107	   will be feasible on a domain-by domain basis, without needing to
2108	   cater for partial deployment of re-PCN in just some of the gateways
2109	   around one PCN-domain.

2111	   Nonetheless, if the upgrade of one ingress gateway is accidentally
2112	   overlooked, the RE flag has been defined the safe way round for the
2113	   default legacy behaviour (leaving RE cleared as "0").  A legacy
2114	   ingress will appear to be declaring a high level of pre-congestion
2115	   into the aggregate.  The fail-safe border mechanism in Section 5.6.3
2116	   might trigger management alarms (which would help in tracking down
2117	   the need to upgrade the ingress), but all packets would continue to
2118	   be delivered safely, as overstatement of downstream congestion
2119	   requires no sanction.

2121	   Only the ingress edge gateways around a PCN-region have to be
2122	   upgraded to add re-PCN support, not interior routers.  It is also
2123	   necessary to add the mechanisms that monitor re-PCN to secure a
2124	   network against misbehaving gateways and networks.  Specifically,
2125	   these are the border mechanisms (Section 5.6) and the mechanisms to
2126	   sanction dishonest marking (Section 5.5).

2128	   We also RECOMMEND adding improvements to forwarding on interior
2129	   routers (Section 4.3.4).  But the system works whether all, some or
2130	   none are upgraded, so interior routers may be upgraded in a piecemeal
2131	   fashion at any time.

2133	8.  Design Choices and Rationale

2135	   The primary insight of this work is that downstream congestion is the
2136	   metric that would be most useful to control an internetwork, and
2137	   particularly to police how one network responds to the congestion it
2138	   causes in a remote network.  This is the problem that has previously
2139	   made it so hard to provide scalable admission control.

2141	   The case for using re-feedback (a generalisation of re-ECN) to police
2142	   congestion response and provide QoS is made in [Re-fb].  Essentially,
2143	   the insight is that congestion is a factor that crosses layers from
2144	   the physical upwards.  Therefore re-feedback polices congestion as it
2145	   crosses the physical interface between networks.  This is achieved by
2146	   bringing information about congestion of resources later on the path
2147	   to the interface, rather than trying to deal with congestion where it
2148	   happens by examining the notoriously unreliable source address in
2149	   packets.  Then congestion crossing the physical interface at a border
2150	   can be policed at the interface, rather than policing the congestion
2151	   on packets that claim to come from an address (which may be spoofed).
2152	   Also, re-feedback works in the network layer independently of other
2153	   layers--despite its name re-feedback does not actually require
2154	   feedback.  It makes a source to act conservatively before it gets
2155	   feedback.

2157	   On the subject of lack of feedback, the feedback not established
2158	   (FNE) codepoint is motivated by arguments for a state set-up bit in
2159	   IP to prevent state exhaustion attacks.  This idea was first put
2160	   forward informally by David Clark and developed by Handley and
2161	   Greenhalgh in [Steps_DoS].  The idea is that network layer datagrams
2162	   should signal explicitly when they require state to be created in the
2163	   network layer or the layer above (e.g. at flow start).  Then a node
2164	   can refuse to create any state unless a datagram declares this
2165	   intent.  We believe the proposed FNE codepoint serves the same
2166	   purpose as the proposed state set-up bit, but it has been overloaded
2167	   with a more specific purpose, using it on more packets than just the
2168	   first in a flow, but never less (i.e. it is idempotent).  In effect
2169	   the FNE codepoint serves the purpose of a `soft-state set-up
2170	   codepoint'.

2172	   The re-feedback paper [Re-fb] also makes the case for converting the
2173	   economic interpretation of congestion into hard engineering
2174	   mechanism, which is the basis of the approach used in this memo.  The
2175	   admission control gateways around the PCN-region use hard
2176	   engineering, not incentives, to prevent end users from sending more
2177	   traffic than they have reserved.  Incentive-based mechanisms are only
2178	   used between networks, because they are expected to respond to
2179	   incentives more rationally than end-users can be expected to.
2180	   However, even then, a network can use fail-safes to protect itself
2181	   from excessively unusual behaviour by neighbouring networks, whether
2182	   due to an accidental misconfiguration or malicious intent.

2184	   The guiding principle behind the incentive-based approach used
2185	   between networks is that any gain from subverting the protocol should
2186	   be precisely neutralised, rather than punished.  If a gain is
2187	   punished to a greater extent than is sufficient to neutralise it, it
2188	   will most likely open up a new vulnerability, where the amplifying
2189	   effect of the punishment mechanism can be turned on others.

2191	   The re-feedback paper also makes the case against the use of
2192	   congestion charging to police congestion if it is based on classic
2193	   feedback (where only upstream congestion is visible to network
2194	   elements).  It argues this would open up receiving networks to
2195	   `denial of funds' attacks and would require end users to accept
2196	   dynamic pricing (which few would).

2198	   Re-PCN has been deliberately designed to simplify policing at the
2199	   borders between networks.  These trust boundaries are the critical
2200	   pinch-points that will limit the scalability of the whole
2201	   internetwork unless the overall design minimises the complexity of
2202	   security functions at these borders.  The border mechanisms described
2203	   in this memo run passively in parallel to data forwarding and they do
2204	   not require per-flow processing.

2206	9.  Security Considerations

2208	   This whole memo concerns the security of a scalable admission control
2209	   system.  In particular the analysis section.  Below some specific
2210	   security issues are mentioned that did not belong elsewhere or which
2211	   comment on the overall robustness of the security provided by the
2212	   design.

2214	   Firstly, we must repeat the statement of applicability in the
2215	   analysis: that we only consider new opportunities for _gainful_
2216	   attack that our proposal introduces, particularly if the attacker can
2217	   avoid being identified.  Despite only involving a few bits, there is
2218	   sufficient complexity in the whole system that there are probably
2219	   numerous possibilities for other attacks.  However, as far as we are
2220	   aware, none reap any benefit to the attacker.  For instance, it would
2221	   be possible for a downstream network to remove the congestion
2222	   markings introduced by an upstream network, but it would only lose
2223	   out on the penalties it could apply to a downstream network.

2225	   When one network forwards a neighbouring network's traffic it will
2226	   always be possible to cause damage by dropping or corrupting it.
2227	   Therefore we do not believe networks would set their routing policies
2228	   to interconnect in the first place if they didn't trust the other
2229	   networks not to arbitrarily damage their traffic.

2231	   Having said this, we do want to highlight some of the weaker parts of
2232	   our argument.

2234	   o  We have argued that networks will be dissuaded from faking
2235	      congestion marking by the possibility that upstream networks will
2236	      route round them.  As we have said, these arguments are based on
2237	      fairly delicate assumptions and will remain fairly tenuous until
2238	      proved in practice, particularly close to the egress where less
2239	      competitive routing is likely.

2241	   o  Given the congestion feedback system is piggy-backed on flow
2242	      signalling, which can be fairly infrequent, sanctions may not be
2243	      appropriate until a flow has been persistently negative for
2244	      perhaps 20s.  This may allow brief attacks to go unpunished.
2245	      However, vulnerability to brief attacks may be reduced if the
2246	      egress triggers asynchronous feedback when the congestion level on
2247	      an aggregate has risen sufficiently since the last feedback,
2248	      rather than waiting for the next opportunity to piggy-back on a
2249	      signal.

2251	   o  We should also point out that the approach in this memo was only
2252	      designed to be robust for admission control.  We do not claim the
2253	      incentives will always be strong enough to force correct flow
2254	      termination behaviour.  This is because a user will tend to
2255	      perceive much greater loss in value if a flow is terminated than
2256	      if admission is denied at the start.  However, in general the
2257	      incentives for correct flow termination are similar to those for
2258	      admission control.

2260	   Finally, it may seem that the 8 codepoints that have been made
2261	   available by extending the ECN field with the RE flag have been used
2262	   rather wastefully.  In effect the RE flag has been used as an
2263	   orthogonal single bit in nearly all cases.  The only exception being
2264	   when the ECN field is cleared to "00".  The mapping of the codepoints
2265	   in an earlier version of this proposal used the codepoint space more
2266	   efficiently, but the scheme became vulnerable to a network operator
2267	   focusing its congestion marking to mark more positive than neutral
2268	   packets in order to reduce its penalties (see Appendix B of
2269	   [I-D.briscoe-tsvwg-re-ecn-tcp]).

2271	   With the scheme as now proposed, once the RE flag is set or cleared
2272	   by the sender or its proxy, it should not be written by the network,
2273	   only read.  So the gateways can detect if any network maliciously
2274	   alters the RE flag.  IPSec AH integrity checking does not cover the
2275	   IPv4 option flags (they were considered mutable--even the one we
2276	   propose using for the RE flag that was `currently unused' when IPSec
2277	   was defined).  But it would be sufficient for a pair of gateways to
2278	   make random checks on whether the RE flag was the same when it
2279	   reached the egress gateway as when it left the ingress.  Indeed, if
2280	   IPSec AH had covered the RE flag, any network intending to alter
2281	   sufficient RE flags to make a gain would have focused its alterations
2282	   on packets without authenticating headers (AHs).

2284	   Therefore, no cryptographic algorithms have been exploited in the
2285	   making of this proposal.

2287	10.  IANA Considerations

2289	   This memo includes no request to IANA.

2291	11.  Conclusions

2293	   This memo solves the classic problem of making flow admission control
2294	   scale to any size network.  It builds on a technique, called PCN,
2295	   which involves the use of Diffserv in a domain and uses pre-
2296	   congestion notification feedback to control admission into each
2297	   network path across the domain [I-D.ietf-pcn-architecture].

2299	   Without PCN, Diffserv requires over-provisioning that must grow
2300	   linearly with network diameter to cater for variation in the traffic
2301	   matrix.  However, even with PCN, multiple network domains can only
2302	   join together into one larger PCN region if all domains trust each
2303	   other to comply with the protocols, invoking admission control and
2304	   flow termination when requested.  Domains could join together and
2305	   still police flows at their borders by requiring reservation
2306	   signalling to touch each border and only use PCN internally to each
2307	   domain.  But the per-flow processing at borders would still limit
2308	   scalability.

2310	   Instead, this memo proposes a technique called re-PCN which enables a
2311	   PCN region to extend across multiple domains, without unscalable per-
2312	   flow processing at borders, and still without the need for linear
2313	   growth in capacity over-provisioning as the hop-diameter of the
2314	   Diffserv region grows.

2316	   We propose that the congestion feedback used for PCN-based admission
2317	   control should be re-echoed into the forward data path, by making a
2318	   trivial modification to the ingress gateway.  We then explain how the
2319	   resulting downstream pre-congestion metric in packets can be
2320	   monitored in bulk at borders to sufficiently emulate flow rate
2321	   policing.

2323	   We claim the result of combining these two approaches is an admission
2324	   control system that scales to any size network _and_ any number of
2325	   interconnected networks, even if they all act in their own interests.

2327	   This proposal aims to convince its readers to "Design in Security
2328	   from the start," by ensuring the PCN wire protocol encoding can
2329	   accommodate the extended set of codepoints defined in this document,
2330	   even if per-flow policing is used at first rather than the bulk
2331	   border policing described here.  This way, we will not build
2332	   ourselves tomorrow's legacy problem.

2334	   Re-echoing congestion feedback is based on a principled technique
2335	   called Re-ECN [I-D.briscoe-tsvwg-re-ecn-tcp], designed to add
2336	   accountability for causing congestion to the general-purpose IP
2337	   datagram service.  Re-ECN proposes to consume the last completely
2338	   unused bit in the basic IPv4 header or it uses extension header in
2339	   IPv6.

2341	12.  Acknowledgements

2343	   All the following have given helpful comments either on re-PCN or on
2344	   relevant parts of re-ECN that re-PCN uses: Arnaud Jacquet, Alessandro
2345	   Salvatori, Steve Rudkin, David Songhurst, John Davey, Ian Self,
2346	   Anthony Sheppard, Carla Di Cairano-Gilfedder (BT), Mark Handley (who
2347	   identified the excess canceled packets attack), Stephen Hailes, Adam
2348	   Greenhalgh (UCL), Francois Le Faucheur, Anna Charny (Cisco), Jozef
2349	   Babiarz, Kwok-Ho Chan, Corey Alexander (Nortel), David Clark, Bill
2350	   Lehr, Sharon Gillett, Steve Bauer (MIT) (who publicised various dummy
2351	   traffic attacks), Sally Floyd (ICIR) and comments from participants
2352	   in the CFP/CRN Inter-Provider QoS, Broadband and DoS-Resistant
2353	   Internet working groups.

2355	13.  Comments Solicited

2357	   Comments and questions are encouraged and very welcome.  They can be
2358	   addressed to the IETF Congestion and Pre-Congestion Notification
2359	   working group's mailing list <pcn@ietf.org>, and/or to the author(s).

2361	14.  References

2363	14.1.  Normative References

2365	   [I-D.briscoe-tsvwg-ecn-tunnel]
2366	              Briscoe, B., "Layered Encapsulation of Congestion
2367	              Notification", draft-briscoe-tsvwg-ecn-tunnel-01 (work in
2368	              progress), July 2008.

2370	   [I-D.briscoe-tsvwg-re-ecn-tcp]
2371	              Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith,
2372	              "Re-ECN: Adding Accountability for Causing Congestion to
2373	              TCP/IP", draft-briscoe-tsvwg-re-ecn-tcp-06 (work in
2374	              progress), August 2008.

2376	   [I-D.eardley-pcn-marking-behaviour]
2377	              Eardley, P., "Marking behaviour of PCN-nodes",
2378	              draft-eardley-pcn-marking-behaviour-01 (work in progress),
2379	              June 2008.

2381	   [I-D.moncaster-pcn-baseline-encoding]
2382	              Moncaster, T., Briscoe, B., and M. Menth, "Baseline
2383	              Encoding and Transport of Pre-Congestion Information",
2384	              draft-moncaster-pcn-baseline-encoding-02 (work in
2385	              progress), July 2008.

2387	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
2388	              Requirement Levels", BCP 14, RFC 2119, March 1997.

2390	   [RFC2211]  Wroclawski, J., "Specification of the Controlled-Load
2391	              Network Element Service", RFC 2211, September 1997.

2393	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
2394	              of Explicit Congestion Notification (ECN) to IP",
2395	              RFC 3168, September 2001.

2397	   [RFC3246]  Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec,
2398	              J., Courtney, W., Davari, S., Firoiu, V., and D.
2399	              Stiliadis, "An Expedited Forwarding PHB (Per-Hop
2400	              Behavior)", RFC 3246, March 2002.

2402	   [RFC4774]  Floyd, S., "Specifying Alternate Semantics for the
2403	              Explicit Congestion Notification (ECN) Field", BCP 124,
2404	              RFC 4774, November 2006.

2406	14.2.  Informative References

2408	   [CLoop_pol]
2409	              Salvatori, A., "Closed Loop Traffic Policing", Politecnico
2410	              Torino and Institut Eurecom Masters Thesis ,
2411	              September 2005.

2413	   [ECN-BGP]  Mortier, R. and I. Pratt, "Incentive Based Inter-Domain
2414	              Routeing", Proc Internet Charging and QoS Technology
2415	              Workshop (ICQT'03) pp308--317, September 2003, <http://
2416	              research.microsoft.com/users/mort/publications.aspx>.

2418	   [I-D.arumaithurai-nsis-pcn]
2419	              Arumaithurai, M., "NSIS PCN-QoSM: A Quality of Service
2420	              Model for Pre-Congestion Notification  (PCN)",
2421	              draft-arumaithurai-nsis-pcn-00 (work in progress),
2422	              September 2007.

2424	   [I-D.charny-pcn-single-marking]
2425	              Charny, A., Zhang, X., Faucheur, F., and V. Liatsos, "Pre-
2426	              Congestion Notification Using Single Marking for Admission
2427	              and  Termination", draft-charny-pcn-single-marking-03
2428	              (work in progress), November 2007.

2430	   [I-D.ietf-nsis-rmd]
2431	              Bader, A., "RMD-QOSM - The Resource Management in Diffserv
2432	              QOS Model", draft-ietf-nsis-rmd-12 (work in progress),
2433	              November 2007.

2435	   [I-D.ietf-pcn-architecture]
2436	              Eardley, P., "Pre-Congestion Notification (PCN)
2437	              Architecture", draft-ietf-pcn-architecture-06 (work in
2438	              progress), September 2008.

2440	   [I-D.ietf-tsvwg-admitted-realtime-dscp]
2441	              Baker, F., Polk, J., and M. Dolly, "DSCPs for Capacity-
2442	              Admitted Traffic",
2443	              draft-ietf-tsvwg-admitted-realtime-dscp-04 (work in
2444	              progress), February 2008.

2446	   [IXQoS]    Briscoe, B. and S. Rudkin, "Commercial Models for IP
2447	              Quality of Service Interconnect", BT Technology Journal
2448	              (BTTJ) 23(2)171--195, April 2005,
2449	              <http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#ixqos>.

2451	   [QoS_scale]
2452	              Reid, A., "Economics and Scalability of QoS Solutions", BT
2453	              Technology Journal (BTTJ) 23(2)97--117, April 2005.

2455	   [RFC2205]  Braden, B., Zhang, L., Berson, S., Herzog, S., and S.
2456	              Jamin, "Resource ReSerVation Protocol (RSVP) -- Version 1
2457	              Functional Specification", RFC 2205, September 1997.

2459	   [RFC2207]  Berger, L. and T. O'Malley, "RSVP Extensions for IPSEC
2460	              Data Flows", RFC 2207, September 1997.

2462	   [RFC2208]  Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell,
2463	              M., Romanow, A., Weinrib, A., and L. Zhang, "Resource
2464	              ReSerVation Protocol (RSVP) Version 1 Applicability
2465	              Statement Some Guidelines on Deployment", RFC 2208,
2466	              September 1997.

2468	   [RFC2747]  Baker, F., Lindell, B., and M. Talwar, "RSVP Cryptographic
2469	              Authentication", RFC 2747, January 2000.

2471	   [RFC2998]  Bernet, Y., Ford, P., Yavatkar, R., Baker, F., Zhang, L.,
2472	              Speer, M., Braden, R., Davie, B., Wroclawski, J., and E.
2473	              Felstaine, "A Framework for Integrated Services Operation
2474	              over Diffserv Networks", RFC 2998, November 2000.

2476	   [RFC3540]  Spring, N., Wetherall, D., and D. Ely, "Robust Explicit
2477	              Congestion Notification (ECN) Signaling with Nonces",
2478	              RFC 3540, June 2003.

2480	   [RFC4301]  Kent, S. and K. Seo, "Security Architecture for the
2481	              Internet Protocol", RFC 4301, December 2005.

2483	   [RFC4727]  Fenner, B., "Experimental Values In IPv4, IPv6, ICMPv4,
2484	              ICMPv6, UDP, and TCP Headers", RFC 4727, November 2006.

2486	   [RFC5129]  Davie, B., Briscoe, B., and J. Tay, "Explicit Congestion
2487	              Marking in MPLS", RFC 5129, January 2008.

2489	   [RSVP-ECN]
2490	              Le Faucheur, F., Charny, A., Briscoe, B., Eardley, P.,
2491	              Babiarz, J., and K. Chan, "RSVP Extensions for Admission
2492	              Control over Diffserv using Pre-congestion Notification",
2493	              draft-lefaucheur-rsvp-ecn-01 (work in progress),
2494	              June 2006.

2496	   [Re-fb]    Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C.,
2497	              Salvatori, A., Soppera, A., and M. Koyabe, "Policing
2498	              Congestion Response in an Internetwork Using Re-Feedback",
2499	              ACM SIGCOMM CCR 35(4)277--288, August 2005, <http://
2500	              www.acm.org/sigs/sigcomm/sigcomm2005/
2501	              techprog.html#session8>.

2503	   [Smart_rtg]
2504	              Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang,
2505	              "Optimizing Cost and Performance for Multihoming", ACM
2506	              SIGCOMM CCR 34(4)79--92, October 2004,
2507	              <http://citeseer.ist.psu.edu/698472.html>.

2509	   [Steps_DoS]
2510	              Handley, M. and A. Greenhalgh, "Steps towards a DoS-
2511	              resistant Internet Architecture", Proc. ACM SIGCOMM
2512	              workshop on Future directions in network architecture
2513	              (FDNA'04) pp 49--56, August 2004.

2515	Appendix A.  Implementation

2517	A.1.  Ingress Gateway Algorithm for Blanking the RE flag

2519	   The ingress gateway receives regular feedback 'PCN-feedback-
2520	   information' reporting the fraction of congestion marked octets for
2521	   each aggregate arriving at the egress.  So for each aggregate it
2522	   should blank the RE flag on this fraction of octets.  A suitable
2523	   pseudo-code algorithm for the ingress gateway is as follows:
2524	   ====================================================================
2525	   for each PCN-capable-packet {
2526	       if RAND(0,1) <= PCN-feedback-information
2527	           writeRE(0);
2528	       else
2529	           writeRE(1);
2530	   }
2531	   ====================================================================

2533	A.2.  Downstream Congestion Metering Algorithms

2535	A.2.1.  Bulk Downstream Congestion Metering Algorithm

2537	   To meter the bulk amount of downstream pre-congestion in traffic
2538	   crossing an inter-domain border, an algorithm is needed that
2539	   accumulates the size of positive packets and subtracts the size of
2540	   negative packets.  We maintain two counters:

2542	      V_b: accumulated pre-congestion volume

2544	      B: total data volume (in case it is needed)

2546	   A suitable pseudo-code algorithm for a border router is as follows:

2548	   ====================================================================
2549	   V_b = 0
2550	   B   = 0
2551	   for each PCN-capable packet {
2552	       b = readLength(packet)      /* set b to packet size          */
2553	       B += b                      /* accumulate total volume       */
2554	       if readEPCN(packet) == (Re-PCT-Echo || FNE) {
2555	           V_b += b                /* increment...                  */
2556	       } elseif readEPCN(packet) == ( AM(-1) || TM(-1) ) {
2557	           V_b -= b                /* ...or decrement V_b...        */
2558	       }                           /*...depending on EPCN field     */
2559	   }
2560	   ====================================================================

2562	   At the end of an accounting period this counter V_b represents the
2563	   pre-congestion volume that penalties could be applied to, as
2564	   described in Section 5.3.

2566	   For instance, accumulated volume of pre-congestion through a border
2567	   interface over a month might be V_b = 5TB (terabyte = 10^12 byte).
2568	   This might have resulted from an average downstream pre-congestion
2569	   level of 0.001% on an accumulated total data volume of B = 500PB
2570	   (petabyte = 10^15 byte).

2572	A.2.2.  Inflation Factor for Persistently Negative Flows

2574	   The following process is suggested to complement the simple algorithm
2575	   above in order to protect against the various attacks from
2576	   persistently negative flows described in Section 5.6.1.  As explained
2577	   in that section, the most important and first step is to estimate the
2578	   contribution of persistently negative flows to the bulk volume of
2579	   downstream pre-congestion and to inflate this bulk volume as if these
2580	   flows weren't there.  The process below has been designed to give an
2581	   unbiased estimate, but it may be possible to define other processes
2582	   that achieve similar ends.

2584	   While the above simple metering algorithm (Appendix A.2) is counting
2585	   the bulk of traffic over an accounting period, the meter should also
2586	   select a subset of the whole flow ID space that is small enough to be
2587	   able to realistically measure but large enough to give a realistic
2588	   sample.  Many different samples of different subsets of the ID space
2589	   should be taken at different times during the accounting period,
2590	   preferably covering the whole ID space.  During each sample, the
2591	   meter should count the volume of positive packets and subtract the
2592	   volume of negative, maintaining a separate account for each flow in
2593	   the sample.  It should run a lot longer than the large majority of
2594	   flows, to avoid a bias from missing the starts and ends of flows,
2595	   which tend to be positive and negative respectively.

2597	   Once the accounting period finishes, the meter should calculate the
2598	   total of the accounts V_{bI} for the subset of flows I in the sample,
2599	   and the total of the accounts V_{fI} excluding flows with a negative
2600	   account from the subset I. Then the weighted mean of all these
2601	   samples should be taken a_S = sum_{forall I} V_{fI} / sum_{forall I}
2602	   V_{bI}.

2604	   If V_b is the result of the bulk accounting algorithm over the
2605	   accounting period (Appendix A.2.1) it can be inflated by this factor
2606	   a_S to get a good unbiased estimate of the volume of downstream
2607	   congestion over the accounting period a_S.V_b, without being polluted
2608	   by the effect of persistently negative flows.

2610	A.3.  Algorithm for Sanctioning Negative Traffic

2612	   {ToDo: Write up algorithms similar to Appendix E of
2613	   [I-D.briscoe-tsvwg-re-ecn-tcp] for the negative flow monitor with
2614	   flow management algorithm and the variant with bounded flow state.}

2616	Author's Address

2618	   Bob Briscoe
2619	   BT & UCL
2620	   B54/77, Adastral Park
2621	   Martlesham Heath
2622	   Ipswich  IP5 3RE
2623	   UK

2625	   Phone: +44 1473 645196
2626	   Email: bob.briscoe@bt.com
2627	   URI:   http://www.cs.ucl.ac.uk/staff/B.Briscoe/

2629	Full Copyright Statement

2631	   Copyright (C) The IETF Trust (2008).

2633	   This document is subject to the rights, licenses and restrictions
2634	   contained in BCP 78, and except as set forth therein, the authors
2635	   retain all their rights.

2637	   This document and the information contained herein are provided on an
2638	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
2639	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
2640	   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
2641	   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
2642	   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2643	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

2645	Intellectual Property

2647	   The IETF takes no position regarding the validity or scope of any
2648	   Intellectual Property Rights or other rights that might be claimed to
2649	   pertain to the implementation or use of the technology described in
2650	   this document or the extent to which any license under such rights
2651	   might or might not be available; nor does it represent that it has
2652	   made any independent effort to identify any such rights.  Information
2653	   on the procedures with respect to rights in RFC documents can be
2654	   found in BCP 78 and BCP 79.

2656	   Copies of IPR disclosures made to the IETF Secretariat and any
2657	   assurances of licenses to be made available, or the result of an
2658	   attempt made to obtain a general license or permission for the use of
2659	   such proprietary rights by implementers or users of this
2660	   specification can be obtained from the IETF on-line IPR repository at
2661	   http://www.ietf.org/ipr.

2663	   The IETF invites any interested party to bring to its attention any
2664	   copyrights, patents or patent applications, or other proprietary
2665	   rights that may cover technology that may be required to implement
2666	   this standard.  Please address the information to the IETF at
2667	   ietf-ipr@ietf.org.

2669	Acknowledgment

2671	   This document was produced using xml2rfc v1.33 (of
2672	   http://xml.resource.org/) from a source in RFC-2629 XML format.