idnits 2.17.1 draft-briscoe-tsvwg-re-ecn-tcp-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 18. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 4123. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 4134. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 4141. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 4147. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (January 10, 2008) is 5944 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567)

  ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681)

  ** Obsolete normative reference: RFC 2960 (Obsoleted by RFC 4960)

  == Outdated reference: A later version (-02) exists of
     draft-ietf-tsvwg-ecn-mpls-01

  == Outdated reference: A later version (-01) exists of
     draft-briscoe-tsvwg-ecn-tunnel-00

  == Outdated reference: A later version (-10) exists of
     draft-ietf-tcpm-ecnsyn-03

  == Outdated reference: A later version (-03) exists of
     draft-moncaster-tcpm-rcv-cheat-02

  -- Obsolete informational reference (is this intentional?): RFC 2402
     (Obsoleted by RFC 4302, RFC 4305)

  -- Obsolete informational reference (is this intentional?): RFC 2406
     (Obsoleted by RFC 4303, RFC 4305)

  -- Obsolete informational reference (is this intentional?): RFC 2988
     (Obsoleted by RFC 6298)

  == Outdated reference: A later version (-03) exists of
     draft-briscoe-re-pcn-border-cheat-00


     Summary: 4 errors (**), 0 flaws (~~), 6 warnings (==), 12 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Transport Area Working Group                                  B. Briscoe
3	Internet-Draft                                                  BT & UCL
4	Intended status: Standards Track                              A. Jacquet
5	Expires: July 13, 2008                                      T. Moncaster
6	                                                                A. Smith
7	                                                                      BT
8	                                                        January 10, 2008

10	     Re-ECN: Adding Accountability for Causing Congestion to TCP/IP
11	                   draft-briscoe-tsvwg-re-ecn-tcp-05

13	Status of this Memo

15	   By submitting this Internet-Draft, each author represents that any
16	   applicable patent or other IPR claims of which he or she is aware
17	   have been or will be disclosed, and any of which he or she becomes
18	   aware will be disclosed, in accordance with Section 6 of BCP 79.

20	   Internet-Drafts are working documents of the Internet Engineering
21	   Task Force (IETF), its areas, and its working groups.  Note that
22	   other groups may also distribute working documents as Internet-
23	   Drafts.

25	   Internet-Drafts are draft documents valid for a maximum of six months
26	   and may be updated, replaced, or obsoleted by other documents at any
27	   time.  It is inappropriate to use Internet-Drafts as reference
28	   material or to cite them other than as "work in progress."

30	   The list of current Internet-Drafts can be accessed at
31	   http://www.ietf.org/ietf/1id-abstracts.txt.

33	   The list of Internet-Draft Shadow Directories can be accessed at
34	   http://www.ietf.org/shadow.html.

36	   This Internet-Draft will expire on July 13, 2008.

38	Copyright Notice

40	   Copyright (C) The IETF Trust (2008).

42	Abstract

44	   This document introduces a new protocol for explicit congestion
45	   notification (ECN), termed re-ECN, which can be deployed
46	   incrementally around unmodified routers.  The protocol arranges an
47	   extended ECN field in each packet so that, as it crosses any
48	   interface in an internetwork, it will carry a truthful prediction of
49	   congestion on the remainder of its path.  Then the upstream party at
50	   any trust boundary in the internetwork can be held responsible for
51	   the congestion they cause, or allow to be caused.  So, networks can
52	   introduce straightforward accountability and policing mechanisms for
53	   incoming traffic from end-customers or from neighbouring network
54	   domains.  The purpose of this document is to specify the re-ECN
55	   protocol at the IP layer and to give guidelines on any consequent
56	   changes required to transport protocols.  It includes the changes
57	   required to TCP both as an example and as a specification.  It also
58	   gives examples of mechanisms that can use the protocol to ensure data
59	   sources respond correctly to congestion.  And it describes example
60	   mechanisms that ensure the dominant selfish strategy of both network
61	   domains and end-points will be to set the extended ECN field
62	   honestly.

64	Authors' Statement: Status (to be removed by the RFC Editor)

66	   Although the re-ECN protocol is intended to make a simple but far-
67	   reaching change to the Internet architecture, the most immediate
68	   priority for the authors is to delay any move of the ECN nonce to
69	   Proposed Standard status.  The argument for this position is
70	   developed in Appendix I.

72	Changes from previous drafts (to be removed by the RFC Editor)

74	   Full diffs created using the rfcdiff tool are available at
75	   

77	   From -04 to -05 (current version):

79	      Completed justification for packet marking with FNE during slow-
80	      start(Appendix D).

82	      Minor editorial changes throughout.

84	   From -03 to -04:

86	      Clarified reasons for holding back ECN nonce (Section 3.2 &
87	      Appendix I).

89	      Clarified Figure 1.

91	      Added Section 4.1.1.1 on equivalence of drops and ECN marks.

93	      Improved precision of Section 5.6 on IP in IP tunnels.

95	      Explained the RTT fairness is possible to enforce, but unlikely to
96	      be required (Section 6.1.3 & Appendix F).

98	      Explained that bulk per-user policing should be adequate but per-
99	      flow policing is also possible if desired, though it is not likely
100	      to be necessary (Section 6.1.5 & Appendix G).

102	      Reinforced need for passive policing at inter-domain borders to
103	      enable all-optical networking (Section 6.1.6).

105	      Minor editorial changes throughout.

107	   From -02 to -03:

109	      Started guidelines for re-ECN support in DCCP and SCTP.

111	      Added annex on limitations of nonce mechanism.

113	      Minor editorial changes throughout.

115	   From -01 to -02:

117	      Explanation on informal terminology in Section 3.4 clarified.

119	      IPv6 wire protocol encoding added (Section 5.2).

121	      Text on (non-)issues with tunnels, encryption and link layer
122	      congestion notification added (Section 5.6 & Section 5.7).

124	      Section added giving evolvability arguments against encouraging
125	      bottleneck policing (Section 6.1.2).  And text on re-ECN's
126	      evolvability by design added to Section 6.1.3

128	      Text on inter-domain policing (Section 6.1.6) and inter-domain
129	      fail-safes (Section 6.1.7) added.

131	   From -00 to -01:

133	      Encoding of re-ECN wire protocol changed for reasons given in
134	      Appendix B and consequently draft substantially re-written.

136	      Substantial text added in sections on applications, incremental
137	      deployment, architectural rationale and security considerations.

139	Table of Contents

141	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  6
142	   2.  Requirements notation  . . . . . . . . . . . . . . . . . . . .  7
143	   3.  Protocol Overview  . . . . . . . . . . . . . . . . . . . . . .  8
144	     3.1.  Background and Applicability . . . . . . . . . . . . . . .  8
145	     3.2.  Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or
146	           v6)  . . . . . . . . . . . . . . . . . . . . . . . . . . .  9
147	     3.3.  Re-ECN Protocol Operation  . . . . . . . . . . . . . . . . 11
148	     3.4.  Informal Terminology . . . . . . . . . . . . . . . . . . . 13
149	   4.  Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 15
150	     4.1.  TCP  . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
151	       4.1.1.  RECN mode: Full re-ECN capable transport . . . . . . . 16
152	       4.1.2.  RECN-Co mode: Re-ECT Sender with a Vanilla or
153	               Nonce ECT Receiver . . . . . . . . . . . . . . . . . . 20
154	       4.1.3.  Capability Negotiation . . . . . . . . . . . . . . . . 21
155	       4.1.4.  Extended ECN (EECN) Field Settings during Flow
156	               Start or after Idle Periods  . . . . . . . . . . . . . 23
157	       4.1.5.  Pure ACKS, Retransmissions, Window Probes and
158	               Partial ACKs . . . . . . . . . . . . . . . . . . . . . 26
159	     4.2.  Other Transports . . . . . . . . . . . . . . . . . . . . . 27
160	       4.2.1.  General Guidelines for Adding Re-ECN to Other
161	               Transports . . . . . . . . . . . . . . . . . . . . . . 27
162	       4.2.2.  Guidelines for adding Re-ECN to RSVP or NSIS . . . . . 28
163	       4.2.3.  Guidelines for adding Re-ECN to DCCP . . . . . . . . . 28
164	       4.2.4.  Guidelines for adding Re-ECN to SCTP . . . . . . . . . 28
165	   5.  Network Layer  . . . . . . . . . . . . . . . . . . . . . . . . 28
166	     5.1.  Re-ECN IPv4 Wire Protocol  . . . . . . . . . . . . . . . . 28
167	     5.2.  Re-ECN IPv6 Wire Protocol  . . . . . . . . . . . . . . . . 30
168	     5.3.  Router Forwarding Behaviour  . . . . . . . . . . . . . . . 31
169	     5.4.  Justification for Setting the First SYN to FNE . . . . . . 32
170	     5.5.  Control and Management . . . . . . . . . . . . . . . . . . 33
171	       5.5.1.  Negative Balance Warning . . . . . . . . . . . . . . . 33
172	       5.5.2.  Rate Response Control  . . . . . . . . . . . . . . . . 34
173	     5.6.  IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 34
174	     5.7.  Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 35
175	   6.  Applications . . . . . . . . . . . . . . . . . . . . . . . . . 36
176	     6.1.  Policing Congestion Response . . . . . . . . . . . . . . . 36
177	       6.1.1.  The Policing Problem . . . . . . . . . . . . . . . . . 36
178	       6.1.2.  The Case Against Bottleneck Policing . . . . . . . . . 37
179	       6.1.3.  Re-ECN Incentive Framework . . . . . . . . . . . . . . 38
180	       6.1.4.  Egress Dropper . . . . . . . . . . . . . . . . . . . . 45
181	       6.1.5.  Policing . . . . . . . . . . . . . . . . . . . . . . . 47
182	       6.1.6.  Inter-domain Policing  . . . . . . . . . . . . . . . . 48
183	       6.1.7.  Inter-domain Fail-safes  . . . . . . . . . . . . . . . 52
184	       6.1.8.  Simulations  . . . . . . . . . . . . . . . . . . . . . 53
185	     6.2.  Other Applications . . . . . . . . . . . . . . . . . . . . 53
186	       6.2.1.  DDoS Mitigation  . . . . . . . . . . . . . . . . . . . 53
187	       6.2.2.  End-to-end QoS . . . . . . . . . . . . . . . . . . . . 54
188	       6.2.3.  Traffic Engineering  . . . . . . . . . . . . . . . . . 54
189	       6.2.4.  Inter-Provider Service Monitoring  . . . . . . . . . . 54
190	     6.3.  Limitations  . . . . . . . . . . . . . . . . . . . . . . . 54
191	   7.  Incremental Deployment . . . . . . . . . . . . . . . . . . . . 55
192	     7.1.  Incremental Deployment Features  . . . . . . . . . . . . . 55
193	     7.2.  Incremental Deployment Incentives  . . . . . . . . . . . . 57
194	   8.  Architectural Rationale  . . . . . . . . . . . . . . . . . . . 61
195	   9.  Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 64
196	     9.1.  Policing Rate Response to Congestion . . . . . . . . . . . 64
197	     9.2.  Congestion Notification Integrity  . . . . . . . . . . . . 65
198	     9.3.  Identifying Upstream and Downstream Congestion . . . . . . 66
199	   10. Security Considerations  . . . . . . . . . . . . . . . . . . . 66
200	   11. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 68
201	   12. Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 68
202	   13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 68
203	   14. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 69
204	   15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 69
205	     15.1. Normative References . . . . . . . . . . . . . . . . . . . 69
206	     15.2. Informative References . . . . . . . . . . . . . . . . . . 70
207	   Appendix A.  Precise Re-ECN Protocol Operation . . . . . . . . . . 73
208	   Appendix B.  Justification for Two Codepoints Signifying Zero
209	                Worth Packets . . . . . . . . . . . . . . . . . . . . 74
210	   Appendix C.  ECN Compatibility . . . . . . . . . . . . . . . . . . 76
211	   Appendix D.  Packet Marking with FNE During Flow Start . . . . . . 77
212	   Appendix E.  Example Egress Dropper Algorithm  . . . . . . . . . . 79
213	   Appendix F.  Re-TTL  . . . . . . . . . . . . . . . . . . . . . . . 79
214	   Appendix G.  Policer Designs to ensure Congestion
215	                Responsiveness  . . . . . . . . . . . . . . . . . . . 80
216	     G.1.  Per-user Policing  . . . . . . . . . . . . . . . . . . . . 80
217	     G.2.  Per-flow Rate Policing . . . . . . . . . . . . . . . . . . 81
218	   Appendix H.  Downstream Congestion Metering Algorithms . . . . . . 84
219	     H.1.  Bulk Downstream Congestion Metering Algorithm  . . . . . . 84
220	     H.2.  Inflation Factor for Persistently Negative Flows . . . . . 85
221	   Appendix I.  Argument for holding back the ECN nonce . . . . . . . 85
222	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 87
223	   Intellectual Property and Copyright Statements . . . . . . . . . . 89

225	1.  Introduction

227	   This document aims:

229	   o  To provide a complete specification of the addition of the re-ECN
230	      protocol to IP and guidelines on how to add it to transport layer
231	      protocols, including a complete specification of re-ECN in TCP as
232	      an example;

234	   o  To show how a number of hard problems become much easier to solve
235	      once re-ECN is available in IP.

237	   A general statement of the problem solved by re-ECN is to provide
238	   sufficient information in each IP datagram to be able to hold senders
239	   and whole networks accountable for the congestion they cause
240	   downstream, before they cause it.  But the every-day problems that
241	   re-ECN can solve are much more recognisable than this rather generic
242	   statement: mitigating distributed denial of service (DDoS);
243	   simplifying differentiation of quality of service (QoS); policing
244	   compliance to congestion control; and so on.

246	   Uniquely, re-ECN manages to enable solutions to these problems
247	   without unduly stifling innovative new ways to use the Internet.
248	   This was a hard balance to strike, given it could be argued that DDoS
249	   is an innovative way to use the Internet.  The most valuable insight
250	   was to allow each network to choose the level of constraint it wishes
251	   to impose.  Also re-ECN has been carefully designed so that networks
252	   that choose to use it conservatively can protect themselves against
253	   the congestion caused in their network by users on other networks
254	   with more liberal policies.

256	   For instance, some network owners want to block applications like
257	   voice and video unless their network is compensated for the extra
258	   share of bottleneck bandwidth taken.  These real-time applications
259	   tend to be unresponsive when congestion arises.  Whereas elastic TCP-
260	   based applications back away quickly, ending up taking a much smaller
261	   share of congested capacity for themselves.  Other network owners
262	   want to invest in large amounts of capacity and make their gains from
263	   simplicity of operation and economies of scale.

265	   Re-ECN allows the more conservative networks to police out flows that
266	   have not asked to be unresponsive to congestion---not because they
267	   are voice or video---just because they don't respond to congestion.
268	   But it also allows other networks to choose not to police.
269	   Crucially, when flows from liberal networks cross into a conservative
270	   network, re-ECN enables the conservative network to apply penalties
271	   to its neighbouring networks for the congestion they allow to be
272	   caused.  And these penalties can be applied to bulk data, without
273	   regard to flows.

275	   Then, if unresponsive applications become so dominant that some of
276	   the more liberal networks experience congestion collapse [RFC3714],
277	   they can change their minds and use re-ECN to apply tighter controls
278	   in order to bring congestion back under control.

280	   Re-ECN works by arranging that each packet arrives at each network
281	   element carrying a view of expected congestion on its own downstream
282	   path, albeit averaged over multiple packets.  Most usefully,
283	   congestion on the remainder of the path becomes visible in the IP
284	   header at the first ingress.  Many of the applications of re-ECN
285	   involve a policer at this ingress using the view of downstream
286	   congestion arriving in packets to police or control the packet rate.

288	   Importantly, the scheme is recursive: a whole network harbouring
289	   users causing congestion in downstream networks can be held
290	   responsible or policed by its downstream neighbour.

292	   This document is structured as follows.  First an overview of the re-
293	   ECN protocol is given (Section 3), outlining its attributes and
294	   explaining conceptually how it works as a whole.  The two main parts
295	   of the document follow, as described above.  That is, the protocol
296	   specification divided into transport (Section 4) and network
297	   (Section 5) layers, then the applications it can be put to, such as
298	   policing DDoS, QoS and congestion control (Section 6).  Although
299	   these applications do not require standardisation themselves, they
300	   are described in a fair degree of detail in order to explain how re-
301	   ECN can be used.  Given re-ECN proposes to use the last undefined bit
302	   in the IPv4 header, we felt it necessary to outline the potential
303	   that re-ECN could release in return for being given that bit.

305	   Deployment issues discussed throughout the document are brought
306	   together in Section 7, which is followed by a brief section
307	   explaining the somewhat subtle rationale for the design from an
308	   architectural perspective (Section 8).  We end by describing related
309	   work (Section 9), listing security considerations (Section 10) and
310	   finally drawing conclusions (Section 12).

312	2.  Requirements notation

314	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
315	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
316	   document are to be interpreted as described in [RFC2119].

318	   This document first specifies a protocol, then describes a framework
319	   that creates the right incentives to ensure compliance to the
320	   protocol.  This could cause confusion because the second part of the
321	   document considers many cases where malicious nodes may not comply
322	   with the protocol.  When such contingencies are described, if any of
323	   the above keywords are not capitalised, that is deliberate.  So, for
324	   instance, the following two apparently contradictory sentences would
325	   be perfectly consistent: i) x MUST do this; ii) x may not do this.

327	3.  Protocol Overview

329	3.1.  Background and Applicability

331	   First we briefly recap the essentials of the ECN protocol [RFC3168].
332	   Two bits in the IP protocol (v4 or v6) are assigned to the ECN field.
333	   The sender clears the field to "00" (Not-ECT) if either end-point
334	   transport is not ECN-capable.  Otherwise it indicates an ECN-capable
335	   transport (ECT) using either of the two code-points "10" or "01"
336	   (ECT(0) and ECT(1) resp.).

338	   ECN-capable routers probabilistically set "11" if congestion is
339	   experienced (CE), the marking probability increasing with the length
340	   of the queue at its egress link (typically using the RED
341	   algorithm [RFC2309]).  However, they still drop rather than mark Not-
342	   ECT packets.  With multiple ECN-capable routers on a path, a flow of
343	   packets accumulates the fraction of CE marking that each router adds.
344	   The combined effect of the packet marking of all the routers along
345	   the path signals congestion of the whole path to the receiver.  So,
346	   for example, if one router early in a path is marking 1% of packets
347	   and another later in a path is marking 2%, flows that pass through
348	   both routers will experience approximately 3% marking (see Appendix A
349	   for a precise treatment).

351	   The choice of two ECT code-points in the ECN field [RFC3168]
352	   permitted future flexibility, optionally allowing the sender to
353	   encode the experimental ECN nonce [RFC3540] in the packet stream.
354	   The nonce is designed to allow a sender to check the integrity of
355	   congestion feedback.  But Section 9.2 explains that it still gives no
356	   control over how fast the sender transmits as a result of the
357	   feedback.  On the other hand, re-ECN is designed both to ensure that
358	   congestion is declared honestly and that the sender's rate responds
359	   appropriately.

361	   Re-ECN is based on a feedback arrangement called `re-
362	   feedback' [Re-fb].  The word is short for either receiver-aligned,
363	   re-inserted or re-echoed feedback.  But it actually works even when
364	   no feedback is available.  In fact it has been carefully designed to
365	   work for single datagram flows.  It also encourages aggregation of
366	   single packet flows by congestion control proxies.  Then, even if the
367	   traffic mix of the Internet were to become dominated by short
368	   messages, it would still be possible to control congestion
369	   effectively and efficiently.

371	   Changing the Internet's feedback architecture seems to imply
372	   considerable upheaval.  But re-ECN can be deployed incrementally at
373	   the transport layer around unmodified routers using existing fields
374	   in IP (v4 or v6).  However it does also require the last undefined
375	   bit in the IPv4 header, which it uses in combination with the 2-bit
376	   ECN field to create four new codepoints.  Nonetheless, changes to IP
377	   routers are RECOMMENDED in order to improve resilience against DoS
378	   attacks.  Similarly, re-ECN works best if both the sender and
379	   receiver transports are re-ECN-capable, but it can work with just
380	   sender support.  Section 7.1 summarises the incremental deployment
381	   strategy.

383	   The re-ECN protocol makes no changes and has no effect on the TCP
384	   congestion control algorithm or on other rate responses to
385	   congestion.  Re-ECN is only concerned with enabling the ingress
386	   network to police that a source is complying with a congestion
387	   control algorithm, which is orthogonal to congestion control itself.

389	   Before re-ECN can be considered worthy of using up the last bit in
390	   the IP header, we must be sure that all our claims are robust.  We
391	   have gradually been reducing the list of outstanding issues, but the
392	   few that still remain are listed in Section 6.3.  We expect new
393	   attacks may still be found, but we offer the re-ECN protocol on the
394	   basis that it is built on fairly solid theoretical foundations and,
395	   so far, it has proved possible to keep it relatively robust.

397	3.2.  Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6)

399	   The re-ECN wire protocol uses the two bit ECN field broadly as in
400	   RFC3168 [RFC3168] as described above, but with five differences of
401	   detail (brought together in a list in Section 7.1).  This
402	   specification defines a new re-ECN extension (RE) flag.  We will
403	   defer the definition of the actual position of the RE flag in the
404	   IPv4 & v6 headers until Section 5.  Until then it will suffice to use
405	   an abstraction of the IPv4 and v6 wire protocols by just calling it
406	   the RE flag.

408	   Unlike the ECN field, the RE flag is intended to be set by the sender
409	   and remain unchanged along the path, although it can be read by
410	   network elements that understand the re-ECN protocol.  It is feasible
411	   that a network element MAY change the setting of the RE flag, perhaps
412	   acting as a proxy for an end-point, but such a protocol would have to
413	   be defined in another specification (e.g. [Re-PCN]).

415	   Although the RE flag is a separate, single bit field, it can be read
416	   as an extension to the two-bit ECN field; the three concatenated bits
417	   in what we will call the extended ECN field (EECN) making eight
418	   codepoints.  We will use the RFC3168 names of the ECN codepoints to
419	   describe settings of the ECN field when the RE flag setting is "don't
420	   care", but we also define the following six extended ECN codepoint
421	   names for when we need to be more specific.

423	   RFC3168 ECN defines uses for all four codepoints of the two-bit ECN
424	   field.  This memo widens the codepoint space to eight, and uses six
425	   codepoints.  One of re-ECN's codepoints is an alternative use of the
426	   codepoint set aside in RFC3168 for the ECN nonce (ECT(1)).
427	   Transports not using re-ECN can still use the ECN nonce, while those
428	   using re-ECN do not need to as long as the sender is also checking
429	   for transport protocol compliance [I-D.moncaster-tcpm-rcv-cheat].
430	   The case for doing this is given in Appendix I.  Two re-ECN
431	   codepoints are given compatible uses to those defined in RFC3168
432	   (Not-ECT and CE).  The other codepoint used by RFC3168 (ECT(0)) isn't
433	   used for re-ECN.  Altogether this leave one codepoint of the eight
434	   unused and available for future use.

436	   +-------+------------+------+--------------+------------------------+
437	   |  ECN  | RFC3168    |  RE  | Extended ECN |     Re-ECN meaning     |
438	   | field | codepoint  | flag | codepoint    |                        |
439	   +-------+------------+------+--------------+------------------------+
440	   |   00  | Not-ECT    |   0  | Not-RECT     |   Not re-ECN-capable   |
441	   |       |            |      |              |        transport       |
442	   |   00  | Not-ECT    |   1  | FNE          |      Feedback not      |
443	   |       |            |      |              |       established      |
444	   |   01  | ECT(1)     |   0  | Re-Echo      |  Re-echoed congestion  |
445	   |       |            |      |              |        and RECT        |
446	   |   01  | ECT(1)     |   1  | RECT         |     Re-ECN capable     |
447	   |       |            |      |              |        transport       |
448	   |   10  | ECT(0)     |   0  | ---          |   Legacy ECN use only  |
449	   |       |            |      |              |                        |
450	   |   10  | ECT(0)     |   1  | --CU--       |    Currently unused    |
451	   |       |            |      |              |                        |
452	   |   11  | CE         |   0  | CE(0)        |   Re-Echo canceled by  |
453	   |       |            |      |              | congestion experienced |
454	   |   11  | CE         |   1  | CE(-1)       | Congestion experienced |
455	   +-------+------------+------+--------------+------------------------+

457	                     Table 1: Extended ECN Codepoints

459	3.3.  Re-ECN Protocol Operation

461	   In this section we will give an overview of the operation of the re-
462	   ECN protocol for TCP/IP, leaving a detailed specification to the
463	   following sections.  Other transports will be discussed later.

465	   In summary, the protocol adds a third `re-echo' stage to the existing
466	   TCP/IP ECN protocol.  Whenever the network adds CE congestion
467	   signalling to the IP header on the forward data path, the receiver
468	   feeds it back to the ingress using TCP, then the sender re-echoes it
469	   into the forward data path using the RE flag in the next packet.

471	   Prior to receiving any feedback a sender will not know which setting
472	   of the RE flag to use, so it sets the feedback not established (FNE)
473	   codepoint.  The network reads the FNE codepoint conservatively as
474	   equivalent to re-echoed congestion.

476	   Specifically, once a flow is established, a re-ECN sender always
477	   initialises the ECN field to ECT(1).  And it usually sets the RE flag
478	   to "1".  Whenever a router re-marks a packet to CE, the receiver
479	   feeds back this event to the sender.  On receiving this feedback, the
480	   re-ECN sender will clear the RE flag to "0" in the next packet it
481	   sends.

483	   We chose to set and clear the RE flag this way round to ease
484	   incremental deployment (see Section 7.1).  To avoid confusion we will
485	   use the term `blanking' (rather than marking) when the RE flag is
486	   cleared to "0".  So, over a stream of packets, we will talk of the
487	   `RE blanking fraction' as the fraction of octets in packets with the
488	   RE flag cleared to "0".

490	         _      _                      _      _
491	       /   \  /   \                  /   \  /   \
492	       | S |--| 0 | - - - - - - - -  | i |--| D |
493	       \ _ /  \ _ /                  \ _ /  \ _ /
494	         .      .                      .      .
495	       ^ .      .                      .      .
496	       | .      .                      .      .
497	       | .     RE blanking fraction    .      .
498	    3% |-------------------------------+=======
499	       | .      .                      |      .
500	    2% | .      .                      |      .
501	       | .      .  CE marking fraction |      .
502	    1% | .      +----------------------+      .
503	       | .      |                      .      .
504	    0% +--------------------------------------->
505	         ^      0     ^                i      ^   resource index
506	         0      ^     1                ^      2   observation points
507	                |                      |
508	              1.00%                  2.00%        marking fraction

510	                 Figure 1: A 2-Router Example (Imprecise)

512	   Figure 1 uses a simple network to illustrate how re-ECN allows
513	   routers to measure downstream congestion.  The horizontal axis
514	   represents the index of each congestible resource (typically queues)
515	   along a path through the Internet.  There may be many routers on the
516	   path, but we assume only two are currently congested (those with
517	   resource index 0 and i).  The two superimposed plots show the
518	   fraction of each extended ECN codepoint in a flow observed along this
519	   path.  Given about 3% of packets reaching the destination are marked
520	   CE, in response to feedback the sender will blank the RE flag in
521	   about 3% of packets it sends.  Then approximate downstream congestion
522	   can be measured at the observation points shown along the path by
523	   subtracting the CE marking fraction from the RE blanking fraction, as
524	   shown in the table below (Appendix A derives these approximations
525	   from a precise analysis).

527	           +-------------------+------------------------------+
528	           | Observation point | Approx downstream congestion |
529	           +-------------------+------------------------------+
530	           |         0         |         3% - 0% = 3%         |
531	           |         1         |         3% - 1% = 2%         |
532	           |         2         |         3% - 3% = 0%         |
533	           +-------------------+------------------------------+

535	   Table 2: Downstream Congestion Measured at Example Observation Points

537	   All along the path, whole-path congestion remains unchanged so it can
538	   be used as a reference against which to compare upstream congestion.
539	   The difference predicts downstream congestion for the rest of the
540	   path.  Therefore, measuring the fractions of each codepoint at any
541	   point in the Internet will reveal upstream, downstream and whole path
542	   congestion.

544	   Note that we have introduced discussion of marking and blanking
545	   fractions solely for illustration.  To be absolutely clear, these
546	   fractions are averages that would result from the behaviour of a TCP
547	   protocol handler mechanically blanking outgoing packets in direct
548	   response to incoming feedback---we are not saying any protocol
549	   handler works with these average fractions directly.

551	3.4.  Informal Terminology

553	   In the rest of this memo we will loosely talk of positive or negative
554	   flows, meaning flows where the moving average of the downstream
555	   congestion metric is persistently positive or negative.  The notion
556	   of a negative metric arises because it is derived by subtracting one
557	   metric from another.  Of course actual downstream congestion cannot
558	   be negative, only the metric can (whether due to time lags or
559	   deliberate malice).

561	   Just as we will loosely talk of positive and negative flows, we will
562	   also talk of positive or negative packets, meaning packets that
563	   contribute positively or negatively to the downstream congestion
564	   metric.

566	   Therefore we will talk of packets having `worth' of +1, 0 or -1,
567	   which, when multiplied by their size, indicates their contribution to
568	   the downstream congestion metric.

570	   Figure 2 shows the main state transitions of the system once a flow
571	   is established, showing the worth of packets in each state.  When the
572	   network congestion marks a packet it decrements its worth (moving
573	   from the left of the main square to the right).  When the sender
574	   blanks the RE flag in order to re-echo congestion it increments the
575	   worth of a packet (moving from the bottom of the main square to the
576	   top).

578	   Sender state         Sent     Worth            Received   Worth
579	                        packet                    packet
580	            +----------------------------------------------------+
581	            |                                                    ^
582	            V                                                    |
583	   Congestion echoed -->Re-Echo  +1  --+--->      CE(0)      0 --+
584	                        (positive)     |            (canceled)   |
585	                                       V    network              |
586	                                       |   congestion            |
587	                                       |                         |
588	   Flow established --> RECT      0  ----+->      CE(-1)    -1 --+
589	            ^           (neutral)      | |          (negative)
590	            |                          | |
591	            |                      no  V V
592	            |               congestion | |
593	            +-----------<--------------+-+

595	        Figure 2: Re-ECN System State Diagram (bootstrap not shown)

597	   The idea is that every time the network decrements the worth of a
598	   packet, the sender increments the worth of a later packet.  Then,
599	   over time, as many positive octets should arrive at the receiver as
600	   negative.  Note we have said octets not packets, so if packets are of
601	   different sizes, the worth should be incremented on enough octets to
602	   balance the octets in negative packets arriving at the receiver.  It
603	   is this balance that will allow the network to hold the sender
604	   accountable for the congestion it causes, as we shall see.  The
605	   informal outline below uses TCP as an example transport, but the idea
606	   would be broadly similar for any transport that adapts its rate to
607	   congestion.

609	   We will start with the sender in `flow established' state.  Normally,
610	   as acknowledgements of earlier packets arrive that don't feedback any
611	   congestion, the congestion window can be opened, so the sender goes
612	   round the smaller sub-loop, sending RECT packets (worth 0) and
613	   returning to the flow established state to send another one.  If a
614	   router congestion marks one of the packets, it decrements the
615	   packet's worth.  The sender will have been continuing to traverse
616	   round the smaller feedback loop every time acknowledgements arrive.
617	   But when congestion feedback returns from this packet that was marked
618	   with -1 worth (the largest loop in the figure) the sender jumps to
619	   the congestion echoed state in order to re-echo the congestion,
620	   incrementing the worth of the next packet to +1 by blanking its RE
621	   flag.  The sender then returns to the flow established state and
622	   continues round the smaller loop, sending packets worth 0.  Note that
623	   the size of the loops is just an artefact of the figure; it is not
624	   meant to imply that one loop is slower than the other - they are both
625	   the same end to end feedback loop.

627	   If a packet carrying re-echoed congestion happens to also be
628	   congestion marked, the +1 worth added by the sender will be cancelled
629	   out by the -1 network congestion marking.  Although the two worth
630	   values correctly cancel out, neither the congestion marking nor the
631	   re-echoed congestion are lost, because the RE bit and the ECN field
632	   are orthogonal.  So, whenever this happens, the receiver will
633	   correctly detect and re-echo the new congestion event as well (the
634	   top sub-loop).  When we need to distinguish, we will sometimes call a
635	   packet marked RECT 'neutral' (0 worth), while we will call the CE(0)
636	   marking 'canceled' (also 0 worth).  If a re-echoed packet isn't
637	   unlucky enough to be further congestion marked, the sender will
638	   return to the flow established state and continue to send RECT
639	   packets (worth 0).

641	   The table below specifies unambiguously the worth of each extended
642	   ECN codepoint.  Note the order is different from the previous table
643	   to better show how the worth increments and decrements.  The FNE
644	   codepoint is an exception.  It is used in the flow bootstrap process
645	   (explained later) and has the same positive (+1) worth as a packet
646	   with the Re-Echo codepoint.

648	   +--------+------+----------------+-------+--------------------------+
649	   |   ECN  |  RE  | Extended ECN   | Worth |      Re-ECN meaning      |
650	   |  field |  bit | codepoint      |       |                          |
651	   +--------+------+----------------+-------+--------------------------+
652	   |   00   |   0  | Not-RECT       | ...   |    Not re-ECN-capable    |
653	   |        |      |                |       |         transport        |
654	   |   01   |   0  | Re-Echo        | +1    | Re-echoed congestion and |
655	   |        |      |                |       |           RECT           |
656	   |   10   |   0  | ---            | ...   |  Legacy ECN use only     |
657	   |   11   |   0  | CE(0)          |  0    |    Re-Echo canceled by   |
658	   |        |      |                |       |  congestion experienced  |
659	   |   00   |   1  | FNE            | +1    | Feedback not established |
660	   |   01   |   1  | RECT           |  0    | Re-ECN capable transport |
661	   |   10   |   1  | --CU--         | ...   |     Currently unused     |
662	   |        |      |                |       |                          |
663	   |   11   |   1  | CE(-1)         | -1    |  Congestion experienced  |
664	   +--------+------+----------------+-------+--------------------------+

666	                Table 3: 'Worth' of Extended ECN Codepoints

668	4.  Transport Layers

670	4.1.  TCP

672	   Re-ECN capability at the sender is essential.  At the receiver it is
673	   optional, as long as the receiver has a basic (`vanilla flavour')
674	   RFC3168-compliant ECN-capable transport (ECT) [RFC3168].  Given re-
675	   ECN is not the first attempt to define the semantics of the ECN
676	   field, we give a table below summarising what happens for various
677	   combinations of capabilities of the sender S and receiver R, as
678	   indicated in the first four columns below.  The last column gives the
679	   mode a half-connection should be in after the first two of the three
680	   TCP handshakes.

682	   +--------+--------------+------------+---------+--------------------+
683	   | Re-ECT |   ECT-Nonce  |     ECT    | Not-ECT |         S-R        |
684	   |        |   (RFC3540)  |  (RFC3168) |         |   Half-connection  |
685	   |        |              |            |         |        Mode        |
686	   +--------+--------------+------------+---------+--------------------+
687	   |   SR   |              |            |         |        RECN        |
688	   |    S   |       R      |            |         |       RECN-Co      |
689	   |    S   |              |      R     |         |       RECN-Co      |
690	   |    S   |              |            |    R    |       Not-ECT      |
691	   +--------+--------------+------------+---------+--------------------+

693	       Table 4: Modes of TCP Half-connection for Combinations of ECN
694	                  Capabilities of Sender S and Receiver R

696	   We will describe what happens in each mode, then describe how they
697	   are negotiated.  The abbreviations for the modes in the above table
698	   mean:

700	   RECN:  Full re-ECN capable transport

702	   RECN-Co:  Re-ECN sender in compatibility mode with a
703	      vanilla [RFC3168] ECN receiver or an [RFC3540] ECN nonce-capable
704	      receiver.  Implementation of this mode is OPTIONAL.

706	   Not-ECT:  Not ECN-capable transport, as defined in [RFC3168] for when
707	      at least one of the transports does not understand even basic ECN
708	      marking.

710	   Note that we use the term Re-ECT for a host transport that is re-ECN-
711	   capable but RECN for the modes of the half connections between hosts
712	   when they are both Re-ECT.  If a host transport is Re-ECT, this fact
713	   alone does NOT imply either of its half connections will necessarily
714	   be in RECN mode, at least not until it has confirmed that the other
715	   host is Re-ECT.

717	4.1.1.  RECN mode: Full re-ECN capable transport

719	   In full RECN mode, for each half connection, both the sender and the
720	   receiver each maintain an unsigned integer counter we will call ECC
721	   (echo congestion counter).  The receiver maintains a count, modulo 8,
722	   of how many times a CE marked packet has arrived during the half-
723	   connection.  Once a RECN connection is established, the three TCP
724	   option flags (ECE, CWR & NS) used for ECN-related functions in other
725	   versions of ECN are used as a 3-bit field for the receiver to
726	   repeatedly tell the sender the current value of ECC whenever it sends
727	   a TCP ACK.  We will call this the echo congestion increment (ECI)
728	   field.  This overloaded use of these 3 option flags as one 3-bit ECI
729	   field is shown in Figure 4.  The actual definition of the TCP header,
730	   including the addition of support for the ECN nonce, is shown for
731	   comparison in Figure 3.  This specification does not redefine the
732	   names of these three TCP option flags, it merely overloads them with
733	   another definition once a flow is established.

735	        0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
736	      +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
737	      |               |           | N | C | E | U | A | P | R | S | F |
738	      | Header Length | Reserved  | S | W | C | R | C | S | S | Y | I |
739	      |               |           |   | R | E | G | K | H | T | N | N |
740	      +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

742	    Figure 3: The (post-ECN Nonce) definition of bytes 13 and 14 of the
743	                                TCP Header

745	        0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
746	      +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
747	      |               |           |           | U | A | P | R | S | F |
748	      | Header Length | Reserved  |    ECI    | R | C | S | S | Y | I |
749	      |               |           |           | G | K | H | T | N | N |
750	      +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

752	    Figure 4: Definition of the ECI field within bytes 13 and 14 of the
753	   TCP Header, overloading the current definitions above for established
754	                                RECN flows.

756	   Receiver Action in RECN Mode

758	      Every time a CE marked packet arrives at a receiver in RECN mode,
759	      the receiver transport increments its local value of ECC modulo 8
760	      and MUST echo its value to the sender in the ECI field of the next
761	      ACK.  It MUST repeat the same value of ECI in every subsequent ACK
762	      until the next CE event, when it increments ECI again.

764	      The increment of the local ECC values is modulo 8 so the field
765	      value simply wraps round back to zero when it overflows.  The
766	      least significant bit is to the right (labelled bit 9).

768	      A receiver in RECN mode MAY delay the echo of a CE to the next
769	      delayed-ACK, which would be necessary if ACK-withholding were
770	      implemented.

772	   Sender Action in RECN Mode

774	      On the arrival of every ACK, the sender compares the ECI field
775	      with its own ECC value, then replaces its local value with that
776	      from the ACK.  The difference D is assumed to be the number of CE
777	      marked packets that arrived at the receiver since it sent the
778	      previously received ACK (but see below for the sender's safety
779	      strategy).  Whenever the ECI field increments by D (and/or d drops
780	      are detected), the sender MUST clear the RE flag to "0" in the IP
781	      header of the next D' data packets it sends (where D' = D + d),
782	      effectively re-echoing each single increment of ECI.  Otherwise
783	      the data sender MUST send all data packets with RE set to "1".

785	      As a general rule, once a flow is established, as well as setting
786	      or clearing the RE flag as above, a data sender in RECN mode MUST
787	      always set the ECN field to ECT(1).  However, the settings of the
788	      extended ECN field during flow start are defined in Section 4.1.4.

790	      As we have already emphasised, the re-ECN protocol makes no
791	      changes and has no effect on the TCP congestion control algorithm.
792	      So, each increment of ECI (or detection of a drop) also triggers
793	      the standard TCP congestion response, but with no more than one
794	      congestion response per round trip, as usual.

796	      A TCP sender also acts as the receiver for the other half-
797	      connection.  The host will maintain two ECC values S.ECC and R.ECC
798	      as sender and receiver respectively.  Every TCP header sent by a
799	      host in RECN mode will also repeat the prevailing value of R.ECC
800	      in its ECI field.  If a sender in RECN mode has to retransmit a
801	      packet due to a suspected loss, the re-transmitted packet MUST
802	      carry the latest prevailing value of R.ECC when it is re-
803	      transmitted, which will not necessarily be the one it carried
804	      originally.

806	4.1.1.1.  Drops and Marks

808	   Re-ECN is based on the ECN protocol [RFC3168] which in turn is
809	   typically based on the RED algorithm [RFC2309].  This algorithm marks
810	   packets as CE with a probability that increases as the size of the
811	   router queue increases.  Howeverif the queue becomes too full then it
812	   will revert to dropping packets.  Because of this it is important
813	   that re-ECN treats each packet drop it detects as if it were actually
814	   a CE mark.  This ensures that it can continue to correctly echo
815	   congestion even through a highly congested path.

817	   In order to ensure that drops are correctly echoed the sender needs
818	   to add the number of drops detected per RTT to the difference in ECI
819	   value waiting to be echoed.  A drop is defined as set out in
820	   [RFC2581] -- if the connection is in slow start then a single
821	   duplicate aknowledgement will be treated as an indication of a drop.
822	   When the system is in the congestion avoidance stage then 3 duplicate
823	   acknowledgements will be treated as a sign of a drop.  In all cases,
824	   if a re-transmission time-out occurs then that will be treatd as a
825	   drop.

827	4.1.1.2.  Safety against Long Pure ACK Loss Sequences

829	   The ECI method was chosen for echoing congestion marking because a
830	   re-ECN sender needs to know about every CE mark arriving at the
831	   receiver, not just whether at least one arrives within a round trip
832	   time (which is all the ECE/CWR mechanism supported).  And, as pure
833	   ACKs are not protected by TCP reliable delivery, we repeat the same
834	   ECI value in every ACK until it changes.  Even if many ACKs in a row
835	   are lost, as soon as one gets through, the ECI field it repeats from
836	   previous ACKs that didn't get through will update the sender on how
837	   many CE marks arrived since the last ACK got through.

839	   The sender will only lose a record of the arrival of a CE mark if all
840	   the ACKS are lost (and all of them were pure ACKs) for a stream of
841	   data long enough to contain 8 or more CE marks.  So, if the marking
842	   fraction was p, at least 8/p pure ACKs would have to be lost.  For
843	   example, if p was 5%, a sequence of 160 pure ACKs would all have to
844	   be lost.  To protect against such extremely unlikely events, if a re-
845	   ECN sender detects a sequence of pure ACKs has been lost it SHOULD
846	   assume the ECI field wrapped as many times as possible within the
847	   sequence.

849	   Specifically, if a re-ECN sender receives an ACK with an
850	   acknowledgement number that acknowledges L segments since the
851	   previous ACK but with a sequence number unchanged from the previously
852	   received ACK, it SHOULD conservatively assume that the ECI field
853	   incremented by D' = L - ((L-D) mod 8), where D is the apparent
854	   increase in the ECI field.  For example if the ACK arriving after 9
855	   pure ACK losses apparently increased ECI by 2, the assumed increment
856	   of ECI would still be 2.  But if ECI apparently increased by 2 after
857	   11 pure ACK losses, ECI should be assumed to have increased by 10.

859	   A re-ECN sender MAY implement a heuristic algorithm to predict beyond
860	   reasonable doubt that the ECI field probably did not wrap within a
861	   sequence of lost pure ACKs.  But such an algorithm is NOT REQUIRED.
862	   Such an algorithm MUST NOT be used unless it is proven to work even
863	   in the presence of correlation between high ACK loss rate on the back
864	   channel and high CE marking rate on the forward channel.

866	   Whatever assumption a re-ECN sender makes about potentially lost CE
867	   marks, both its congestion control and its re-echoing behaviour
868	   SHOULD be consistent with the assumption it makes.

870	4.1.2.  RECN-Co mode: Re-ECT Sender with a Vanilla or Nonce ECT Receiver

872	   If the half-connection is in RECN-Co mode, ECN feedback proceeds no
873	   differently to that of vanilla ECN.  In other words, the receiver
874	   sets the ECE flag repeatedly in the TCP header and the sender
875	   responds by setting the CWR flag.  Although RECN-Co mode is used when
876	   the receiver has not implemented the re-ECN protocol, the sender can
877	   infer enough from its vanilla ECN feedback to set or clear the RE
878	   flag reasonably well.  Specifically, every time the receiver toggles
879	   the ECE field from "0" to "1" (or a loss is detected), as well as
880	   setting CWR in the TCP flags, the re-ECN sender MUST blank the RE
881	   flag of the next packet to "0" as it would do in full RECN mode.
882	   Otherwise, the data sender SHOULD send all other packets with RE set
883	   to "1".  Once a flow is established, a re-ECN data sender in RECN-Co
884	   mode MUST always set the ECN field to ECT(1).

886	   If a CE marked packet arrives at the receiver within a round trip
887	   time of a previous mark, the receiver will still be echoing ECE for
888	   the last CE mark.  Therefore, such a mark will be missed by the
889	   sender.  Of course, this isn't of concern for congestion control, but
890	   it does mean that very occasionally the RE blanking fraction will be
891	   understated.  Therefore flows in RECN-Co mode may occasionally be
892	   mistaken for very lightly cheating flows and consequently might
893	   suffer a small number of packet drops through an egress dropper
894	   (Section 6.1.4).  We expect re-ECN would be deployed for some time
895	   before policers and droppers start to enforce it.  So, given there is
896	   not much ECN deployment yet anyway, this minor problem may affect
897	   only a very small proportion of flows, reducing to nothing over the
898	   years as vanilla ECN hosts upgrade.  The use of RECN-Co mode would
899	   need to be reviewed in the light of experience at the time of re-ECN
900	   deployment.

902	   RECN-Co mode is OPTIONAL.  Re-ECN implementers who want to keep their
903	   code simple, MAY choose not to implement this mode.  If they do not,
904	   a re-ECN sender SHOULD fall back to vanilla ECT mode in the presence
905	   of an ECN-capable receiver.  It MAY choose to fall back to the ECT-
906	   Nonce mode, but if re-ECN implementers don't want to be bothered with
907	   RECN-Co mode, they probably won't want to add an ECT-Nonce mode
908	   either.

910	4.1.2.1.  Re-ECN support for the ECN Nonce

912	   A TCP half-connection in RECN-Co mode MUST NOT support the ECN
913	   Nonce [RFC3540].  This means that the sending code of a re-ECN
914	   implementation will never need to include ECN Nonce support.  Re-ECN
915	   is intended to provide wider protection than the ECN nonce against
916	   congestion control misbehaviour, and re-ECN only requires support
917	   from the sender, therefore it is preferable to specifically rule out
918	   the need for dual sender implementations.  As a consequence, a re-ECN
919	   capable sender will never set ECT(0), so it will be easier for
920	   network elements to discriminate re-ECN traffic flows from other ECN
921	   traffic, which will always contain some ECT(0) packets.

923	   However, a re-ECN implementation MAY OPTIONALLY include receiving
924	   code that complies with the ECN Nonce protocol when interacting with
925	   a sender that supports the ECN nonce (rather than re-ECN), but this
926	   support is NOT REQUIRED.

928	   RFC3540 allows an ECN nonce sender to choose whether to sanction a
929	   receiver that does not ever set the nonce sum.  Given re-ECN is
930	   intended to provide wider protection than the ECN nonce against
931	   congestion control misbehaviour, implementers of re-ECN receivers MAY
932	   choose not to implement backwards compatibility with the ECN nonce
933	   capability.  This may be because they deem that the risk of sanctions
934	   is low, perhaps because significant deployment of the ECN nonce seems
935	   unlikely at implementation time.

937	4.1.3.  Capability Negotiation

939	   During the TCP hand-shake at the start of a connection, an originator
940	   of the connection (host A) with a re-ECN-capable transport MUST
941	   indicate it is Re-ECT by setting the TCP options NS=1, CWR=1 and
942	   ECE=1 in the initial SYN.

944	   A responding Re-ECT host (host B) MUST return a SYN ACK with flags
945	   CWR=1 and ECE=0.  The responding host MUST NOT set this combination
946	   of flags unless the preceding SYN has already indicated Re-ECT
947	   support as above.  A Re-ECT server (B) can use either setting of the
948	   NS flag combined with this type of SYN ACK in response to a SYN from
949	   a Re-ECT client (A).  Normally a Re-ECT server will reply to a Re-ECT
950	   client with NS=0, but in the special circumstance below it can return
951	   a SYN ACK with NS=1.

953	   If the initial SYN from Re-ECT client A is marked CE(-1), a Re-ECT
954	   server B MUST increment its local value of ECC.  But B cannot reflect
955	   the value of ECC in the SYN ACK, because it is still using the 3 bits
956	   to negotiate connection capabilities.  So, server B MUST set the
957	   alternative TCP header flags in its SYN ACK: NS=1, CWR=1 and ECE=0.

959	   These handshakes are summarised in Table 5 below, with X meaning
960	   `don't care'.  The handshakes used for the other flavours of ECN are
961	   also shown for comparison.  To compress the width of the table, the
962	   headings of the first four columns have been severely abbreviated, as
963	   follows:

965	      R: *R*e-ECT

967	      N: ECT-*N*once (RFC3540)

969	      E: *E*CT (RFC3168)

971	      I: Not-ECT (*I*mplicit congestion notification).

973	   These correspond with the same headings used in Table 4.  Indeed, the
974	   resulting modes in the last two columns of the table below are a more
975	   comprehensive way of saying the same thing as Table 4.

977	   +----+---+---+---+------------+-------------+-----------+-----------+
978	   | R  | N | E | I |   SYN A-B  | SYN ACK B-A |  A-B Mode |  B-A Mode |
979	   +----+---+---+---+------------+-------------+-----------+-----------+
980	   |    |   |   |   | NS CWR ECE |  NS CWR ECE |           |           |
981	   | AB |   |   |   |  1   1   1 |  X   1   0  |    RECN   |    RECN   |
982	   | A  | B |   |   |  1   1   1 |  1   0   1  |  RECN-Co  | ECT-Nonce |
983	   | A  |   | B |   |  1   1   1 |  0   0   1  |  RECN-Co  |    ECT    |
984	   | A  |   |   | B |  1   1   1 |  0   0   0  |  Not-ECT  |  Not-ECT  |
985	   | B  | A |   |   |  0   1   1 |  0   0   1  | ECT-Nonce |  RECN-Co  |
986	   | B  |   | A |   |  0   1   1 |  0   0   1  |    ECT    |  RECN-Co  |
987	   | B  |   |   | A |  0   0   0 |  0   0   0  |  Not-ECT  |  Not-ECT  |
988	   +----+---+---+---+------------+-------------+-----------+-----------+

990	      Table 5: TCP Capability Negotiation between Originator (A) and
991	                               Responder (B)

993	   As soon as a re-ECN capable TCP server receives a SYN, it MUST set
994	   its two half-connections into the modes given in Table 5.  As soon as
995	   a re-ECN capable TCP client receives a SYN ACK, it MUST set its two
996	   half-connections into the modes given in Table 5.  The half-
997	   connections will remain in these modes for the rest of the
998	   connection, including for the third segment of TCP's three-way hand-
999	   shake (the ACK).

1001	   {ToDo: Consider SYNs within a connection.}

1003	   Recall that, if the SYN ACK reflects the same flag settings as the
1004	   preceding SYN (because there is a broken legacy implementation that
1005	   behaves this way), RFC3168 specifies that the whole connection MUST
1006	   revert to Not-ECT.

1008	   Also note that, whenever the SYN flag of a TCP segment is set
1009	   (including when the ACK flag is also set), the NS, CWR and ECE flags
1010	   MUST NOT be interpreted as the 3-bit ECI value, which is only set as
1011	   a copy of the local ECC value in non-SYN packets.

1013	4.1.4.  Extended ECN (EECN) Field Settings during Flow Start or after
1014	        Idle Periods

1016	   If the originator (A) of a TCP connection supports re-ECN it MUST set
1017	   the extended ECN (EECN) field in the IP header of the initial SYN
1018	   packet to the feedback not established (FNE) codepoint.

1020	   FNE is a new extended ECN codepoint defined by this specification
1021	   (Section 3.2).  The feedback not established (FNE) codepoint is used
1022	   when the transport does not have the benefit of ECN feedback so it
1023	   cannot decide whether to set or clear the RE flag.

1025	   If after receiving a SYN the server B has set its sending half-
1026	   connection into RECN mode or RECN-Co mode, it MUST set the extended
1027	   ECN field in the IP header of its SYN ACK to the feedback not
1028	   established (FNE) codepoint.  Note the careful wording here, which
1029	   means that Re-ECT server B MUST set FNE on a SYN ACK whether it is
1030	   responding to a SYN from a Re-ECT client or from a client that is
1031	   merely ECN-capable.

1033	   The original ECN specification [RFC3168] required SYNs and SYN ACKs
1034	   to use the Not-ECT codepoint of the ECN field.  The aim was to
1035	   prevent well-known DoS attacks such as SYN flooding being able to
1036	   gain from the advantage that ECN capability afforded over drop at
1037	   ECN-capable routers.

1039	   For a SYN ACK, Kuzmanovic [I-D.ietf-tcpm-ecnsyn] has shown that this
1040	   caution was unnecessary, and proposes to allow a SYN ACK to be ECN-
1041	   capable to improve performance.  We have gone further by proposing to
1042	   make the initial SYN ECN-capable too.  By stipulating the FNE
1043	   codepoint for the initial SYN, we comply with RFC3168 in word but not
1044	   in spirit, because we have indeed set the ECN field to Not-ECT, but
1045	   we have extended the ECN field with another bit.  And it will be seen
1046	   (Section 5.3) that we have defined one setting of that bit to mean an
1047	   ECN-capable transport.  Therefore, by proposing that the FNE
1048	   codepoint MUST be used on the initial SYN of a connection, we have
1049	   (deliberately) made the initial SYN ECN-capable.  Section 5.4
1050	   justifies deciding to make the initial SYN ECN-capable.

1052	   Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will
1053	   have already been set on the initial SYN and possibly the SYN ACK as
1054	   above.  But each re-ECN sender will have to set FNE cautiously on a
1055	   few data packets as well, given a number of packets will usually have
1056	   to be sent before sufficient congestion feedback is received.  The
1057	   behaviour will be different depending on the mode of the half-
1058	   connection:

1060	   RECN mode:  Given the constraints on TCP's initial window [RFC3390]
1061	      and its exponential window increase during slow start
1062	      phase [RFC2581], it turns out that the sender SHOULD set FNE on
1063	      the first and third data packets in its flow, assuming equal sized
1064	      data packets once a flow is established.  Appendix D presents the
1065	      calculation that led to this conclusion.  Below, after running
1066	      through the start of an example TCP session, we give the intuition
1067	      learned from that calculation.

1069	   RECN-Co mode:  A re-ECT sender that switches into re-ECN
1070	      compatibility mode or into Not-ECT mode (because it has detected
1071	      the corresponding host is not re-ECN capable) MUST limit its
1072	      initial window to 1 segment.  The reasoning behind this constraint
1073	      is given in Section 5.4.  Having set this initial window, a re-ECN
1074	      sender in RECN-Co mode SHOULD set FNE on the first and third data
1075	      packets in a flow, as for RECN mode.

1077	   +----+------+----------------+-------+-------+---------------+------+
1078	   |    | Data | TCP A(Re-ECT)  | IP A  | IP B  | TCP B(Re-ECT) | Data |
1079	   +----+------+----------------+-------+-------+---------------+------+
1080	   |    | Byte |  SEQ  ACK CTL  | EECN  | EECN  |  SEQ  ACK CTL | Byte |
1081	   | -- | ---- | -------------  | ----- | ----- | ------------- | ---- |
1082	   |  1 |      | 0100      SYN  | FNE   | -->   |      R.ECC=0  |      |
1083	   |    |      |    CWR,ECE,NS  |       |       |               |      |
1084	   |  2 |      |      R.ECC=0   | <--   | FNE   | 0300 0101     |      |
1085	   |    |      |                |       |       |   SYN,ACK,CWR |      |
1086	   |  3 |      | 0101 0301 ACK  | RECT  | -->   |      R.ECC=0  |      |
1087	   |  4 | 1000 | 0101 0301 ACK  | FNE   | -->   |      R.ECC=0  |      |
1088	   |  5 |      |      R.ECC=0   | <--   | FNE   | 0301 1102 ACK | 1460 |
1089	   |  6 |      |      R.ECC=0   | <--   | RECT  | 1762 1102 ACK | 1460 |
1090	   |  7 |      |      R.ECC=0   | <--   | FNE   | 3222 1102 ACK | 1460 |
1091	   |  8 |      | 1102 1762 ACK  | RECT  | -->   |      R.ECC=0  |      |
1092	   |  9 |      |      R.ECC=0   | <--   | RECT  | 4682 1102 ACK | 1460 |
1093	   | 10 |      |      R.ECC=0   | <--   | RECT  | 6142 1102 ACK | 1460 |
1094	   | 11 |      | 1102 3222 ACK  | RECT  | -->   |      R.ECC=0  |      |
1095	   | 12 |      |      R.ECC=0   | <--   | RECT  | 7602 1102 ACK | 1460 |
1096	   | 13 |      |      R.ECC=1   | <*-   | RECT  | 9062 1102 ACK | 1460 |
1097	   |    |      | ...            |       |       |               |      |
1098	   +----+------+----------------+-------+-------+---------------+------+

1100	                      Table 6: TCP Session Example #1

1102	   Table 6 shows an example TCP session, where the server B sets FNE on
1103	   its first and third data packets (lines 5 & 7) as well as on the
1104	   initial SYN ACK as previously described.  The left hand half of the
1105	   table shows the relevant settings of headers sent by client A in
1106	   three layers: the TCP payload size; TCP settings; then IP settings.
1107	   The right hand half gives equivalent columns for server B. The only
1108	   TCP settings shown are the sequence number (SEQ), acknowledgement
1109	   number (ACK) and the relevant control (CTL) flags that A sets in the
1110	   TCP header.  The IP columns show the setting of the extended ECN
1111	   (EECN) field.

1113	   Also shown on the receiving side of the table is the value of the
1114	   receiver's echo congestion counter (R.ECC) after processing the
1115	   incoming EECN header.  Note that, once a host sets a half-connection
1116	   into RECN mode, it MUST initialise its local value of ECC to zero.

1118	   The intuition that Appendix D gives for why a sender should set FNE
1119	   on the first and third data packets is as follows.  At line 13, a
1120	   packet sent by B is shown with an '*', which means it has been
1121	   congestion marked by an intermediate router from RECT to CE(-1).  On
1122	   receiving this CE marked packet, client A increments its ECC counter
1123	   to 1 as shown.  This was the 7th data packet B sent, but before
1124	   feedback about this event returns to B, it might well have sent many
1125	   more packets.  Indeed, during exponential slow start, about as many
1126	   packets will be in flight (unacknowledged) as have been acknowledged.
1127	   So, when the feedback from the congestion event on B's 7th segment
1128	   returns, B will have sent about 7 further packets that will still be
1129	   in flight.  At that stage, B's best estimate of the network's packet
1130	   marking fraction will be 1/7.  So, as B will have sent about 14
1131	   packets, it should have already marked 2 of them as FNE in order to
1132	   have marked 1/7; hence the need to have set the first and third data
1133	   packets to FNE.

1135	   Client A's behaviour in Table 6 also shows FNE being set on the first
1136	   SYN and the first data packet (lines 1 & 4), but in this case it
1137	   sends no more data packets, so of course, it cannot, and does not
1138	   need to, set FNE again.  Note that in the A-B direction there is no
1139	   need to set FNE on the third part of the three-way hand-shake (line
1140	   3---the ACK).

1142	   Note that in this section we have used the word SHOULD rather than
1143	   MUST when specifying how to set FNE on data segments before positive
1144	   congestion feedback arrives (but note that the word MUST was used for
1145	   FNE on the SYN and SYN ACK).  FNE is only RECOMMENDED for the first
1146	   and third data segments to entertain the possibility that the TCP
1147	   transport has the benefit of other knowledge of the path, which it
1148	   re-uses from one flow for the benefit of a newly starting flow.  For
1149	   instance, one flow can re-use knowledge of other flows between the
1150	   same hosts if using a Congestion Manager [RFC3124] or when a proxy
1151	   host aggregates congestion information for large numbers of flows.

1153	   After an idle period of more than 1 second, a re-ECN sender transport
1154	   MUST set the EECN field of the packet that resumes the connection to
1155	   FNE.  Note that this next packet may be sent a very long time later,
1156	   a packet does NOT have to be sent after 1 second of idling.  In order
1157	   that the design of network policers can be deterministic, this
1158	   specification deliberately puts an absolute lower limit on how long a
1159	   connection can be idle before the packet that resumes the connection
1160	   must be set to FNE, rather than relating it to the connection round
1161	   trip time.  We use the lower bound of the retransmission timeout
1162	   (RTO) [RFC2988], which is commonly used as the idle period before TCP
1163	   must reduce to the restart window [RFC2581].  Note our specification
1164	   of re-ECN's idle period is NOT intended to change the idle period for
1165	   TCP's restart, nor indeed for any other purposes.

1167	   {ToDo: Describe how the sender falls back to legacy modes if packets
1168	   don't appear to be getting through (to work round firewalls
1169	   discarding packets they consider unusual).}

1171	4.1.5.  Pure ACKS, Retransmissions, Window Probes and Partial ACKs

1173	   A re-ECN sender MUST clear the RE flag to "0" and set the ECN field
1174	   to Not-ECT in pure ACKs, retransmissions and window probes, as
1175	   specified in [RFC3168].  Our eventual goal is for all packets to be
1176	   sent with re-ECN enabled, and we believe the semantics of the ECI
1177	   field go a long way towards being able to achieve this.  However, we
1178	   have not completed a full security analysis for these cases,
1179	   therefore, currently we merely re-state current practice.

1181	   We must also reconcile the facts that congestion marking is applied
1182	   to packets but acknowledgements cover octet ranges and acknowledged
1183	   octet boundaries need not match the transmitted boundaries.  The
1184	   general principle we work to is to remain compatible with TCP's
1185	   congestion control which is driven by congestion events at packet
1186	   granularity while at the same time aiming to blank the RE flag on at
1187	   least as many octets in a flow as have been marked CE.

1189	   Therefore, a re-ECN TCP receiver MUST increment its ECC value as many
1190	   times as CE marked packets have been received.  And that value MUST
1191	   be echoed to the sender in the first available ACK using the ECI
1192	   field.  This ensures the TCP sender's congestion control receives
1193	   timely feedback on congestion events at the same packet granularity
1194	   that they were generated on congested routers.

1196	   Then, a re-ECN sender stores the difference D between its own ECC
1197	   value and the incoming ECI field by incrementing a counter R. Then, R
1198	   is decremented by 1 each subsequent packet that is sent with the RE
1199	   flag blanked, until R is no longer positive.  Using this technique,
1200	   whenever a re-ECN transport sends a not re-ECN capable (NRECN) packet
1201	   (e.g. a retransmission), the remaining packets required to have the
1202	   RE flag blanked will be automatically carried over to subsequent
1203	   packets, through the variable R.

1205	   This does not ensure precisely the same number of octets have RE
1206	   blanked as were CE marked.  But we believe positive errors will
1207	   cancel negative over a long enough period. {ToDo: However, more
1208	   research is needed to prove whether this is so.  If it is not, it may
1209	   be necessary to increment and decrement R in octets rather than
1210	   packets, by incrementing R as the product of D and the size in octets
1211	   of packets being sent (typically the MSS).}

1213	4.2.  Other Transports

1215	4.2.1.  General Guidelines for Adding Re-ECN to Other Transports

1217	   Re-ECT sender transports that have established the receiver transport
1218	   is at least ECN-capable (not necessarily re-ECN capable) MUST blank
1219	   the RE codepoint in packets carrying at least as many octets as
1220	   arrive at receiver with the CE codepoint set.  Re-ECN-capable sender
1221	   transports should always initialise the ECN field to the ECT(1)
1222	   codepoint once a flow is established.

1224	   If the sender transport does not have sufficient feedback to even
1225	   estimate the path's CE rate, it SHOULD set FNE continuously.  If the
1226	   sender transport has some, perhaps stale, feedback to estimate that
1227	   the path's CE rate is nearly definitely less than E%, the transport
1228	   MAY blank RE in packets for E% of sent octets, and set the RECT
1229	   codepoint for the remainder.

1231	   The following sections give guidelines on how re-ECN support could be
1232	   added to RSVP or NSIS, to DCCP, and to SCTP - although separate
1233	   Internet drafts will be necessary to document the exact mechanics of
1234	   re-ECN in each of these protocols.

1236	   {ToDo: Give a brief outline of what would be expected for each of the
1237	   following:

1239	   o  UDP fire and forget (e.g.  DNS)

1241	   o  UDP streaming with no feedback

1243	   o  UDP streaming with feedback

1245	   }

1247	4.2.2.  Guidelines for adding Re-ECN to RSVP or NSIS

1249	   A separate I-D has been submitted [Re-PCN] describing how re-ECN can
1250	   be used in an edge-to-edge rather than end-to-end scenario.  It can
1251	   then be used by downstream networks to police whether upstream
1252	   networks are blocking new flow reservations when downstream
1253	   congestion is too high, even though the congestion is in other
1254	   operators' downstream networks.  This relates to current IETF work on
1255	   Admission Control over Diffserv using Pre-Congestion Notification
1256	   (PCN)  [PCN-arch].

1258	4.2.3.  Guidelines for adding Re-ECN to DCCP

1260	   Beside adjusting the initial features negotiation sequence, operating
1261	   re-ECN in DCCP [RFC4340] could be achieved by defining a new option
1262	   to be added to acknowledgments, that would include a multibit field
1263	   where the destination could copy its ECC.

1265	4.2.4.  Guidelines for adding Re-ECN to SCTP

1267	   Annex 1 in [RFC2960] gives the specifications for SCTP to support
1268	   ECN.  Similar steps should be taken to support re-ECN.  Beside
1269	   adjusting the initial features negotiation sequence, operating re-ECN
1270	   in SCTP could be achieved by defining a new control chunk, that would
1271	   include a multibit field where the destination could copy its ECC

1273	5.  Network Layer

1275	5.1.  Re-ECN IPv4 Wire Protocol

1277	   The wire protocol of the ECN field in the IP header remains largely
1278	   unchanged from [RFC3168].  However, an extension to the ECN field we
1279	   call the RE (re-ECN extension) flag (Section 3.2) is defined in this
1280	   document.  It doubles the extended ECN codepoint space, giving 8
1281	   potential codepoints.  The semantics of the extra codepoints are
1282	   backward compatible with the semantics of the 4 original codepoints
1283	   [RFC3168] (Section 7.1 collects together and summarises all the
1284	   changes defined in this document).

1286	   For IPv4, this document proposes that the new RE control flag will be
1287	   positioned where the `reserved' control flag was at bit 48 of the
1288	   IPv4 header (counting from 0).  Alternatively, some would call this
1289	   bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4
1290	   header (Figure 5).

1292	             0   1   2
1293	           +---+---+---+
1294	           | R | D | M |
1295	           | E | F | F |
1296	           +---+---+---+

1298	   Figure 5: New Definition of the Re-ECN Extension (RE) Control Flag at
1299	                  the Start of Byte 7 of the IPv4 Header

1301	   The semantics of the RE flag are described in outline in Section 3
1302	   and specified fully in Section 4.  The RE flag is always considered
1303	   in conjunction with the 2-bit ECN field, as if they were concatenated
1304	   together to form a 3-bit extended ECN field.  If the ECN field is set
1305	   to either the ECT(1) or CE codepoint, when the RE flag is blanked
1306	   (cleared to "0") it represents a re-echo of congestion experienced by
1307	   an early packet.  If the ECN field is set to the Not-ECT codepoint,
1308	   when the RE flag is set to "1" it represents the feedback not
1309	   established (FNE) codepoint, which signals that the packet was sent
1310	   without the benefit of congestion feedback.

1312	   It is believed that the FNE codepoint can simultaneously serve other
1313	   purposes, particularly where the start of a flow needs distinguishing
1314	   from packets later in the flow.  For instance it would have been
1315	   useful to identify new flows for tag switching and might enable
1316	   similar developments in the future if it were adopted.  It is similar
1317	   to the state set-up bit idea designed to protect against memory
1318	   exhaustion attacks.  This idea was proposed informally by David Clark
1319	   and documented by Handley and Greenhalgh [Steps_DoS].  The FNE
1320	   codepoint can be thought of as a `soft-state set-up flag', because it
1321	   is idempotent (i.e. one occurrence of the flag is sufficient but
1322	   further occurrences achieve the same effect if previous ones were
1323	   lost).

1325	   We are sure there will probably be other claims pending on the use of
1326	   bit 48.  We know of at least two [ARI05], [RFC3514] but neither have
1327	   been pursued in the IETF, so far, although the present proposal would
1328	   meet the needs of the former.

1330	   The security flag proposal (commonly known as the evil bit) was
1331	   published on 1 April 2003 as Informational RFC 3514, but it was not
1332	   adopted due to confusion over whether evil-doers might set it
1333	   inappropriately.  The present proposal is backward compatible with
1334	   RFC3514 because if re-ECN compliant senders were benign they would
1335	   correctly clear the evil bit to honestly declare that they had just
1336	   received congestion feedback.  Whereas evil-doers would hide
1337	   congestion feedback by setting the evil bit continuously, or at least
1338	   more often than they should.  So, evil senders can be identified,
1339	   because they declare that they are good less often than they should.

1341	5.2.  Re-ECN IPv6 Wire Protocol

1343	   For IPv6, this document proposes that the new RE control flag will be
1344	   positioned as the first bit of the option field of a new Congestion
1345	   hop by hop option header (Figure 6).

1347	        0                   1                   2                   3
1348	        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1349	       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1350	       |  Next Header  |  Hdr ext Len  |  Option Type  | Opt Length =4 |
1351	       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1352	       |R|                     Reserved for future use                 |
1353	       |E|                                                             |
1354	       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1356	      Figure 6: Definition of a New IPv6 Congestion Hop by Hop Option
1357	         Header containing the Re-ECN Extension (RE) Control Flag

1359	               0 1 2 3 4 5 6 7 8
1360	               +-+-+-+-+-+-+-+-+-
1361	               |AIU|C|Option ID|
1362	               +-+-+-+-+-+-+-+-+-

1364	           Figure 7: Congestion Hop by Hop Option Type Encoding

1366	   The Hop-by-Hop Options header enables packets to carry information to
1367	   be examined and processed by routers or nodes along the packet's
1368	   delivery path, including the source and destination nodes.  For re-
1369	   ECN, the two bits of the Action If Unrecognized (AIU) flag of the
1370	   Congestion extension header MUST be set to "00" meaning if
1371	   unrecognized `skip over option and continue processing the header'.
1372	   Then, any routers or a receiver not upgraded with the optional re-ECN
1373	   features described in this memo will simply ignore this header.  But
1374	   routers with these optional re-ECN features or a re-ECN policing
1375	   function, will process this Congestion extension header.

1377	   The `C' flag MUST be set to "1" to specify that the Option Data
1378	   (currently only the RE control flag) can change en-route to the
1379	   packet's final destination.  This ensures that, when an
1380	   Authentication header (AH [RFC2402]) is present in the packet, for
1381	   any option whose data may change en-route, its entire Option Data
1382	   field will be treated as zero-valued octets when computing or
1383	   verifying the packet's authenticating value.

1385	   Although the RE control flag should not be changed along the path, we
1386	   expect that the rest of this option field that is currently `Reserved
1387	   for future use' could be used for a multi-bit congestion notification
1388	   field which we would expect to change en route.  As the RE flag does
1389	   not need end-to-end authentication, we set the C flag to '1'.

1391	   {ToDo: A Congestion Hop by Hop Option ID will need to be registered
1392	   with IANA.}

1394	5.3.  Router Forwarding Behaviour

1396	   Re-ECN works well without modifying the forwarding behaviour of any
1397	   routers.  However, below, two OPTIONAL changes to forwarding
1398	   behaviour are defined which respectively enhance performance and
1399	   improve a router's discrimination against flooding attacks.  They are
1400	   both OPTIONAL additions that we propose MAY apply by default to all
1401	   Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN
1402	   marking behaviours [RFC3168].  Specifications for PHBs MAY define
1403	   different forwarding behaviours from this default, but this is NOT
1404	   REQUIRED.  [Re-PCN] is one example.

1406	   FNE indicates ECT:

1408	      The FNE codepoint tells a router to assume that the packet was
1409	      sent by an ECN-capable transport (see Section 5.4).  Therefore an
1410	      FNE packet MAY be marked rather than dropped.  Note that the FNE
1411	      codepoint has been intentionally chosen so that, to legacy routers
1412	      (which do not inspect the RE flag) an FNE packet appears to be
1413	      Not-ECT so it will be dropped by legacy AQM algorithms.

1415	      A network operator MUST NOT configure a router to ECN mark rather
1416	      than drop FNE packets unless it can guarantee that FNE packets
1417	      will be rate limited, either locally or upstream.  The ingress
1418	      policers discussed in Section 6.1.5 would count as rate limiters
1419	      for this purpose.

1421	   Preferential Drop:  If a re-ECN capable router experiences very high
1422	      load so that it has to drop arriving packets (e.g. a DoS attack),
1423	      it MAY preferentially drop packets within the same Diffserv PHB
1424	      using the preference order for extended ECN codepoints given in
1425	      Table 7.  Preferential dropping can be difficult to implement on
1426	      some hardware, but if feasible it would discriminate against
1427	      attack traffic if done as part of the overall policing framework
1428	      of Section 6.1.3.  If nowhere else, routers at the egress of a
1429	      network SHOULD implement preferential drop (stronger than the MAY
1430	      above).  For simplicity, preferences 4 & 5 MAY be merged into one
1431	      preference level.

1433	   +-------+-----+------------+-------+------------+-------------------+
1434	   |  ECN  |  RE | Extended   | Worth | Drop Pref  |   Re-ECN meaning  |
1435	   | field | bit | ECN        |       | (1 = drop  |                   |
1436	   |       |     | codepoint  |       | 1st)       |                   |
1437	   +-------+-----+------------+-------+------------+-------------------+
1438	   |   01  |  0  | Re-Echo    | +1    | 5/4        |     Re-echoed     |
1439	   |       |     |            |       |            |   congestion and  |
1440	   |       |     |            |       |            |        RECT       |
1441	   |   00  |  1  | FNE        | +1    | 4          |    Feedback not   |
1442	   |       |     |            |       |            |    established    |
1443	   |   11  |  0  | CE(0)      | 0     | 3          |  Re-Echo canceled |
1444	   |       |     |            |       |            |   by congestion   |
1445	   |       |     |            |       |            |    experienced    |
1446	   |   01  |  1  | RECT       | 0     | 3          |   Re-ECN capable  |
1447	   |       |     |            |       |            |     transport     |
1448	   |   11  |  1  | CE(-1)     | -1    | 3          |     Congestion    |
1449	   |       |     |            |       |            |    experienced    |
1450	   |   10  |  1  | --CU--     | n/a   | 2          |  Currently Unused |
1451	   |   10  |  0  | ---        | n/a   | 2          |   Legacy ECN use  |
1452	   |       |     |            |       |            |        only       |
1453	   |   00  |  0  | Not-RECT   | n/a   | 1          |        Not        |
1454	   |       |     |            |       |            |   re-ECN-capable  |
1455	   |       |     |            |       |            |     transport     |
1456	   +-------+-----+------------+-------+------------+-------------------+

1458	       Table 7: Drop Preference of EECN Codepoints (Sorted by `Worth')

1460	      The above drop preferences are arranged to preserve packets with
1461	      more positive worth (Section 3.4), given senders of positive
1462	      packets must have honestly declared downstream congestion.  This
1463	      is explained fully in Section 6 on applications, particularly when
1464	      the application of re-ECN to protect against DDoS attacks is
1465	      described.

1467	5.4.  Justification for Setting the First SYN to FNE

1469	   Congested routers may mark an FNE packet to CE(-1) (Section 5.3), and
1470	   the initial SYN MUST be set to FNE by Re-ECT client A
1471	   (Section 4.1.4).  So an initial SYN may be marked CE(-1) rather than
1472	   dropped.  This seems dangerous, because the sender has not yet
1473	   established whether the receiver is a legacy one that does not
1474	   understand congestion marking.  It also seems to allow malicious
1475	   senders to take advantage of ECN marking to avoid so much drop when
1476	   launching SYN flooding attacks.  Below we explain the features of the
1477	   protocol design that remove both these dangers.

1479	   ECN-capable initial SYN with a Not-ECT server:  If the TCP server B
1480	      is re-ECN capable, provision is made for it to feedback a possible
1481	      congestion marked SYN in the SYN ACK (Section 4.1.4).  But if the
1482	      TCP client A finds out from the SYN ACK that the server was not
1483	      ECN-capable, the TCP client MUST consider the first SYN as
1484	      congestion marked before setting itself into Not-ECT mode.
1485	      Section 4.1.4 mandates that such a TCP client MUST also set its
1486	      initial window to 1 segment.  In this way we remove the need to
1487	      cautiously avoid setting the first SYN to Not-RECT.  This will
1488	      give worse performance while deployment is patchy, but better
1489	      performance once deployment is widespread.

1491	   SYN flooding attacks can't exploit ECN-capability:  Malicious hosts
1492	      may think they can use the advantage that ECN-marking gives over
1493	      drop in launching classic SYN-flood attacks.  But Section 5.3
1494	      mandates that a router MUST only be configured to treat packets
1495	      with the FNE codepoint as ECN-capable if FNE packets are rate
1496	      limited.  Introduction of the FNE codepoint was a deliberate move
1497	      to enable transport-neutral handling of flow-start and flow state
1498	      set-up in the IP layer where it belongs.  It then becomes possible
1499	      to protect against flooding attacks of all forms (not just SYN
1500	      flooding) without transport-specific inspection for things like
1501	      the SYN flag in TCP headers.  Then, for instance, SYN flooding
1502	      attacks using IPSec ESP encryption can also be rate limited at the
1503	      IP layer.

1505	   It might seem pedantic going to all this trouble to enable ECN on the
1506	   initial packet of a flow, but it is motivated by a much wider concern
1507	   to ensure safe congestion control will still be possible even if the
1508	   application mix evolves to the point where the majority of flows
1509	   consist of a single window or even a single packet.  It also allows
1510	   denial of service attacks to be more easily isolated and prevented.

1512	5.5.  Control and Management

1514	5.5.1.  Negative Balance Warning

1516	   A new ICMP message type is being considered so that a dropper can
1517	   warn the apparent sender of a flow that it has started to sanction
1518	   the flow.  The message would have similar semantics to the `Time
1519	   exceeded' ICMP message type.  To ensure the sender has to invest some
1520	   work before the network will generate such a message, a dropper
1521	   SHOULD only send such a message for flows that have demonstrated that
1522	   they have started correctly by establishing a positive record, but
1523	   have later gone negative.  The threshold is up to the implementation.
1524	   The purpose of the message is to deconfuse the cause of drops from
1525	   other causes, such as congestion or transmission losses.  The dropper
1526	   would send the message to the sender of the flow, not the receiver.

1528	   If we did define this message type, it would be REQUIRED for all re-
1529	   ECT senders to parse and understand it.  Note that a sender MUST only
1530	   use this message to explain why losses are occurring.  A sender MUST
1531	   NOT take this message to mean that losses have occurred that it was
1532	   not aware of.  Otherwise, spoof messages could be sent by malicious
1533	   sources to slow down a sender (c.f.  ICMP source quench).

1535	   However, the need for this message type is not yet confirmed, as we
1536	   are considering how to prevent it being used by malicious senders to
1537	   scan for droppers and to test their threshold settings. {ToDo:
1538	   Complete this section.}

1540	5.5.2.  Rate Response Control

1542	   As discussed in Section 6.1.5 the sender's access operator will be
1543	   expected to use bulk per-user policing, but they might choose to
1544	   introduce a per-flow policer.  In cases where operators do introduce
1545	   per-flow policing, there may be a need for a sender to send a request
1546	   to the ingress policer asking for permission to apply a non-default
1547	   response to congestion (where TCP-friendly is assumed to be the
1548	   default).  This would require the sender to know what message
1549	   format(s) to use and to be able to discover how to address the
1550	   policer.  The required control protocol(s) are outside the scope of
1551	   this document, but will require definition elsewhere.

1553	   The policer is likely to be local to the sender and inline, probably
1554	   at the ingress interface to the internetwork.  So, discovery should
1555	   not be hard.  A variety of control protocols already exist for some
1556	   widely used rate-responses to congestion.  For instance DCCP
1557	   congestion control identifiers (CCIDs [RFC4340]) fulfil this role and
1558	   so does QoS signalling (e.g. and RSVP request for controlled load
1559	   service is equivalent to a request for no rate response to
1560	   congestion, but with admission control).

1562	5.6.  IP in IP Tunnels

1564	   For re-ECN to work correctly through IP in IP tunnels, it needs
1565	   slightly different tunnel handling to regular ECN [RFC3168].
1566	   Currently there is some incosistency between how the handling of IP
1567	   in IP tunnels is defined in [RFC3168] and how it is defined in
1568	   [RFC4301], but re-ECN would work fine with the IPsec behaviour.  This
1569	   inconsistency is addressed in a new Internet Draft [ECN-tunnel] that
1570	   proposes to update RFC3168 tunnel behaviour to bring it into line
1571	   with IPsec.  Ideally, for re-ECN to work through a tunnel, the tunnel
1572	   entry should copy both the RE flag and the ECN field from the inner
1573	   to the outer IP header.  Then at the tunnel exit, any congestion
1574	   marking of the outer ECN field should overwrite the inner ECN field
1575	   (unless the inner field is Not-ECT in which case an alarm should be
1576	   raised).  The RE flag shouldn't change along a path, so the outer RE
1577	   flag should be the same as the inner.  If it isn't a management alarm
1578	   should be raised.  This behaviour is the same as the full-
1579	   functionality variant of [RFC3168] at tunnel exit, but different at
1580	   tunnel entry.

1582	   If tunnels are left as they are specified in [RFC3168], whether the
1583	   limited or full-functionality variants are used, a problem arises
1584	   with re-ECN if a tunnel crosses an inter-domain boundary, because the
1585	   difference between positive and negative markings will not be
1586	   correctly accounted for.  In a limited functionality ECN tunnel, the
1587	   flow will appear to be legacy traffic, and therefore may be wrongly
1588	   rate limited.  In a full-functionality ECN tunnel, the result will
1589	   depend whether the tunnel entry copies the inner RE flag to the outer
1590	   header or the RE flag in the outer header is always cleared.  If the
1591	   former, the flow will tend to be too positive when accounted for at
1592	   borders.  If the latter, it will be too negative.  If the rules set
1593	   out in [ECN-tunnel] are followed then this will not be an issue.

1595	5.7.  Non-Issues

1597	   The following issues might seem to cause unfavourable interactions
1598	   with re-ECN, but we will explain why they don't:

1600	   o  Various link layers support explicit congestion notification, such
1601	      as Frame Relay and ATM.  Explicit congestion notification is
1602	      proposed to be added to other link layers, such as Ethernet
1603	      (802.3ar Ethernet congestion management) and MPLS [ECN-MPLS];

1605	   o  Encryption and IPSec.

1607	   In the case of congestion notification at the link layer, each
1608	   particular link layer scheme either manages congestion on the link
1609	   with its own link-level feedback (the usual arrangement in the cases
1610	   of ATM and Frame Relay), or congestion notification from the link
1611	   layer is merged into congestion notification at the IP level when the
1612	   frame headers are decapsulated at the end of the link (the
1613	   recommended arrangement in the Ethernet and MPLS cases).  Given the
1614	   RE flag is not intended to change along the path, this means that
1615	   downstream congestion will still be measureable at any point where IP
1616	   is processed on the path by subtracting positive from negative
1617	   markings.

1619	   In the case of encryption, as long as the tunnel issues described in
1620	   Section 5.6 are dealt with, payload encryption itself will not be a
1621	   problem.  The design goal of re-ECN is to include downstream
1622	   congestion in the IP header so that it is not necessary to bury into
1623	   inner headers.  Obfuscation of flow identifiers is not a problem for
1624	   re-ECN policing elements.  Re-ECN doesn't ever require flow
1625	   identifiers to be valid, it only requires them to be unique.  So if
1626	   an IPSec encapsulating security payload (ESP [RFC2406]) or an
1627	   authentication header (AH [RFC2402]) is used, the security parameters
1628	   index (SPI) will be a sufficient flow identifier, as it is intended
1629	   to be unique to a flow without revealing actual port numbers.

1631	   In general, even if endpoints use some locally agreed scheme to hide
1632	   port numbers, re-ECN policing elements can just consider the pair of
1633	   source and destination IP addresses as the flow identifier.  Re-ECN
1634	   encourages endpoints to at least tell the network layer that a
1635	   sequence of packets are all part of the same flow, if indeed they
1636	   are.  The alternative would be for the sender to make each packet
1637	   appear to be a new flow, which would require them all to be marked
1638	   FNE in order to avoid being treated with the bulk of malicious flows
1639	   at the egress dropper.  Given the FNE marking is worth +1 and
1640	   networks are likely to rate limit FNE packets, endpoints are given an
1641	   incentive not to set FNE on each packet.  But if the sender really
1642	   does want to hide the flow relationship between packets it can choose
1643	   to pay the cost of multiple FNE packets, which in the long run will
1644	   compensate for the extra memory required on network policing elements
1645	   to process each flow.

1647	6.  Applications

1649	6.1.  Policing Congestion Response

1651	6.1.1.  The Policing Problem

1653	   The current Internet architecture trusts hosts to respond voluntarily
1654	   to congestion.  Limited evidence shows that the large majority of
1655	   end-points on the Internet comply with a TCP-friendly response to
1656	   congestion.  But telephony (and increasingly video) services over the
1657	   best effort Internet are attracting the interest of major commercial
1658	   operations.  Most of these applications do not respond to congestion
1659	   at all.  Those that can switch to lower rate codecs, still have a
1660	   lower bound below which they must become unresponsive to congestion.

1662	   Of course, the Internet is intended to support many different
1663	   application behaviours.  But the problem is that this freedom can be
1664	   exercised irresponsibly.  The greater problem is that we will never
1665	   be able to agree on where the boundary is between responsible and
1666	   irresponsible.  Therefore re-ECN is designed to allow different
1667	   networks to set their own view of the limit to irresponsibility, and
1668	   to allow networks that choose a more conservative limit to push back
1669	   against congestion caused in more liberal networks.

1671	   As an example of the impossibility of setting a standard for
1672	   fairness, mandating TCP-friendliness would set the bar too high for
1673	   unresponsive streaming media, but still some would say the bar was
1674	   too low.  Even though all known peer-to-peer filesharing applications
1675	   are TCP-compatible, they can cause a disproportionate amount of
1676	   congestion, simply by using multiple flows and by transferring data
1677	   continuously relative to other short-lived sessions.  On the other
1678	   hand, if we swung the other way and set the bar low enough to allow
1679	   streaming media to be unresponsive, we would also allow denial of
1680	   service attacks, which are typically unresponsive to congestion and
1681	   consist of multiple continuous flows.

1683	   Applications that need (or choose) to be unresponsive to congestion
1684	   can effectively take (some would say steal) whatever share of
1685	   bottleneck resources they want from responsive flows.  Whether or not
1686	   such free-riding is common, inability to prevent it increases the
1687	   risk of poor returns for investors in network infrastructure, leading
1688	   to under-investment.  An increasing proportion of unresponsive or
1689	   free-riding demand coupled with persistent under-supply is a broken
1690	   economic cycle.  Therefore, if the current, largely co-operative
1691	   consensus continues to erode, congestion collapse could become more
1692	   common in more areas of the Internet [RFC3714].

1694	   While we have designed re-ECN so that networks can choose to deploy
1695	   stringent policing, this does not imply we advocate that every
1696	   network should introduce tight controls on those that cause
1697	   congestion.  Re-ECN has been specifically designed to allow different
1698	   networks to choose how conservative or liberal they wish to be with
1699	   respect to policing congestion.  But those that choose to be
1700	   conservative can protect themselves from the excesses that liberal
1701	   networks allow their users.

1703	6.1.2.  The Case Against Bottleneck Policing

1705	   The state of the art in rate policing is the bottleneck policer,
1706	   which is intended to be deployed at any forwarding resource that may
1707	   become congested.  Its aim is to detect flows that cause
1708	   significantly more local congestion than others.  Although operators
1709	   might solve their immediate problems by deploying bottleneck
1710	   policers, we are concerned that widespread deployment would make it
1711	   extremely hard to evolve new application behaviours.  We believe the
1712	   IETF should offer re-ECN as the preferred protocol on which to base
1713	   solutions to the policing problems of operators, because it would not
1714	   harm evolvability and, frankly, it would be far more effective (see
1715	   later for why).

1717	   Approaches like [XCHOKe] & [pBox] are nice approaches for rate
1718	   policing traffic without the benefit of whole path information (such
1719	   as could be provided by re-ECN).  But they must be deployed at
1720	   bottlenecks in order to work.  Unfortunately, a large proportion of
1721	   traffic traverses at least two bottlenecks (in two access networks),
1722	   particularly with the current traffic mix where peer-to-peer file-
1723	   sharing is prevalent.  If ECN were deployed, we believe it would be
1724	   likely that these bottleneck policers would be adapted to combine ECN
1725	   congestion marking from the upstream path with local congestion
1726	   knowledge.  But then the only useful placement for such policers
1727	   would be close to the egress of the internetwork.

1729	   But then, if these bottleneck policers were widely deployed (which
1730	   would require them to be more effective than they are now), the
1731	   Internet would find itself with one universal rate adaptation policy
1732	   (probably TCP-friendliness) embedded throughout the network.  Given
1733	   TCP's congestion control algorithm is already known to be hitting its
1734	   scalability limits and new algorithms are being developed for high-
1735	   speed congestion control, embedding TCP policing into the Internet
1736	   would make evolution to new algorithms extremely painful.  If a
1737	   source wanted to use a different algorithm, it would have to first
1738	   discover then negotiate with all the policers on its path,
1739	   particularly those in the far access network.  The IETF has already
1740	   traveled that path with the Intserv architecture and found it
1741	   constrains scalability [RFC2208].

1743	   Anyway, if bottleneck policers were ever widely deployed, they would
1744	   be likely to be bypassed by determined attackers.  They inherently
1745	   have to police fairness per flow or per source-destination pair.
1746	   Therefore they can easily be circumvented either by opening multiple
1747	   flows (by varying the end-point port number); or by spoofing the
1748	   source address but arranging with the receiver to hide the true
1749	   return address at a higher layer.

1751	6.1.3.  Re-ECN Incentive Framework

1753	   The aim is to create an incentive environment that ensures optimal
1754	   sharing of capacity despite everyone acting selfishly (including
1755	   lying and cheating).  Of course, the mechanisms put in place for this
1756	   can lie dormant wherever co-operation is the norm.

1758	   Throughout this document we focus on path congestion.  But some forms
1759	   of fairness, particularly TCP's, also depend on round trip time.  If
1760	   TCP-fairness is required, we also propose to measure downstream path
1761	   delay using re-feedback.  We give a simple outline of how this could
1762	   work in Appendix F.  However, we do not expect this to be necessary,
1763	   as researchers tend to agree that only congestion control dynamics
1764	   need to depend on RTT, not the rate that the algorithm would converge
1765	   on after a period of stability.

1767	   Figure 8 sketches the incentive framework that we will describe piece
1768	   by piece throughout this section.  We will do a first pass in
1769	   overview, then return to each piece in detail.  We re-use the earlier
1770	   example of how downstream congestion is derived by subtracting
1771	   upstream congestion from path congestion (Figure 1) but depict
1772	   multiple trust boundaries to turn it into an internetwork.  For
1773	   clarity, only downstream congestion is shown (the difference between
1774	   the two earlier plots).  The graph displays downstream path
1775	   congestion seen in a typical flow as it traverses an example path
1776	   from sender S to receiver R, across networks N1, N2 & N4.  Everyone
1777	   is shown using re-ECN correctly, but we intend to show why everyone
1778	   would /choose/ to use it correctly, and honestly.

1780	   Three main types of self-interest can be identified:

1782	   o  Users want to transmit data across the network as fast as
1783	      possible, paying as little as possible for the privilege.  In this
1784	      respect, there is no distinction between senders and receivers,
1785	      but we must be wary of potential malice by one on the other;

1787	   o  Network operators want to maximise revenues from the resources
1788	      they invest in.  They compete amongst themselves for the custom of
1789	      users.

1791	   o  Attackers (whether users or networks) want to use any opportunity
1792	      to subvert the new re-ECN system for their own gain or to damage
1793	      the service of their victims, whether targeted or random.

1795	          policer
1796	           |
1797	           |
1798	         S <-----N1----> <---N2---> <---N4--> R         domain
1799	         | :                                :
1800	       A\|/:                                :
1801	       | V :                                :
1802	    3% |---------+                          :
1803	       |   :     |                          :
1804	    2% |   :     +-----------------------+  :
1805	       |   :    downstream congestion    |  :
1806	    1% |   :                             |  :
1807	       |   :                             |  :
1808	    0% +---------------------------------+=====-->
1809	                 0                       i  ^      resource index
1810	                 |                       | /|\
1811	               1.00%                  2.00% |       marking fraction
1812	                                            |
1813	                                         dropper

1815	   Figure 8: Incentive Framework, showing creation of opposing pressures
1816	     to under-declare and over-declare downstream congestion, using a
1817	                           policer and a dropper

1819	   Source congestion control:  We want to ensure that the sender will
1820	      throttle its rate as downstream congestion increases.  Whatever
1821	      the agreed congestion response (whether TCP-compatible or some
1822	      enhanced QoS), to some extent it will always be against the
1823	      sender's interest to comply.

1825	   Ingress policing:  But it is in all the network operators' interests
1826	      to encourage fair congestion response, so that their investments
1827	      are employed to satisfy the most valuable demand.  The re-ECN
1828	      protocol ensures packets carry the necessary information about
1829	      their own expected downstream congestion so that N1 can deploy a
1830	      policer at its ingress to check that S1 is complying with whatever
1831	      congestion control it should be using (Section 6.1.5).  If N1 is
1832	      extremely conservative it could police each flow, but it is likely
1833	      to just police the bulk amount of congestion each customer causes
1834	      without regard to flows, or if it is extremely liberal it need not
1835	      police congestion control at all.  Whatever, it is always
1836	      preferable to police traffic at the very first ingress into an
1837	      internetwork, before non-compliant traffic can cause any damage.

1839	   Edge egress dropper:  If the policer ensures the source has less
1840	      right to a high rate the higher it declares downstream congestion,
1841	      the source has a clear incentive to understate downstream
1842	      congestion.  But, if flows of packets are understated when they
1843	      enter the internetwork, they will have become negative by the time
1844	      they leave.  So, we introduce a dropper at the last network
1845	      egress, which drops packets in flows that persistently declare
1846	      negative downstream congestion (see Section 6.1.4 for details).

1848	               ..competitive routing
1849	             .'         :      '.
1850	           .'  p e n a l:t i e s '.
1851	          :           | :       \  :
1852	       A  :           | :        | :
1853	       |S <-----N1----> <---N2---> <---N4--> R         domain
1854	       |  :           | :        | :
1855	       |  V           | :        | :
1856	    3% |--------+     | :        | :
1857	       |        |     V V        V V
1858	    2% |        +-----------------------+
1859	       |       downstream congestion    |
1860	    1% |          :                     |
1861	       |          :                     |
1862	    0% +--------------------------------+=====-->
1863	                0                ^      i         resource index
1864	                |               /|\     |
1865	              1.00%              |   2.00%         marking fraction
1866	                                 |
1867	                             sanctions

1869	                 Figure 9: Incentives at Inter-domain Borders

1871	   Inter-domain traffic policing:  But next we must ask, if congestion
1872	      arises downstream (say in N4), what is the ingress network's
1873	      (N1's) incentive to police its customers' response?  If N1 turns a
1874	      blind eye, its own customers benefit while other networks suffer.
1875	      This is why all inter-domain QoS architectures (e.g. Intserv,
1876	      Diffserv) police traffic each time it crosses a trust boundary.
1877	      We have already shown that re-ECN gives a trustworthy measure of
1878	      the expected downstream congestion that a flow will cause by
1879	      subtracting negative volume from positive at any intermediate
1880	      point on a path.  N4 (say) can use this measure to police all the
1881	      responses to congestion of all the sources beyond its upstream
1882	      neighbour (N2), but in bulk with one very simple passive
1883	      mechanism, rather than per flow, as we will now explain using
1884	      Figure 9.

1886	   Emulating policing with inter-domain congestion penalties:  Between
1887	      high-speed networks, we would rather avoid per-flow policing, and
1888	      we would rather avoid holding back traffic while it is policed.
1889	      Instead, once re-ECN has arranged headers to carry downstream
1890	      congestion honestly, N2 can contract to pay N4 penalties in
1891	      proportion to a single bulk count of the congestion metrics
1892	      crossing their mutual trust boundary (Section 6.1.6).  In this
1893	      way, N4 puts pressure on N2 to suppress downstream congestion, for
1894	      every flow passing through the border interface, even though they
1895	      will all start and end in different places, and even though they
1896	      may all be allowed different responses to congestion.  The figure
1897	      depicts this downward pressure on N2 by the solid downward arrow
1898	      at the egress of N2.  Then N2 has an incentive either to police
1899	      the congestion response of its own ingress traffic (from N1) or to
1900	      emulate policing by applying penalties to N1 in turn on the basis
1901	      of congestion counted at their mutual boundary.  In this recursive
1902	      way, the incentives for each flow to respond correctly to
1903	      congestion trace back with each flow precisely to each source,
1904	      despite the mechanism not recognising flows (see Section 6.2.2).

1906	   Inter-domain congestion charging diversity:  Any two networks are
1907	      free to agree any of a range of penalty regimes between themselves
1908	      but they would only provide the right incentives if they were
1909	      within the following reasonable constraints.  N2 should expect to
1910	      have to pay penalties to N4 where penalties monotonically increase
1911	      with the volume of congestion and negative penalties are not
1912	      allowed.  For instance, they may agree an SLA with tiered
1913	      congestion thresholds, where higher penalties apply the higher the
1914	      threshold that is broken.  But the most obvious (and useful) form
1915	      of penalty is where N4 levies a charge on N2 proportional to the
1916	      volume of downstream congestion N2 dumps into N4.  In the
1917	      explanation that follows, we assume this specific variant of
1918	      volume charging between networks - charging proportionate to the
1919	      volume of congestion.

1921	      We must make clear that we are not advocating that everyone should
1922	      use this form of contract.  We are well aware that the IETF tries
1923	      to avoid standardising technology that depends on a particular
1924	      business model.  And we strongly share this desire to encourage
1925	      diversity.  But our aim is merely to show that border policing can
1926	      at least work with this one model, then we can assume that
1927	      operators might experiment with the metric in other models (see
1928	      Section 6.1.6 for examples).  Of course, operators are free to
1929	      complement this usage element of their charges with traditional
1930	      capacity charging, and we expect they will as predicted by
1931	      economics.

1933	   No congestion charging to users:  Bulk congestion penalties at trust
1934	      boundaries are passive and extremely simple, and lose none of
1935	      their per-packet precision from one boundary to the next (unlike
1936	      Diffserv all-address traffic conditioning agreements, which
1937	      dissipate their effectiveness across long topologies).  But at any
1938	      trust boundary, there is no imperative to use congestion charging.

1940	      Traditional traffic policing can be used, if the complexity and
1941	      cost is preferred.  In particular, at the boundary with end
1942	      customers (e.g. between S and N1), traffic policing will most
1943	      likely be more appropriate.  Policer complexity is less of a
1944	      concern at the edge of the network.  And end-customers are known
1945	      to be highly averse to the unpredictability of congestion
1946	      charging.

1948	   NOTE WELL:  This document neither advocates nor requires congestion
1949	      charging for end customers and advocates but does not require
1950	      inter-domain congestion charging.

1952	   Competitive discipline of inter-domain traffic engineering:  With
1953	      inter-domain congestion charging, a domain seems to have a
1954	      perverse incentive to fake congestion; N2's profit depends on the
1955	      difference between congestion at its ingress (its revenue) and at
1956	      its egress (its cost).  So, overstating internal congestion seems
1957	      to increase profit.  However, smart border routing [Smart_rtg] by
1958	      N1 will bias its routing towards the least cost routes.  So, N2
1959	      risks losing all its revenue to competitive routes if it
1960	      overstates congestion (see Section 6.2.3).  In other words, if N2
1961	      is the least congested route, its ability to raise excess profits
1962	      is limited by the congestion on the next least congested route.
1963	      This pressure on N2 to remain competitive is represented by the
1964	      dotted downward arrow at the ingress to N2 in Figure 9.

1966	   Closing the loop:  All the above elements conspire to trap everyone
1967	      between two opposing pressures (the downward and upward arrows in
1968	      Figure 8 & Figure 9), ensuring the downstream congestion metric
1969	      arrives at the destination neither above nor below zero.  So, we
1970	      have arrived back where we started in our argument.  The ingress
1971	      edge network can rely on downstream congestion declared in the
1972	      packet headers presented by the sender.  So it can police the
1973	      sender's congestion response accordingly.

1975	   Evolvability of congestion control:  We have seen that re-ECN enables
1976	      policing at the very first ingress.  We have also seen that, as
1977	      flows continue on their path through further networks downstream,
1978	      re-ECN removes the need for further per-domain ingress policing of
1979	      all the different congestion responses allowed to each different
1980	      flow.  This is why the evolvability of re-ECN policing is so
1981	      superior to bottleneck policing or to any policing of different
1982	      QoS for different flows.  Even if all access networks choose to
1983	      conservatively police congestion per flow, each will want to
1984	      compete with the others to allow new responses to congestion for
1985	      new types of application.  With re-ECN, each can introduce new
1986	      controls independently, without coordinating with other networks
1987	      and without having to standardise anything.  But, as we have just
1988	      seen, by making inter-domain penalties proportionate to bulk
1989	      downtream congestion, downstream networks can be agnostic to the
1990	      specific congestion response for each flow, but they can still
1991	      apply more penalty the more liberal the ingress access network has
1992	      been in the response to congestion it allowed for each flow.

1994	6.1.3.1.  The Case against Classic Feedback

1996	   A system that produces an optimal outcome as a result of everyone's
1997	   selfish actions is extremely powerful.  Especially one that enables
1998	   evolvability of congestion control.  But why do we have to change to
1999	   re-ECN to achieve it?  Can't classic congestion feedback (as used
2000	   already by standard ECN) be arranged to provide similar incentives
2001	   and similar evolvability?  Superficially it can.  Kelly's seminal
2002	   work showed how we can allow everyone the freedom to evolve whatever
2003	   congestion control behaviour is in their application's best interest
2004	   but still optimise the whole system of networks and users by placing
2005	   a price on congestion to ensure responsible use of this
2006	   freedom [Evol_cc]).  Kelly used ECN with its classic congestion
2007	   feedback model as the mechanism to convey congestion price
2008	   information.  The mechanism could be thought of as volume charging;
2009	   except only the volume of packets marked with congestion experienced
2010	   (CE) was counted.

2012	   However, below we explain why relying on classic feedback /required/
2013	   congestion charging to be used, while re-ECN achieves the same
2014	   powerful outcome (given it is built on Kelly's foundations), but does
2015	   not /require/ congestion charging.  In brief, the problem with
2016	   classic feedback is that the incentives have to trace the indirect
2017	   path back to the sender---the long way round the feedback loop.  For
2018	   example, if classic feedback were used in Figure 8, N2 would have had
2019	   to influence N1 via all of N4, R & S rather than directly.

2021	   Inability to agree what is happening downstream:  In order to police
2022	      its upstream neighbour's congestion response, the neighbours
2023	      should be able to agree on the congestion to be responded to.
2024	      Whatever the feedback regime, as packets change hands at each
2025	      trust boundary, any path metrics they carry are verifiable by both
2026	      neighbours.  But, with a classic path metric, they can only agree
2027	      on the /upstream/ path congestion.

2029	   Inaccessible back-channel:  The network needs a whole-path congestion
2030	      metric if it wants to control the source.  Classically, whole path
2031	      congestion emerges at the destination, to be fed back from
2032	      receiver to sender in a back-channel.  But, in any data network,
2033	      back-channels need not be visible to relays, as they are
2034	      essentially communications between the end-points.  They may be
2035	      encrypted, asymmetrically routed or simply omitted, so no network
2036	      element can reliably intercept them.  The congestion charging
2037	      literature solves this problem by charging the receiver and
2038	      assuming this will cause the receiver to refer the charges to the
2039	      sender.  But, of course, this creates unintended side-effects...

2041	   `Receiver pays' unacceptable:  In connectionless datagram networks,
2042	      receivers and receiving networks cannot prevent reception from
2043	      malicious senders, so `receiver pays' opens them to `denial of
2044	      funds' attacks.

2046	   End-user congestion charging unacceptable:  Even if 'denial of funds'
2047	      were not a problem, we know that end-users are highly averse to
2048	      the unpredictability of congestion charging and anyway, we want to
2049	      avoid restricting network operators to just one retail tariff.
2050	      But with classic feedback only an upstream metric is available, so
2051	      we cannot avoid having to wrap the `receiver pays' money flow
2052	      around the feedback loop, necessarily forcing end-users to be
2053	      subjected to congestion charging.

2055	   To summarise so far, with classic feedback, policing congestion
2056	   response without losing evolvability /requires/ congestion charging
2057	   of end-users and a `receiver pays' model, whereas, with re-ECN, it is
2058	   still possible to influence incentives using congestion charging but
2059	   using the safer `sender pays' model.  However, congestion charging is
2060	   only likely to be appropriate between domains.  So, without losing
2061	   evolvability, re-ECN enables technical policing mechanisms that are
2062	   more appropriate for end users than congestion pricing.

2064	   We now take a second pass over the incentive framework, filling in
2065	   the detail.

2067	6.1.4.  Egress Dropper

2069	   As traffic leaves the last network before the receiver (domain N4 in
2070	   Figure 8), the fraction of positive octets in a flow should match the
2071	   fraction of negative octets introduced by congestion marking, leaving
2072	   a balance of zero.  If it is less (a negative flow), it implies that
2073	   the source is understating path congestion (which will reduce the
2074	   penalties that N2 owes N4).

2076	   If flows are positive, N4 need take no action---this simply means its
2077	   upstream neighbour is paying more penalties than it needs to, and the
2078	   source is going slower than it needs to.  But, to protect itself
2079	   against persistently negative flows, N4 will need to install a
2080	   dropper at its egress.  Appendix E gives a suggested algorithm for
2081	   this dropper.  There is no intention that the dropper algorithm needs
2082	   to be standardised, it is merely provided to show that an efficient,
2083	   robust algorithm is possible.  But whatever algorithm is used must
2084	   meet the criteria below:

2086	   o  It SHOULD introduce minimal false positives for honest flows;

2088	   o  It SHOULD quickly detect and sanction dishonest flows (minimal
2089	      false negatives);

2091	   o  It MUST be invulnerable to state exhaustion attacks from malicious
2092	      sources.  For instance, if the dropper uses flow-state, it should
2093	      not be possible for a source to send numerous packets, each with a
2094	      different flow ID, to force the dropper to exhaust its memory
2095	      capacity;

2097	   o  It MUST introduce sufficient loss in goodput so that malicious
2098	      sources cannot play off losses in the egress dropper against
2099	      higher allowed throughput.  Salvatori [CLoop_pol] describes this
2100	      attack, which involves the source understating path congestion
2101	      then inserting forward error correction (FEC) packets to
2102	      compensate expected losses.

2104	   Note that the dropper operates on flows but we would like it not to
2105	   require per-flow state.  This is why we have been careful to ensure
2106	   that all flows MUST start with a packet marked with the FNE
2107	   codepoint.  If a flow does not start with the FNE codepoint, a
2108	   dropper is likely to treat it unfavourably.  This risk makes it worth
2109	   setting the FNE codepoint at the start of a flow, even though there
2110	   is a cost to the sender of setting FNE (positive `worth').  Indeed,
2111	   with the FNE codepoint, the rate at which a sender can generate new
2112	   flows can be limited (Appendix G).  In this respect, the FNE
2113	   codepoint works like Handley's state set-up bit [Steps_DoS].

2115	   Appendix E also gives an example dropper implementation that
2116	   aggregates flow state.  Dropper algorithms will often maintain a
2117	   moving average across flows of the fraction of RE blanked packets.
2118	   When maintaining an average across flows, a dropper SHOULD only allow
2119	   flows into the average if they start with FNE, but it SHOULD NOT
2120	   include packets with the FNE codepoint set in the average.  A sender
2121	   sets the FNE codepoint when it does not have the benefit of feedback
2122	   from the receiver.  So, counting packets with FNE cleared would be
2123	   likely to make the average unnecessarily positive, providing headroom
2124	   (or should we say footroom?) for dishonest (negative) traffic.

2126	   If the dropper detects a persistently negative flow, it SHOULD drop
2127	   sufficient negative and neutral packets to force the flow to not be
2128	   negative.  Drops SHOULD be focused on just sufficient packets in
2129	   misbehaving flows to remove the negative bias while doing minimal
2130	   extra harm.

2132	6.1.5.  Policing

2134	   Access operators who wish to limit the congeston that a sender is
2135	   able to cause can deploy policers at the very first ingress to the
2136	   internetwork.  Re-ECN has been designed to avoid the need for
2137	   bottleneck policing so that we can avoid a future where a single rate
2138	   adaptation policy is embedded throughout the network.  Instead, re-
2139	   ECN allows the particular rate adaptation policy to be solely agreed
2140	   bilaterally between the sender and its ingress access provider
2141	   (Section 5.5.2 discusses possible ways to signal between them), which
2142	   allows congestion control to be policed, but maintains its
2143	   evolvability, requiring only a single, local box to be updated.

2145	   Appendix G gives examples of per-user policing algorithms.  But there
2146	   is no implication that these algorithms are to be standardised, or
2147	   that they are ideal.  The ingress rate policer is the part of the re-
2148	   ECN incentive framework that is intended to be the most flexible.
2149	   Once endpoint protocol handlers for re-ECN and egress droppers are in
2150	   place, operators can choose exactly which congestion response they
2151	   want to police, and whether they want to do it per user, per flow or
2152	   not at all.

2154	   The re-ECN protocol allows these ingress policers to easily perform
2155	   bulk per-user policing (Appendix G.1).  This is likely to provide
2156	   sufficient incentive to the user to correctly respond to congestion
2157	   without needing the policing function to be overly complex.  If an
2158	   access operator chose they could use per-flow policing according to
2159	   the widely adopted TCP rate adaptation ( Appendix G.2) or other
2160	   alternatives, however this would introduce extra complexity to the
2161	   system.

2163	   If a per-flow rate policer is used, it should use path (not
2164	   downstream) congestion as the relevant metric, which is represented
2165	   by the fraction of octets in packets with positive (Re-Echo and FNE)
2166	   and canceled (CE(0)) markings.  Of course, re-ECN provides all the
2167	   information a policer needs directly in the packets being policed.
2168	   So, even policing TCP's AIMD algorithm is relatively straightforward
2169	   (Appendix G.2).

2171	   Note that we have included canceled packets in the measure of path
2172	   congestion.  Canceled packets arise when the sender re-echoes earlier
2173	   congestion, but then this Re-Echo packet just happens to be
2174	   congestion marked itself.  One would not normally expect many
2175	   canceled packets at the first ingress because one would not normally
2176	   expect much congestion marking to have been necessary that soon in
2177	   the path.  However, a home network or campus network may well sit
2178	   between the sending endpoint and the ingress policer, so some
2179	   congestion may occur upstream of the policer.  And if congestion does
2180	   occur upstream, some canceled packets should be visible, and should
2181	   be taken into account in the measure of path congestion.

2183	   But a much more important reason for including canceled packets in
2184	   the measure of path congestion at an ingress policer is that a sender
2185	   might otherwise subvert the protocol by sending canceled packets
2186	   instead of neutral (RECT) packets.  Like neutral, canceled packets
2187	   are worth zero, so the sender knows they won't be counted against any
2188	   quota it might have been allowed.  But unlike neutral packets,
2189	   canceled packets are immune to congestion marking, because they have
2190	   already been congestion marked.  So, it is both correct and useful
2191	   that canceled packets should be included in a policer's measure of
2192	   path congestion, as this removes the incentive the sender would
2193	   otherwise have to mark more packets as canceled than it should.

2195	   An ingress policer should also ensure that flows are not already
2196	   negative when they enter the access network.  As with canceled
2197	   packets, the presence of negative packets will typically be unusual.
2198	   Therefore it will be easy to detect negative flows at the ingress by
2199	   just detecting negative packets then monitoring the flow they belong
2200	   to.

2202	   Of course, even if the sender does operate its own network, it may
2203	   arrange not to congestion mark traffic.  Whether the sender does this
2204	   or not is of no concern to anyone else except the sender.  Such a
2205	   sender will not be policed against its own network's contribution to
2206	   congestion, but the only resulting problem would be overload in the
2207	   sender's own network.

2209	   Finally, we must not forget that an easy way to circumvent re-ECN's
2210	   defences is for the source to turn off re-ECN support, by setting the
2211	   Not-RECT codepoint, implying legacy traffic.  Therefore an ingress
2212	   policer should put a general rate-limit on Not-RECT traffic, which
2213	   SHOULD be lax during early, patchy deployment, but will have to
2214	   become stricter as deployment widens.  Similarly, flows starting
2215	   without an FNE packet can be confined by a strict rate-limit used for
2216	   the remainder of flows that haven't proved they are well-behaved by
2217	   starting correctly (therefore they need not consume any flow state---
2218	   they are just confined to the `misbehaving' bin if they carry an
2219	   unrecognised flow ID).

2221	6.1.6.  Inter-domain Policing

2223	   One of the main design goals of re-ECN is for border security
2224	   mechanisms to be as simple as possible, otherwise they will become
2225	   the pinch-points that limit scalability of the whole internetwork.
2226	   We want to avoid per-flow processing at borders and to keep to
2227	   passive mechanisms that can monitor traffic in parallel to
2228	   forwarding, rather than having to filter traffic inline---in series
2229	   with forwarding.  Such passive, off-line mechanisms are essential for
2230	   future high-speed all-optical border interconnection where packets
2231	   cannot be buffered while they are checked for policy compliance.

2233	   So far, we have been able to keep the border mechanisms simple,
2234	   despite having had to harden them against some subtle attacks on the
2235	   re-ECN design.  The mechanisms are still passive and avoid per-flow
2236	   processing.

2238	   The basic accounting mechanism at each border interface simply
2239	   involves accumulating the volume of packets with positive worth (Re-
2240	   Echo and FNE), and subtracting the volume of those with negative
2241	   worth: CE(-1).  Even though this mechanism takes no regard of flows,
2242	   over an accounting period (say a month) this subtraction will account
2243	   for the downstream congestion caused by all the flows traversing the
2244	   interface, wherever they come from, and wherever they go to.  The two
2245	   networks can agree to use this metric however they wish to determine
2246	   some congestion-related penalty against the upstream network.
2247	   Although the algorithm could hardly be simpler, it is spelled out
2248	   using pseudo-code in Appendix H.1.

2250	   Various attempts to subvert the re-ECN design have been made.  In all
2251	   cases their root cause is persistently negative flows.  But, after
2252	   describing these attacks we will show that we don't actually have to
2253	   get rid of all persistently negative flows in order to thwart the
2254	   attacks.

2256	   In honest flows, downstream congestion is measured as positive minus
2257	   negative volume.  So if all flows are honest (i.e. not persistently
2258	   negative), adding all positive volume and all negative volume without
2259	   regard to flows will give an aggregate measure of downstream
2260	   congestion.  But such simple aggregation is only possible if no flows
2261	   are persistently negative.  Unless persistently negative flows are
2262	   completely removed, they will reduce the aggregate measure of
2263	   congestion.  The aggregate may still be positive overall, but not as
2264	   positive as it would have been had the negative flows been removed.

2266	   In Section 6.1.4 we discussed how to sanction traffic to remove, or
2267	   at least to identify, persistently negative flows.  But, even if the
2268	   sanction for negative traffic is to discard it, unless it is
2269	   discarded at the exact point it goes negative, it will wrongly
2270	   subtract from aggregate downstream congestion, at least at any
2271	   borders it crosses after it has gone negative but before it is
2272	   discarded.

2274	   We rely on sanctions to deter dishonest understatement of congestion.
2275	   But even the ultimate sanction of discard can only be effective if
2276	   the sender is bothered about the data getting through to its
2277	   destination.  A number of attacks have been identified where a sender
2278	   gains from sending dummy traffic or it can attack someone or
2279	   something using dummy traffic even though it isn't communicating any
2280	   information to anyone:

2282	   o  A host can send traffic with no positive markings towards its
2283	      intended destination, aiming to transmit as much traffic as any
2284	      dropper will allow [Bauer06].  It may add forward error correction
2285	      (FEC) to repair as much drop as it experiences.

2287	   o  A host can send dummy traffic into the network with no positive
2288	      markings and with no intention of communicating with anyone, but
2289	      merely to cause higher levels of congestion for others who do want
2290	      to communicate (DoS).  So, to ride over the extra congestion,
2291	      everyone else has to spend more of whatever rights to cause
2292	      congestion they have been allowed.

2294	   o  A network can simply create its own dummy traffic to congest
2295	      another network, perhaps causing it to lose business at no cost to
2296	      the attacking network.  This is a form of denial of service
2297	      perpetrated by one network on another.  The preferential drop
2298	      measures in Section 5.3 provide crude protection against such
2299	      attacks, but we are not overly worried about more accurate
2300	      prevention measures, because it is already possible for networks
2301	      to DoS other networks on the general Internet, but they generally
2302	      don't because of the grave consequences of being found out.  We
2303	      are only concerned if re-ECN increases the motivation for such an
2304	      attack, as in the next example.

2306	   o  A network can just generate negative traffic and send it over its
2307	      border with a neighbour to reduce the overall penalties that it
2308	      should pay to that neighbour.  It could even initialise the TTL so
2309	      it expired shortly after entering the neighbouring network,
2310	      reducing the chance of detection further downstream.  This attack
2311	      need not be motivated by a desire to deny service and indeed need
2312	      not cause denial of service.  A network's main motivator would
2313	      most likely be to reduce the penalties it pays to a neighbour.
2314	      But, the prospect of financial gain might tempt the network into
2315	      mounting a DoS attack on the other network as well, given the gain
2316	      would offset some of the risk of being detected.

2318	   The first step towards a solution to all these problems with negative
2319	   flows is to be able to estimate the contribution they make to
2320	   downstream congestion at a border and to correct the measure
2321	   accordingly.  Although ideally we want to remove negative flows
2322	   themselves, perhaps surprisingly, the most effective first step is to
2323	   cancel out the polluting effect negative flows have on the measure of
2324	   downstream congestion at a border.  It is more important to get an
2325	   unbiased estimate of their effect, than to try to remove them all.  A
2326	   suggested algorithm to give an unbiased estimate of the contribution
2327	   from negative flows to the downstream congestion measure is given in
2328	   Appendix H.2.

2330	   Although making an accurate assessment of the contribution from
2331	   negative flows may not be easy, just the single step of neutralising
2332	   their polluting effect on congestion metrics removes all the gains
2333	   networks could otherwise make from mounting dummy traffic attacks on
2334	   each other.  This puts all networks on the same side (only with
2335	   respect to negative flows of course), rather than being pitched
2336	   against each other.  The network where this flow goes negative as
2337	   well as all the networks downstream lose out from not being
2338	   reimbursed for any congestion this flow causes.  So they all have an
2339	   interest in getting rid of these negative flows.  Networks forwarding
2340	   a flow before it goes negative aren't strictly on the same side, but
2341	   they are disinterested bystanders---they don't care that the flow
2342	   goes negative downstream, but at least they can't actively gain from
2343	   making it go negative.  The problem becomes localised so that once a
2344	   flow goes negative, all the networks from where it happens and beyond
2345	   downstream each have a small problem, each can detect it has a
2346	   problem and each can get rid of the problem if it chooses to.  But
2347	   negative flows can no longer be used for any new attacks.

2349	   Once an unbiased estimate of the effect of negative flows can be
2350	   made, the problem reduces to detecting and preferably removing flows
2351	   that have gone negative as soon as possible.  But importantly,
2352	   complete eradication of negative flows is no longer critical---best
2353	   endeavours will be sufficient.

2355	   For instance, let us consider the case where a source sends traffic
2356	   with no positive markings at all, hoping to at least get as much
2357	   traffic delivered as network-based droppers will allow.  The flow is
2358	   likely to go at least slightly negative in the first network on the
2359	   path (N1 if we use the example network layout in Figure 9).  If all
2360	   networks use the algorithm in Appendix H.2 to inflate penalties at
2361	   their border with an upstream network, they will remove the effect of
2362	   negative flows.  So, for instance, N2 will not be paying a penalty to
2363	   N1 for this flow.  Further, because the flow contributes no positive
2364	   markings at all, a dropper at the egress will completely remove it.

2366	   The remaining problem is that every network is carrying a flow that
2367	   is causing congestion to others but not being held to account for the
2368	   congestion it is causing.  Whenever the fail-safe border algorithm
2369	   (Section 6.1.7) or the border algorithm to compensate for negative
2370	   flows (Appendix H.2) detects a negative flow, it can instantiate a
2371	   focused dropper for that flow locally.  It may be some time before
2372	   the flow is detected, but the more strongly negative the flow is, the
2373	   more quickly it will be detected by the fail-safe algorithm.  But, in
2374	   the meantime, it will not be distorting border incentives.  Until it
2375	   is detected, if it contributes to drop anywhere, its packets will
2376	   tend to be dropped before others if routers use the preferential drop
2377	   rules in Section 5.3, which discriminate against non-positive
2378	   packets.  All networks below the point where a flow goes negative
2379	   (N1, N2 and N4 in this case) have an incentive to remove this flow,
2380	   but the router where it first goes negative (in N1) can of course
2381	   remove the problem for everyone downstream.

2383	   In the case of DDoS attacks, Section 6.2.1 describes how re-ECN
2384	   mitigates their force.

2386	6.1.7.  Inter-domain Fail-safes

2388	   The mechanisms described so far create incentives for rational
2389	   network operators to behave.  That is, one operator aims to make
2390	   another behave responsibly by applying penalties and expects a
2391	   rational response (i.e. one that trades off costs against benefits).
2392	   It is usually reasonable to assume that other network operators will
2393	   behave rationally (policy routing can avoid those that might not).
2394	   But this approach does not protect against the misconfigurations and
2395	   accidents of other operators.

2397	   Therefore, we propose the following two mechanisms at a network's
2398	   borders to provide "defence in depth".  Both are similar:

2400	   Highly positive flows:  A small sample of positive packets should be
2401	      picked randomly as they cross a border interface.  Then subsequent
2402	      packets matching the same source and destination address and DSCP
2403	      should be monitored.  If the fraction of positive marking is well
2404	      above a threshold (to be determined by operational practice), a
2405	      management alarm SHOULD be raised, and the flow MAY be
2406	      automatically subject to focused drop.

2408	   Persistently negative flows:  A small sample of congestion marked
2409	      (negative) packets should be picked randomly as they cross a
2410	      border interface.  Then subsequent packets matching the same
2411	      source and destination address and DSCP should be monitored.  If
2412	      the balance of positive minus negative markings is persistently
2413	      negative, a management alarm SHOULD be raised, and the flow MAY be
2414	      automatically subject to focused drop.

2416	   Both these mechanisms rely on the fact that highly positive (or
2417	   negative) flows will appear more quickly in the sample by selecting
2418	   randomly solely from positive (or negative) packets.

2420	6.1.8.  Simulations

2422	   Simulations of policer and dropper performance done for the multi-bit
2423	   version of re-feedback have been included in section 5 "Dropper
2424	   Performance" of [Re-fb].  Simulations of policer and dropper for the
2425	   re-ECN version described in this document are work in progress.

2427	6.2.  Other Applications

2429	6.2.1.  DDoS Mitigation

2431	   A flooding attack is inherently about congestion of a resource.
2432	   Because re-ECN ensures the sources causing network congestion
2433	   experience the cost of their own actions, it acts as a first line of
2434	   defence against DDoS.  As load focuses on a victim, upstream queues
2435	   grow, requiring honest sources to pre-load packets with a higher
2436	   fraction of positive packets.  Once downstream routers are so
2437	   congested that they are dropping traffic, they will be CE marking the
2438	   traffic they do forward 100%.  Honest sources will therefore be
2439	   sending Re-Echo 100% (and therefore being severely rate-limited at
2440	   the ingress).

2442	   Senders under malicious control can either do the same as honest
2443	   sources, and be rate-limited at ingress, or they can understate
2444	   congestion by sending more neutral RECT packets than they should.  If
2445	   sources understate congestion (i.e. do not re-echo sufficient
2446	   positive packets) and the preferential drop ranking is implemented on
2447	   routers (Section 5.3), these routers will preserve positive traffic
2448	   until last.  So, the neutral traffic from malicious sources will all
2449	   be automatically dropped first.  Either way, the malicious sources
2450	   cannot send more than honest sources.

2452	   Further, hosts under malicious control will tend to be re-used for
2453	   many different attacks.  They will therefore build up a long term
2454	   history of causing congestion.  Therefore, as long as the population
2455	   of potentially compromisable hosts around the Internet is limited,
2456	   the per-user policing algorithms in Appendix G.1 will gradually
2457	   throttle down zombies and other launchpads for attacks.  Therefore,
2458	   widespread deployment of re-ECN could considerably dampen the force
2459	   of DDoS.  Certainly, zombie armies could hold their fire for long
2460	   enough to be able to build up enough credit in the per-user policers
2461	   to launch an attack.  But they would then still be limited to no more
2462	   throughput than other, honest users.

2464	   Inter-domain traffic policing (see Section 6.1.6)ensures that any
2465	   network that harbours compromised `zombie' hosts will have to bear
2466	   the cost of the congestion caused by traffic from zombies in
2467	   downstream networks.  Such networks will be incentivised to deploy
2468	   per-user policers that rate-limit hosts that are unresponsive to
2469	   congestion so they can only send very slowly into congested paths.
2470	   As well as protecting other networks, the extremely poor performance
2471	   at any sign of congestion will incentivise the zombie's owner to
2472	   clean it up.  However, the host should behave normally when using
2473	   uncongested paths.

2475	   Uniquely, re-ECN handles DDoS traffic without relying on the validity
2476	   of identifiers in packets.  Certainly the egress dropper relies on
2477	   uniqueness of flow identifiers, but not their validity.  So if a
2478	   source spoofs another address, re-ECN works just as well, as long as
2479	   the attacker cannot imitate all the flow identifiers of another
2480	   active flow passing through the same dropper (see Section 6.3).
2481	   Similarly, the ingress policer relies on uniqueness of flow IDs, not
2482	   their validity.  Because a new flow will only be allowed any rate at
2483	   all if it starts with FNE, and the more FNE packets there are
2484	   starting new flows, the more they will be limited.  Essentially a re-
2485	   ECN policer limits the bulk of all congestion entering the network
2486	   through a physical interface; limiting the congestion caused by each
2487	   flow is merely an optional extra.

2489	6.2.2.  End-to-end QoS

2491	   {ToDo: (Section 3.3.2 of [Re-fb] entitled `Edge QoS' gives an outline
2492	   of the text that will be added here).}

2494	6.2.3.  Traffic Engineering

2496	   {ToDo: }

2498	6.2.4.  Inter-Provider Service Monitoring

2500	   {ToDo: }

2502	6.3.  Limitations

2504	   The known limitations of the re-ECN approach are:

2506	   o  We still cannot defend against the attack described in Section 10
2507	      where a malicious source sends negative traffic through the same
2508	      egress dropper as another flow and imitates its flow identifiers,
2509	      allowing a malicious source to cause an innocent flow to
2510	      experience heavy drop.

2512	   o  Re-feedback for TTL (re-TTL) would also be desirable at the same
2513	      time as re-ECN.  Unfortunately this requires a further standards
2514	      action for the mechanisms briefly described in Appendix F

2516	   o  Traffic must be ECN-capable for re-ECN to be effective.  The only
2517	      defence against malicious users who turn off ECN capbility is that
2518	      networks are expected to rate limit Not-ECT traffic and to apply
2519	      higher drop preference to it during congestion.  Although these
2520	      are blunt instruments, they at least represent a feasible scenario
2521	      for the future Internet where Not-ECT traffic co-exists with re-
2522	      ECN traffic, but as a severely hobbled under-class.  We recommend
2523	      (Section 7.1) that while accommodating a smooth initial transition
2524	      to re-ECN, policing policies should gradually be tightened to rate
2525	      limit Not-ECT traffic more strictly in the longer term.

2527	   o  When checking whether a flow is balancing positive markings with
2528	      congestion marking, re-ECN can only account for congestion
2529	      marking, not drops.  So, whenever a sender experiences drop, it
2530	      does not have to re-echo the congestion event.  Nonetheless, it is
2531	      hardly any advantage to be able to send faster than other flows
2532	      only if your traffic is dropped and the other traffic isn't.

2534	   o  We are considering the issue of whether it would be useful to
2535	      truncate rather than drop packets that appear to be malicious, so
2536	      that the feedback loop is not broken but useful data can be
2537	      removed.

2539	7.  Incremental Deployment

2541	7.1.  Incremental Deployment Features

2543	   The design of the re-ECN protocol started from the fact that the
2544	   current ECN marking behaviour of routers was sufficient and that re-
2545	   feedback could be introduced around these routers by changing the
2546	   sender behaviour but not the routers.  Otherwise, if we had required
2547	   routers to be changed, the chance of encountering a path that had
2548	   every router upgraded would be vanishly small during early
2549	   deployment, giving no incentive to start deployment.  Also, as there
2550	   is no new forwarding behaviour, routers and hosts do not have to
2551	   signal or negotiate anything.

2553	   However, networks that choose to protect themselves using re-ECN do
2554	   have to add new security functions at their trust boundaries with
2555	   others.  They distinguish legacy traffic by its ECN field.  Traffic
2556	   from Not-ECT transports is distinguishable by its Not-RECT marking.
2557	   Traffic from legacy ECN transports is distinguished from re-ECN by
2558	   which of ECT(0) or ECT(1) is used.  We chose to use ECT(1) for re-ECN
2559	   traffic deliberately.  Existing ECN sources set ECT(0) on either 50%
2560	   (the nonce) or 100% (the default) of packets, whereas re-ECN does not
2561	   use ECT(0) at all.  We can use this distinguishing feature of legacy
2562	   ECN traffic to separate it out for different treatment at the various
2563	   border security functions: egress dropping, ingress policing and
2564	   border policing.

2566	   The general principle we adopt is that an egress dropper will not
2567	   drop any legacy traffic, but ingress and border policers will limit
2568	   the bulk rate of legacy traffic that can enter each network.  Then,
2569	   during early re-ECN deployment, operators can set very permissive (or
2570	   non-existent) rate-limits on legacy traffic, but once re-ECN
2571	   implementations are generally available, legacy traffic can be rate-
2572	   limited increasingly harshly.  Ultimately, an operator might choose
2573	   to block all legacy traffic entering its network, or at least only
2574	   allow through a trickle.

2576	   Then, as the limits are set more strictly, the more legacy ECN
2577	   sources will gain by upgrading to re-ECN.  Thus, towards the end of
2578	   the voluntary incremental deployment period, legacy transports can be
2579	   given progressively stronger encouragement to upgrade.

2581	   The following list of minor changes, brings together all the points
2582	   where Re-ECN semantics for use of the two-bit ECN field are different
2583	   compared to RFC3168:

2585	   o  A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender
2586	      sets ECT(0) by default (Section 3.3);

2588	   o  No provision is necessary for a re-ECN capable source transport to
2589	      use the ECN nonce (Section 4.1.2.1);

2591	   o  Routers MAY preferentially drop different extended ECN codepoints
2592	      (Section 5.3);

2594	   o  Packets carrying the feedback not established (FNE) codepoint MAY
2595	      optionally be marked rather than dropped by routers, even though
2596	      their ECN field is Not-ECT (with the important caveat in
2597	      Section 5.3);

2599	   o  Packets may be dropped by policing nodes because of apparent
2600	      misbehaviour, not just because of congestion (Section 6);

2602	   o  Tunnel entry behaviour is still to be defined, but may have to be
2603	      different from RFC3168 (Section 5.6).

2605	   None of these changes REQUIRE any modifications to routers.  Also
2606	   none of these changes affect anything about end to end congestion
2607	   control; they are all to do with allowing networks to police that end
2608	   to end congestion control is well-behaved.

2610	7.2.  Incremental Deployment Incentives

2612	   It would only be worth standardising the re-ECN protocol if there
2613	   existed a coherent story for how it might be incrementally deployed.
2614	   In order for it to have a chance of deployment, everyone who needs to
2615	   act must have a strong incentive to act, and the incentives must
2616	   arise in the order that deployment would have to happen.  Re-ECN
2617	   works around unmodified ECN routers, but we can't just discuss why
2618	   and how re-ECN deployment might build on ECN deployment, because
2619	   there is precious little to build on in the first place.  Instead, we
2620	   aim to show that re-ECN deployment could carry ECN with it.  We focus
2621	   on commercial deployment incentives, although some of the arguments
2622	   apply equally to academic or government sectors.

2624	   ECN deployment:

2626	      ECN is largely implemented in commercial routers, but generally
2627	      not as a supported feature, and it has largely not been deployed
2628	      by commercial network operators.  It has been released in many
2629	      Unix-based operating systems, but not in proprietary OSs like
2630	      Windows or those in many mobile devices.  For detailed deployment
2631	      status, see [ECN-Deploy].  We believe the reason ECN deployment
2632	      has not happened is twofold:

2634	      *  ECN requires changes to both routers and hosts.  If someone
2635	         wanted to sell the improvement that ECN offers, they would have
2636	         to co-ordinate deployment of their product with others.  An ECN
2637	         server only gives any improvement on an ECN network.  An ECN
2638	         network only gives any improvement if used by ECN devices.
2639	         Deployment that requires co-ordination adds cost and delay and
2640	         tends to dilute any competitive advantage that might be gained.

2642	      *  ECN `only' gives a performance improvement.  Making a product a
2643	         bit faster (whether the product is a device or a network),
2644	         isn't usually a sufficient selling point to be worth the cost
2645	         of co-ordinating across the industry to deploy it.  Network
2646	         operators tend to avoid re-configuring a working network unless
2647	         launching a new product.

2649	   ECN and re-ECN for Edge-to-edge Assured QoS:

2651	      We believe the proposal to provide assured QoS sessions using a
2652	      form of ECN called pre-congestion notification (PCN) [PCN-arch] is
2653	      most likely to break the deadlock in ECN deployment first.  It
2654	      only requires edge-to-edge deployment so it does not require
2655	      endpoint support.  It can be deployed in a single network, then
2656	      grow incrementally to interconnected networks.  And it provides a
2657	      different `product' (internetworked assured QoS), rather than
2658	      merely making an existing product a bit faster.

2660	      Not only could this assured QoS application kick-start ECN
2661	      deployment, it could also carry re-ECN deployment with it; because
2662	      re-ECN can enable the assured QoS region to expand to a large
2663	      internetwork where neighbouring networks do not trust each other.
2664	      [Re-PCN] argues that re-ECN security should be built in to the QoS
2665	      system from the start, explaining why and how.

2667	      If ECN and re-ECN were deployed edge-to-edge for assured QoS,
2668	      operators would gain valuable experience.  They would also clear
2669	      away many technical obstacles such as firewall configurations that
2670	      block all but the legacy settings of the ECN field and the RE
2671	      flag.

2673	   ECN in Access Networks:

2675	      The next obstacle to ECN deployment would be extension to access
2676	      and backhaul networks, where considerable link layer differences
2677	      makes implementation non-trivial, particularly on congested
2678	      wireless links.  ECN and re-ECN work fine during partial
2679	      deployment, but they will not be very useful if the most congested
2680	      elements in networks are the last to support them.  Access network
2681	      support is one of the weakest parts of this deployment story.  All
2682	      we can hope is that, once the benefits of ECN are better
2683	      understood by operators, they will push for the necessary link
2684	      layer implementations as deployment proceeds.

2686	   Policing Unresponsive Flows:

2688	      Re-ECN allows a network to offer differentiated quality of service
2689	      as explained in Section 6.2.2.  But we do not believe this will
2690	      motivate initial deployment of re-ECN, because the industry is
2691	      already set on alternative ways of doing QoS.  Despite being much
2692	      more complicated and expensive, the alternative approaches are
2693	      here and now.

2695	      But re-ECN is critical to QoS deployment in another respect.  It
2696	      can be used to prevent applications from taking whatever bandwidth
2697	      they choose without asking.

2699	      Currently, applications that remain resolute in their lack of
2700	      response to congestion are rewarded by other TCP applications.  In
2701	      other words, TCP is naively friendly, in that it reduces its rate
2702	      in response to congestion whether it is competing with friends
2703	      (other TCPs) or with enemies (unresponsive applications).

2705	      Therefore, those network owners that want to sell QoS will be keen
2706	      to ensure that their users can't help themselves to QoS for free.
2707	      Given the very large revenues at stake, we believe effective
2708	      policing of congestion response will become highly sought after by
2709	      network owners.

2711	      But this does not necessarily argue for re-ECN deployment.
2712	      Network owners might choose to deploy bottleneck policers rather
2713	      than re-ECN-based policing.  However, under Related Work
2714	      (Section 9) we argue that bottleneck policers are inherently
2715	      vulnerable to circumvention.

2717	      Therefore we believe there will be a strong demand from network
2718	      owners for re-ECN deployment so they can police flows that do not
2719	      ask to be unresponsive to congestion, in order to protect their
2720	      revenues from flows that do ask (QoS).  In particular, we suspect
2721	      that the operators of cellular networks will want to prevent VoIP
2722	      and video applications being used freely on their networks as a
2723	      more open market develops in GPRS and 3G devices.

2725	      Initial deployments are likely to be isolated to single cellular
2726	      networks.  Cellular operators would first place requirements on
2727	      device manufacturers to include re-ECN in the standards for mobile
2728	      devices.  In parallel, they would put out tenders for ingress and
2729	      egress policers.  Then, after a while they would start to tighten
2730	      rate limits on Not-ECT traffic from non-standard devices and they
2731	      would start policing whatever non-accredited applications people
2732	      might install on mobile devices with re-ECN support in the
2733	      operating system.  This would force even independent mobile device
2734	      manufacturers to provide re-ECN support.  Early standardisation
2735	      across the cellular operators is likely, including interconnection
2736	      agreements with penalties for excess downstream congestion.

2738	      We suspect some fixed broadband networks (whether cable or DSL)
2739	      would follow a similar path.  However, we also believe that larger
2740	      parts of the fixed Internet would not choose to police on a per-
2741	      flow basis.  Some might choose to police congestion on a per-user
2742	      basis in order to manage heavy peer-to-peer file-sharing, but it
2743	      seems likely that a sizeable majority would not deploy any form of
2744	      policing.

2746	      This hybrid situation begs the question, "How does re-ECN work for
2747	      networks that choose to using policing if they connect with others
2748	      that don't?"  Traffic from non-ECN capable sources will arrive
2749	      from other networks and cause congestion within the policed, ECN-
2750	      capable networks.  So networks that chose to police congestion
2751	      would rate-limit Not-ECT traffic throughout their network,
2752	      particularly at their borders.  They would probably also set
2753	      higher usage prices in their interconnection contracts for
2754	      incoming Not-ECT and Not-RECT traffic.  We assume that
2755	      interconnection contracts between networks in the same tier will
2756	      include congestion penalties before contracts with provider
2757	      backbones do.

2759	      A hybrid situation could remain for all time.  As was explained in
2760	      the introduction, we believe in healthy competition between
2761	      policing and not policing, with no imperative to convert the whole
2762	      world to the religion of policing.  Networks that chose not to
2763	      deploy egress droppers would leave themselves open to being
2764	      congested by senders in other networks.  But that would be their
2765	      choice.

2767	      The important aspect of the egress dropper though is that it most
2768	      protects the network that deploys it.  If a network does not
2769	      deploy an egress dropper, sources sending into it from other
2770	      networks will be able to understate the congestion they are
2771	      causing.  Whereas, if a network deploys an egress dropper, it can
2772	      know how much congestion other networks are dumping into it, and
2773	      apply penalties or charges accordingly.  So, whether or not a
2774	      network polices its own sources at ingress, it is in its interests
2775	      to deploy an egress dropper.

2777	   Host support:

2779	      In the above deployment scenario, host operating system support
2780	      for re-ECN came about through the cellular operators demanding it
2781	      in device standards (i.e. 3GPP).  Of course, increasingly, mobile
2782	      devices are being built to support multiple wireless technologies.
2783	      So, if re-ECN were stipulated for cellular devices, it would
2784	      automatically appear in those devices connected to the wireless
2785	      fringes of fixed networks if they coupled cellular with WiFi or
2786	      Bluetooth technology, for instance.  Also, once implemented in the
2787	      operating system of one mobile device, it would tend to be found
2788	      in other devices using the same family of operating system.

2790	      Therefore, whether or not a fixed network deployed ECN, or
2791	      deployed re-ECN policers and droppers, many of its hosts might
2792	      well be using re-ECN over it.  Indeed, they would be at an
2793	      advantage when communicating with hosts across Re-ECN policed
2794	      networks that rate limited Not-RECT traffic.

2796	   Other possible scenarios:

2798	      The above is thankfully not the only plausible scenario we can
2799	      think of.  One of the many clubs of operators that meet regularly
2800	      around the world might decide to act together to persuade a major
2801	      operating system manufacturer to implement re-ECN.  And they may
2802	      agree between them on an interconnection model that includes
2803	      congestion penalties.

2805	      Re-ECN provides an interesting opportunity for device
2806	      manufacturers as well as network operators.  Policers can be
2807	      configured loosely when first deployed.  Then as re-ECN take-up
2808	      increases, they can be tightened up, so that a network with re-ECN
2809	      deployed can gradually squeeze down the service provided to legacy
2810	      devices that have not upgraded to re-ECN.  Many device vendors
2811	      rely on replacement sales.  And operating system companies rely
2812	      heavily on new release sales.  Also support services would like to
2813	      be able to force stragglers to upgrade.  So, the ability to
2814	      throttle service to legacy operating systems is quite valuable.

2816	      Also, policing unresponsive sources may not be the only or even
2817	      the first application that drives deployment.  It may be policing
2818	      causes of heavy congestion (e.g. peer-to-peer file-sharing).  Or
2819	      it may be mitigation of denial of service.  Or we may be wrong in
2820	      thinking simpler QoS will not be the initial motivation for re-ECN
2821	      deployment.  Indeed, the combined pressure for all these may be
2822	      the motivator, but it seems optimistic to expect such a level of
2823	      joined-up thinking from today's communications industry.  We
2824	      believe a single application alone must be a sufficient motivator.

2826	      In short, everyone gains from adding accountability to TCP/IP,
2827	      except the selfish or malicious.  So, deployment incentives tend
2828	      to be strong.

2830	8.  Architectural Rationale

2832	   In the Internet's technical community, the danger of not responding
2833	   to congestion is well-understood, as well as its attendant risk of
2834	   congestion collapse [RFC3714].  However, one side of the Internet's
2835	   commercial community considers that the very essence of IP is to
2836	   provide open access to the internetwork for all applications.  They
2837	   see congestion as a symptom of over-conservative investment, and rely
2838	   on revising application designs to find novel ways to keep
2839	   applications working despite congestion.  They argue that the
2840	   Internet was never intended to be solely for TCP-friendly
2841	   applications.  Meanwhile, another side of the Internet's commercial
2842	   community believes that it is worthwhile providing a network for
2843	   novel applications only if it has sufficient capacity, which can
2844	   happen only if a greater share of application revenues can be
2845	   /assured/ for the infrastructure provider.  Otherwise the major
2846	   investments required would carry too much risk and wouldn't happen.

2848	   The lesson articulated in [Tussle] is that we shouldn't embed our
2849	   view on these arguments into the Internet at design time.  Instead we
2850	   should design the Internet so that the outcome of these arguments can
2851	   get decided at run-time.  Re-ECN is designed in that spirit.  Once
2852	   the protocol is available, different network operators can choose how
2853	   liberal they want to be in holding people accountable for the
2854	   congestion they cause.  Some might boldly invest in capacity and not
2855	   police its use at all, hoping that novel applications will result.
2856	   Others might use re-ECN for fine-grained flow policing, expecting to
2857	   make money selling vertically integrated services.  Yet others might
2858	   sit somewhere half-way, perhaps doing coarse, per-user policing.  All
2859	   might change their minds later.  But re-ECN always allows them to
2860	   interconnect so that the careful ones can protect themselves from the
2861	   liberal ones.

2863	   The incentive-based approach used for re-ECN is based on Gibbens and
2864	   Kelly's arguments [Evol_cc] on allowing endpoints the freedom to
2865	   evolve new congestion control algorithms for new applications.  They
2866	   ensured responsible behaviour despite everyone's self-interest by
2867	   applying pricing to ECN marking, and Kelly had proved stability and
2868	   optimality in an earlier paper.

2870	   Re-ECN keeps all the underlying economic incentives, but rearranges
2871	   the feedback.  The idea is to allow a network operator (if it
2872	   chooses) to deploy engineering mechanisms like policers at the front
2873	   of the network which can be designed to behave /as if/ they are
2874	   responding to congestion prices.  Rather than having to subject users
2875	   to congestion pricing, networks can then use more traditional
2876	   charging regimes (or novel ones).  But the engineering can constrain
2877	   the overall amount of congestion a user can cause.  This provides a
2878	   buffer against completely outrageous congestion control, but still
2879	   makes it easy for novel applications to evolve if they need different
2880	   congestion control to the norms.  It also allows novel charging
2881	   regimes to evolve.

2883	   Despite being achieved with a relatively minor protocol change, re-
2884	   ECN is an architectural change.  Previously, Internet congestion
2885	   could only be controlled by the data sender, because it was the only
2886	   one both in a position to control the load and in a position to see
2887	   information on congestion.  Re-ECN levels the playing field.  It
2888	   recognises that the network also has a role to play in moderating
2889	   (policing) congestion control.  But policing is only truly effective
2890	   at the first ingress into an internetwork, whereas path congestion
2891	   was previously only visible at the last egress.  So, re-ECN
2892	   democratises congestion information.  Then the choice over who
2893	   actually controls congestion can be made at run-time, not design
2894	   time---a bit like an aircraft with dual controls.  And different
2895	   operators can make different choices.  We believe non-architectural
2896	   approaches to this problem are unlikely to offer more than partial
2897	   solutions (see Section 9).

2899	   Importantly, re-ECN does NOT REQUIRE assumptions about specific
2900	   congestion responses to be embedded in any network elements, except
2901	   at the first ingress to the internetwork if that level of control is
2902	   desired by the ingress operator.  But such tight policing will be a
2903	   matter of agreement between the source and its access network
2904	   operator.  The ingress operator need not police congestion response
2905	   at flow granularity; it can simply hold a source responsible for the
2906	   aggregate congestion it causes, perhaps keeping it within a monthly
2907	   congestion quota.  Or if the ingress network trusts the source, it
2908	   can do nothing.

2910	   Therefore, the aim of the re-ECN protocol is NOT solely to police
2911	   TCP-friendliness.  Re-ECN preserves IP as a generic network layer for
2912	   all sorts of responses to congestion, for all sorts of transports.
2913	   Re-ECN merely ensures truthful downstream congestion information is
2914	   available in the network layer for all sorts of accountability
2915	   applications.

2917	   The end to end design principle does not say that all functions
2918	   should be moved out of the lower layers---only those functions that
2919	   are not generic to all higher layers.  Re-ECN adds a function to the
2920	   network layer that is generic, but was omitted: accountability for
2921	   causing congestion.  Accountability is not something that an end-user
2922	   can provide to themselves.  We believe re-ECN adds no more than is
2923	   sufficient to hold each flow accountable, even if it consists of a
2924	   single datagram.

2926	   "Accountability" implies being able to identify who is responsible
2927	   for causing congestion.  However, at the network layer it would NOT
2928	   be useful to identify the cause of congestion by adding individual or
2929	   organisational identity information, NOR by using source IP
2930	   addresses.  Rather than bringing identity information to the point of
2931	   congestion, we bring downstream congestion information to the point
2932	   where the cause can be most easily identified and dealt with.  That
2933	   is, at any trust boundary congestion can be associated with the
2934	   physically connected upstream neighbour that is directly responsible
2935	   for causing it (whether intentionally or not).  A trust boundary
2936	   interface is exactly the place to police or throttle in order to
2937	   directly mitigate congestion, rather than having to trace the
2938	   (ir)responsible party in order to shut them down.

2940	   Some considered that ECN itself was a layering violation.  The
2941	   reasoning went that the interface to a layer should provide a service
2942	   to the higher layer and hide how the lower layer does it.  However,
2943	   ECN reveals the state of the network layer and below to the transport
2944	   layer.  A more positive way to describe ECN is that it is like the
2945	   return value of a function call to the network layer.  It explicitly
2946	   returns the status of the request to deliver a packet, by returning a
2947	   value representing the current risk that a packet will not be served.
2948	   Re-ECN has similar semantics, except the transport layer must try to
2949	   guess the return value, then it can use the actual return value from
2950	   the network layer to modify the next guess.

2952	   The guiding principle behind all the discussion in Section 6.1.6 on
2953	   Policing is that any gain from subverting the protocol should be
2954	   precisely neutralised, rather than punished.  If a gain is punished
2955	   to a greater extent than is sufficient to neutralise it, it will most
2956	   likely open up a new vulnerability, where the amplifying effect of
2957	   the punishment mechanism can be turned on others.

2959	   For instance, if possible, flows should be removed as soon as they go
2960	   negative, but we do NOT RECOMMEND any attempts to discard such flows
2961	   further upstream while they are still positive.  Such over-zealous
2962	   push-back is unnecessary and potentially dangerous.  These flows have
2963	   paid their `fare' up to the point they go negative, so there is no
2964	   harm in delivering them that far.  If someone downstream asks for a
2965	   flow to be dropped as near to the source as possible, because they
2966	   say it is going to become negative later, an upstream node cannot
2967	   test the truth of this assertion.  Rather than have to authenticate
2968	   such messages, re-ECN has been designed so that flows can be dropped
2969	   solely based on locally measurable evidence.  A message hinting that
2970	   a flow should be watched closely to test for negativity is fine.  But
2971	   not a message that claims that a positive flow will go negative
2972	   later, so it should be dropped. .

2974	9.  Related Work

2976	   {Due to lack of time, this section is incomplete.  The reader is
2977	   referred to the Related Work section of [Re-fb] for a brief selection
2978	   of related ideas.}

2980	9.1.  Policing Rate Response to Congestion

2982	   ATM network elements send congestion back-pressure
2983	   messages [ITU-T.I.371] along each connection, duplicating any end to
2984	   end feedback because they don't trust it.  On the other hand, re-ECN
2985	   ensures information in forwarded packets can be used for congestion
2986	   management without requiring a connection-oriented architecture and
2987	   re-using the overhead of fields that are already set aside for end to
2988	   end congestion control (and routing loop detection in the case of re-
2989	   TTL in Appendix F).

2991	   We borrowed ideas from policers in the literature [pBox],[XCHOKe],
2992	   AFD etc. for our rate equation policer.  However, without the benefit
2993	   of re-ECN they don't police the correct rate for the condition of
2994	   their path.  They detect unusually high /absolute/ rates, but only
2995	   while the policer itself is congested, because they work by detecting
2996	   prevalent flows in the discards from the local RED queue.  These
2997	   policers must sit at every potential bottleneck, whereas our policer
2998	   need only be located at each ingress to the internetwork.  As Floyd &
2999	   Fall explain [pBox], the limitation of their approach is that a high
3000	   sending rate might be perfectly legitimate, if the rest of the path
3001	   is uncongested or the round trip time is short.  Commercially
3002	   available rate policers cap the rate of any one flow.  Or they
3003	   enforce monthly volume caps in an attempt to control high volume
3004	   file-sharing.  They limit the value a customer derives.  They might
3005	   also limit the congestion customers can cause, but only as an
3006	   accidental side-effect.  They actually punish traffic that fills
3007	   troughs as much as traffic that causes peaks in utilisation.  In
3008	   practice network operators need to be able to allocate service by
3009	   cost during congestion, and by value at other times.

3011	9.2.  Congestion Notification Integrity

3013	   The choice of two ECT code-points in the ECN field [RFC3168]
3014	   permitted future flexibility, optionally allowing the sender to
3015	   encode the experimental ECN nonce [RFC3540] in the packet stream.
3016	   This mechanism has since been included in the specifications of DCCP
3017	   [RFC4340].

3019	   The ECN nonce is an elegant scheme that allows the sender to detect
3020	   if someone in the feedback loop - the receiver especially - tries to
3021	   claim no congestion was experienced when in fact congestion led to
3022	   packet drops or ECN marks.  For each packet it sends, the sender
3023	   chooses between the two ECT codepoints in a pseudo-random sequence.
3024	   Then, whenever the network marks a packet with CE, if the receiver
3025	   wants to deny congestion happened, she has to guess which ECT
3026	   codepoint was overwritten.  She has only a 50:50 chance of being
3027	   correct each time she denies a congestion mark or a drop, which
3028	   ultimately will give her away.

3030	   The purpose of a network-layer nonce should primarily be protection
3031	   of the network, while a transport-layer nonce would be better used to
3032	   protect the sender from cheating receivers.  Now, the assumption
3033	   behind the ECN nonce is that a sender will want to detect whether a
3034	   receiver is suppressing congestion feedback.  This is only true if
3035	   the sender's interests are aligned with the network's, or with the
3036	   community of users as a whole.  This may be true for certain large
3037	   senders, who are under close scrutiny and have a reputation to
3038	   maintain.  But we have to deal with a more hostile world, where
3039	   traffic may be dominated by peer-to-peer transfers, rather than
3040	   downloads from a few popular sites.  Often the `natural' self-
3041	   interest of a sender is not aligned with the interests of other
3042	   users.  It often wishes to transfer data quickly to the receiver as
3043	   much as the receiver wants the data quickly.

3045	   In contrast, the re-ECN protocol enables policing of an agreed rate-
3046	   response to congestion (e.g. TCP-friendliness) at the sender's
3047	   interface with the internetwork.  It also ensures downstream networks
3048	   can police their upstream neighbours, to encourage them to police
3049	   their users in turn.  But most importantly, it requires the sender to
3050	   declare path congestion to the network and it can remove traffic at
3051	   the egress if this declaration is dishonest.  So it can police
3052	   correctly, irrespective of whether the receiver tries to suppress
3053	   congestion feedback or whether the sender ignores genuine congestion
3054	   feedback.  Therefore the re-ECN protocol addresses a much wider range
3055	   of cheating problems, which includes the one addressed by the ECN
3056	   nonce.

3058	9.3.  Identifying Upstream and Downstream Congestion

3060	   Purple [Purple] proposes that routers should use the CWR flag in the
3061	   TCP header of ECN-capable flows to work out path congestion and
3062	   therefore downstream congestion in a similar way to re-ECN.  However,
3063	   because CWR is in the transport layer, it is not always visible to
3064	   network layer routers and policers.  Purple's motivation was to
3065	   improve AQM, not policing.  But, of course, nodes trying to avoid a
3066	   policer would not be expected to allow CWR to be visible.

3068	10.  Security Considerations

3070	   This whole memo concerns the deployment of a secure congestion
3071	   control framework.  However, below we list some specific security
3072	   issues that we are still working on:

3074	   o  Malicious users have ability to launch dynamically changing
3075	      attacks, exploiting the time it takes to detect an attack, given
3076	      ECN marking is binary.  We are concentrating on subtle
3077	      interactions between the ingress policer and the egress dropper in
3078	      an effort to make it impossible to game the system.

3080	   o  There is an inherent need for at least some flow state at the
3081	      egress dropper given the binary marking environment, which leads
3082	      to an apparent vulnerability to state exhaustion attacks.  An
3083	      egress dropper design with bounded flow state is in write-up.

3085	   o  A malicious source can spoof another user's address and send
3086	      negative traffic to the same destination in order to fool the
3087	      dropper into sanctioning the other user's flow.  To prevent or
3088	      mitigate these two different kinds of DoS attack, against the
3089	      dropper and against given flows, we are considering various
3090	      protection mechanisms.  Section 5.5.1 discusses one of these.

3092	   o  A malicious client can send requests using a spoofed source
3093	      address to a server (such as a DNS server) that tends to respond
3094	      with single packet responses.  This server will then be tricked
3095	      into having to set FNE on the first (and only) packet of all these
3096	      wasted responses.  Given packets marked FNE are worth +1, this
3097	      will cause such servers to consume more of their allowance to
3098	      cause congestion than they would wish to.  In general, re-ECN is
3099	      deliberately designed so that single packet flows have to bear the
3100	      cost of not discovering the congestion state of their path.  One
3101	      of the reasons for introducing re-ECN is to encourage short flows
3102	      to make use of previous path knowledge by moving the cost of this
3103	      lack of knowledge to sources that create short flows.  Therefore,
3104	      we in the long run we might expect services like DNS to aggregate
3105	      single packet flows into connections where it brings benefits.
3106	      However, this attack where DNS requests are made from spoofed
3107	      addresses genuinely forces the server to waste its resources.  The
3108	      only mitigating feature is that the attacker has to set FNE on
3109	      each of its requests if they are to get through an egress dropper
3110	      to a DNS server.  The attacker therefore has to consume as many
3111	      resources as the victim, which at least implies re-ECN does not
3112	      unwittingly amplify this attack.

3114	   Having highlighted outstanding security issues, we now explain the
3115	   design decisions that were taken based on a security-related
3116	   rationale.  It may seem that the six codepoints of the eight made
3117	   available by extending the ECN field with the RE flag have been used
3118	   rather wastefully to encode just five states.  In effect the RE flag
3119	   has been used as an orthogonal single bit, using up four codepoints
3120	   to encode the three states of positive, neutral and negative worth.
3121	   The mapping of the codepoints in an earlier version of this proposal
3122	   used the codepoint space more efficiently, but the scheme became
3123	   vulnerable to network operators bypassing congestion penalties by
3124	   focusing congestion marking on positive packets.  Appendix B explains
3125	   why fixing that problem while allowing for incremental deployment,
3126	   would have used another codepoint anyway.  So it was better to use
3127	   this orthogonal encoding scheme, which greatly simplified the whole
3128	   protocol and brought with it some subtle security benefits (see the
3129	   last paragraph of Appendix B).

3131	   With the scheme as now proposed, once the RE flag is set or cleared
3132	   by the sender or its proxy, it should not be written by the network,
3133	   only read.  So the endpoints can detect if any network maliciously
3134	   alters the RE flag.  IPSec AH integrity checking does not cover the
3135	   IPv4 option flags (they were considered mutable---even the one we
3136	   propose using for the RE flag that was `currently unused' when IPSec
3137	   was defined).  But it would be sufficient for a pair of endpoints to
3138	   make random checks on whether the RE flag was the same when it
3139	   reached the egress as when it left the ingress.  Indeed, if IPSec AH
3140	   had covered the RE flag, any network intending to alter sufficient RE
3141	   flags to make a gain would have focused its alterations on packets
3142	   without authenticating headers (AHs).

3144	   The security of re-ECN has been deliberately designed to not rely on
3145	   cryptography.

3147	11.  IANA Considerations

3149	   This memo includes no request to IANA (yet).

3151	   If this memo was to progress to standards track, it would list:

3153	   o  The new RE flag in IPv4 (Section 5.1) and its extension with the
3154	      ECN field to create a new set of extended ECN (EECN) codepoints;

3156	   o  The definition of the EECN codepoints for default Diffserv PHBs
3157	      (Section 3.2)

3159	   o  The new extension header for IPv6 (Section 5.2);

3161	   o  The new combinations of flags in the TCP header for capability
3162	      negotiation (Section 4.1.3);

3164	   o  The new ICMP message type (Section 5.5.1).

3166	12.  Conclusions

3168	   {ToDo:}

3170	13.  Acknowledgements

3172	   Sebastien Cazalet and Andrea Soppera contributed to the idea of re-
3173	   feedback.  All the following have given helpful comments: Andrea
3174	   Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley,
3175	   Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright,
3176	   John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru
3177	   Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd
3178	   (ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark
3179	   Handley (who developed the attack with canceled packets), Adam
3180	   Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft
3181	   (Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who
3182	   complemented our own dummy traffic attacks with others), Liz Maida
3183	   (MIT), and comments from participants in the CRN/CFP Broadband and
3184	   DoS-resistant Internet working groups.

3186	14.  Comments Solicited

3188	   Comments and questions are encouraged and very welcome.  They can be
3189	   addressed to the IETF Transport Area working group's mailing list
3190	   , and/or to the authors.

3192	15.  References

3194	15.1.  Normative References

3196	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
3197	              Requirement Levels", BCP 14, RFC 2119, March 1997.

3199	   [RFC2309]  Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering,
3200	              S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G.,
3201	              Partridge, C., Peterson, L., Ramakrishnan, K., Shenker,
3202	              S., Wroclawski, J., and L. Zhang, "Recommendations on
3203	              Queue Management and Congestion Avoidance in the
3204	              Internet", RFC 2309, April 1998.

3206	   [RFC2581]  Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
3207	              Control", RFC 2581, April 1999.

3209	   [RFC2960]  Stewart, R., Xie, Q., Morneault, K., Sharp, C.,
3210	              Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M.,
3211	              Zhang, L., and V. Paxson, "Stream Control Transmission
3212	              Protocol", RFC 2960, October 2000.

3214	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
3215	              of Explicit Congestion Notification (ECN) to IP",
3216	              RFC 3168, September 2001.

3218	   [RFC3390]  Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's
3219	              Initial Window", RFC 3390, October 2002.

3221	   [RFC4340]  Kohler, E., Handley, M., and S. Floyd, "Datagram
3222	              Congestion Control Protocol (DCCP)", RFC 4340, March 2006.

3224	   [RFC4341]  Floyd, S. and E. Kohler, "Profile for Datagram Congestion
3225	              Control Protocol (DCCP) Congestion Control ID 2: TCP-like
3226	              Congestion Control", RFC 4341, March 2006.

3228	   [RFC4342]  Floyd, S., Kohler, E., and J. Padhye, "Profile for
3229	              Datagram Congestion Control Protocol (DCCP) Congestion
3230	              Control ID 3: TCP-Friendly Rate Control (TFRC)", RFC 4342,
3231	              March 2006.

3233	15.2.  Informative References

3235	   [ARI05]    Adams, J., Roberts, L., and A. IJsselmuiden, "Changing the
3236	              Internet to Support Real-Time Content Supply from a Large
3237	              Fraction of Broadband Residential Users", BT Technology
3238	              Journal (BTTJ) 23(2), April 2005.

3240	   [Bauer06]  Bauer, S., Faratin, P., and R. Beverly, "Assessing the
3241	              assumptions underlying mechanism design for the Internet",
3242	              Proc. Workshop on the Economics of Networked Systems
3243	              (NetEcon06) , June 2006, .

3246	   [CLoop_pol]
3247	              Salvatori, A., "Closed Loop Traffic Policing", Politecnico
3248	              Torino and Institut Eurecom Masters Thesis ,
3249	              September 2005.

3251	   [ECN-Deploy]
3252	              Floyd, S., "ECN (Explicit Congestion Notification) in
3253	              TCP/IP; Implementation and Deployment of ECN", Web-page ,
3254	              May 2004,
3255	              .

3257	   [ECN-MPLS]
3258	              Davie, B., Briscoe, B., and J. Tay, "Explicit Congestion
3259	              Marking in MPLS", draft-ietf-tsvwg-ecn-mpls-01 (work in
3260	              progress), June 2007.

3262	   [ECN-tunnel]
3263	              Briscoe, B., "Layered Encapsulation of Congestion
3264	              Notification", draft-briscoe-tsvwg-ecn-tunnel-00 (work in
3265	              progress), June 2007.

3267	   [Evol_cc]  Gibbens, R. and F. Kelly, "Resource pricing and the
3268	              evolution of congestion control", Automatica 35(12)1969--
3269	              1985, December 1999,
3270	              .

3272	   [I-D.ietf-tcpm-ecnsyn]
3273	              Kuzmanovic, A., "Adding Explicit Congestion Notification
3274	              (ECN) Capability to TCP's SYN/ACK  Packets",
3275	              draft-ietf-tcpm-ecnsyn-03 (work in progress),
3276	              November 2007.

3278	   [I-D.moncaster-tcpm-rcv-cheat]
3279	              Moncaster, T., "A TCP Test to Allow Senders to Identify
3280	              Receiver Non-Compliance",
3281	              draft-moncaster-tcpm-rcv-cheat-02 (work in progress),
3282	              November 2007.

3284	   [ITU-T.I.371]
3285	              ITU-T, "Traffic Control and Congestion Control in
3286	              {B-ISDN}", ITU-T Rec. I.371 (03/04), March 2004.

3288	   [Jiang02]  Jiang, H. and D. Dovrolis, "The Macroscopic Behavior of
3289	              the TCP Congestion Avoidance Algorithm", ACM SIGCOMM
3290	              CCR 32(3)75-88, July 2002,
3291	              .

3293	   [Mathis97]
3294	              Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The
3295	              Macroscopic Behavior of the TCP Congestion Avoidance
3296	              Algorithm", ACM SIGCOMM CCR 27(3)67--82, July 1997,
3297	              .

3299	   [PCN-arch]
3300	              Eardley, P., Babiarz, J., Chan, K., Charny, A., Geib, R.,
3301	              Karagiannis, G., Menth, M., and T. Tsou, "Pre-Congestion
3302	              Notification Architecture",
3303	              draft-eardley-pcn-architecture-00 (work in progress),
3304	              June 2007.

3306	   [Purple]   Pletka, R., Waldvogel, M., and S. Mannal, "PURPLE:
3307	              Predictive Active Queue Management Utilizing Congestion
3308	              Information", Proc. Local Computer Networks (LCN 2003) ,
3309	              October 2003.

3311	   [RFC2208]  Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell,
3312	              M., Romanow, A., Weinrib, A., and L. Zhang, "Resource
3313	              ReSerVation Protocol (RSVP) Version 1 Applicability
3314	              Statement Some Guidelines on Deployment", RFC 2208,
3315	              September 1997.

3317	   [RFC2402]  Kent, S. and R. Atkinson, "IP Authentication Header",
3318	              RFC 2402, November 1998.

3320	   [RFC2406]  Kent, S. and R. Atkinson, "IP Encapsulating Security
3321	              Payload (ESP)", RFC 2406, November 1998.

3323	   [RFC2475]  Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z.,
3324	              and W. Weiss, "An Architecture for Differentiated
3325	              Services", RFC 2475, December 1998.

3327	   [RFC2988]  Paxson, V. and M. Allman, "Computing TCP's Retransmission
3328	              Timer", RFC 2988, November 2000.

3330	   [RFC3124]  Balakrishnan, H. and S. Seshan, "The Congestion Manager",
3331	              RFC 3124, June 2001.

3333	   [RFC3514]  Bellovin, S., "The Security Flag in the IPv4 Header",
3334	              RFC 3514, April 2003.

3336	   [RFC3540]  Spring, N., Wetherall, D., and D. Ely, "Robust Explicit
3337	              Congestion Notification (ECN) Signaling with Nonces",
3338	              RFC 3540, June 2003.

3340	   [RFC3714]  Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion
3341	              Control for Voice Traffic in the Internet", RFC 3714,
3342	              March 2004.

3344	   [RFC4301]  Kent, S. and K. Seo, "Security Architecture for the
3345	              Internet Protocol", RFC 4301, December 2005.

3347	   [Re-PCN]   Briscoe, B., "Emulating Border Flow Policing using Re-ECN
3348	              on Bulk Data", draft-briscoe-re-pcn-border-cheat-00 (work
3349	              in progress), July 2007.

3351	   [Re-fb]    Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C.,
3352	              Salvatori, A., Soppera, A., and M. Koyabe, "Policing
3353	              Congestion Response in an Internetwork Using Re-Feedback",
3354	              ACM SIGCOMM CCR 35(4)277--288, August 2005, .

3358	   [Savage99]
3359	              Savage, S., Cardwell, N., Wetherall, D., and T. Anderson,
3360	              "TCP congestion control with a misbehaving receiver", ACM
3361	              SIGCOMM CCR 29(5), October 1999,
3362	              .

3364	   [Smart_rtg]
3365	              Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang,
3366	              "Optimizing Cost and Performance for Multihoming", ACM
3367	              SIGCOMM CCR 34(4)79--92, October 2004,
3368	              .

3370	   [Steps_DoS]
3371	              Handley, M. and A. Greenhalgh, "Steps towards a DoS-
3372	              resistant Internet Architecture", Proc. ACM SIGCOMM
3373	              workshop on Future directions in network architecture
3374	              (FDNA'04) pp 49--56, August 2004.

3376	   [Tussle]   Clark, D., Sollins, K., Wroclawski, J., and R. Braden,
3377	              "Tussle in Cyberspace: Defining Tomorrow's Internet", ACM
3378	              SIGCOMM CCR 32(4)347--356, October 2002,
3379	              .

3382	   [XCHOKe]   Chhabra, P., Chuig, S., Goel, A., John, A., Kumar, A.,
3383	              Saran, H., and R. Shorey, "XCHOKe: Malicious Source
3384	              Control for Congestion Avoidance at Internet Gateways",
3385	              Proceedings of IEEE International Conference on Network
3386	              Protocols (ICNP-02) , November 2002,
3387	              .

3389	   [pBox]     Floyd, S. and K. Fall, "Promoting the Use of End-to-End
3390	              Congestion Control in the Internet", IEEE/ACM Transactions
3391	              on Networking 7(4) 458--472, August 1999,
3392	              .

3394	Appendix A.  Precise Re-ECN Protocol Operation

3396	   {ToDo: fix this}

3398	   The protocol operation in the middle described in Section 3.3 was an
3399	   approximation.  In fact, standard ECN router marking combines 1% and
3400	   2% marking into slightly less than 3% whole-path marking, because
3401	   routers deliberately mark CE whether or not it has already been
3402	   marked by another router upstream.  So the combined marking fraction
3403	   would actually be 100% - (100% - 1%)(100% - 2%) = 2.98%.

3405	   To generalise this we will need some notation.

3407	   o  j represents the index of each resource (typically queues) along a
3408	      path, ranging from 0 at the first router to n-1 at the last.

3410	   o  m_j represents the fraction of octets *m*arked CE by a particular
3411	      router (whether or not they are already marked) because of
3412	      congestion of resource j.

3414	   o  u_j represents congestion *u*pstream of resource j, being the
3415	      fraction of CE marking in arriving packet headers (before
3416	      marking).

3418	   o  p_j represents *p*ath congestion, being the fraction of packets
3419	      arriving at resource j with the RE flag blanked (excluding Not-
3420	      RECT packets).

3422	   o  v_j denotes expected congestion downstream of resource j, which
3423	      can be thought of as a *v*irtual marking fraction, being derived
3424	      from two other marking fractions.

3426	   Observed fractions of each particular codepoint (u, p and v) and
3427	   router marking rate m are dimensionless fractions, being the ratio of
3428	   two data volumes (marked and total) over a monitoring period.  All
3429	   measurements are in terms of octets, not packets, assuming that line
3430	   resources are more congestible than packet processing.

3432	   The path congestion (RE blanking fraction) set by the sender should
3433	   reflect the upstream congestion (CE marking fraction) fed back from
3434	   the destination.  Therefore in the steady state

3436	      p_0  = u_n
3437	           = 1 - (1 - m_1)(1 - m_2)...

3439	   Similarly, at some point j in the middle of the network, if p = 1 -
3440	   (1 - u_j)(1 - v_j), then

3442	      v_j  = 1 - (1 - p)/(1 - u_j)

3444	          ~= p - u_j;                      if u_j << 100%

3446	   So, between the two routers in the example in Section 3.3, congestion
3447	   downstream is

3449	      v_1  = 100.00% - (100% - 2.98%) / (100% - 1.00%)
3450	           = 2.00%,

3452	   or a useful approximation of downstream congestion is

3454	      v_1 ~= 2.98% - 1.00%
3455	          ~= 1.98%.

3457	Appendix B.  Justification for Two Codepoints Signifying Zero Worth
3458	             Packets

3460	   It may seem a waste of a codepoint to set aside two codepoints of the
3461	   Extended ECN field to signify zero worth (RECT and CE(0) are both
3462	   worth zero).  The justification is subtle, but worth recording.

3464	   The original version of re-ECN ([Re-fb] and draft-00 of this memo)
3465	   used three codepoints for neutral (ECT(1)), positive (ECT(0)) and
3466	   negative (CE) packets.  The sender set packets to neutral unless re-
3467	   echoing congestion, when it set them positive, in much the same way
3468	   that it blanks the RE flag in the current protocol.  However, routers
3469	   were meant to mark congestion by setting packets negative (CE)
3470	   irrespective of whether they had previously been neutral or positive.

3472	   However, we did not arrange for senders to remember which packet had
3473	   been sent with which codepoint, or for feedback to say exactly which
3474	   packets arrived with which codepoints.  The transport was meant to
3475	   inflate the number of positive packets it sent to allow for a few
3476	   being wiped out by congestion marking.  We (wrongly) assumed that
3477	   routers would congestion mark packets indiscriminately, so the
3478	   transport could infer how many positive packets had been marked and
3479	   compensate accordingly by re-echoing.  But this created a perverse
3480	   incentive for routers to preferentially congestion mark positive
3481	   packets rather than neutral ones.

3483	   We could have removed this perverse incentive by requiring re-ECN
3484	   senders to remember which packets they had sent with which codepoint.
3485	   And for feedback from the receiver to identify which packets arrived
3486	   as which.  Then, if a positive packet was congestion marked to
3487	   negative, the sender could have re-echoed twice to maintain the
3488	   balance between positive and negative at the receiver.

3490	   Instead, we chose to make re-echoing congestion (blanking RE)
3491	   orthogonal to congestion notification (marking CE), which required a
3492	   second neutral codepoint (the orthogonal scheme forms the main square
3493	   of four codepoints in Figure 2).  Then the receiver would be able to
3494	   detect and echo a congestion event even if it arrived on a packet
3495	   that had originally been positive.

3497	   If we had added extra complexity to the sender and receiver
3498	   transports to track changes to individual packets, we could have made
3499	   it work, but then routers would have had an incentive to mark
3500	   positive packets with half the probability of neutral packets.  That
3501	   in turn would have led router algorithms to become more complex.
3502	   Then senders wouldn't know whether a mark had been introduced by a
3503	   simple or a complex router algorithm.  That in turn would have
3504	   required another codepoint to distinguish between legacy ECN and new
3505	   re-ECN router marking.

3507	   Once the cost of IP header codepoint real-estate was the same for
3508	   both schemes, there was no doubt that the simpler option for
3509	   endpoints and for routers should be chosen.  The resulting protocol
3510	   also no longer needed the tricky inflation/deflation complexity of
3511	   the original (broken) scheme.  It was also much simpler to understand
3512	   conceptually.

3514	   A further advantage of the new orthogonal four-codepoint scheme was
3515	   that senders owned sole rights to change the RE flag and routers
3516	   owned sole rights to change the ECN field.  Although we still arrange
3517	   the incentives so neither party strays outside their dominion, these
3518	   clear lines of authority simplify the matter.

3520	   Finally, a little redundancy can be very powerful in a scheme such as
3521	   this.  In one flow, the proportion of packets changed to CE should be
3522	   the same as the proportion of RECT packets changed to CE(-1) and the
3523	   proportion of Re-Echo packets changed to CE(0).  Double checking
3524	   using such redundant relationships can improve the security of a
3525	   scheme (cf. double-entry book-keeping or the ECN Nonce).
3526	   Alternatively, it might be necessary to exploit the redundancy in the
3527	   future to encode an extra information channel.

3529	Appendix C.  ECN Compatibility

3531	   The rationale for choosing the particular combinations of SYN and SYN
3532	   ACK flags in Section 4.1.3 is as follows.

3534	   Choice of SYN flags:  A re-ECN sender can work with vanilla ECN
3535	      receivers so we wanted to use the same flags as would be used in
3536	      an ECN-setup SYN [RFC3168] (CWR=1, ECE=1).  But at the same time,
3537	      we wanted a server (host B) that is Re-ECT to be able to recognise
3538	      that the client (A) is also Re-ECT.  We believe also setting NS=1
3539	      in the initial SYN achieves both these objectives, as it should be
3540	      ignored by vanilla ECT receivers and by ECT-Nonce receivers.  But
3541	      senders that are not Re-ECT should not set NS=1.  At the time ECN
3542	      was defined, the NS flag was not defined, so setting NS=1 should
3543	      be ignored by existing ECT receivers (but testing against
3544	      implementations may yet prove otherwise).  The ECN Nonce
3545	      RFC [RFC3540] is silent on what the NS field might be set to in
3546	      the TCP SYN, but we believe the intent was for a nonce client to
3547	      set NS=0 in the initial SYN (again only testing will tell).
3548	      Therefore we define a Re-ECN-setup SYN as one with NS=1, CWR=1 &
3549	      ECE=1

3551	   Choice of SYN ACK flags:  Choice of SYN ACK: The client (A) needs to
3552	      be able to determine whether the server (B) is Re-ECT.  The
3553	      original ECN specification required an ECT server to respond to an
3554	      ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1.  There
3555	      is no room to modify this by setting the NS flag, as that is
3556	      already set in the SYN ACK of an ECT-Nonce server.  So we used the
3557	      only combination of CWR and ECE that would not be used by existing
3558	      TCP receivers: CWR=1 and ECE=0.  The original ECN specification
3559	      defines this combination as a non-ECN-setup SYN ACK, which remains
3560	      true for vanilla and Nonce ECTs.  But for re-ECN we define it as a
3561	      Re-ECN-setup SYN ACK.  We didn't use a SYN ACK with both CWR and
3562	      ECE cleared to 0 because that would be the likely response from
3563	      most Not-ECT receivers.  And we didn't use a SYN ACK with both CWR
3564	      and ECE set to 1 either, as at least one broken receiver
3565	      implementation echoes whatever flags were in the SYN into its SYN
3566	      ACK.  Therefore we define a Re-ECN-setup SYN ACK as one with CWR=1
3567	      & ECE=0.

3569	   Choice of two alternative SYN ACKs:  the NS flag may take either
3570	      value in a Re-ECN-setup SYN ACK.  Section 5.4 REQUIRES that a Re-
3571	      ECT server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to
3572	      echo congestion experienced (CE) on the initial SYN.  Otherwise a
3573	      Re-ECN-setup SYN ACK MUST be returned with NS=0.  The only current
3574	      known use of the NS flag in a SYN ACK is to indicate support for
3575	      the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1.
3576	      Given the ECN nonce MUST NOT be used for a RECN mode connection, a
3577	      Re-ECN-setup SYN ACK can use either setting of the NS flag without
3578	      any risk of confusion, because the CWR & ECE flags will be
3579	      reversed relative to those used by an ECN nonce SYN ACK.

3581	Appendix D.  Packet Marking with FNE During Flow Start

3583	   FNE (feedback not established) packets have two functions.  Their
3584	   main role is to announce the start of a new flow when feedback has
3585	   not yet been established.  However they also have the role of
3586	   balancing the expected feedback and can be used where there are
3587	   sudden changes in the rate of transmission.  Whilst this should not
3588	   happen under TCP their use as speculative marking is used in building
3589	   the following argument as to why the first and third packets should
3590	   be set to FNE.

3592	   The proportion of FNE packets in each roundtrip should be a high
3593	   estimate of the potential error in the balance of number of
3594	   congestion marked packets versus number of re-echo packets already
3595	   issued.

3597	   Let's call:

3599	      S: the number of the TCP segments sent so far

3601	      F: the number of FNE packets sent so far
3602	      R: the number of Re-Echo packets sent so far

3604	      A: the number of acknowledgments received so far

3606	      C: the number of acknowledgments echoing a CE packet

3608	   In normal operation, when we want to send packet S+1, we first need
3609	   to check that enough Re-Echo packets have been issued:

3611	   If R 1 FNE

3651	   o  if the acknowledgment doesn't echo a mark

3653	      *  for the second packet, A=F=S=1 R=C=0 ==> 1 RECT

3655	      *  for the third packet, S=2 A=F=1 R=C=0 ==> 1 FNE

3657	   o  if no acknowledgement for these two packets echoes a congestion
3658	      mark, then {A=S=3 F=2 R=C=0} which gives k<2*4/1-3, so the source

3660	   o  if no acknowledgement for these four packets echoes a congestion
3661	      mark, then {A=S=7 F=2 R=C=0} which gives k<2*8/1-7, so the source
3662	      could send another 8 RECT packets. ==> 8 RECT

3664	   This behaviour happens to match TCP's congestion window control in
3665	   slow start, which is why for TCP sources, only the first and third
3666	   packet need be FNE packets.

3668	   A source that would open the congestion window any quicker would have
3669	   to insert more FNE packets.  As another example a UDP source sending
3670	   VBR traffic might need to send several FNE packets ahead of the
3671	   traffic peaks it generates.

3673	Appendix E.  Example Egress Dropper Algorithm

3675	   {ToDo: Write up the basic algorithm with flow state, then the
3676	   aggregated one.}

3678	Appendix F.  Re-TTL

3680	   This Appendix gives an overview of a proposal to be able to overload
3681	   the TTL field in the IP header to monitor downstream propagation
3682	   delay.  This is included to show that it would be possible to take
3683	   account of RTT if it was deemed desirable.

3685	   Delay re-feedback can be achieved by overloading the TTL field,
3686	   without changing IP or router TTL processing.  A target value for TTL
3687	   at the destination would need standardising, say 16.  If the path hop
3688	   count increased by more than 16 during a routing change, it would
3689	   temporarily be mistaken for a routing loop, so this target would need
3690	   to be chosen to exceed typical hop count increases.  The TCP wire
3691	   protocol and handlers would need modifying to feed back the
3692	   destination TTL and initialise it.  It would be necessary to
3693	   standardise the unit of TTL in terms of real time (as was the
3694	   original intent in the early days of the Internet).

3696	   In the longer term, precision could be improved if routers
3697	   decremented TTL to represent exact propagation delay to the next
3698	   router.  That is, for a router to decrement TTL by, say, 1.8 time
3699	   units it would alternate the decrement of every packet between 1 & 2
3700	   at a ratio of 1:4.  Although this might sometimes require a seemingly
3701	   dangerous null decrement, a packet in a loop would still decrement to
3702	   zero after 255 time units on average.  As more routers were upgraded
3703	   to this more accurate TTL decrement, path delay estimates would
3704	   become increasingly accurate despite the presence of some legacy
3705	   routers that continued to always decrement the TTL by 1.

3707	Appendix G.  Policer Designs to ensure Congestion Responsiveness

3709	G.1.  Per-user Policing

3711	   User policing requires a policer on the ingress interface of the
3712	   access router associated with the user.  At that point, the traffic
3713	   of the user hasn't diverged on different routes yet; nor has it mixed
3714	   with traffic from other sources.

3716	   In order to ensure that a user doesn't generate more congestion in
3717	   the network than her due share, a modified bulk token-bucket is
3718	   maintained with the following parameter:

3720	   o  b_0 the initial token level

3722	   o  r the filling rate

3724	   o  b_max the bucket depth

3726	   The same token bucket algorithm is used as in many areas of
3727	   networking, but how it is used is very different:

3729	   o  all traffic from a user over the lifetime of their subscription is
3730	      policed in the same token bucket.

3732	   o  only positive and canceled packets (Re-Echo, FNE and CE(0))
3733	      consume tokens

3735	   Such a policer will allow network operators to throttle the
3736	   contribution of their users to network congestion.  This will require
3737	   the appropriate contractual terms to be in place between operators
3738	   and users.  For instance: a condition for a user to subscribe to a
3739	   given network service may be that she should not cause more than a
3740	   volume C_user of congestion over a reference period T_user, although
3741	   she may carry forward up to N_user times her allowance at the end of
3742	   each period.  These terms directly set the parameter of the user
3743	   policer:

3745	   o  b_0 = C_user

3747	   o  r = C_user/T_user

3749	   o  b_max = b_0 * (N_user +1)

3751	   Besides the congestion budget policer above, another user policer may
3752	   be necessary to further rate-limit FNE packets, if they are to be
3753	   marked rather than dropped (see discussion in Section 5.3.).  Rate-
3754	   limiting FNE packets will prevent high bursts of new flow arrivals,
3755	   which is a very useful feature in DoS prevention.  A condition to
3756	   subscribe to a given network service would have to be that a user
3757	   should not generate more than C_FNE FNE packets, over a reference
3758	   period T_FNE, with no option to carry forward any of the allowance at
3759	   the end of each period.  These terms directly set the parameters of
3760	   the FNE policer:

3762	   o  b_0 = C_FNE

3764	   o  r = C_FNE/T_FNE

3766	   o  b_max = b_0

3768	   T_FNE should be a much shorter period than T_user: for instance T_FNE
3769	   could be in the order of minutes while T_user could be in order of
3770	   weeks.

3772	G.2.  Per-flow Rate Policing

3774	   Whilst we believe that simple per-user policing would be sufficient
3775	   to ensure senders comply with congestion control, some operators may
3776	   wish to police the rate response of each flow to congestion as well.
3777	   Although we do not believe this will be neceesary, we include this
3778	   section to show how one could perform per-flow policing using
3779	   enforcement of TCP-fairness as an example.  Per-flow policing aims to
3780	   enforce congestion responsiveness on the shortest information
3781	   timescale on a network path: packet roundtrips.

3783	   This again requires that the appropriate terms be agreed between a
3784	   network operator and its users, where a congestion responsiveness
3785	   policy might be required for the use of a given network service
3786	   (perhaps unless the user specifically requests otherwise).

3788	   As an example, we describe below how a rate adaptation policer can be
3789	   designed when the applicable rate adaptation policy is TCP-
3790	   compliance.  In that context, the average throughput of a flow will
3791	   be expected to be bounded by the value of the TCP throughput during
3792	   congestion avoidance, given in Mathis' formula [Mathis97]

3794	      x_TCP = k * s / ( T * sqrt(m) )

3796	   where:

3798	   o  x_TCP is the throughput of the TCP flow in packets per second,

3800	   o  k is a constant upper-bounded by sqrt(3/2),

3802	   o  s is the average packet size of the flow,

3804	   o  T is the roundtrip time of the flow,

3806	   o  m is the congestion level experienced by the flow.

3808	   We define the marking period N=1/m which represents the average
3809	   number of packets between two positive or canceled packets.  Mathis'
3810	   formula can be re-written as:

3812	      x_TCP = k*s*sqrt(N)/T

3814	   We can then get the average inter-mark time in a compliant TCP flow,
3815	   dt_TCP, by solving (x_TCP/s)*dt_TCP = N which gives

3817	      dt_TCP = sqrt(N)*T/k

3819	   We rely on this equation for the design of a rate-adaptation policer
3820	   as a variation of a token bucket.  In that case a policer has to be
3821	   set up for each policed flow.  This may be triggered by FNE packets,
3822	   with the remainder of flows being all rate limited together if they
3823	   do not start with an FNE packet.

3825	   Where maintaining per flow state is not a problem, for instance on
3826	   some access routers, systematic per-flow policing may be considered.
3827	   Should per-flow state be more constrained, rate adaptation policing
3828	   could be limited to a random sample of flows exhibiting positive or
3829	   canceled packets.

3831	   As in the case of user policing, only positive or canceled packets
3832	   will consume tokens, however the amount of tokens consumed will
3833	   depend on the congestion signal.

3835	   When a new rate adaptation policer is set up for flow j, the
3836	   following state is created:

3838	   o  a token bucket b_j of depth b_max starting at level b_0

3840	   o  a timestamp t_j = timenow()

3842	   o  a counter N_j = 0

3844	   o  a roundtrip estimate T_j

3846	   o  a filling rate r

3848	   When the policing node forwards a packet of flow j with no Re-Echo:

3850	   o  . the counter is incremented: N_j += 1

3852	   When the policing node forwards a packet of flow j carrying a
3853	   congestion mark (CE):

3855	   o  the counter is incremented: N_j += 1

3857	   o  the token level is adjusted: b_j += r*(timenow()-t_j) - sqrt(N_j)*
3858	      T_j/k

3860	   o  the counter is reset: N_j = 0

3862	   o  the timer is reset: t_j = timenow()

3864	   An implementation example will be given in a later draft that avoids
3865	   having to extract the square root.

3867	   Analysis: For a TCP flow, for r= 1 token/sec, on average,

3869	      r*(timenow()-t_j)-sqrt(N_j)* T_j/k = dt_TCP - sqrt(N)*T/k = 0

3871	   This means that the token level will fluctuate around its initial
3872	   level.  The depth b_max of the bucket sets the timescale on which the
3873	   rate adaptation policy is performed while the filling rate r sets the
3874	   trade-off between responsiveness and robustness:

3876	   o  the higher b_max, the longer it will take to catch greedy flows

3878	   o  the higher r, the fewer false positives (greedy verdict on
3879	      compliant flows) but the more false negatives (compliant verdict
3880	      on greedy flows)

3882	   This rate adaptation policer requires the availability of a roundtrip
3883	   estimate which may be obtained for instance from the application of
3884	   re-feedback to the downstream delay Appendix F or passive estimation
3885	   [Jiang02].

3887	   When the bucket of a policer located at the access router (whether it
3888	   is a per-user policer or a per-flow policer) becomes empty, the
3889	   access router SHOULD drop at least all packets causing the token
3890	   level to become negative.  The network operator MAY take further
3891	   sanctions if the token level of the per-flow policers associated with
3892	   a user becomes negative.

3894	Appendix H.  Downstream Congestion Metering Algorithms

3896	H.1.  Bulk Downstream Congestion Metering Algorithm

3898	   To meter the bulk amount of downstream congestion in traffic crossing
3899	   an inter-domain border an algorithm is needed that accumulates the
3900	   size of positive packets and subtracts the size of negative packets.
3901	   We maintain two counters:

3903	      V_b: accumulated congestion volume

3905	      B: total data volume (in case it is needed)

3907	   A suitable pseudo-code algorithm for a border router is as follows:

3909	   ====================================================================
3910	   V_b = 0
3911	   B   = 0
3912	   for each re-ECN-capable packet {
3913	       b = readLength(packet)      /* set b to packet size          */
3914	       B += b                      /* accumulate total volume       */
3915	       if readEECN(packet) == (Re-Echo || FNE) {
3916	           V_b += b                /* increment...                  */
3917	       } elseif readEECN(packet) == CE(-1) {
3918	           V_b -= b                /* ...or decrement V_b...        */
3919	       }                           /*...depending on EECN field     */
3920	   }
3921	   ====================================================================

3923	   At the end of an accounting period this counter V_b represents the
3924	   congestion volume that penalties could be applied to, as described in
3925	   Section 6.1.6.

3927	   For instance, accumulated volume of congestion through a border
3928	   interface over a month might be V_b = 5PB (petabyte = 10^15 byte).
3929	   This might have resulted from an average downstream congestion level
3930	   of 1% on an accumulated total data volume of B = 500PB.

3932	H.2.  Inflation Factor for Persistently Negative Flows

3934	   The following process is suggested to complement the simple algorithm
3935	   above in order to protect against the various attacks from
3936	   persistently negative flows described in Section 6.1.6.  As explained
3937	   in that section, the most important and first step is to estimate the
3938	   contribution of persistently negative flows to the bulk volume of
3939	   downstream pre-congestion and to inflate this bulk volume as if these
3940	   flows weren't there.  The process below has been designed to give an
3941	   unbiased estimate, but it may be possible to define other processes
3942	   that achieve similar ends.

3944	   While the above simple metering algorithm is counting the bulk of
3945	   traffic over an accounting period, the meter should also select a
3946	   subset of the whole flow ID space that is small enough to be able to
3947	   realistically measure but large enough to give a realistic sample.
3948	   Many different samples of different subsets of the ID space should be
3949	   taken at different times during the accounting period, preferably
3950	   covering the whole ID space.  During each sample, the meter should
3951	   count the volume of positive packets and subtract the volume of
3952	   negative, maintaining a separate account for each flow in the sample.
3953	   It should run a lot longer than the large majority of flows, to avoid
3954	   a bias from missing the starts and ends of flows, which tend to be
3955	   positive and negative respectively.

3957	   Once the accounting period finishes, the meter should calculate the
3958	   total of the accounts V_{bI} for the subset of flows I in the sample,
3959	   and the total of the accounts V_{fI} excluding flows with a negative
3960	   account from the subset I. Then the weighted mean of all these
3961	   samples should be taken a_S = sum_{forall I} V_{fI} / sum_{forall I}
3962	   V_{bI}.

3964	   If V_b is the result of the bulk accounting algorithm over the
3965	   accounting period (Appendix H.1) it can be inflated by this factor
3966	   a_S to get a good unbiased estimate of the volume of downstream
3967	   congestion over the accounting period a_S.V_b, without being polluted
3968	   by the effect of persistently negative flows.

3970	Appendix I.  Argument for holding back the ECN nonce

3972	   The ECN nonce is a mechanism that allows a /sending/ transport to
3973	   detect if drop or ECN marking at a congested router has been
3974	   suppressed by a node somewhere in the feedback loop---another router
3975	   or the receiver.

3977	   Space for the ECN nonce was set aside in [RFC3168] (currently
3978	   proposed standard) while the full nonce mechanism is specified in

3980	   [RFC3540] (currently experimental).  The specifications for [RFC4340]
3981	   (currently proposed standard) requires that "Each DCCP sender SHOULD
3982	   set ECN Nonces on its packets...".  It also mandates as a requirement
3983	   for all CCID profiles that "Any newly defined acknowledgement
3984	   mechanism MUST include a way to transmit ECN Nonce Echoes back to the
3985	   sender.", therefore:

3987	   o  The CCID profile for TCP-like Congestion Control [RFC4341]
3988	      (currently proposed standard) says "The sender will use the ECN
3989	      Nonce for data packets, and the receiver will echo those nonces in
3990	      its Ack Vectors."

3992	   o  The CCID profile for TCP-Friendly Rate Control (TFRC) [RFC4342]
3993	      recommends that "The sender [use] Loss Intervals options' ECN
3994	      Nonce Echoes (and possibly any Ack Vectors' ECN Nonce Echoes) to
3995	      probabilistically verify that the receiver is correctly reporting
3996	      all dropped or marked packets."

3998	   The primary function of the ECN nonce is to protect the integrity of
3999	   the information about congestion: ECN marks and packet drops.
4000	   However, when the nonce is used to protect the integrity of
4001	   information about packet drops, rather than ECN marks, a transport
4002	   layer nonce will always be sufficient (because a drop loses the
4003	   transport header as well as the ECN field in the network header),
4004	   which would avoid using scarce IP header codepoint space.  Similarly,
4005	   a transport layer nonce would protect against a receiver sending
4006	   early acknowledgements [Savage99].

4008	   If the ECN nonce reveals integrity problems with the information
4009	   about congestion, the sending transport can use that knowledge for
4010	   two functions:

4012	   o  to protect its own resources, by allocating them in proportion to
4013	      the rates that each network path can sustain, based on congestion
4014	      control,

4016	   o  and to protect congested routers in the network, by slowing down
4017	      drastically its connection to the destination with corrupt
4018	      congestion information.

4020	   If the sending transport chooses to act in the interests of congested
4021	   routers, it can reduce its rate if it detects some malicious party in
4022	   the feedback loop may be suppressing ECN feedback.  But it would only
4023	   be useful to congested routers when /all/ senders using them are
4024	   trusted to act in interest of the congested routers.

4026	   In the end, the only essential use of a network layer nonce is when
4027	   sending transports (e.g. large servers) want to allocate their /own/
4028	   resources in proportion to the rates that each network path can
4029	   sustain, based on congestion control.  In that case, the nonce allows
4030	   senders to be assured that they aren't being duped into giving more
4031	   of their own resources to a particular flow.  And if congestion
4032	   suppression is detected, the sending transport can rate limit the
4033	   offending connection to protect its own resources.  Certainly, this
4034	   is a useful function, but the IETF should carefully decide whether
4035	   such a single, very specific case warrants IP header space.

4037	   In contrast, re-ECN allows all routers to fully protect themselves
4038	   from such attacks, without having to trust anyone - senders,
4039	   receivers, neighbouring networks.  Re-ECN is therefore proposed in
4040	   preference to the ECN nonce on the basis that it addresses the
4041	   generic problem of accountability for congestion of a network's
4042	   resources at the IP layer.

4044	   Delaying the ECN nonce is justified because the applicability of the
4045	   ECN nonce seems too limited for it to consume a two-bit codepoint in
4046	   the IP header.  It therefore seems prudent to give time for an
4047	   alternative way to be found to do the one function the nonce is
4048	   essential for.

4050	   Moreover, while we have re-designed the re-ECN codepoints so that
4051	   they do not prevent the ECN nonce progressing, the same is not true
4052	   the other way round.  If the ECN nonce started to see some deployment
4053	   (perhaps because it was blessed with proposed standard status),
4054	   incremental deployment of re-ECN would effectively be impossible,
4055	   because re-ECN marking fractions at inter-domain borders would be
4056	   polluted by unknown levels of nonce traffic.

4058	   The authors are aware that re-ECN must prove it has the potential it
4059	   claims if it is to displace the nonce.  Therefore, every effort has
4060	   been made to complete a comprehensive specification of re-ECN so that
4061	   its potential can be assessed.  We therefore seek the opinion of the
4062	   Internet community on whether the re-ECN protocol is sufficiently
4063	   useful to warrant standards action.

4065	Authors' Addresses

4067	   Bob Briscoe
4068	   BT & UCL
4069	   B54/77, Adastral Park
4070	   Martlesham Heath
4071	   Ipswich  IP5 3RE
4072	   UK

4074	   Phone: +44 1473 645196
4075	   Email: bob.briscoe@bt.com
4076	   URI:   http://www.cs.ucl.ac.uk/staff/B.Briscoe/

4078	   Arnaud Jacquet
4079	   BT
4080	   B54/70, Adastral Park
4081	   Martlesham Heath
4082	   Ipswich  IP5 3RE
4083	   UK

4085	   Phone: +44 1473 647284
4086	   Email: arnaud.jacquet@bt.com
4087	   URI:

4089	   Toby Moncaster
4090	   BT
4091	   B54/70, Adastral Park
4092	   Martlesham Heath
4093	   Ipswich  IP5 3RE
4094	   UK

4096	   Phone: +44 1473 648734
4097	   Email: toby.moncaster@bt.com

4099	   Alan Smith
4100	   BT
4101	   B54/76, Adastral Park
4102	   Martlesham Heath
4103	   Ipswich  IP5 3RE
4104	   UK

4106	   Phone: +44 1473 640404
4107	   Email: alan.p.smith@bt.com

4109	Full Copyright Statement

4111	   Copyright (C) The IETF Trust (2008).

4113	   This document is subject to the rights, licenses and restrictions
4114	   contained in BCP 78, and except as set forth therein, the authors
4115	   retain all their rights.

4117	   This document and the information contained herein are provided on an
4118	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
4119	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
4120	   THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
4121	   OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
4122	   THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
4123	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

4125	Intellectual Property

4127	   The IETF takes no position regarding the validity or scope of any
4128	   Intellectual Property Rights or other rights that might be claimed to
4129	   pertain to the implementation or use of the technology described in
4130	   this document or the extent to which any license under such rights
4131	   might or might not be available; nor does it represent that it has
4132	   made any independent effort to identify any such rights.  Information
4133	   on the procedures with respect to rights in RFC documents can be
4134	   found in BCP 78 and BCP 79.

4136	   Copies of IPR disclosures made to the IETF Secretariat and any
4137	   assurances of licenses to be made available, or the result of an
4138	   attempt made to obtain a general license or permission for the use of
4139	   such proprietary rights by implementers or users of this
4140	   specification can be obtained from the IETF on-line IPR repository at
4141	   http://www.ietf.org/ipr.

4143	   The IETF invites any interested party to bring to its attention any
4144	   copyrights, patents or patent applications, or other proprietary
4145	   rights that may cover technology that may be required to implement
4146	   this standard.  Please address the information to the IETF at
4147	   ietf-ipr@ietf.org.

4149	Acknowledgments

4151	   Funding for the RFC Editor function is provided by the IETF
4152	   Administrative Support Activity (IASA).  This document was produced
4153	   using xml2rfc v1.32 (of http://xml.resource.org/) from a source in
4154	   RFC-2629 XML format.