idnits 2.17.1 

draft-briscoe-tsvwg-re-ecn-tcp-09.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 25, 2010) is 4932 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681)

  ** Obsolete normative reference: RFC 4305 (Obsoleted by RFC 4835)

  ** Obsolete normative reference: RFC 4960 (Obsoleted by RFC 9260)

  ** Downref: Normative reference to an Experimental RFC: RFC 5562

  -- Obsolete informational reference (is this intentional?): RFC 2309
     (Obsoleted by RFC 7567)

  -- Obsolete informational reference (is this intentional?): RFC 2988
     (Obsoleted by RFC 6298)


     Summary: 4 errors (**), 0 flaws (~~), 1 warning (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Transport Area Working Group                             B. Briscoe, Ed.
3	Internet-Draft                                                A. Jacquet
4	Intended status: Standards Track                                      BT
5	Expires: April 28, 2011                                     T. Moncaster
6	                                                           Moncaster.com
7	                                                                A. Smith
8	                                                                      BT
9	                                                        October 25, 2010

11	     Re-ECN: Adding Accountability for Causing Congestion to TCP/IP
12	                   draft-briscoe-tsvwg-re-ecn-tcp-09

14	Abstract

16	   This document introduces a new protocol for explicit congestion
17	   notification (ECN), termed re-ECN, which can be deployed
18	   incrementally around unmodified routers.  The protocol works by
19	   arranging an extended ECN field in each packet so that, as it crosses
20	   any interface in an internetwork, it will carry a truthful prediction
21	   of congestion on the remainder of its path.  The purpose of this
22	   document is to specify the re-ECN protocol at the IP layer and to
23	   give guidelines on any consequent changes required to transport
24	   protocols.  It includes the changes required to TCP both as an
25	   example and as a specification.  It briefly gives examples of
26	   mechanisms that can use the protocol to ensure data sources respond
27	   correctly to congestion, but these are described more fully in a
28	   companion document.

30	Status of This Memo

32	   This Internet-Draft is submitted in full conformance with the
33	   provisions of BCP 78 and BCP 79.

35	   Internet-Drafts are working documents of the Internet Engineering
36	   Task Force (IETF).  Note that other groups may also distribute
37	   working documents as Internet-Drafts.  The list of current Internet-
38	   Drafts is at http://datatracker.ietf.org/drafts/current/.

40	   Internet-Drafts are draft documents valid for a maximum of six months
41	   and may be updated, replaced, or obsoleted by other documents at any
42	   time.  It is inappropriate to use Internet-Drafts as reference
43	   material or to cite them other than as "work in progress."

45	   This Internet-Draft will expire on April 28, 2011.

47	Copyright Notice
48	   Copyright (c) 2010 IETF Trust and the persons identified as the
49	   document authors.  All rights reserved.

51	   This document is subject to BCP 78 and the IETF Trust's Legal
52	   Provisions Relating to IETF Documents
53	   (http://trustee.ietf.org/license-info) in effect on the date of
54	   publication of this document.  Please review these documents
55	   carefully, as they describe your rights and restrictions with respect
56	   to this document.  Code Components extracted from this document must
57	   include Simplified BSD License text as described in Section 4.e of
58	   the Trust Legal Provisions and are provided without warranty as
59	   described in the Simplified BSD License.

61	Table of Contents

63	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
64	   2.  Requirements notation  . . . . . . . . . . . . . . . . . . . .  6
65	   3.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  6
66	   4.  Protocol Overview  . . . . . . . . . . . . . . . . . . . . . .  6
67	     4.1.  Simplified Re-ECN Protocol . . . . . . . . . . . . . . . .  6
68	       4.1.1.  Congestion Control and Policing the Protocol . . . . .  7
69	       4.1.2.  Background and Applicability . . . . . . . . . . . . .  7
70	     4.2.  Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or
71	           v6)  . . . . . . . . . . . . . . . . . . . . . . . . . . .  9
72	     4.3.  Re-ECN Protocol Operation  . . . . . . . . . . . . . . . . 10
73	     4.4.  Positive and Negative Flows  . . . . . . . . . . . . . . . 12
74	   5.  Network Layer  . . . . . . . . . . . . . . . . . . . . . . . . 13
75	     5.1.  Re-ECN IPv4 Wire Protocol  . . . . . . . . . . . . . . . . 13
76	     5.2.  Re-ECN IPv6 Wire Protocol  . . . . . . . . . . . . . . . . 15
77	     5.3.  Router Forwarding Behaviour  . . . . . . . . . . . . . . . 16
78	     5.4.  Justification for Setting the First SYN to FNE . . . . . . 17
79	     5.5.  Control and Management . . . . . . . . . . . . . . . . . . 18
80	       5.5.1.  Negative Balance Warning . . . . . . . . . . . . . . . 18
81	       5.5.2.  Rate Response Control  . . . . . . . . . . . . . . . . 19
82	     5.6.  IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 19
83	     5.7.  Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 20
84	   6.  Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 21
85	     6.1.  TCP  . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
86	       6.1.1.  RECN mode: Full Re-ECN capable transport . . . . . . . 22
87	       6.1.2.  RECN-Co mode: Re-ECT Sender with a RFC3168
88	               compliant ECN Receiver . . . . . . . . . . . . . . . . 24
89	       6.1.3.  Capability Negotiation . . . . . . . . . . . . . . . . 26
90	       6.1.4.  Extended ECN (EECN) Field Settings during Flow
91	               Start or after Idle Periods  . . . . . . . . . . . . . 27
92	       6.1.5.  Pure ACKS, Retransmissions, Window Probes and
93	               Partial ACKs . . . . . . . . . . . . . . . . . . . . . 31
94	     6.2.  Other Transports . . . . . . . . . . . . . . . . . . . . . 31
95	       6.2.1.  General Guidelines for Adding Re-ECN to Other
96	               Transports . . . . . . . . . . . . . . . . . . . . . . 32
97	       6.2.2.  Guidelines for adding Re-ECN to RSVP or NSIS . . . . . 32
98	       6.2.3.  Guidelines for adding Re-ECN to DCCP . . . . . . . . . 32
99	       6.2.4.  Guidelines for adding Re-ECN to SCTP . . . . . . . . . 33
100	   7.  Incremental Deployment . . . . . . . . . . . . . . . . . . . . 33
101	   8.  Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 34
102	     8.1.  Congestion Notification Integrity  . . . . . . . . . . . . 34
103	   9.  Security Considerations  . . . . . . . . . . . . . . . . . . . 35
104	   10. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 37
105	   11. Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 37
106	   12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 37
107	   13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 38
108	   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 38
109	     14.1. Normative References . . . . . . . . . . . . . . . . . . . 38
110	     14.2. Informative References . . . . . . . . . . . . . . . . . . 39
111	   Appendix A.  Precise Re-ECN Protocol Operation . . . . . . . . . . 41
112	   Appendix B.  Justification for Two Codepoints Signifying Zero
113	                Worth Packets . . . . . . . . . . . . . . . . . . . . 42
114	   Appendix C.  ECN Compatibility . . . . . . . . . . . . . . . . . . 44
115	   Appendix D.  Packet Marking with FNE During Flow Start . . . . . . 45
116	   Appendix E.  Argument for holding back the ECN nonce . . . . . . . 47
117	   Appendix F.  Alternative Terminology Used in Other Documents . . . 49

119	Authors' Statement: Status (to be removed by the RFC Editor)

121	   Although the re-ECN protocol is intended to make a simple but far-
122	   reaching change to the Internet architecture, the most immediate
123	   priority for the authors is to delay any move of the ECN nonce to
124	   Proposed Standard status.  The argument for this position is
125	   developed in Appendix E.

127	Changes from previous drafts (to be removed by the RFC Editor)

129	   Full diffs from all previous verisons (created using the rfcdiff
130	   tool) are available at <http://www.bobbriscoe.net/pubs.html#retcp>

132	   From -08 to -09 (current version):

134	      Re-issued to keep alive for reference by ConEx working group.

136	      Hardly any changes to content, even where it is out of date,
137	      except references updated.

139	   From -07 to -08:

141	      Minor changes and consistency checks.

143	      References updated.

145	   From -06 to -07:

147	      Major changes made following splitting this protocol document from
148	      the related motivations document [I-D.tsvwg-re-ecn-motivation].

150	      Significant re-ordering of remaining text.

152	      New terminology introduced for clarity.

154	      Minor editorial changes throughout.

156	1.  Introduction

158	   This document provides a complete specification for the addition of
159	   the re-ECN protocol to IP and guidelines on how to add it to
160	   transport layer protocols, including a complete specification of re-
161	   ECN in TCP as an example.  The motivation behind this proposal is
162	   given in [I-D.tsvwg-re-ecn-motivation], but we include a brief
163	   summary here.

165	   Re-ECN is intended to allow senders to inform the network of the
166	   level of congestion they expect their flows to see.  This information
167	   is currently only visible at the transport layer.  ECN [RFC3168]
168	   reveals the upstream congestion state of any path by monitoring the
169	   rate of CE marks.  The receiver then informs the sender when they
170	   have seen a marked packet.  Re-ECN builds on ECN by providing new
171	   codepoints that allow the sender to declare the level of congestion
172	   they expect on the forward path.  It is closely related to ECN and
173	   indeed we define a compatability mode to allow a re-ECN sender to
174	   communicate with an ECN receiver [xref].

176	   If a sender understates expected congestion compared to actual
177	   congestion then the network could discard packets or enact some other
178	   sanction.  A policer can also be introduced at the ingress of
179	   networks that can limit the level of congestion being caused.

181	   A general statement of the problem solved by re-ECN is to provide
182	   sufficient information in each IP datagram to be able to hold senders
183	   and whole networks accountable for the congestion they cause
184	   downstream, before they cause it.  But the every-day problems that
185	   re-ECN can solve are much more recognisable than this rather generic
186	   statement: mitigating distributed denial of service (DDoS);
187	   simplifying differentiation of quality of service (QoS); policing
188	   compliance to congestion control; and so on.

190	   It is important to add a few key points.

192	   o  In any stnadard network it always takes one round trip before any
193	      feedback is received.  For this reason a sender must make a
194	      conservative prediction by transmitting IP packets with a special
195	      Cautious marking when it is unsure of the state of the network.

197	   o  It should be noted that the prediction is carried in-band in
198	      normal data packets and for many transports feedback can be
199	      carried in the normal acknowledgements or control packets.

201	   o  The re-ECN protocol is independent of the transport.  In TCP,
202	      acknowledgments are used to convey the feedback from receiver to
203	      sender.  This memo concentrates on TCP as an example transport
204	      protocol, however the re-ECN protocol is compatible with any
205	      transport where feedback can be sent from receiver to sender.

207	   This document is structured as follows.  First an overview of the re-
208	   ECN protocol is given (Section 4), outlining its attributes and
209	   explaining conceptually how it works as a whole.  The two main parts
210	   of the document follow.  That is, the protocol specification divided
211	   into network (Section 5) and transport (Section 6) layers.
212	   Deployment issues discussed throughout the document are brought
213	   together in Section 7.  Related work is discussed in (Section 8).

215	2.  Requirements notation

217	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
218	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
219	   document are to be interpreted as described in [RFC2119].

221	3.  Terminology

223	   The following terminology is used throughout this memo.  Some of this
224	   terminology is new and, to avoid confusion, Appendix F sets out all
225	   the alternative terminology that has been used in other re-ECN
226	   related documents.

228	   o  Neutral packet - a packet that is able to be congestion marked by
229	      an ECN or re-ECN queue.

231	   o  Negative packet - a Neutral packet that has been congestion marked
232	      by an ECN or re-ECN queue.

234	   o  Positive packet - a packet that has been marked by the sender to
235	      indicate the expected level of congestion along its path.  In
236	      general Positive packets should only be sent in response to
237	      feedback received from the receiver.*

239	   o  Cancelled packet - a Positive Packet that has been congestion
240	      marked by an ECN or re-ECN queue.

242	   o  Cautious packet - a packet that has been marked by the sender to
243	      indeiate the expected level of congestion along its path.  In
244	      general Cautious packets should be used when there is insufficient
245	      feedback to be confident about the congestion state of the
246	      network.*

248	      * the difference between positive and cautious packets is
249	      explained in detail later in the document along with guidelines on
250	      the use of Cautious packets.

252	   All the above terms have related IP codepoints as defined in
253	   (Section 5).

255	4.  Protocol Overview

257	4.1.  Simplified Re-ECN Protocol

259	   We describe here the simplified re-ECN protocol.  To simplify the
260	   description we assume packets and segments are synonymous.

262	   Packets are sent from a sender to a receiver.  In Figure 1 the queues
263	   (Q1 and Q2) are ECN enabled as per RFC 3168 [RFC3168].  If congestion
264	   occurs then packets are marked with the congestion experienced (CE)
265	   flag exactly as in the ECN protocol [RFC3168]; the routers do not
266	   need to be modified and do not need to know the re-ECN protocol.  The
267	   receiver constantly informs the sender of the current count of
268	   Negative packets it has seen.  The sender uses this information
269	   determine how many Positive packets it must send into the network.
270	   The receiver's aim is to balance the number of bytes that have been
271	   congestion marked with the number of Positive bytes it has sent.

273	          +--------- Feedback----------+
274	          |                            |
275	          v                            |
276	        +---+    +----+    +----+    +---+
277	        |   |    |    |    |    |    |   |
278	        | S |--->| Q1 |--->| Q2 |--->| R |
279	        |   |    |    |    |    |    |   |
280	        +---+    +----+    +----+    +---+

282	                          Figure 1: Simple Re-ECN

284	4.1.1.  Congestion Control and Policing the Protocol

286	   The arrangement of the protocol ensures that packets carry a
287	   declaration of the amount of congestion that will be experienced on
288	   the path.  The re-ECN protocol is orthogonal to to any congestion
289	   control algorithms, but can be used to ensure that congestion control
290	   is being applied by the sender.

292	   In general we assume that there will be a policer at the network
293	   ingress which can rate limit traffic based on the amount of
294	   congestion declared.

296	   At the network egress there is a droper which can impose sanctions on
297	   flows that incorrectly declare congestion.

299	   Policers and droppers are explained in more detail in
300	   [I-D.tsvwg-re-ecn-motivation].

302	4.1.2.  Background and Applicability

304	   The re-ECN protocol makes no changes and has no effect on the TCP
305	   congestion control algorithm or on other rate responses to
306	   congestion.  Re-ECN is not a new congestion control protocol, rather
307	   it is orthogonal to congestion control itself.  Re-ECN is concerned
308	   with revealing information about congestion so that users and
309	   networks can be held accountable for the congestion they cause, or
310	   allow to be caused.

312	   Re-ECN builds on ECN so we briefly recap the essentials of the ECN
313	   protocol [RFC3168].  Two bits in the IP protocol (v4 or v6) are
314	   assigned to the ECN field.  The sender clears the field to "00" (Not-
315	   ECT) if either end-point transport is not ECN-capable.  Otherwise it
316	   indicates an ECN-capable transport (ECT) using either of the two
317	   code-points "10" or "01" (ECT(0) and ECT(1) resp.).

319	   ECN-capable queues probabilistically set this field to "11" if
320	   congestion is experienced (CE).  In general this marking probability
321	   will increase with the length of the queue at its egress link
322	   (typically using the RED algorithm [RFC2309]).  However, they still
323	   drop rather than mark Not-ECT packets.  With multiple ECN-capable
324	   queues on a path, a flow of packets accumulates the fraction of CE
325	   marking that each queue adds.  The combined effect of the packet
326	   marking of all the queues along the path signals congestion of the
327	   whole path to the receiver.  So, for example, if one queue early in a
328	   path is marking 1% of packets and another later in a path is marking
329	   2%, flows that pass through both queues will experience approximately
330	   3% marking (see Appendix A for a precise treatment).

332	   The choice of two ECT code-points in the ECN field [RFC3168]
333	   permitted future flexibility, optionally allowing the sender to
334	   encode the experimental ECN nonce [RFC3540] in the packet stream.
335	   The nonce is designed to allow a sender to check the integrity of
336	   congestion feedback.  But Section 8.1 explains that it still gives no
337	   control over how fast the sender transmits as a result of the
338	   feedback.  On the other hand, re-ECN is designed both to ensure that
339	   congestion is declared honestly and that the sender's rate responds
340	   appropriately.

342	   Re-ECN is based on a feedback arrangement called `re-
343	   feedback' [Re-fb].  The word is short for either receiver-aligned,
344	   re-inserted or re-echoed feedback.  But it actually works even when
345	   no feedback is available.  In fact it has been carefully designed to
346	   work for single datagram flows.  It also encourages aggregation of
347	   single packet flows by congestion control proxies.  Then, even if the
348	   traffic mix of the Internet were to become dominated by short
349	   messages, it would still be possible to control congestion
350	   effectively and efficiently.

352	   Changing the Internet's feedback architecture seems to imply
353	   considerable upheaval.  But re-ECN can be deployed incrementally at
354	   the transport layer around unmodified queues using existing fields in
355	   IP (v4 or v6).  However it does also require the last undefined bit
356	   in the IPv4 header, which it uses in combination with the 2-bit ECN
357	   field to create four new codepoints.  Nonetheless, we RECOMMEND
358	   adding optional preferentail drop to IP queues based on the re-ECN
359	   fields in order to improve resilience against DoS attacks.
360	   Similarly, re-ECN works best if both the sender and receiver
361	   transports are re-ECN-capable, but it can work with just sender
362	   support(Section 6.1.2).

364	4.2.  Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6)

366	   The re-ECN wire protocol uses the two bit ECN field broadly as in
367	   RFC3168 [RFC3168] as described above, but with five differences of
368	   detail (brought together in a list in Section 7).  This specification
369	   defines a new re-ECN extension (RE) flag.  We will defer the
370	   definition of the actual position of the RE flag in the IPv4 & v6
371	   headers until Section 5.  When we don't need to choose between IPv4
372	   and v6 wire protocols it will suffice call it the RE flag.

374	   Unlike the ECN field, the RE flag is intended to be set by the sender
375	   and SHOULD remain unchanged along the path, although it can be read
376	   by network elements that understand the re-ECN protocol.  It is
377	   feasible that a network element MAY change the setting of the RE
378	   flag, perhaps acting as a proxy for an end-point, but such a protocol
379	   would have to be defined in another specification
380	   (e.g. [I-D.re-pcn-border-cheat]).

382	   Although the RE flag is a separate, single bit field, it can be read
383	   as an extension to the two-bit ECN field; the three concatenated bits
384	   in what we will call the extended ECN field (EECN) giving eight
385	   codepoints.  We will use the RFC3168 names of the ECN codepoints to
386	   describe settings of the ECN field when the RE flag setting is "don't
387	   care", but we also define the following six extended ECN codepoint
388	   names for when we need to be more specific.

390	   One of re-ECN's codepoints is an alternative use of the codepoint set
391	   aside in RFC3168 for the ECN nonce (ECT(1)).  Transports using re-ECN
392	   do not need to use the ECN nonce as long as the sender is also
393	   checking for transport protocol compliance [tcp-rcv-cheat].  The case
394	   for doing this is given in Appendix E.  Two re-ECN codepoints are
395	   given compatible uses to those defined in RFC3168 (Not-ECT and CE).
396	   The other codepoint used by RFC3168 (ECT(0)) isn't used for re-ECN.
397	   Altogether this leave one codepoint of the eight unused by ECN or re-
398	   ECN and available for future use.

400	   +--------+-------------+-------+-----------+------------------------+
401	   |   ECN  |   RFC3168   |   RE  |    EECN   |     re-ECN meaning     |
402	   |  field |  codepoint  |  flag | codepoint |                        |
403	   +--------+-------------+-------+-----------+------------------------+
404	   |   00   |   Not-ECT   |   0   |  Not-ECT  |   Not re-ECN-capable   |
405	   |        |             |       |           |   transport (Legacy)   |
406	   |   00   |     ---     |   1   |    FNE    |      Feedback not      |
407	   |        |             |       |           | established (Cautious) |
408	   |   01   |    ECT(1)   |   0   |  Re-Echo  |  Re-echoed congestion  |
409	   |        |             |       |           |   and RECT (Positive)  |
410	   |   01   |     ---     |   1   |    RECT   |     Re-ECN capable     |
411	   |        |             |       |           |   transport (Neutral)  |
412	   |   10   |    ECT(0)   |   0   |   ECT(0)  |  RFC3168 ECN use only  |
413	   |        |             |       |           |                        |
414	   |   10   |     ---     |   1   |   --CU--  |    Currently unused    |
415	   |        |             |       |           |                        |
416	   |   11   |      CE     |   0   |   CE(0)   |  Re-Echo cancelled by  |
417	   |        |             |       |           |     CE (Cancelled)     |
418	   |   11   |     ---     |   1   |   CE(-1)  | Congestion Experienced |
419	   |        |             |       |           |       (Negative)       |
420	   +--------+-------------+-------+-----------+------------------------+

422	                     Table 1: Extended ECN Codepoints

424	4.3.  Re-ECN Protocol Operation

426	   In this section we will give an overview of the operation of the re-
427	   ECN protocol for TCP/IP, leaving a detailed specification to the
428	   following sections.  Other transports will be discussed later.

430	   In summary, the protocol adds a third `re-echo' stage to the existing
431	   TCP/IP ECN protocol.  Whenever the network adds CE congestion
432	   signalling to the IP header on the forward data path, the receiver
433	   feeds it back to the ingress using TCP, then the sender re-echoes it
434	   into the forward data path using the RE flag in the next packet.

436	   Prior to receiving any feedback a sender will not know which setting
437	   of the RE flag to use, so it sends Cautious packets by setting the
438	   FNE codepoint.  The network reads the FNE codepoint conservatively as
439	   equivalent to re-echoed congestion.

441	   Specifically, once feedback from an ECN or re-ECN capable flow is
442	   established, a re-ECN sender always initialises the ECN field to
443	   ECT(1).  And it usually sets the RE flag to "1" indicating a Neutral
444	   packet.  Whenever a queue marks a packet to CE, the receiver feeds
445	   back this event to the sender.  On receiving this feedback, the re-
446	   ECN sender will clear the RE flag to "0" in the next packet it sends
447	   (indicating a Positive packet).

449	   We chose to set and clear the RE flag this way round to ease
450	   incremental deployment (see Section 7).  To avoid confusion we will
451	   use the term `blanking' (rather than marking) when the RE flag is
452	   cleared to "0".  So, over a stream of packets, we will talk of the
453	   `RE blanking fraction' as the fraction of octets in packets with the
454	   RE flag cleared to "0".

456	       +---+  +----+                +----+  +---+
457	       | S |--| Q1 |----------------| Q2 |--| R |
458	       +---+  +----+                +----+  +---+
459	         .      .                      .      .
460	       ^ .      .                      .      .
461	       | .      .                      .      .
462	       | .     RE blanking fraction    .      .
463	    3% |-------------------------------+=======
464	       | .      .                      |      .
465	    2% | .      .                      |      .
466	       | .      .  CE marking fraction |      .
467	    1% | .      +----------------------+      .
468	       | .      |                      .      .
469	    0% +--------------------------------------->
470	         ^          ^                      ^
471	         L          M                      N    Observation points

473	                  Figure 2: A 2-Queue Example (Imprecise)

475	   Figure 2 uses a simple network to illustrate how re-ECN allows queues
476	   to measure downstream congestion.  The receiver views a CE marking
477	   fraction of 3% which is fed back to the sender.  The sender sets an
478	   RE blanking fraction of 3% to match this.  This RE blanking fraction
479	   can be observed along the path as the RE flag is not changed by
480	   network nodes once set by the sender.  This is shown by the
481	   horizontal line at 3% in the figure.  The CE marked fraction is shown
482	   by the stepped line which rises to meet the RE blanking fraction line
483	   with steps at at each queue where packets are marked.  Two queues are
484	   shown (Q1 and Q2) that are currently congested.  Each time packets
485	   pass through a fraction are marked; 1% at Q1 and 2% at Q2).  The
486	   approximate downstream congestion can be measured at the observation
487	   points shown along the path by subtracting the CE marking fraction
488	   from the RE blanking fraction, as shown in the table below
489	   (Appendix A derives these approximations from a precise analysis).
490	   NB due to the unary nature of ECN marking and the equivalent unary
491	   nature of re-ECN blanking, the precise fraction of marked bytes must
492	   be calculated by maintaining a moving average of the number of
493	   packets that have been marked as a proportion of the total number of
494	   packets.

496	   Along the path the fraction of packets that had their RE field
497	   cleared remains unchanged so it can be used as a reference against
498	   which to compare upstream congestion.  The difference predicts
499	   downstream congestion for the rest of the path.  Therefore, measuring
500	   the fractions of each codepoint at any point in the Internet will
501	   reveal upstream, downstream and whole path congestion.

503	   Note that we have introduced discussion of marking and blanking
504	   fractions solely for illustration.  We are not saying any protocol
505	   handler will work with these average fractions directly.  In fact the
506	   protocol actually requires the number of marked and blanked bytes to
507	   balance by the time the packet reaches the receiver.

509	4.4.  Positive and Negative Flows

511	   In Section 3 we introduced the terms Positive, Neutral, Negative,
512	   Cautious and Cancelled.  This terminology is based on the requirement
513	   to balance the proportion of bytes marked as CE with the proportion
514	   of bytes that are re-echo marked.  In the rest of this memo we will
515	   loosely talk of positive or negative flows, meaning flows where the
516	   moving average of the downstream congestion metric is persistently
517	   positive or negative.  A negative flow is one where more CE marked
518	   packets than re-ECN blanked packets arrive.  Likewise in positive
519	   flows more re-ECN blanked packets arrive than CE marked packets.  The
520	   notion of a negative metric arises because it is derived by
521	   subtracting one metric from another.  Of course actual downstream
522	   congestion cannot be negative, only the metric can (whether due to
523	   time lags or deliberate malice).

525	   Therefore we will talk of packets having `worth' of +1, 0 or -1,
526	   which, when multiplied by their size, indicates their contribution to
527	   the downstream congestion metric.  The worth of each type of packet
528	   is given below in Table 2.  The idea is that most flows start with
529	   zero worth.  Every time the network decrements the worth of a packet,
530	   the sender increments the worth of a later packet.  Then, over time,
531	   as many positive octets should arrive at the receiver as negative.
532	   Note we have said octets not packets, so if packets are of different
533	   sizes, the worth should be incremented on enough octets to balance
534	   the octets in negative packets arriving at the receiver.  It is this
535	   balance that will allow the network to hold the sender accountable
536	   for the congestion it causes.

538	   If a packet carrying re-echoed congestion happens to also be
539	   congestion marked, the +1 worth added by the sender will be cancelled
540	   out by the -1 network congestion marking.  Although the two worth
541	   values correctly cancel out, neither the congestion marking nor the
542	   re-echoed congestion are lost, because the RE bit and the ECN field
543	   are orthogonal.  So, whenever this happens, the receiver will
544	   correctly detect and re-echo the new congestion event as well.

546	   The table below specifies unambiguously the worth of each extended
547	   ECN codepoint.  Note the order is different from the previous table
548	   to better show how the worth increments and decrements.

550	   +---------+-------+---------------+-------+-------------------------+
551	   |   ECN   |   RE  | Extended ECN  | Worth |       Re-ECN Term       |
552	   |  field  |  bit  | codepoint     |       |                         |
553	   +---------+-------+---------------+-------+-------------------------+
554	   |    00   |   0   | Not-RECT      | ...   |           ---           |
555	   |    00   |   1   | FNE           | +1    |         Cautious        |
556	   |    01   |   0   | Re-Echo       | +1    |         Positive        |
557	   |    10   |   0   | Legacy        | ...   |   RFC3168 ECN use only  |
558	   |         |       |               |       |                         |
559	   |    11   |   0   | CE(0)         |  0    |         Negative        |
560	   |    01   |   1   | RECT          |  0    |         Neutral         |
561	   |    10   |   1   | --CU--        | ...   |     Currently unused    |
562	   |         |       |               |       |                         |
563	   |    11   |   1   | CE(-1)        | -1    |         Negative        |
564	   +---------+-------+---------------+-------+-------------------------+

566	                Table 2: 'Worth' of Extended ECN Codepoints

568	5.  Network Layer

570	5.1.  Re-ECN IPv4 Wire Protocol

572	   The wire protocol of the ECN field in the IP header remains largely
573	   unchanged from [RFC3168].  However, an extension to the ECN field we
574	   call the RE (Re-ECN extension) flag (Section 4.2) is defined in this
575	   document.  It doubles the extended ECN codepoint space, giving 8
576	   potential codepoints.  The semantics of the extra codepoints are
577	   backward compatible with the semantics of the 4 original codepoints
578	   [RFC3168] (Section 7 collects together and summarises all the changes
579	   defined in this document).

581	   For IPv4, this document proposes that the new RE control flag will be
582	   positioned where the `reserved' control flag was at bit 48 of the
583	   IPv4 header (counting from 0).  Alternatively, some would call this
584	   bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4
585	   header (Figure 3).

587	             0   1   2
588	           +---+---+---+
589	           | R | D | M |
590	           | E | F | F |
591	           +---+---+---+

593	   Figure 3: New Definition of the Re-ECN Extension (RE) Control Flag at
594	                  the Start of Byte 7 of the IPv4 Header

596	   The semantics of the RE flag are described in outline in Section 4
597	   and specified fully in Section 6.  The RE flag is always considered
598	   in conjunction with the 2-bit ECN field, as if they were concatenated
599	   together to form a 3-bit extended ECN field.  If the ECN field is set
600	   to either the ECT(1) or CE codepoint, when the RE flag is blanked
601	   (cleared to "0") it represents a re-echo of congestion experienced by
602	   an early packet.  If the ECN field is set to the Not-ECT codepoint,
603	   when the RE flag is set to "1" it represents the feedback not
604	   established (FNE) codepoint, which signals that the packet was sent
605	   without the benefit of congestion feedback.

607	   It is believed that the FNE codepoint can simultaneously serve other
608	   purposes, particularly where the start of a flow needs distinguishing
609	   from packets later in the flow.  For instance it would have been
610	   useful to identify new flows for tag switching and might enable
611	   similar developments in the future if it were adopted.  It is similar
612	   to the state set-up bit idea designed to protect against memory
613	   exhaustion attacks.  This idea was proposed informally by David Clark
614	   and documented by Handley and Greenhalgh  [Steps_DoS].  The FNE
615	   codepoint can be thought of as a `soft-state set-up flag', because it
616	   is idempotent (i.e. one occurrence of the flag is sufficient but
617	   further occurrences achieve the same effect if previous ones were
618	   lost).

620	   We are sure there will probably be other claims pending on the use of
621	   bit 48.  We know of at least two  [ARI05], [RFC3514] but neither have
622	   been pursued in the IETF, so far, although the present proposal would
623	   meet the needs of the latter.

625	   The security flag proposal (commonly known as the evil bit) was
626	   published on 1 April 2003 as Informational RFC 3514, but it was not
627	   adopted due to confusion over whether evil-doers might set it
628	   inappropriately.  The present proposal is backward compatible with
629	   RFC3514 because if re-ECN compliant senders were benign they would
630	   correctly clear the evil bit to honestly declare that they had just
631	   received congestion feedback.  Whereas evil-doers would hide
632	   congestion feedback by setting the evil bit continuously, or at least
633	   more often than they should.  So, evil senders can be identified,
634	   because they declare that they are good less often than they should.

636	5.2.  Re-ECN IPv6 Wire Protocol

638	   For IPv6, this document proposes that the new RE control flag will be
639	   positioned as the first bit of the option field of a new Congestion
640	   hop by hop option header (Figure 4).

642	        0                   1                   2                   3
643	        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
644	       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
645	       |  Next Header  |  Hdr ext Len  |  Option Type  | Opt Length =4 |
646	       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
647	       |R|                     Reserved for future use                 |
648	       |E|                                                             |
649	       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

651	      Figure 4: Definition of a New IPv6 Congestion Hop by Hop Option
652	         Header containing the re-ECN Extension (RE) Control Flag

654	               0 1 2 3 4 5 6 7 8
655	               +-+-+-+-+-+-+-+-+-
656	               |AIU|C|Option ID|
657	               +-+-+-+-+-+-+-+-+-

659	           Figure 5: Congestion Hop by Hop Option Type Encoding

661	   The Hop-by-Hop Options header enables packets to carry information to
662	   be examined and processed by routers or nodes along the packet's
663	   delivery path, including the source and destination nodes.  For re-
664	   ECN, the two bits of the Action If Unrecognized (AIU) flag of the
665	   Congestion extension header MUST be set to "00" meaning if
666	   unrecognized `skip over option and continue processing the header'.
667	   Then, any routers or a receiver not upgraded with the optional re-ECN
668	   features described in this memo will simply ignore this header.  But
669	   routers with these optional re-ECN features or a re-ECN policing
670	   function, will process this Congestion extension header.

672	   The `C' flag MUST be set to "1" to specify that the Option Data
673	   (currently only the RE control flag) can change en-route to the
674	   packet's final destination.  This ensures that, when an
675	   Authentication header (AH [RFC4302]) is present in the packet, for
676	   any option whose data may change en-route, its entire Option Data
677	   field will be treated as zero-valued octets when computing or
678	   verifying the packet's authenticating value.

680	   Although the RE control flag should not be changed along the path, we
681	   expect that the rest of this option field that is currently `Reserved
682	   for future use' could be used for a multi-bit congestion notification
683	   field which we would expect to change en route.  As the RE flag does
684	   not need end-to-end authentication, we set the C flag to '1'.

686	   {ToDo: A Congestion Hop by Hop Option ID will need to be registered
687	   with IANA.}

689	5.3.  Router Forwarding Behaviour

691	   Re-ECN works well without modifying the forwarding behaviour of any
692	   routers.  However, below, two OPTIONAL changes to forwarding
693	   behaviour are defined which respectively enhance performance and
694	   improve a router's discrimination against flooding attacks.  They are
695	   both OPTIONAL additions that we propose MAY apply by default to all
696	   Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN
697	   marking behaviours [RFC3168].  Specifications for PHBs MAY define
698	   different forwarding behaviours from this default, but this is not
699	   required.  [I-D.re-pcn-border-cheat] is one example.

701	   FNE indicates ECT:

703	      The FNE codepoint tells a router to assume that the packet was
704	      sent by an ECN-capable transport (see Section 5.4).  Therefore an
705	      FNE packet MAY be marked rather than dropped.  Note that the FNE
706	      codepoint has been intentionally chosen so that, to RFC3168
707	      compliant routers (which do not inspect the RE flag) an FNE packet
708	      appears to be Not-ECT so it will be dropped by legacy AQM
709	      algorithms.

711	      A network operator MUST NOT configure a queue to ECN mark rather
712	      than drop FNE packets unless it can guarantee that FNE packets
713	      will be rate limited, either locally or upstream.  The ingress
714	      policers discussed in [I-D.tsvwg-re-ecn-motivation] would count as
715	      rate limiters for this purpose.

717	   Preferential Drop:  If a re-ECN capable router queue experiences very
718	      high load so that it has to drop arriving packets (e.g. a DoS
719	      attack), it MAY preferentially drop packets within the same
720	      Diffserv PHB using the preference order for extended ECN
721	      codepoints given in Table 3.  Preferential dropping can be
722	      difficult to implement on some hardware, but if feasible it would
723	      discriminate against attack traffic if done as part of the overall
724	      policing framework of [I-D.tsvwg-re-ecn-motivation].  If nowhere
725	      else, routers at the egress of a network SHOULD implement
726	      preferential drop (stronger than the MAY above).  For simplicity,
727	      preferences 4 & 5 MAY be merged into one preference level.

729	      The tabulated drop preferences are arranged to preserve packets
730	      with more positive worth (Section 4.4), given senders of positive
731	      packets must have honestly declared downstream congestion.  A full
732	      treatment of this is provided in the companion document desribing
733	      the motivation and architecture for re-ECN
734	      [I-D.tsvwg-re-ecn-motivation] particularly when the application of
735	      re-ECN to protect against DDoS attacks is described.

737	   +-------+-----+------------+-------+------------+-------------------+
738	   |  ECN  |  RE | Extended   | Worth | Drop Pref  |   Re-ECN meaning  |
739	   | field | bit | ECN        |       | (1 = drop  |                   |
740	   |       |     | codepoint  |       | 1st)       |                   |
741	   +-------+-----+------------+-------+------------+-------------------+
742	   |   01  |  0  | Re-Echo    | +1    | 5/4        |     Re-echoed     |
743	   |       |     |            |       |            |   congestion and  |
744	   |       |     |            |       |            |        RECT       |
745	   |   00  |  1  | FNE        | +1    | 4          |    Feedback not   |
746	   |       |     |            |       |            |    established    |
747	   |   11  |  0  | CE(0)      | 0     | 3          |  Re-Echo canceled |
748	   |       |     |            |       |            |   by congestion   |
749	   |       |     |            |       |            |    experienced    |
750	   |   01  |  1  | RECT       | 0     | 3          |   Re-ECN capable  |
751	   |       |     |            |       |            |     transport     |
752	   |   11  |  1  | CE(-1)     | -1    | 3          |     Congestion    |
753	   |       |     |            |       |            |    experienced    |
754	   |   10  |  1  | --CU--     | n/a   | 2          |  Currently Unused |
755	   |   10  |  0  | ---        | n/a   | 2          |  RFC3168 ECN use  |
756	   |       |     |            |       |            |        only       |
757	   |   00  |  0  | Not-RECT   | n/a   | 1          |        Not        |
758	   |       |     |            |       |            |   Re-ECN-capable  |
759	   |       |     |            |       |            |     transport     |
760	   +-------+-----+------------+-------+------------+-------------------+

762	      Table 3: Drop Preference of EECN Codepoints (Sorted by `Worth')

764	5.4.  Justification for Setting the First SYN to FNE

766	   the initial SYN MUST be set to FNE by Re-ECT client A (Section 6.1.4)
767	   and (Section 5.3) says a queue MAY optionally treat an FNE packet as
768	   ECN capable, so an initial SYN may be marked CE(-1) rather than
769	   dropped.  This seems dangerous, because the sender has not yet
770	   established whether the receiver is a RFC3168 one that does not
771	   understand congestion marking.  It also seems to allow malicious
772	   senders to take advantage of ECN marking to avoid so much drop when
773	   launching SYN flooding attacks.  Below we explain the features of the
774	   protocol design that remove both these dangers.

776	   ECN-capable initial SYN with a Not-ECT server:  If the TCP server B
777	      is re-ECN capable, provision is made for it to feedback a possible
778	      congestion marked SYN in the SYN ACK (Section 6.1.4).  But if the
779	      TCP client A finds out from the SYN ACK that the server was not
780	      ECN-capable, the TCP client MUST conservatively consider the first
781	      SYN as congestion marked before setting itself into Not-ECT mode.
782	      Section 6.1.4 mandates that such a TCP client MUST also set its
783	      initial window to 1 segment.  In this way we remove the need to
784	      cautiously avoid setting the first SYN to Not-RECT.  This will
785	      give worse performance while deployment is patchy, but better
786	      performance once deployment is widespread.

788	   SYN flooding attacks can't exploit ECN-capability:  Malicious hosts
789	      may think they can use the advantage that ECN-marking gives over
790	      drop in launching classic SYN-flood attacks.  But Section 5.3
791	      mandates that a router MUST only be configured to treat packets
792	      with the FNE codepoint as ECN-capable if FNE packets are rate
793	      limited somewhere.  Introduction of the FNE codepoint was a
794	      deliberate move to enable transport-neutral handling of flow-start
795	      and flow state set-up in the IP layer where it belongs.  It then
796	      becomes possible to protect against flooding attacks of all forms
797	      (not just SYN flooding) without transport-specific inspection for
798	      things like the SYN flag in TCP headers.  Then, for instance, SYN
799	      flooding attacks using IPSec ESP encryption can also be rate
800	      limited at the IP layer.

802	   It might seem pedantic going to all this trouble to enable ECN on the
803	   initial packet of a flow, but it is motivated by a much wider concern
804	   to ensure safe congestion control will still be possible even if the
805	   application mix evolves to the point where the majority of flows
806	   consist of a single window or even a single packet.  It also allows
807	   denial of service attacks to be more easily isolated and prevented.

809	5.5.  Control and Management

811	5.5.1.  Negative Balance Warning

813	   A new ICMP message type is being considered so that a dropper can
814	   warn the apparent sender of a flow that it has started to sanction
815	   the flow.  The message would have similar semantics to the `Time
816	   exceeded' ICMP message type.  To ensure the sender has to invest some
817	   work before the network will generate such a message, a dropper
818	   SHOULD only send such a message for flows that have demonstrated that
819	   they have started correctly by establishing a positive record, but
820	   have later gone negative.  The threshold is up to the implementation.
821	   The purpose of the message is to deconfuse the cause of drops from
822	   other causes, such as congestion or transmission losses.  The dropper
823	   would send the message to the sender of the flow, not the receiver.

825	   If we did define this message type, it would be REQUIRED for all re-
826	   ECT senders to parse and understand it.  Note that a sender MUST only
827	   use this message to explain why losses are occurring.  A sender MUST
828	   NOT take this message to mean that losses have occurred that it was
829	   not aware of.  Otherwise, spoof messages could be sent by malicious
830	   sources to slow down a sender (c.f.  ICMP source quench).

832	   However, the need for this message type is not yet confirmed, as we
833	   are considering how to prevent it being used by malicious senders to
834	   scan for droppers and to test their threshold settings. {ToDo:
835	   Complete this section.}

837	5.5.2.  Rate Response Control

839	   As discussed in [I-D.tsvwg-re-ecn-motivation] the sender's access
840	   operator will be expected to use bulk per-user policing, but they
841	   might choose to introduce a per-flow policer.  In cases where
842	   operators do introduce per-flow policing, there may be a need for a
843	   sender to send a request to the ingress policer asking for permission
844	   to apply a non-default response to congestion (where TCP-friendly is
845	   assumed to be the default).  This would require the sender to know
846	   what message format(s) to use and to be able to discover how to
847	   address the policer.  The required control protocol(s) are outside
848	   the scope of this document, but will require definition elsewhere.

850	   The policer is likely to be local to the sender and inline, probably
851	   at the ingress interface to the internetwork.  So, discovery should
852	   not be hard.  A variety of control protocols already exist for some
853	   widely used rate-responses to congestion.  For instance DCCP
854	   congestion control identifiers (CCIDs [RFC4340]) fulfil this role and
855	   so does QoS signalling (e.g. and RSVP request for controlled load
856	   service is equivalent to a request for no rate response to
857	   congestion, but with admission control).

859	5.6.  IP in IP Tunnels

861	   For re-ECN to work correctly through IP in IP tunnels, it needs
862	   slightly different tunnel handling to regular ECN [RFC3168].
863	   Currently there is some incosistency between how the handling of IP
864	   in IP tunnels is defined in [RFC3168] and how it is defined in
865	   [RFC4301], but re-ECN would work fine with the IPsec behaviour.  This
866	   inconsistency is addressed in a new Internet Draft
867	   [I-D.ietf-tsvwg-ecn-tunnel] that proposes to update RFC3168 tunnel
868	   behaviour to bring it into line with IPsec.  Ideally, for re-ECN to
869	   work through a tunnel, the tunnel entry should copy both the RE flag
870	   and the ECN field from the inner to the outer IP header.  Then at the
871	   tunnel exit, any congestion marking of the outer ECN field should
872	   overwrite the inner ECN field (unless the inner field is Not-ECT in
873	   which case an alarm should be raised).  The RE flag shouldn't change
874	   along a path, so the outer RE flag should be the same as the inner.
875	   If it isn't a management alarm should be raised.  This behaviour is
876	   the same as the full-functionality variant of [RFC3168] at tunnel
877	   exit, but different at tunnel entry.

879	   If tunnels are left as they are specified in [RFC3168], whether the
880	   limited or full-functionality variants are used, a problem arises
881	   with re-ECN if a tunnel crosses an inter-domain boundary, because the
882	   difference between positive and negative markings will not be
883	   correctly accounted for.  In a limited functionality ECN tunnel, the
884	   flow will appear to be RFC3168 compliant traffic, and therefore may
885	   be wrongly rate limited.  In a full-functionality ECN tunnel, the
886	   result will depend whether the tunnel entry copies the inner RE flag
887	   to the outer header or the RE flag in the outer header is always
888	   cleared.  If the former, the flow will tend to be too positive when
889	   accounted for at borders.  If the latter, it will be too negative.
890	   If the rules set out in [I-D.ietf-tsvwg-ecn-tunnel] are followed then
891	   this will not be an issue.

893	5.7.  Non-Issues

895	   The following issues might seem to cause unfavourable interactions
896	   with re-ECN, but we will explain why they don't:

898	   o  Various link layers support explicit congestion notification, such
899	      as Frame Relay and ATM.  Explicit congestion notification is
900	      proposed to be added to other link layers, such as Ethernet
901	      (802.3ar Ethernet congestion management) and MPLS [RFC5129];

903	   o  Encryption and IPSec.

905	   In the case of congestion notification at the link layer, each
906	   particular link layer scheme either manages congestion on the link
907	   with its own link-level feedback (the usual arrangement in the cases
908	   of ATM and Frame Relay), or congestion notification from the link
909	   layer is merged into congestion notification at the IP level when the
910	   frame headers are decapsulated at the end of the link (the
911	   recommended arrangement in the Ethernet and MPLS cases).  Given the
912	   RE flag is not intended to change along the path, this means that
913	   downstream congestion will still be measureable at any point where IP
914	   is processed on the path by subtracting positive from negative
915	   markings.

917	   In the case of encryption, as long as the tunnel issues described in
918	   Section 5.6 are dealt with, payload encryption itself will not be a
919	   problem.  The design goal of re-ECN is to include downstream
920	   congestion in the IP header so that it is not necessary to bury into
921	   inner headers.  Obfuscation of flow identifiers is not a problem for
922	   re-ECN policing elements.  Re-ECN doesn't ever require flow
923	   identifiers to be valid, it only requires them to be unique.  So if
924	   an IPSec encapsulating security payload (ESP [RFC4305]) or an
925	   authentication header (AH [RFC4302]) is used, the security parameters
926	   index (SPI) will be a sufficient flow identifier, as it is intended
927	   to be unique to a flow without revealing actual port numbers.

929	   In general, even if endpoints use some locally agreed scheme to hide
930	   port numbers, re-ECN policing elements can just consider the pair of
931	   source and destination IP addresses as the flow identifier.  Re-ECN
932	   encourages endpoints to at least tell the network layer that a
933	   sequence of packets are all part of the same flow, if indeed they
934	   are.  The alternative would be for the sender to make each packet
935	   appear to be a new flow, which would require them all to be marked
936	   FNE in order to avoid being treated with the bulk of malicious flows
937	   at the egress dropper.  Given the FNE marking is worth +1 and
938	   networks are likely to rate limit FNE packets, endpoints are given an
939	   incentive not to set FNE on each packet.  But if the sender really
940	   does want to hide the flow relationship between packets it can choose
941	   to pay the cost of multiple FNE packets, which in the long run will
942	   compensate for the extra memory required on network policing elements
943	   to process each flow.

945	6.  Transport Layers

947	6.1.  TCP

949	   Re-ECN capability at the sender is essential.  At the receiver it is
950	   optional, as long as the receiver has a basic RFC3168-compliant ECN-
951	   capable transport (ECT) [RFC3168].  Given re-ECN is not the first
952	   attempt to define the semantics of the ECN field, we give a table
953	   below summarising what happens for various combinations of
954	   capabilities of the sender S and receiver R, as indicated in the
955	   first four columns below.  The last column gives the mode a half-
956	   connection should be in after the first two of the three TCP
957	   handshakes.

959	   +--------+--------------+------------+---------+--------------------+
960	   | Re-ECT |   ECT-Nonce  |     ECT    | Not-ECT |         S-R        |
961	   |        |   (RFC3540)  |  (RFC3168) |         |   Half-connection  |
962	   |        |              |            |         |        Mode        |
963	   +--------+--------------+------------+---------+--------------------+
964	   |   SR   |              |            |         |        RECN        |
965	   |    S   |       R      |            |         |       RECN-Co      |
966	   |    S   |              |      R     |         |       RECN-Co      |
967	   |    S   |              |            |    R    |       Not-ECT      |
968	   +--------+--------------+------------+---------+--------------------+

970	       Table 4: Modes of TCP Half-connection for Combinations of ECN
971	                  Capabilities of Sender S and Receiver R

973	   We will describe what happens in each mode, then describe how they
974	   are negotiated.  The abbreviations for the modes in the above table
975	   mean:

977	   RECN:  Full re-ECN capable transport

979	   RECN-Co:  Re-ECN sender in compatibility mode with a RFC3168
980	      compliant [RFC3168] ECN receiver or an [RFC3540] ECN nonce-capable
981	      receiver.  Implementation of this mode is OPTIONAL.

983	   Not-ECT:  Not ECN-capable transport, as defined in [RFC3168] for when
984	      at least one of the transports does not understand even basic ECN
985	      marking.

987	   Note that we use the term Re-ECT for a host transport that is re-ECN-
988	   capable but RECN for the modes of the half connections between hosts
989	   when they are both Re-ECT.  If a host transport is Re-ECT, this fact
990	   alone does NOT imply either of its half connections will necessarily
991	   be in RECN mode, at least not until it has confirmed that the other
992	   host is Re-ECT.

994	6.1.1.  RECN mode: Full Re-ECN capable transport

996	   In full RECN mode, for each half connection, both the sender and the
997	   receiver each maintain an unsigned integer counter we will call ECC
998	   (echo congestion counter).  The receiver maintains a count of how
999	   many times a CE marked packet has arrived during the half-connection.
1000	   Once a RECN connection is established, the three TCP option flags
1001	   (ECE, CWR & NS) used for ECN-related functions in other versions of
1002	   ECN are used as a 3-bit field for the receiver to repeatedly tell the
1003	   sender the current value of ECC, modulo 8, whenever it sends a TCP
1004	   ACK.  We will call this the echo congestion increment (ECI) field.
1005	   This overloaded use of these 3 option flags as one 3-bit ECI field is
1006	   shown in Figure 7.  The actual definition of the TCP header,
1007	   including the addition of support for the ECN nonce, is shown for
1008	   comparison in Figure 6.  This specification does not redefine the
1009	   names of these three TCP option flags, it merely overloads them with
1010	   another definition once a flow is established.

1012	        0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
1013	      +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
1014	      |               |           | N | C | E | U | A | P | R | S | F |
1015	      | Header Length | Reserved  | S | W | C | R | C | S | S | Y | I |
1016	      |               |           |   | R | E | G | K | H | T | N | N |
1017	      +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

1019	    Figure 6: The (post-ECN Nonce) definition of bytes 13 and 14 of the
1020	                                TCP Header

1022	        0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
1023	      +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
1024	      |               |           |           | U | A | P | R | S | F |
1025	      | Header Length | Reserved  |    ECI    | R | C | S | S | Y | I |
1026	      |               |           |           | G | K | H | T | N | N |
1027	      +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

1029	    Figure 7: Definition of the ECI field within bytes 13 and 14 of the
1030	   TCP Header, overloading the current definitions above for established
1031	                                RECN flows.

1033	   Receiver Action in RECN Mode

1035	      Every time a CE marked packet arrives at a receiver in RECN mode,
1036	      the receiver transport increments its local value of ECC and MUST
1037	      echo its value, modulo 8, to the sender in the ECI field of the
1038	      next ACK.  It MUST repeat the same value of ECI in every
1039	      subsequent ACK until the next CE event, when it increments ECI
1040	      again.

1042	      The increment of the local ECC values is modulo 8 so the field
1043	      value simply wraps round back to zero when it overflows.  The
1044	      least significant bit is to the right (labelled bit 9).

1046	      A receiver in RECN mode MAY delay the echo of a CE to the next
1047	      delayed-ACK, which would be necessary if ACK-withholding were
1048	      implemented.

1050	   Sender Action in RECN Mode

1052	      On the arrival of every ACK, the sender compares the ECI field
1053	      with its own ECC value, then replaces its local value with that
1054	      from the ACK.  The difference D (D = (ECI + 8 - ECC mod 8) mod 8)
1055	      is assumed to be the number of CE marked packets that arrived at
1056	      the receiver since it sent the previously received ACK (but see
1057	      below for the sender's safety strategy).  Whenever the ECI field
1058	      increments by D (and/or d drops are detected), the sender MUST
1059	      clear the RE flag to "0" in the IP header of the next D' data
1060	      packets it sends (where D' = D + d), effectively re-echoing each
1061	      single increment of ECI.  Otherwise the data sender MUST send all
1062	      data packets with RE set to "1".

1064	      As a general rule, once a flow is established, as well as setting
1065	      or clearing the RE flag as above, a data sender in RECN mode MUST
1066	      always set the ECN field to ECT(1).  However, the settings of the
1067	      extended ECN field during flow start are defined in Section 6.1.4.

1069	      As we have already emphasised, the re-ECN protocol makes no
1070	      changes and has no effect on the TCP congestion control algorithm.
1071	      So, the first increment of ECI (or detection of a drop) in a RTT
1072	      triggers the standard TCP congestion response, no more than one
1073	      congestion response per round trip, as usual.  However, the sender
1074	      re-echoes every increment of ECI irrespective of RTTs.

1076	      A TCP sender also acts as the receiver for the other half-
1077	      connection.  The host will maintain two ECC values S.ECC and R.ECC
1078	      as sender and receiver respectively.  Every TCP header sent by a
1079	      host in RECN mode will also repeat the prevailing value of R.ECC
1080	      in its ECI field.  If a sender in RECN mode has to retransmit a
1081	      packet due to a suspected loss, the re-transmitted packet MUST
1082	      carry the latest prevailing value of R.ECC when it is re-
1083	      transmitted, which will not necessarily be the one it carried
1084	      originally.

1086	6.1.2.  RECN-Co mode: Re-ECT Sender with a RFC3168 compliant ECN
1087	        Receiver

1089	   If the half-connection is in RECN-Co mode, ECN feedback proceeds no
1090	   differently to that of RFC3168 compliant ECN.  In other words, the
1091	   receiver sets the ECE flag repeatedly in the TCP header and the
1092	   sender responds by setting the CWR flag.  Although RECN-Co mode is
1093	   used when the receiver has not implemented the re-ECN protocol, the
1094	   sender can infer enough from its RFC3168 compliant ECN feedback to
1095	   set or clear the RE flag reasonably well.  Specifically, every time
1096	   the receiver toggles the ECE field from "0" to "1" (or a loss is
1097	   detected), as well as setting CWR in the TCP flags, the re-ECN sender
1098	   MUST blank the RE flag of the next packet to "0" as it would do in
1099	   full RECN mode.  Otherwise, the data sender SHOULD send all other
1100	   packets with RE set to "1".  Once a flow is established, a re-ECN
1101	   data sender in RECN-Co mode MUST always set the ECN field to ECT(1).

1103	   If a CE marked packet arrives at the receiver within a round trip
1104	   time of a previous mark, the receiver will still be echoing ECE for
1105	   the last CE mark.  Therefore, such a mark will be missed by the
1106	   sender.  Of course, this isn't of concern for congestion control, but
1107	   it does mean that very occasionally the RE blanking fraction will be
1108	   understated.  Therefore flows in RECN-Co mode may occasionally be
1109	   mistaken for very lightly cheating flows and consequently might
1110	   suffer a small number of packet drops through an egress dropper.  We
1111	   expect re-ECN would be deployed for some time before policers and
1112	   droppers start to enforce it.  So, given there is not much ECN
1113	   deployment yet anyway, this minor problem may affect only a very
1114	   small proportion of flows, reducing to nothing over the years as
1115	   RFC3168 compliant ECN hosts upgrade.  The use of RECN-Co mode would
1116	   need to be reviewed in the light of experience at the time of re-ECN
1117	   deployment.

1119	   RECN-Co mode is OPTIONAL.  Re-ECN implementers who want to keep their
1120	   code simple, MAY choose not to implement this mode.  If they do not,
1121	   a re-ECN sender SHOULD fall back to RFC3168 compliant ECT mode in the
1122	   presence of an ECN-capable receiver.  It MAY choose to fall back to
1123	   the ECT-Nonce mode, but if re-ECN implementers don't want to be
1124	   bothered with RECN-Co mode, they probably won't want to add an ECT-
1125	   Nonce mode either.

1127	6.1.2.1.  Re-ECN support for the ECN Nonce

1129	   A TCP half-connection in RECN-Co mode MUST NOT support the ECN
1130	   Nonce [RFC3540].  This means that the sending code of a re-ECN
1131	   implementation will never need to include ECN Nonce support.  Re-ECN
1132	   is intended to provide wider protection than the ECN nonce against
1133	   congestion control misbehaviour, and re-ECN only requires support
1134	   from the sender, therefore it is preferable to specifically rule out
1135	   the need for dual sender implementations.  As a consequence, a re-ECN
1136	   capable sender will never set ECT(0), so it will be easier for
1137	   network elements to discriminate re-ECN traffic flows from other ECN
1138	   traffic, which will always contain some ECT(0) packets.

1140	   However, a re-ECN implementation MAY OPTIONALLY include receiving
1141	   code that complies with the ECN Nonce protocol when interacting with
1142	   a sender that supports the ECN nonce (rather than re-ECN), but this
1143	   support is not required.

1145	   RFC3540 allows an ECN nonce sender to choose whether to sanction a
1146	   receiver that does not ever set the nonce sum.  Given re-ECN is
1147	   intended to provide wider protection than the ECN nonce against
1148	   congestion control misbehaviour, implementers of re-ECN receivers MAY
1149	   choose not to implement backwards compatibility with the ECN nonce
1150	   capability.  This may be because they deem that the risk of sanctions
1151	   is low, perhaps because significant deployment of the ECN nonce seems
1152	   unlikely at implementation time.

1154	6.1.3.  Capability Negotiation

1156	   During the TCP hand-shake at the start of a connection, an originator
1157	   of the connection (host A) with a re-ECN-capable transport MUST
1158	   indicate it is Re-ECT by setting the TCP flags NS=1, CWR=1 and ECE=1
1159	   in the initial SYN.

1161	   A responding Re-ECT host (host B) MUST return a SYN ACK with flags
1162	   CWR=1 and ECE=0.  The responding host MUST NOT set this combination
1163	   of flags unless the preceding SYN has already indicated Re-ECT
1164	   support as above.  Normally a Re-ECT server (B) will reply to a Re-
1165	   ECT client with NS=0, but if the initial SYN from Re-ECT client A is
1166	   marked CE(-1), a Re-ECT server B MUST increment its local value of
1167	   ECC.  But B cannot reflect the value of ECC in the SYN ACK, because
1168	   it is still using the 3 bits to negotiate connection capabilities.
1169	   So, server B MUST set the alternative TCP header flags in its SYN
1170	   ACK: NS=1, CWR=1 and ECE=0.

1172	   These handshakes are summarised in Table 5 below, with X indicating
1173	   NS can be either 0 or 1 depending on whether congestion had been
1174	   experienced.  The handshakes used for the other flavours of ECN are
1175	   also shown for comparison.  To compress the width of the table, the
1176	   headings of the first four columns have been severely abbreviated, as
1177	   follows:

1179	      R: *R*e-ECT

1181	      N: ECT-*N*once (RFC3540)

1183	      E: *E*CT (RFC3168)

1185	      I: Not-ECT (*I*mplicit congestion notification).

1187	   These correspond with the same headings used in Table 4.  Indeed, the
1188	   resulting modes in the last two columns of the table below are a more
1189	   comprehensive way of saying the same thing as Table 4.

1191	   +----+---+---+---+------------+-------------+-----------+-----------+
1192	   | R  | N | E | I |   SYN A-B  | SYN ACK B-A |  A-B Mode |  B-A Mode |
1193	   +----+---+---+---+------------+-------------+-----------+-----------+
1194	   |    |   |   |   | NS CWR ECE |  NS CWR ECE |           |           |
1195	   | AB |   |   |   |  1   1   1 |  X   1   0  |    RECN   |    RECN   |
1196	   | A  | B |   |   |  1   1   1 |  1   0   1  |  RECN-Co  | ECT-Nonce |
1197	   | A  |   | B |   |  1   1   1 |  0   0   1  |  RECN-Co  |    ECT    |
1198	   | A  |   |   | B |  1   1   1 |  0   0   0  |  Not-ECT  |  Not-ECT  |
1199	   | B  | A |   |   |  0   1   1 |  0   0   1  | ECT-Nonce |  RECN-Co  |
1200	   | B  |   | A |   |  0   1   1 |  0   0   1  |    ECT    |  RECN-Co  |
1201	   | B  |   |   | A |  0   0   0 |  0   0   0  |  Not-ECT  |  Not-ECT  |
1202	   +----+---+---+---+------------+-------------+-----------+-----------+

1204	      Table 5: TCP Capability Negotiation between Originator (A) and
1205	                               Responder (B)

1207	   As soon as a re-ECN capable TCP server receives a SYN, it MUST set
1208	   its two half-connections into the modes given in Table 5.  As soon as
1209	   a re-ECN capable TCP client receives a SYN ACK, it MUST set its two
1210	   half-connections into the modes given in Table 5.  The half-
1211	   connections will remain in these modes for the rest of the
1212	   connection, including for the third segment of TCP's three-way hand-
1213	   shake (the ACK).

1215	   {ToDo: Consider RSTs within a connection.}

1217	   Recall that, if the SYN ACK reflects the same flag settings as the
1218	   preceding SYN (because there is a broken RFC3168 compliant
1219	   implementation that behaves this way), RFC3168 specifies that the
1220	   whole connection MUST revert to Not-ECT.

1222	   Also note that, whenever the SYN flag of a TCP segment is set
1223	   (including when the ACK flag is also set), the NS, CWR and ECE flags
1224	   ( i.e the ECI field of the SYNACK) MUST NOT be interpreted as the
1225	   3-bit ECI value, which is only set as a copy of the local ECC value
1226	   in non-SYN packets.

1228	6.1.4.  Extended ECN (EECN) Field Settings during Flow Start or after
1229	        Idle Periods

1231	   If the originator (A) of a TCP connection supports re-ECN it MUST set
1232	   the extended ECN (EECN) field in the IP header of the initial SYN
1233	   packet to the feedback not established (FNE) codepoint.

1235	   FNE is a new extended ECN codepoint defined by this specification
1236	   (Section 4.2).  The feedback not established (FNE) codepoint is used
1237	   when the transport does not have the benefit of ECN feedback so it
1238	   cannot decide whether to set or clear the RE flag.

1240	   If after receiving a SYN the server B has set its sending half-
1241	   connection into RECN mode or RECN-Co mode, it MUST set the extended
1242	   ECN field in the IP header of its SYN ACK to the feedback not
1243	   established (FNE) codepoint.  Note the careful wording here, which
1244	   means that Re-ECT server B MUST set FNE on a SYN ACK whether it is
1245	   responding to a SYN from a Re-ECT client or from a client that is
1246	   merely ECN-capable.  This is because FNE indicates the transport is
1247	   ECN capable.

1249	   The original ECN specification [RFC3168] required SYNs and SYN ACKs
1250	   to use the Not-ECT codepoint of the ECN field.  The aim was to
1251	   prevent well-known DoS attacks such as SYN flooding being able to
1252	   gain from the advantage that ECN capability afforded over drop at
1253	   ECN-capable routers.

1255	   For a SYN ACK, Kuzmanovic [RFC5562] has shown that this caution was
1256	   unnecessary, and allows a SYN ACK to be ECN-capable to improve
1257	   performance.  By stipulating the FNE codepoint for the initial SYN,
1258	   we comply with RFC3168 in word but not in spirit, because we have
1259	   indeed set the ECN field to Not-ECT, but we have extended the ECN
1260	   field with another bit.  And it will be seen (Section 5.3) that we
1261	   have defined one setting of that bit to mean an ECN-capable
1262	   transport.  Therefore, by proposing that the FNE codepoint MUST be
1263	   used on the initial SYN of a connection, we have gone further by
1264	   proposing to make the initial SYN ECN-capable too.  Section 5.4
1265	   justifies deciding to make the initial SYN ECN-capable.

1267	   Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will
1268	   have already been set on the initial SYN and possibly the SYN ACK as
1269	   above.  But each re-ECN sender will have to set FNE cautiously on a
1270	   few data packets as well, given a number of packets will usually have
1271	   to be sent before sufficient congestion feedback is received.  The
1272	   behaviour will be different depending on the mode of the half-
1273	   connection:

1275	   RECN mode:  Given the constraints on TCP's initial window [RFC3390]
1276	      and its exponential window increase during slow start
1277	      phase [RFC2581], it turns out that the sender SHOULD set FNE on
1278	      the first and third data packets in its flow after the initial
1279	      3-way handshake, assuming equal sized data packets once a flow is
1280	      established.  Appendix D presents the calculation that led to this
1281	      conclusion.  Below, after running through the start of an example
1282	      TCP session, we give the intuition learned from that calculation.

1284	   RECN-Co mode:  A re-ECT sender that switches into re-ECN
1285	      compatibility mode or into Not-ECT mode (because it has detected
1286	      the corresponding host is not re-ECN capable) MUST limit its
1287	      initial window to 1 segment.  The reasoning behind this constraint
1288	      is given in Section 5.4.  Having set this initial window, a re-ECN
1289	      sender in RECN-Co mode SHOULD set FNE on the first and third data
1290	      packets in a flow, as for RECN mode.

1292	   +----+------+----------------+-------+-------+---------------+------+
1293	   |    | Data | TCP A(Re-ECT)  | IP A  | IP B  | TCP B(Re-ECT) | Data |
1294	   +----+------+----------------+-------+-------+---------------+------+
1295	   |    | Byte |  SEQ  ACK CTL  | EECN  | EECN  |  SEQ  ACK CTL | Byte |
1296	   | -- | ---- | -------------  | ----- | ----- | ------------- | ---- |
1297	   |  1 |      | 0100      SYN  | FNE   | -->   |      R.ECC=0  |      |
1298	   |    |      |    CWR,ECE,NS  |       |       |               |      |
1299	   |  2 |      |      R.ECC=0   | <--   | FNE   | 0300 0101     |      |
1300	   |    |      |                |       |       |   SYN,ACK,CWR |      |
1301	   |  3 |      | 0101 0301 ACK  | RECT  | -->   |      R.ECC=0  |      |
1302	   |  4 | 1000 | 0101 0301 ACK  | FNE   | -->   |      R.ECC=0  |      |
1303	   |  5 |      |      R.ECC=0   | <--   | FNE   | 0301 1102 ACK | 1460 |
1304	   |  6 |      |      R.ECC=0   | <--   | RECT  | 1762 1102 ACK | 1460 |
1305	   |  7 |      |      R.ECC=0   | <--   | FNE   | 3222 1102 ACK | 1460 |
1306	   |  8 |      | 1102 1762 ACK  | RECT  | -->   |      R.ECC=0  |      |
1307	   |  9 |      |      R.ECC=0   | <--   | RECT  | 4682 1102 ACK | 1460 |
1308	   | 10 |      |      R.ECC=0   | <--   | RECT  | 6142 1102 ACK | 1460 |
1309	   | 11 |      | 1102 3222 ACK  | RECT  | -->   |      R.ECC=0  |      |
1310	   | 12 |      |      R.ECC=0   | <--   | RECT  | 7602 1102 ACK | 1460 |
1311	   | 13 |      |      R.ECC=1   | <*-   | RECT  | 9062 1102 ACK | 1460 |
1312	   |    |      | ...            |       |       |               |      |
1313	   +----+------+----------------+-------+-------+---------------+------+

1315	                      Table 6: TCP Session Example #1

1317	   Table 6 shows an example TCP session, where the server B sets FNE on
1318	   its first and third data packets (lines 5 & 7) as well as on the
1319	   initial SYN ACK as previously described.  The left hand half of the
1320	   table shows the relevant settings of headers sent by client A in
1321	   three layers: the TCP payload size; TCP settings; then IP settings.
1322	   The right hand half gives equivalent columns for server B. The only
1323	   TCP settings shown are the sequence number (SEQ), acknowledgement
1324	   number (ACK) and the relevant control (CTL) flags that A sets in the
1325	   TCP header.  The IP columns show the setting of the extended ECN
1326	   (EECN) field.

1328	   Also shown on the receiving side of the table is the value of the
1329	   receiver's echo congestion counter (R.ECC) after processing the
1330	   incoming EECN header.  Note that, once a host sets a half-connection
1331	   into RECN mode, it MUST initialise its local value of ECC to zero.

1333	   The intuition that Appendix D gives for why a sender should set FNE
1334	   on the first and third data packets is as follows.  At line 13, a
1335	   packet sent by B is shown with an '*', which means it has been
1336	   congestion marked by an intermediate queue from RECT to CE(-1).  On
1337	   receiving this CE marked packet, client A increments its ECC counter
1338	   to 1 as shown.  This was the 7th data packet B sent, but before
1339	   feedback about this event returns to B, it might well have sent many
1340	   more packets.  Indeed, during exponential slow start, about as many
1341	   packets will be in flight (unacknowledged) as have been acknowledged.
1342	   So, when the feedback from the congestion event on B's 7th segment
1343	   returns, B will have sent about 7 further packets that will still be
1344	   in flight.  At that stage, B's best estimate of the network's packet
1345	   marking fraction will be 1/7.  So, as B will have sent about 14
1346	   packets, it should have already marked 2 of them as FNE in order to
1347	   have marked 1/7; hence the need to have set the first and third data
1348	   packets to FNE.

1350	   Client A's behaviour in Table 6 also shows FNE being set on the first
1351	   SYN and the first data packet (lines 1 & 4), but in this case it
1352	   sends no more data packets, so of course, it cannot, and does not
1353	   need to, set FNE again.  Note that in the A-B direction there is no
1354	   need to set FNE on the third part of the three-way hand-shake (line
1355	   3---the ACK).

1357	   Note that in this section we have used the word SHOULD rather than
1358	   MUST when specifying how to set FNE on data segments before positive
1359	   congestion feedback arrives (but note that the word MUST was used for
1360	   FNE on the SYN and SYN ACK).  FNE is only RECOMMENDED for the first
1361	   and third data segments to entertain the possibility that the TCP
1362	   transport has the benefit of other knowledge of the path, which it
1363	   re-uses from one flow for the benefit of a newly starting flow.  For
1364	   instance, one flow can re-use knowledge of other flows between the
1365	   same hosts if using a Congestion Manager [RFC3124] or when a proxy
1366	   host aggregates congestion information for large numbers of flows.

1368	   After an idle period of more than 1 second, a re-ECN sender transport
1369	   MUST set the EECN field of the packet that resumes the connection to
1370	   FNE.  Note that this next packet may be sent a very long time later,
1371	   a packet does NOT have to be sent after 1 second of idling.  In order
1372	   that the design of network policers can be deterministic, this
1373	   specification deliberately puts an absolute lower limit on how long a
1374	   connection can be idle before the packet that resumes the connection
1375	   must be set to FNE, rather than relating it to the connection round
1376	   trip time.  We use the lower bound of the retransmission timeout
1377	   (RTO) [RFC2988], which is commonly used as the idle period before TCP
1378	   must reduce to the restart window [RFC2581].  Note our specification
1379	   of re-ECN's idle period is NOT intended to change the idle period for
1380	   TCP's restart, nor indeed for any other purposes.

1382	   {ToDo: Describe how the sender falls back to RFC3168 modes if packets
1383	   don't appear to be getting through (to work round firewalls
1384	   discarding packets they consider unusual).}

1386	6.1.5.  Pure ACKS, Retransmissions, Window Probes and Partial ACKs

1388	   A re-ECN sender MUST clear the RE flag to "0" and set the ECN field
1389	   to Not-ECT in pure ACKs, retransmissions and window probes, as
1390	   specified in  [RFC3168].  Our eventual goal is for all packets to be
1391	   sent with re-ECN enabled, and we believe the semantics of the ECI
1392	   field go a long way towards being able to achieve this.  However, we
1393	   have not completed a full security analysis for these cases,
1394	   therefore, currently we merely re-state current practice.

1396	   We must also reconcile the facts that congestion marking is applied
1397	   to packets but acknowledgements cover octet ranges and acknowledged
1398	   octet boundaries need not match the transmitted boundaries.  The
1399	   general principle we work to is to remain compatible with TCP's
1400	   congestion control which is driven by congestion events at packet
1401	   granularity while at the same time aiming to blank the RE flag on at
1402	   least as many octets in a flow as have been marked CE.

1404	   Therefore, a re-ECN TCP receiver MUST increment its ECC value as many
1405	   times as CE marked packets have been received.  And that value MUST
1406	   be echoed to the sender in the first available ACK using the ECI
1407	   field.  This ensures the TCP sender's congestion control receives
1408	   timely feedback on congestion events at the same packet granularity
1409	   that they were generated on congested queues.

1411	   Then, a re-ECN sender stores the difference D between its own ECC
1412	   value and the incoming ECI field by incrementing a counter R. Then, R
1413	   is decremented by 1 each subsequent packet that is sent with the RE
1414	   flag blanked, until R is no longer positive.  Using this technique,
1415	   whenever a re-ECN transport sends a not re-ECN capable packet (e.g. a
1416	   retransmission), the remaining packets required to have the RE flag
1417	   blanked will be automatically carried over to subsequent packets,
1418	   through the variable R.

1420	   This does not ensure precisely the same number of octets have RE
1421	   blanked as were CE marked.  But we believe positive errors will
1422	   cancel negative over a long enough period. {ToDo: However, more
1423	   research is needed to prove whether this is so.  If it is not, it may
1424	   be necessary to increment and decrement R in octets rather than
1425	   packets, by incrementing R as the product of D and the size in octets
1426	   of packets being sent (typically the MSS).}

1428	6.2.  Other Transports
1429	6.2.1.  General Guidelines for Adding Re-ECN to Other Transports

1431	   As a general rule, Re-ECT sender transports that have established the
1432	   receiver transport is at least ECN-capable (not necessarily re-ECN
1433	   capable) MUST blank the RE codepoint for at least as many octets as
1434	   arrive at receiver with the CE codepoint set.  Re-ECN-capable sender
1435	   transports should always initialise the ECN field to the ECT(1)
1436	   codepoint once a flow is established.

1438	   If the sender transport does not have sufficient feedback to even
1439	   estimate the path's CE rate, it SHOULD set FNE continuously.  If the
1440	   sender transport has some, perhaps stale, feedback to estimate that
1441	   the path's CE rate is nearly definitely less than E%, the transport
1442	   MAY blank RE in packets for E% of sent octets, and set the RECT
1443	   codepoint for the remainder.

1445	   The following sections give guidelines on how re-ECN support could be
1446	   added to RSVP or NSIS, to DCCP, and to SCTP - although separate
1447	   Internet drafts will be necessary to document the exact mechanics of
1448	   re-ECN in each of these protocols.

1450	   {ToDo: Give a brief outline of what would be expected for each of the
1451	   following:

1453	   o  UDP fire and forget (e.g.  DNS)

1455	   o  UDP streaming with no feedback

1457	   o  UDP streaming with feedback

1459	   }

1461	6.2.2.  Guidelines for adding Re-ECN to RSVP or NSIS

1463	   A separate I-D has been submitted [I-D.re-pcn-border-cheat]
1464	   describing how re-ECN can be used in an edge-to-edge rather than end-
1465	   to-end scenario.  It can then be used by downstream networks to
1466	   police whether upstream networks are blocking new flow reservations
1467	   when downstream congestion is too high, even though the congestion is
1468	   in other operators' downstream networks.  This relates to current
1469	   IETF work on Admission Control over Diffserv using Pre-Congestion
1470	   Notification (PCN)  [RFC5559].

1472	6.2.3.  Guidelines for adding Re-ECN to DCCP

1474	   Beside adjusting the initial features negotiation sequence, operating
1475	   re-ECN in DCCP [RFC4340] could be achieved by defining a new option
1476	   to be added to acknowledgments, that would include a multibit field
1477	   where the destination could copy its ECC.

1479	6.2.4.  Guidelines for adding Re-ECN to SCTP

1481	   Appendix A in [RFC4960] gives the specifications for SCTP to support
1482	   ECN.  Similar steps should be taken to support re-ECN.  Beside
1483	   adjusting the initial features negotiation sequence, operating re-ECN
1484	   in SCTP could be achieved by defining a new control chunk, that would
1485	   include a multibit field where the destination could copy its ECC

1487	7.  Incremental Deployment

1489	   The design of the re-ECN protocol started from the fact that the
1490	   current ECN marking behaviour of queues was sufficient and that re-
1491	   feedback could be introduced around these queues by changing the
1492	   sender behaviour but not the routers.  Otherwise, if we had required
1493	   routers to be changed, the chance of encountering a path that had
1494	   every router upgraded would be vanishly small during early
1495	   deployment, giving no incentive to start deployment.  Also, as there
1496	   is no new forwarding behaviour, routers and hosts do not have to
1497	   signal or negotiate anything.

1499	   However, networks that choose to protect themselves using re-ECN do
1500	   have to add new security functions at their trust boundaries with
1501	   others.  They distinguish legacy traffic by its ECN field.  Traffic
1502	   from Not-ECT transports is distinguishable by its Not-ECT marking.
1503	   Traffic from RFC3168 compliant ECN transports is distinguished from
1504	   re-ECN by which of ECT(0) or ECT(1) is used.  We chose to use ECT(1)
1505	   for re-ECN traffic deliberately.  Existing ECN sources set ECT(0) on
1506	   either 50% (the nonce) or 100% (the default) of packets, whereas re-
1507	   ECN does not use ECT(0) at all.  We can use this distinguishing
1508	   feature of RFC3168 compliant ECN traffic to separate it out for
1509	   different treatment at the various border security functions: egress
1510	   dropping, ingress policing and border policing.

1512	   The general principle we adopt is that an egress dropper will not
1513	   drop any legacy traffic, but ingress and border policers will limit
1514	   the bulk rate of legacy traffic (Not-ECT, ECT(0) and those amrked
1515	   with the unused codepoint) that can enter each network.  Then, during
1516	   early re-ECN deployment, operators can set very permissive (or non-
1517	   existent) rate-limits on legacy traffic, but once re-ECN
1518	   implementations are generally available, legacy traffic can be rate-
1519	   limited increasingly harshly.  Ultimately, an operator might choose
1520	   to block all legacy traffic entering its network, or at least only
1521	   allow through a trickle.

1523	   Then, as the limits are set more strictly, the more RFC3168 ECN
1524	   sources will gain by upgrading to re-ECN.  Thus, towards the end of
1525	   the voluntary incremental deployment period, RFC3168 compliant
1526	   transports can be given progressively stronger encouragement to
1527	   upgrade.

1529	   The following list of minor changes, brings together all the points
1530	   where re-ECN semantics for use of the two-bit ECN field are different
1531	   compared to RFC3168:

1533	   o  A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender
1534	      sets ECT(0) by default (Section 4.3);

1536	   o  No provision is necessary for a re-ECN capable source transport to
1537	      use the ECN nonce (Section 6.1.2.1);

1539	   o  Routers MAY preferentially drop different extended ECN codepoints
1540	      (Section 5.3);

1542	   o  Packets carrying the feedback not established (FNE) codepoint MAY
1543	      optionally be marked rather than dropped by routers, even though
1544	      their ECN field is Not-ECT (with the important caveat in
1545	      Section 5.3);

1547	   o  Packets may be dropped by policing nodes because of apparent
1548	      misbehaviour, not just because of congestion ;

1550	   o  Tunnel entry behaviour is still to be defined, but may have to be
1551	      different from RFC3168 (Section 5.6).

1553	   None of these changes REQUIRE any modifications to routers.  Also
1554	   none of these changes affect anything about end to end congestion
1555	   control; they are all to do with allowing networks to police that end
1556	   to end congestion control is well-behaved.

1558	8.  Related Work

1560	8.1.  Congestion Notification Integrity

1562	   The choice of two ECT code-points in the ECN field [RFC3168]
1563	   permitted future flexibility, optionally allowing the sender to
1564	   encode the experimental ECN nonce [RFC3540] in the packet stream.
1565	   This mechanism has since been included in the specifications of DCCP
1566	   [RFC4340].

1568	   The ECN nonce is an elegant scheme that allows the sender to detect
1569	   if someone in the feedback loop - the receiver especially - tries to
1570	   claim no congestion was experienced when in fact congestion led to
1571	   packet drops or ECN marks.  For each packet it sends, the sender
1572	   chooses between the two ECT codepoints in a pseudo-random sequence.

1574	   Then, whenever the network marks a packet with CE, if the receiver
1575	   wants to deny congestion happened, she has to guess which ECT
1576	   codepoint was overwritten.  She has only a 50:50 chance of being
1577	   correct each time she denies a congestion mark or a drop, which
1578	   ultimately will give her away.

1580	   The purpose of a network-layer nonce should primarily be protection
1581	   of the network, while a transport-layer nonce would be better used to
1582	   protect the sender from cheating receivers.  Now, the assumption
1583	   behind the ECN nonce is that a sender will want to detect whether a
1584	   receiver is suppressing congestion feedback.  This is only true if
1585	   the sender's interests are aligned with the network's, or with the
1586	   community of users as a whole.  This may be true for certain large
1587	   senders, who are under close scrutiny and have a reputation to
1588	   maintain.  But we have to deal with a more hostile world, where
1589	   traffic may be dominated by peer-to-peer transfers, rather than
1590	   downloads from a few popular sites.  Often the `natural' self-
1591	   interest of a sender is not aligned with the interests of other
1592	   users.  It often wishes to transfer data quickly to the receiver as
1593	   much as the receiver wants the data quickly.

1595	   In contrast, the re-ECN protocol enables policing of an agreed rate-
1596	   response to congestion (e.g. TCP-friendliness) at the sender's
1597	   interface with the internetwork.  It also ensures downstream networks
1598	   can police their upstream neighbours, to encourage them to police
1599	   their users in turn.  But most importantly, it requires the sender to
1600	   declare path congestion to the network and it can remove traffic at
1601	   the egress if this declaration is dishonest.  So it can police
1602	   correctly, irrespective of whether the receiver tries to suppress
1603	   congestion feedback or whether the sender ignores genuine congestion
1604	   feedback.  Therefore the re-ECN protocol addresses a much wider range
1605	   of cheating problems, which includes the one addressed by the ECN
1606	   nonce.

1608	9.  Security Considerations

1610	   This whole memo concerns the deployment of a secure congestion
1611	   control framework.  However, below we list some specific security
1612	   issues that we are still working on:

1614	   o  Malicious users have ability to launch dynamically changing
1615	      attacks, exploiting the time it takes to detect an attack, given
1616	      ECN marking is binary.  We are concentrating on subtle
1617	      interactions between the ingress policer and the egress dropper in
1618	      an effort to make it impossible to game the system.

1620	   o  There is an inherent need for at least some flow state at the
1621	      egress dropper given the binary marking environment, which leads
1622	      to an apparent vulnerability to state exhaustion attacks.  An
1623	      egress dropper design with bounded flow state is in write-up.

1625	   o  A malicious source can spoof another user's address and send
1626	      negative traffic to the same destination in order to fool the
1627	      dropper into sanctioning the other user's flow.  To prevent or
1628	      mitigate these two different kinds of DoS attack, against the
1629	      dropper and against given flows, we are considering various
1630	      protection mechanisms.

1632	   o  A malicious client can send requests using a spoofed source
1633	      address to a server (such as a DNS server) that tends to respond
1634	      with single packet responses.  This server will then be tricked
1635	      into having to set FNE on the first (and only) packet of all these
1636	      wasted responses.  Given packets marked FNE are worth +1, this
1637	      will cause such servers to consume more of their allowance to
1638	      cause congestion than they would wish to.  In general, re-ECN is
1639	      deliberately designed so that single packet flows have to bear the
1640	      cost of not discovering the congestion state of their path.  One
1641	      of the reasons for introducing re-ECN is to encourage short flows
1642	      to make use of previous path knowledge by moving the cost of this
1643	      lack of knowledge to sources that create short flows.  Therefore,
1644	      we in the long run we might expect services like DNS to aggregate
1645	      single packet flows into connections where it brings benefits.
1646	      However, this attack where DNS requests are made from spoofed
1647	      addresses genuinely forces the server to waste its resources.  The
1648	      only mitigating feature is that the attacker has to set FNE on
1649	      each of its requests if they are to get through an egress dropper
1650	      to a DNS server.  The attacker therefore has to consume as many
1651	      resources as the victim, which at least implies re-ECN does not
1652	      unwittingly amplify this attack.

1654	   Having highlighted outstanding security issues, we now explain the
1655	   design decisions that were taken based on a security-related
1656	   rationale.  It may seem that the six codepoints of the eight made
1657	   available by extending the ECN field with the RE flag have been used
1658	   rather wastefully to encode just five states.  In effect the RE flag
1659	   has been used as an orthogonal single bit, using up four codepoints
1660	   to encode the three states of positive, neutral and negative worth.
1661	   The mapping of the codepoints in an earlier version of this proposal
1662	   used the codepoint space more efficiently, but the scheme became
1663	   vulnerable to network operators bypassing congestion penalties by
1664	   focusing congestion marking on positive packets.  Appendix B explains
1665	   why fixing that problem while allowing for incremental deployment,
1666	   would have used another codepoint anyway.  So it was better to use
1667	   this orthogonal encoding scheme, which greatly simplified the whole
1668	   protocol and brought with it some subtle security benefits (see the
1669	   last paragraph of Appendix B).

1671	   With the scheme as now proposed, once the RE flag is set or cleared
1672	   by the sender or its proxy, it should not be written by the network,
1673	   only read.  So the endpoints can detect if any network maliciously
1674	   alters the RE flag.  IPSec AH integrity checking does not cover the
1675	   IPv4 option flags (they were considered mutable---even the one we
1676	   propose using for the RE flag that was `currently unused' when IPSec
1677	   was defined).  But it would be sufficient for a pair of endpoints to
1678	   make random checks on whether the RE flag was the same when it
1679	   reached the egress as when it left the ingress.  Indeed, if IPSec AH
1680	   had covered the RE flag, any network intending to alter sufficient RE
1681	   flags to make a gain would have focused its alterations on packets
1682	   without authenticating headers (AHs).

1684	   The security of re-ECN has been deliberately designed to not rely on
1685	   cryptography.

1687	10.  IANA Considerations

1689	   This memo includes no request to IANA (yet).

1691	   If this memo was to progress to standards track, it would list:

1693	   o  The new RE flag in IPv4 (Section 5.1) and its extension with the
1694	      ECN field to create a new set of extended ECN (EECN) codepoints;

1696	   o  The definition of the EECN codepoints for default Diffserv PHBs
1697	      (Section 4.2)

1699	   o  The new extension header for IPv6 (Section 5.2);

1701	   o  The new combinations of flags in the TCP header for capability
1702	      negotiation (Section 6.1.3);

1704	11.  Conclusions

1706	   {ToDo:}

1708	12.  Acknowledgements

1710	   Sebastien Cazalet and Andrea Soppera contributed to the idea of re-
1711	   feedback.  All the following have given helpful comments: Andrea
1712	   Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley,
1713	   Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright,
1714	   John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru
1715	   Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd
1716	   (ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark
1717	   Handley (who developed the attack with canceled packets), Adam
1718	   Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft
1719	   (Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who
1720	   complemented our own dummy traffic attacks with others), Liz Maida
1721	   (MIT), and comments from participants in the CRN/CFP Broadband and
1722	   DoS-resistant Internet working groups.A special thank you to
1723	   Alessandro Salvatori for coming up with fiendish attacks on re-ECN.

1725	13.  Comments Solicited

1727	   Comments and questions are encouraged and very welcome.  They can be
1728	   addressed to the IETF Transport Area working group's mailing list
1729	   <tsvwg@ietf.org>, and/or to the authors.

1731	14.  References

1733	14.1.  Normative References

1735	   [I-D.ietf-tsvwg-ecn-tunnel]    Briscoe, B., "Tunnelling of Explicit
1736	                                  Congestion Notification",
1737	                                  draft-ietf-tsvwg-ecn-tunnel-10 (work
1738	                                  in progress), August 2010.

1740	   [RFC2119]                      Bradner, S., "Key words for use in
1741	                                  RFCs to Indicate Requirement Levels",
1742	                                  BCP 14, RFC 2119, March 1997.

1744	   [RFC2581]                      Allman, M., Paxson, V., and W.
1745	                                  Stevens, "TCP Congestion Control",
1746	                                  RFC 2581, April 1999.

1748	   [RFC3168]                      Ramakrishnan, K., Floyd, S., and D.
1749	                                  Black, "The Addition of Explicit
1750	                                  Congestion Notification (ECN) to IP",
1751	                                  RFC 3168, September 2001.

1753	   [RFC3390]                      Allman, M., Floyd, S., and C.
1754	                                  Partridge, "Increasing TCP's Initial
1755	                                  Window", RFC 3390, October 2002.

1757	   [RFC4302]                      Kent, S., "IP Authentication Header",
1758	                                  RFC 4302, December 2005.

1760	   [RFC4305]                      Eastlake, D., "Cryptographic Algorithm
1761	                                  Implementation Requirements for
1762	                                  Encapsulating Security Payload (ESP)
1763	                                  and Authentication Header (AH)",
1764	                                  RFC 4305, December 2005.

1766	   [RFC4340]                      Kohler, E., Handley, M., and S. Floyd,
1767	                                  "Datagram Congestion Control Protocol
1768	                                  (DCCP)", RFC 4340, March 2006.

1770	   [RFC4341]                      Floyd, S. and E. Kohler, "Profile for
1771	                                  Datagram Congestion Control Protocol
1772	                                  (DCCP) Congestion Control ID 2: TCP-
1773	                                  like Congestion Control", RFC 4341,
1774	                                  March 2006.

1776	   [RFC4342]                      Floyd, S., Kohler, E., and J. Padhye,
1777	                                  "Profile for Datagram Congestion
1778	                                  Control Protocol (DCCP) Congestion
1779	                                  Control ID 3: TCP-Friendly Rate
1780	                                  Control (TFRC)", RFC 4342, March 2006.

1782	   [RFC4960]                      Stewart, R., "Stream Control
1783	                                  Transmission Protocol", RFC 4960,
1784	                                  September 2007.

1786	   [RFC5562]                      Kuzmanovic, A., Mondal, A., Floyd, S.,
1787	                                  and K. Ramakrishnan, "Adding Explicit
1788	                                  Congestion Notification (ECN)
1789	                                  Capability to TCP's SYN/ACK Packets",
1790	                                  RFC 5562, June 2009.

1792	14.2.  Informative References

1794	   [ARI05]                        Adams, J., Roberts, L., and A.
1795	                                  IJsselmuiden, "Changing the Internet
1796	                                  to Support Real-Time Content Supply
1797	                                  from a Large Fraction of Broadband
1798	                                  Residential Users", BT Technology
1799	                                  Journal (BTTJ) 23(2), April 2005.

1801	   [I-D.re-pcn-border-cheat]      Briscoe, B., "Emulating Border Flow
1802	                                  Policing using Re-PCN on Bulk Data",
1803	                                  draft-briscoe-re-pcn-border-cheat-03
1804	                                  (work in progress), October 2009.

1806	   [I-D.tsvwg-re-ecn-motivation]  Briscoe, B., Jacquet, A., Moncaster,
1807	                                  T., and A. Smith, "Re-ECN: A Framework
1808	                                  for adding Congestion Accountability
1809	                                  to TCP/IP", draft-briscoe-tsvwg-re-
1810	                                  ecn-tcp-motivation-02 (work in
1811	                                  progress), October 2010.

1813	   [RFC2309]                      Braden, B., Clark, D., Crowcroft, J.,
1814	                                  Davie, B., Deering, S., Estrin, D.,
1815	                                  Floyd, S., Jacobson, V., Minshall, G.,
1816	                                  Partridge, C., Peterson, L.,
1817	                                  Ramakrishnan, K., Shenker, S.,
1818	                                  Wroclawski, J., and L. Zhang,
1819	                                  "Recommendations on Queue Management
1820	                                  and Congestion Avoidance in the
1821	                                  Internet", RFC 2309, April 1998.

1823	   [RFC2475]                      Blake, S., Black, D., Carlson, M.,
1824	                                  Davies, E., Wang, Z., and W. Weiss,
1825	                                  "An Architecture for Differentiated
1826	                                  Services", RFC 2475, December 1998.

1828	   [RFC2988]                      Paxson, V. and M. Allman, "Computing
1829	                                  TCP's Retransmission Timer", RFC 2988,
1830	                                  November 2000.

1832	   [RFC3124]                      Balakrishnan, H. and S. Seshan, "The
1833	                                  Congestion Manager", RFC 3124,
1834	                                  June 2001.

1836	   [RFC3514]                      Bellovin, S., "The Security Flag in
1837	                                  the IPv4 Header", RFC 3514,
1838	                                  April 2003.

1840	   [RFC3540]                      Spring, N., Wetherall, D., and D. Ely,
1841	                                  "Robust Explicit Congestion
1842	                                  Notification (ECN) Signaling with
1843	                                  Nonces", RFC 3540, June 2003.

1845	   [RFC4301]                      Kent, S. and K. Seo, "Security
1846	                                  Architecture for the Internet
1847	                                  Protocol", RFC 4301, December 2005.

1849	   [RFC5129]                      Davie, B., Briscoe, B., and J. Tay,
1850	                                  "Explicit Congestion Marking in MPLS",
1851	                                  RFC 5129, January 2008.

1853	   [RFC5559]                      Eardley, P., "Pre-Congestion
1854	                                  Notification (PCN) Architecture",
1855	                                  RFC 5559, June 2009.

1857	   [Re-fb]                        Briscoe, B., Jacquet, A., Di Cairano-
1858	                                  Gilfedder, C., Salvatori, A., Soppera,
1859	                                  A., and M. Koyabe, "Policing
1860	                                  Congestion Response in an Internetwork
1861	                                  Using Re-Feedback", ACM SIGCOMM
1862	                                  CCR 35(4)277--288, August 2005, <http:

1864	                                  //www.acm.org/sigs/sigcomm/
1865	                                  sigcomm2005/techprog.html#session8>.

1867	   [Savage99]                     Savage, S., Cardwell, N., Wetherall,
1868	                                  D., and T. Anderson, "TCP congestion
1869	                                  control with a misbehaving receiver",
1870	                                  ACM SIGCOMM CCR 29(5), October 1999, <
1871	                                  http://citeseer.ist.psu.edu/
1872	                                  savage99tcp.html>.

1874	   [Steps_DoS]                    Handley, M. and A. Greenhalgh, "Steps
1875	                                  towards a DoS-resistant Internet
1876	                                  Architecture", Proc. ACM SIGCOMM
1877	                                  workshop on Future directions in
1878	                                  network architecture (FDNA'04) pp
1879	                                  49--56, August 2004.

1881	   [tcp-rcv-cheat]                Moncaster, T., Briscoe, B., and A.
1882	                                  Jacquet, "A TCP Test to Allow Senders
1883	                                  to Identify Receiver Non-Compliance",
1884	                                  draft-moncaster-tcpm-rcv-cheat-02
1885	                                  (work in progress), November 2007.

1887	Appendix A.  Precise Re-ECN Protocol Operation

1889	   {ToDo: fix this}

1891	   The protocol operation in the middle described in Section 4.3 was an
1892	   approximation.  In fact, standard ECN router marking combines 1% and
1893	   2% marking into slightly less than 3% whole-path marking, because
1894	   routers deliberately mark CE whether or not it has already been
1895	   marked by another router upstream.  So the combined marking fraction
1896	   would actually be 100% - (100% - 1%)(100% - 2%) = 2.98%.

1898	   To generalise this we will need some notation.

1900	   o  j represents the index of each resource (typically queues) along a
1901	      path, ranging from 0 at the first router to n-1 at the last.

1903	   o  m_j represents the fraction of octets *m*arked CE by a particular
1904	      router (whether or not they are already marked) because of
1905	      congestion of resource j.

1907	   o  u_j represents congestion *u*pstream of resource j, being the
1908	      fraction of CE marking in arriving packet headers (before
1909	      marking).

1911	   o  p_j represents *p*ath congestion, being the fraction of packets
1912	      arriving at resource j with the RE flag blanked (excluding Not-
1913	      RECT packets).

1915	   o  v_j denotes expected congestion downstream of resource j, which
1916	      can be thought of as a *v*irtual marking fraction, being derived
1917	      from two other marking fractions.

1919	   Observed fractions of each particular codepoint (u, p and v) and
1920	   router marking rate m are dimensionless fractions, being the ratio of
1921	   two data volumes (marked and total) over a monitoring period.  All
1922	   measurements are in terms of octets, not packets, assuming that line
1923	   resources are more congestible than packet processing.

1925	   The path congestion (RE blanking fraction) set by the sender should
1926	   reflect the upstream congestion (CE marking fraction) fed back from
1927	   the destination.  Therefore in the steady state

1929	      p_0  = u_n
1930	           = 1 - (1 - m_1)(1 - m_2)...

1932	   Similarly, at some point j in the middle of the network, if p = 1 -
1933	   (1 - u_j)(1 - v_j), then

1935	      v_j  = 1 - (1 - p)/(1 - u_j)

1937	          ~= p - u_j;                      if u_j << 100%

1939	   So, between the two routers in the example in Section 4.3, congestion
1940	   downstream is

1942	      v_1  = 100.00% - (100% - 2.98%) / (100% - 1.00%)
1943	           = 2.00%,

1945	   or a useful approximation of downstream congestion is

1947	      v_1 ~= 2.98% - 1.00%
1948	          ~= 1.98%.

1950	Appendix B.  Justification for Two Codepoints Signifying Zero Worth
1951	             Packets

1953	   It may seem a waste of a codepoint to set aside two codepoints of the
1954	   Extended ECN field to signify zero worth (RECT and CE(0) are both
1955	   worth zero).  The justification is subtle, but worth recording.

1957	   The original version of Re-ECN ([Re-fb] and draft-00 of this memo)
1958	   used three codepoints for neutral (ECT(1)), positive (ECT(0)) and
1959	   negative (CE) packets.  The sender set packets to neutral unless re-
1960	   echoing congestion, when it set them positive, in much the same way
1961	   that it blanks the RE flag in the current protocol.  However, routers
1962	   were meant to mark congestion by setting packets negative (CE)
1963	   irrespective of whether they had previously been neutral or positive.

1965	   However, we did not arrange for senders to remember which packet had
1966	   been sent with which codepoint, or for feedback to say exactly which
1967	   packets arrived with which codepoints.  The transport was meant to
1968	   inflate the number of positive packets it sent to allow for a few
1969	   being wiped out by congestion marking.  We (wrongly) assumed that
1970	   routers would congestion mark packets indiscriminately, so the
1971	   transport could infer how many positive packets had been marked and
1972	   compensate accordingly by re-echoing.  But this created a perverse
1973	   incentive for routers to preferentially congestion mark positive
1974	   packets rather than neutral ones.

1976	   We could have removed this perverse incentive by requiring Re-ECN
1977	   senders to remember which packets they had sent with which codepoint.
1978	   And for feedback from the receiver to identify which packets arrived
1979	   as which.  Then, if a positive packet was congestion marked to
1980	   negative, the sender could have re-echoed twice to maintain the
1981	   balance between positive and negative at the receiver.

1983	   Instead, we chose to make re-echoing congestion (blanking RE)
1984	   orthogonal to congestion notification (marking CE), which required a
1985	   second neutral codepoint.  Then the receiver would be able to detect
1986	   and echo a congestion event even if it arrived on a packet that had
1987	   originally been positive.

1989	   If we had added extra complexity to the sender and receiver
1990	   transports to track changes to individual packets, we could have made
1991	   it work, but then routers would have had an incentive to mark
1992	   positive packets with half the probability of neutral packets.  That
1993	   in turn would have led router algorithms to become more complex.
1994	   Then senders wouldn't know whether a mark had been introduced by a
1995	   simple or a complex router algorithm.  That in turn would have
1996	   required another codepoint to distinguish between RFC3168 ECN and new
1997	   Re-ECN router marking.

1999	   Once the cost of IP header codepoint real-estate was the same for
2000	   both schemes, there was no doubt that the simpler option for
2001	   endpoints and for routers should be chosen.  The resulting protocol
2002	   also no longer needed the tricky inflation/deflation complexity of
2003	   the original (broken) scheme.  It was also much simpler to understand
2004	   conceptually.

2006	   A further advantage of the new orthogonal four-codepoint scheme was
2007	   that senders owned sole rights to change the RE flag and routers
2008	   owned sole rights to change the ECN field.  Although we still arrange
2009	   the incentives so neither party strays outside their dominion, these
2010	   clear lines of authority simplify the matter.

2012	   Finally, a little redundancy can be very powerful in a scheme such as
2013	   this.  In one flow, the proportion of packets changed to CE should be
2014	   the same as the proportion of RECT packets changed to CE(-1) and the
2015	   proportion of Re-Echo packets changed to CE(0).  Double checking
2016	   using such redundant relationships can improve the security of a
2017	   scheme (cf. double-entry book-keeping or the ECN Nonce).
2018	   Alternatively, it might be necessary to exploit the redundancy in the
2019	   future to encode an extra information channel.

2021	Appendix C.  ECN Compatibility

2023	   The rationale for choosing the particular combinations of SYN and SYN
2024	   ACK flags in Section 6.1.3 is as follows.

2026	   Choice of SYN flags:  A Re-ECN sender can work with RFC3168 compliant
2027	      ECN receivers so we wanted to use the same flags as would be used
2028	      in an ECN-setup SYN [RFC3168] (CWR=1, ECE=1).  But at the same
2029	      time, we wanted a server (host B) that is Re-ECT to be able to
2030	      recognise that the client (A) is also Re-ECT.  We believe also
2031	      setting NS=1 in the initial SYN achieves both these objectives, as
2032	      it should be ignored by RFC3168 compliant ECT receivers and by
2033	      ECT-Nonce receivers.  But senders that are not Re-ECT should not
2034	      set NS=1.  At the time ECN was defined, the NS flag was not
2035	      defined, so setting NS=1 should be ignored by existing ECT
2036	      receivers (but testing against implementations may yet prove
2037	      otherwise).  The ECN Nonce RFC [RFC3540] is silent on what the NS
2038	      field might be set to in the TCP SYN, but we believe the intent
2039	      was for a nonce client to set NS=0 in the initial SYN (again only
2040	      testing will tell).  Therefore we define a Re-ECN-setup SYN as one
2041	      with NS=1, CWR=1 & ECE=1

2043	   Choice of SYN ACK flags:  Choice of SYN ACK: The client (A) needs to
2044	      be able to determine whether the server (B) is Re-ECT.  The
2045	      original ECN specification required an ECT server to respond to an
2046	      ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1.  There
2047	      is no room to modify this by setting the NS flag, as that is
2048	      already set in the SYN ACK of an ECT-Nonce server.  So we used the
2049	      only combination of CWR and ECE that would not be used by existing
2050	      TCP receivers: CWR=1 and ECE=0.  The original ECN specification
2051	      defines this combination as a non-ECN-setup SYN ACK, which remains
2052	      true for RFC3168 compliant and Nonce ECTs.  But for Re-ECN we
2053	      define it as a Re-ECN-setup SYN ACK.  We didn't use a SYN ACK with
2054	      both CWR and ECE cleared to 0 because that would be the likely
2055	      response from most Not-ECT receivers.  And we didn't use a SYN ACK
2056	      with both CWR and ECE set to 1 either, as at least one broken
2057	      receiver implementation echoes whatever flags were in the SYN into
2058	      its SYN ACK.  Therefore we define a Re-ECN-setup SYN ACK as one
2059	      with CWR=1 & ECE=0.

2061	   Choice of two alternative SYN ACKs:  the NS flag may take either
2062	      value in a Re-ECN-setup SYN ACK.  Section 5.4 REQUIRES that a Re-
2063	      ECT server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to
2064	      echo congestion experienced (CE) on the initial SYN.  Otherwise a
2065	      Re-ECN-setup SYN ACK MUST be returned with NS=0.  The only current
2066	      known use of the NS flag in a SYN ACK is to indicate support for
2067	      the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1.
2068	      Given the ECN nonce MUST NOT be used for a RECN mode connection, a
2069	      Re-ECN-setup SYN ACK can use either setting of the NS flag without
2070	      any risk of confusion, because the CWR & ECE flags will be
2071	      reversed relative to those used by an ECN nonce SYN ACK.

2073	Appendix D.  Packet Marking with FNE During Flow Start

2075	   FNE (feedback not established) packets have two functions.  Their
2076	   main role is to announce the start of a new flow when feedback has
2077	   not yet been established.  However they also have the role of
2078	   balancing the expected feedback and can be used where there are
2079	   sudden changes in the rate of transmission.  Whilst this should not
2080	   happen under TCP their use as speculative marking is used in building
2081	   the following argument as to why the first and third packets should
2082	   be set to FNE.

2084	   The proportion of FNE packets in each roundtrip should be a high
2085	   estimate of the potential error in the balance of number of
2086	   congestion marked packets versus number of re-echo packets already
2087	   issued.

2089	   Let's call:

2091	      S: the number of the TCP segments sent so far

2093	      F: the number of FNE packets sent so far

2095	      R: the number of Re-Echo packets sent so far

2097	      A: the number of acknowledgments received so far

2099	      C: the number of acknowledgments echoing a CE packet

2101	   In normal operation, when we want to send packet S+1, we first need
2102	   to check that enough Re-Echo packets have been issued:

2104	   If R<C, then S+1 will be a Re-echo packet

2106	   Next we need to estimate the amount of congestion observed so far.
2107	   If congestion was stationary, it could be estimated as C/A. A
2108	   pessimistic bound is (C+1)/(A+1) which assumes that the next
2109	   acknowledgment will echo a CE packet; we'll use that more pessimistic
2110	   estimate to drive the generation of FNE packets.

2112	   The number of CE packets expected when (S+1) will be acknowledged is
2113	   therefore (S+1)*(C+1)/(A+1).  Packet S+1 should be set to FNE if that
2114	   expected value exceeds the sum of FNE and Re-Echo packets sent so
2115	   far.

2117	      If  (F+R)<(S+1)*(C+1)/(A+1),
2118	        then S+1 will be set to FNE
2119	        else S+1 will be set to RECT

2121	   So the full test should be:

2123	      When packet (S+1) is about to be sent...
2124	        If R<C,
2125	           then S+1 will be set to Re-Echo
2126	        Else if  (F+R)<(S+1)*(C+1)/(A+1),
2127	          then S+1 will be set to FNE
2128	        Else S+1 will be set to RECT

2130	   This means that at any point, given A, R, F, C, the source could send
2131	   another k RECT packets, so that k < (F+R)*(A+1)/(C+1)-S

2133	   The above scheme is independent of the actions of both the dropper
2134	   and policer and doesn't depend on the rate adaptation discipline of
2135	   the source.  It only defines Re-Echo packets as notification of
2136	   effective end-to-end congestion (as witnessed at the previous
2137	   roundtrip), and FNE packets as notification of speculative end-to-end
2138	   congestion based on a high estimate of congestion

2140	   In practice, for any source:

2142	   o  for the first packet, A=R=F=C=S=0 ==> 1 FNE

2144	   o  if the acknowledgment doesn't echo a mark

2146	      *  for the second packet, A=F=S=1 R=C=0 ==> 1 RECT

2148	      *  for the third packet, S=2 A=F=1 R=C=0 ==> 1 FNE

2150	   o  if no acknowledgement for these two packets echoes a congestion
2151	      mark, then {A=S=3 F=2 R=C=0} which gives k<2*4/1-3, so the source

2153	   o  if no acknowledgement for these four packets echoes a congestion
2154	      mark, then {A=S=7 F=2 R=C=0} which gives k<2*8/1-7, so the source
2155	      could send another 8 RECT packets. ==> 8 RECT

2157	   This behaviour happens to match TCP's congestion window control in
2158	   slow start, which is why for TCP sources, only the first and third
2159	   packet need be FNE packets.

2161	   A source that would open the congestion window any quicker would have
2162	   to insert more FNE packets.  As another example a UDP source sending
2163	   VBR traffic might need to send several FNE packets ahead of the
2164	   traffic peaks it generates.

2166	Appendix E.  Argument for holding back the ECN nonce

2168	   The ECN nonce is a mechanism that allows a /sending/ transport to
2169	   detect if drop or ECN marking at a congested router has been
2170	   suppressed by a node somewhere in the feedback loop---another router
2171	   or the receiver.

2173	   Space for the ECN nonce was set aside in [RFC3168] (currently
2174	   proposed standard) while the full nonce mechanism is specified in
2175	   [RFC3540] (currently experimental).  The specifications for [RFC4340]
2176	   (currently proposed standard) requires that "Each DCCP sender SHOULD
2177	   set ECN Nonces on its packets...".  It also mandates as a requirement
2178	   for all CCID profiles that "Any newly defined acknowledgement
2179	   mechanism MUST include a way to transmit ECN Nonce Echoes back to the
2180	   sender.", therefore:

2182	   o  The CCID profile for TCP-like Congestion Control [RFC4341]
2183	      (currently proposed standard) says "The sender will use the ECN
2184	      Nonce for data packets, and the receiver will echo those nonces in
2185	      its Ack Vectors."

2187	   o  The CCID profile for TCP-Friendly Rate Control (TFRC) [RFC4342]
2188	      recommends that "The sender [use] Loss Intervals options' ECN
2189	      Nonce Echoes (and possibly any Ack Vectors' ECN Nonce Echoes) to
2190	      probabilistically verify that the receiver is correctly reporting
2191	      all dropped or marked packets."

2193	   The primary function of the ECN nonce is to protect the integrity of
2194	   the information about congestion: ECN marks and packet drops.
2195	   However, when the nonce is used to protect the integrity of
2196	   information about packet drops, rather than ECN marks, a transport
2197	   layer nonce will always be sufficient (because a drop loses the
2198	   transport header as well as the ECN field in the network header),
2199	   which would avoid using scarce IP header codepoint space.  Similarly,
2200	   a transport layer nonce would protect against a receiver sending
2201	   early acknowledgements [Savage99].

2203	   If the ECN nonce reveals integrity problems with the information
2204	   about congestion, the sending transport can use that knowledge for
2205	   two functions:

2207	   o  to protect its own resources, by allocating them in proportion to
2208	      the rates that each network path can sustain, based on congestion
2209	      control,

2211	   o  and to protect congested routers in the network, by slowing down
2212	      drastically its connection to the destination with corrupt
2213	      congestion information.

2215	   If the sending transport chooses to act in the interests of congested
2216	   routers, it can reduce its rate if it detects some malicious party in
2217	   the feedback loop may be suppressing ECN feedback.  But it would only
2218	   be useful to congested routers when /all/ senders using them are
2219	   trusted to act in interest of the congested routers.

2221	   In the end, the only essential use of a network layer nonce is when
2222	   sending transports (e.g. large servers) want to allocate their /own/
2223	   resources in proportion to the rates that each network path can
2224	   sustain, based on congestion control.  In that case, the nonce allows
2225	   senders to be assured that they aren't being duped into giving more
2226	   of their own resources to a particular flow.  And if congestion
2227	   suppression is detected, the sending transport can rate limit the
2228	   offending connection to protect its own resources.  Certainly, this
2229	   is a useful function, but the IETF should carefully decide whether
2230	   such a single, very specific case warrants IP header space.

2232	   In contrast, Re-ECN allows all routers to fully protect themselves
2233	   from such attacks, without having to trust anyone - senders,
2234	   receivers, neighbouring networks.  Re-ECN is therefore proposed in
2235	   preference to the ECN nonce on the basis that it addresses the
2236	   generic problem of accountability for congestion of a network's
2237	   resources at the IP layer.

2239	   Delaying the ECN nonce is justified because the applicability of the
2240	   ECN nonce seems too limited for it to consume a two-bit codepoint in
2241	   the IP header.  It therefore seems prudent to give time for an
2242	   alternative way to be found to do the one function the nonce is
2243	   essential for.

2245	   Moreover, while we have re-designed the Re-ECN codepoints so that
2246	   they do not prevent the ECN nonce progressing, the same is not true
2247	   the other way round.  If the ECN nonce started to see some deployment
2248	   (perhaps because it was blessed with proposed standard status),
2249	   incremental deployment of Re-ECN would effectively be impossible,
2250	   because Re-ECN marking fractions at inter-domain borders would be
2251	   polluted by unknown levels of nonce traffic.

2253	   The authors are aware that Re-ECN must prove it has the potential it
2254	   claims if it is to displace the nonce.  Therefore, every effort has
2255	   been made to complete a comprehensive specification of Re-ECN so that
2256	   its potential can be assessed.  We therefore seek the opinion of the
2257	   Internet community on whether the Re-ECN protocol is sufficiently
2258	   useful to warrant standards action.

2260	Appendix F.  Alternative Terminology Used in Other Documents

2262	   A number of alternative terms have been used in various documents
2263	   describing re-feedback and re-ECN.  These are set out in the
2264	   following table

2266	   +-------------------+---------------+-------------------------------+
2267	   |      Current      |      EECN     |             Colour            |
2268	   |    Terminology    |   codepoint   |                               |
2269	   +-------------------+---------------+-------------------------------+
2270	   |      Cautious     |      FNE      |             Green             |
2271	   |      Positive     |    Re-Echo    |             Black             |
2272	   |      Neutral      |      RECT     |              Grey             |
2273	   |      Negative     |     CE(-1)    |              Red              |
2274	   |     Cancelled     |     CE(0)     |           Red-Black           |
2275	   |     Legacy ECN    |     ECT(0)    |             White             |
2276	   |  Currently Unused |     --CU--    |        Currently unused       |
2277	   |                   |               |                               |
2278	   |       Legacy      |    Not-ECT    |             White             |
2279	   +-------------------+---------------+-------------------------------+

2281	                  Table 7: Alternative re-ECN Terminology

2283	Authors' Addresses

2285	   Bob Briscoe (editor)
2286	   BT
2287	   B54/77, Adastral Park
2288	   Martlesham Heath
2289	   Ipswich  IP5 3RE
2290	   UK

2292	   Phone: +44 1473 645196
2293	   EMail: bob.briscoe@bt.com
2294	   URI:   http://bobbriscoe.net/

2296	   Arnaud Jacquet
2297	   BT
2298	   B54/70, Adastral Park
2299	   Martlesham Heath
2300	   Ipswich  IP5 3RE
2301	   UK

2303	   Phone: +44 1473 647284
2304	   EMail: arnaud.jacquet@bt.com
2305	   URI:

2307	   Toby Moncaster
2308	   Moncaster.com
2309	   Dukes
2310	   Layer Marney
2311	   Colchester  CO5 9UZ
2312	   UK

2314	   EMail: toby@moncaster.com

2316	   Alan Smith
2317	   BT
2318	   B54/76, Adastral Park
2319	   Martlesham Heath
2320	   Ipswich  IP5 3RE
2321	   UK

2323	   Phone: +44 1473 640404
2324	   EMail: alan.p.smith@bt.com