idnits 2.17.1 

draft-ietf-conex-tcp-modifications-10.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 542 has weird spacing: '..._flight  credi...'

  -- The document date (October 13, 2015) is 3117 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Congestion Exposure (ConEx)                           M. Kuehlewind, Ed.
3	Internet-Draft                                                ETH Zurich
4	Intended status: Experimental                           R. Scheffenegger
5	Expires: April 15, 2016                                     NetApp, Inc.
6	                                                        October 13, 2015

8	               TCP modifications for Congestion Exposure
9	                 draft-ietf-conex-tcp-modifications-10

11	Abstract

13	   Congestion Exposure (ConEx) is a mechanism by which senders inform
14	   the network about expected congestion based on congestion feedback
15	   from previous packets in the same flow.  This document describes the
16	   necessary modifications to use ConEx with the Transmission Control
17	   Protocol (TCP).

19	Status of This Memo

21	   This Internet-Draft is submitted in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF).  Note that other groups may also distribute
26	   working documents as Internet-Drafts.  The list of current Internet-
27	   Drafts is at http://datatracker.ietf.org/drafts/current/.

29	   Internet-Drafts are draft documents valid for a maximum of six months
30	   and may be updated, replaced, or obsoleted by other documents at any
31	   time.  It is inappropriate to use Internet-Drafts as reference
32	   material or to cite them other than as "work in progress."

34	   This Internet-Draft will expire on April 15, 2016.

36	Copyright Notice

38	   Copyright (c) 2015 IETF Trust and the persons identified as the
39	   document authors.  All rights reserved.

41	   This document is subject to BCP 78 and the IETF Trust's Legal
42	   Provisions Relating to IETF Documents
43	   (http://trustee.ietf.org/license-info) in effect on the date of
44	   publication of this document.  Please review these documents
45	   carefully, as they describe your rights and restrictions with respect
46	   to this document.  Code Components extracted from this document must
47	   include Simplified BSD License text as described in Section 4.e of
48	   the Trust Legal Provisions and are provided without warranty as
49	   described in the Simplified BSD License.

51	Table of Contents

53	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
54	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
55	   2.  Sender-side Modifications . . . . . . . . . . . . . . . . . .   3
56	   3.  Counting Congestion . . . . . . . . . . . . . . . . . . . . .   4
57	     3.1.  Loss Detection  . . . . . . . . . . . . . . . . . . . . .   6
58	       3.1.1.  Without SACK Support  . . . . . . . . . . . . . . . .   7
59	     3.2.  ECN . . . . . . . . . . . . . . . . . . . . . . . . . . .   8
60	       3.2.1.  Accurate ECN Feedback . . . . . . . . . . . . . . . .  10
61	       3.2.2.  Classic ECN Support . . . . . . . . . . . . . . . . .  10
62	   4.  Setting the ConEx Flags . . . . . . . . . . . . . . . . . . .  11
63	     4.1.  Setting the E or the L Flag . . . . . . . . . . . . . . .  11
64	     4.2.  Setting the Credit Flag . . . . . . . . . . . . . . . . .  11
65	   5.  Loss of ConEx Information . . . . . . . . . . . . . . . . . .  14
66	   6.  Timeliness of the ConEx Signals . . . . . . . . . . . . . . .  14
67	   7.  Open Areas for Experimentation  . . . . . . . . . . . . . . .  15
68	   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  17
69	   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  17
70	   10. Security Considerations . . . . . . . . . . . . . . . . . . .  17
71	   11. References  . . . . . . . . . . . . . . . . . . . . . . . . .  18
72	     11.1.  Normative References . . . . . . . . . . . . . . . . . .  18
73	     11.2.  Informative References . . . . . . . . . . . . . . . . .  19
74	   Appendix A.  Revision history . . . . . . . . . . . . . . . . . .  20
75	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  20

77	1.  Introduction

79	   Congestion Exposure (ConEx) is a mechanism by which senders inform
80	   the network about expected congestion based on congestion feedback
81	   from previous packets in the same flow.  ConEx concepts and use cases
82	   are further explained in [RFC6789].  The abstract ConEx mechanism is
83	   explained in [draft-ietf-conex-abstract-mech].  This document
84	   describes the necessary modifications to use ConEx with the
85	   Transmission Control Protocol (TCP).

87	   The markings for ConEx signaling are defined in the ConEx Destination
88	   Option (CDO) for IPv6 [draft-ietf-conex-destopt].  Specifically, the
89	   use of four flags is defined: X (ConEx-capable), L (loss
90	   experienced), E (ECN experienced) and C (credit).

92	   ConEx signaling is based on loss or Explicit Congestion Notification
93	   (ECN) marks [RFC3168] as congestion indications.  The sender collects
94	   this congestion information based on existing TCP feedback mechanisms
95	   from the receiver to the sender.  No changes are needed at the
96	   receiver to implement ConEx signaling.  Therefore no additional
97	   negotiation is needed to implement and use ConEx at the sender.  This
98	   document specifies the sender's actions that are needed to provide
99	   meaningful ConEx information to the network.

101	   Section 2 provides an overview of the modifications needed for TCP
102	   senders to implement ConEx.  First congestion information has to be
103	   extracted from TCP's loss or ECN feedback as described in section 3.
104	   Section 4 details how to set the CDO marking based on this congestion
105	   information.  Section 5 discusses loss of packets carrying ConEx
106	   information.  Section 6 discusses timeliness of the ConEx feedback
107	   signal, given congestion is a temporary state.

109	   This document describes congestion accounting for TCP with and
110	   without the Selective Acknowledgment (SACK) extension [RFC2018] (in
111	   section 3.1).  However, ConEx benefits from the more accurate
112	   information that SACK provides about the number of bytes dropped in
113	   the network.  It is therefore preferable to use the SACK extension
114	   when using TCP with ConEx.  The detailed mechanism to set the L flag
115	   in response to loss-based congestion feedback signal is given in
116	   section 4.1.

118	   While loss has to be minimized, ECN can provide more fine-grained
119	   feedback information.  ConEx-based traffic measurement or management
120	   mechanisms could benefit from this.  Unfortunately, the current ECN
121	   feedback mechanism does not reflect multiple congestion markings if
122	   they occur within the same Round-Trip Time (RTT).  A more accurate
123	   feedback extension to ECN (AccECN) is proposed in a separate document
124	   [draft-kuehlewind-tcpm-accurate-ecn], as this is also useful for
125	   other mechanisms.

127	   Congestion accounting for both classic ECN feedback and AccECN
128	   feedback is explained in detail in section 3.2.  Setting the E flag
129	   in response to ECN-based congestion feedback is again detailed in
130	   section 4.1.

132	1.1.  Requirements Language

134	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
135	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
136	   document are to be interpreted as described in [RFC2119].

138	2.  Sender-side Modifications

140	   This section gives an overview of actions that need to be taken by a
141	   TCP sender modified to use ConEx signaling.

143	   In the TCP handshake, a ConEx sender MUST negotiate for SACK and ECN
144	   preferably with AccECN feedback.  Therefore a ConEx sender MUST also
145	   implement SACK and ECN.  Depending on the capability of the receiver,
146	   the following operation modes exist:

148	   o  SACK-accECN-ConEx (SACK and accurate ECN feedback)

150	   o  SACK-ECN-ConEx (SACK and 'classic' instead of accurate ECN)

152	   o  accECN-ConEx (no SACK but accurate ECN feedback)

154	   o  ECN-ConEx (no SACK and no accurate ECN feedback but 'classic' ECN)

156	   o  SACK-ConEx (SACK but no ECN at all)

158	   o  Basic-ConEx (neither SACK nor ECN)

160	   A ConEx sender MUST expose all congestion information to the network
161	   according to the congestion information received by ECN or based on
162	   loss information provided by the TCP feedback loop.  A TCP sender
163	   SHOULD count congestion byte-wise (rather than packet-wise; see next
164	   paragraph).  After any congestion notification, a sender MUST mark
165	   subsequent packets with the appropriate ConEx flag in the IP header.
166	   Furthermore, a ConEx sender must send enough credit to cover all
167	   experienced congestion for the connection so far, as well as the risk
168	   of congestion for the current transmission (see Section 4.2).

170	   With SACK the number of lost payload bytes is known, but not the
171	   number of packets carrying these bytes.  With classic ECN only an
172	   indication is given that a marking occurred but not the exact number
173	   of payload bytes nor packets.  As network congestion is usually byte-
174	   congestion [RFC7141], the byte-size of a packet marked with a CDO
175	   flag is defined to represent that number of bytes of congestion
176	   signaling [draft-ietf-conex-destopt].  Therefore the exact number of
177	   bytes should be taken into account, if available, to make the ConEx
178	   signal as exact as possible.

180	   Detailed mechanisms for congestion counting in each operation mode
181	   are described in the next section.

183	3.  Counting Congestion

185	   A ConEx TCP sender maintains two counters: one that counts congestion
186	   based on the information retrieved by loss detection, and a second
187	   that accounts for ECN based congestion feedback.  These counters hold
188	   the number of outstanding bytes that should be ConEx marked with
189	   respectively the E flag or the L flag in subsequent packets.

191	   The outstanding bytes for congestion indications based on loss are
192	   maintained in the loss exposure gauge (LEG), as explained in
193	   Section 3.1.

195	   The outstanding bytes counted based on ECN feedback information are
196	   maintained in the congestion exposure gauge (CEG), as explained in
197	   Section 3.2.

199	   When the sender sends a ConEx capable packet with the E or L flag
200	   set, it reduces the respective counter by the byte-size of the
201	   packet.  This is explained for both counters in Section 4.1.

203	   Note that all bytes of an IP packet must be counted in the LEG or CEG
204	   to capture the right number of bytes that should be marked.
205	   Therefore the sender SHOULD take the payload and headers into
206	   account, up to and including the IP header.  However, in TCP the
207	   information regarding how large the headers of a lost or marked
208	   packet were is usually not available, as only payload data will be
209	   acknowledged.

211	   If equal-sized packets, or at least equally distributed packet sizes,
212	   can be assumed, the sender MAY only add and subtract TCP payload
213	   bytes.  In this case there should be about the same number of ConEx
214	   marked packets as the original packets that were causing the
215	   congestion.  Thus both contain about the same number of header bytes
216	   so they will cancel out.  This case is assumed for simplicity in the
217	   following sections.

219	   Otherwise, if a sender sends different sized packets (with unequally
220	   distributed packet sizes), the sender needs to memorize or estimate
221	   the number of lost or ECN-marked packets.  If the sender has
222	   sufficient memory available, the most accurate way to reconstruct the
223	   number of lost or marked packets is to remember the sequence number
224	   of all sent but not acknowledged packets.  In this case a sender is
225	   able to reconstruct the number of packets and thus the header bytes
226	   that were sent during the last RTT.  Otherwise, if e.g. not enough
227	   memory is available, the sender should estimate the packet size, e.g.
228	   if the packet size distribution follows a certain known pattern, or
229	   by using the minimum packet size seen in the last RTT.

231	   If the number of newly sent-out packets with the ConEx L or E flag
232	   set is smaller (or larger) than this estimated number of lost/ECN-
233	   marked packets, the additional header bytes should be added to (or
234	   can be subtracted from) the respective gauge.

236	3.1.  Loss Detection

238	   This section applies whether or not SACK support is available.  The
239	   following subsection (Section 3.1.1) handles the case when SACK is
240	   not available.

242	   A TCP sender detects losses and subsequently retransmits the lost
243	   data.  Therefore, ConEx sender can simply set the ConEx L flag on all
244	   retransmissions in order to at least cover the amount of bytes lost.
245	   If this approach is taken, no LEG is needed.

247	   However, any retransmission may be spurious.  In this case more bytes
248	   have been marked than necessary.  To compensate for this effect a
249	   ConEx sender can maintain a local signed counter, the (LEG), that
250	   indicates the number of outstanding bytes to be sent with the ConEx L
251	   flag and also can become negative.

253	   Using the LEG, when a TCP sender decides that a data segment needs to
254	   be retransmitted, it will increase LEG by the size of the TCP payload
255	   bytes in the retransmission (assuming equal sized segments such that
256	   the retransmitted packet will have the same number of header bytes as
257	   the original ones):

259	   For each retransmission:

261	   LEG += payload

263	   Note, how the LEG is reduced when the ConEx L marking are set is
264	   described in section Section 4.

266	   Further to accommodate spurious retransmissions, a ConEx sender
267	   SHOULD make use of heuristics to detect such spurious retransmissions
268	   (e.g.  F-RTO [RFC5682], DSACK [RFC3708], and Eifel [RFC3522],
269	   [RFC4015]) if already available in a given implementation.  If no
270	   mechanism for detecting spurious retransmissions is available, the
271	   ConEx sender MAY chose to implement one of the mechanism stated
272	   above.  However, given the inaccuracy that ConEx may have anyway and
273	   the timeliness of ConEx information, a ConEx MAY also chose to not
274	   compensate for spurious retransmission.  In this case if spurious
275	   retransmissions occur, the ConEx sender simple has sent too many
276	   ConEx signals which e.g. would decrease the congestion allowance in a
277	   ConEx policer unnecessarily.

279	   If a heuristic method is used to detect spurious retransmission and
280	   has determined that a certain number of packets were retransmitted
281	   erroneously, the ConEx sender subtracts the payload size of these TCP
282	   packets from LEG.

284	   If a spurious retransmission is detected:

286	   LEG -= payload

288	   Note that LEG can become negative, if too many L marking have already
289	   been sent.  This case is further discussed in section Section 6.

291	3.1.1.  Without SACK Support

293	   If multiple losses occur within one RTT and SACK is not used, it may
294	   take several RTTs until all lost data is retransmitted.  With the
295	   scheme described above, the ConEx information will be delayed
296	   considerably, but timeliness is important for ConEx.  For ConEx, it
297	   is important to know how much data was lot; it is not important to
298	   know what data is lost.  During the first RTT after the initial loss
299	   detection, the amount of received data and thus also the amount of
300	   lost data can be estimated based on the number of received ACKs.

302	   Therefore a ConEx sender can use the following algorithm to estimated
303	   the number of lost bytes with an additional delay of one RTT using an
304	   additional Loss Estimation Counter (LEC):

306	      flight_bytes:      current flight size in bytes
307	      retransmit_bytes:  payload size of the retransmission

309	      At the first retransmission in a congestion event LEC is set:

311	         LEC = flight_bytes - 3*SMSS

313	         (At this point of time in the transmission, in the worst case,
314	         all packets in flight minus three that trigged the dupACks
315	         could have been lost.)

317	      Then during the first RTT of the congestion event:

319	         For each retransmission:
320	            LEG += retransmit_bytes
321	            LEC -= retransmit_bytes

323	         For each ACK:
324	            LEC -= SMSS

326	      After one RTT:

328	         LEG += LEC

330	         (The LEC now estimates the number of outstanding bytes
331	         that should be ConEx L marked.)

333	      After the first RTT for each following retransmissions:

335	         if (LEC > 0): LEC -= retransmit_bytes
336	         else if (LEC==0): LEG += retransmit_bytes

338	         if (LEC < 0): LEG += -LEC

340	         (The LEG is not increased for those bytes that were
341	         already counted.)

343	3.2.  ECN

345	   ECN [RFC3168] is an IP/TCP mechanism that allows network nodes to
346	   mark packets with the Congestion Experienced (CE) mark instead of
347	   dropping them when congestion occurs.

349	   A receiver might support 'classic' ECN, the more accurate ECN
350	   feedback scheme (AccECN), or neither.  In the case that ECN is not
351	   supported for a connection, of course, no ECN marks will occur; thus
352	   the sender will never set the E flag.  Otherwise, a ConEx sender
353	   needs to maintain a signed counter, the congestion exposure gauge
354	   (CEG), for the number of outstanding bytes that have to be ConEx
355	   marked with the E flag.

357	   The CEG is increased when ECN information is received from an ECN-
358	   capable receiver supporting the 'classic' ECN scheme or the accurate
359	   ECN feedback scheme.  When the ConEx sender receives an ACK
360	   indicating one or more segments were received with a CE mark, CEG is
361	   increased by the appropriate number of bytes as described further
362	   below.

364	   Unfortunately in case of duplicate acknowledgements the number of
365	   newly acknowledged bytes will be zero even though (CE marked) data
366	   has been received.  Therefore, we increase the CEG by DeliveredData,
367	   as defined below:

369	   DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS -
370	   (is_after_dup)*num_dup*1SMSS +

372	   DeliveredData covers the number of bytes that has been newly
373	   delivered to the receiver.  Therefore on each arrival of an ACK,
374	   DeliveredData will be increased by the newly acknowledged bytes
375	   (acked_bytes) as indicated by the current ACK, relative to all past
376	   ACKs.  The formula depends on whether SACK is available: if SACK is
377	   not available SACK_diff is always zero, whereas is ACK information is
378	   available is_dup and is_after_dup are always zero.

380	   With SACK, DeliveredData is increased by the number of bytes provided
381	   by (new) SACK information (SACK_diff).  Note, if less unacknowledged
382	   bytes are announced in the new SACK information than in the previous
383	   ACK, SACK_diff can be negative.  In this case, data is newly
384	   acknowledged (in acked_bytes), that has previously already been
385	   accumulated into DeliveredData based on SACK information.

387	   Otherwise without SACK, DeliveredData is increased by 1 SMSS on
388	   duplicate acknowledgements because duplicate acknowledgements do not
389	   acknowledge any new data (and acked_bytes will be zero).  For the
390	   subsequent partial or full ACK, acked_bytes cover all newly
391	   acknowledged bytes including those already accounted for with the
392	   receipt of any duplicate acknowledgement.  Therefore DeliveredData is
393	   reduced by one SMSS for each preceding duplicate ACK.  Consequently,
394	   is_dup is one if the current ACK is a duplicated ACK without SACK,
395	   and zero otherwise. is_after_dup is only one for the next full or
396	   partial ACK after a number of duplicated ACKs without SACK and
397	   num_dup counts the number of duplicated ACKs in a row (which usually
398	   is 3 or more).

400	   With classic ECN, one congestion marked packet causes continuous
401	   congestion feedback for a whole round trip, thus hiding the arrival
402	   of any further congestion marked packets during that round trip.  A
403	   more accurate ECN feedback scheme (AccECN) is needed to ensure that
404	   feedback properly reflects the extent of congestion marking.  The two
405	   cases, with and without a receiver capable of AccECN, are discussed
406	   in the following sections.

408	3.2.1.  Accurate ECN Feedback

410	   With a more accurate ECN feedback scheme (AccECN) that is supported
411	   by the receiver, either the number of marked packets or the number of
412	   marked bytes will be fed back from the receiver to the sender and is
413	   therefore know at sender-side.  In the latter case, the CEG can
414	   directly be increased by the number of marked bytes.  Otherwise if D
415	   is assumed to be the number of marks, the gauge (CEG) will be
416	   conservatively increased by one SMSS for each marking or at max the
417	   number of newly acknowledged bytes:

419	   CEG += min(SMSS*D, DeliveredData)

421	3.2.2.  Classic ECN Support

423	   With classic ECN, as soon as a CE mark is seen at the receiver, it
424	   will feed this information back to the sender by setting the Echo
425	   Congestion Experienced (ECE) flag in the TCP header of subsequent
426	   ACKs.  Once the sender receives the first ECE of a congestion
427	   notification, it sets the CWR flag in the TCP header once.  When this
428	   packet with Congestion Window Reduced (CWR) flag in the TCP header
429	   arrives at the receiver, acknowledging its first ECE feedback, the
430	   receiver stops setting ECE.

432	   If the ConEx sender fully conforms to the semantics of ECN signaling
433	   as defined by [RFC3168], it will receive one full RTT of ACKs with
434	   the ECE flag set whenever at least one CE mark was received by the
435	   receiver.  As the sender cannot estimate how many packets have
436	   actually been CE marked during this RTT, the most conservative
437	   assumption MAY be taken, namely assuming that all packets were
438	   marked.  This can be achieved by increasing the CEG by DeliveredData
439	   for each ACK with the ECE flag:

441	   CEG += DeliveredData

443	   Optionally a ConEx sender could implement the following technique
444	   (that not conforms to [RFC3168]), called advanced compatibility mode,
445	   to considerably improve its estimate of the number of ECN-marked
446	   packets:

448	   To extract more than one ECE indication per RTT, a ConEx sender could
449	   set the CWR flag continuously to force the receiver to signal only
450	   one ECE per CE mark.  Unfortunately, the use of delayed ACKs
451	   [RFC5681] (which is common) will prevent feedback of every CE mark;
452	   if a CWR confirmation is received before the ECE can be sent out on
453	   the next ACK, ECN feedback information could get lost (depending on
454	   the actual receiver implementation).  Thus a sender SHOULD set CWR
455	   only on those data segments that will presumably trigger a (delayed)
456	   ACK.  The sender would need an additional control loop to estimate
457	   which data segments will trigger an ACK in order to extract more
458	   timely congestion notifications.  Still, the CEG SHOULD be increased
459	   by DeliveredData, as one or more CE marked packets could be
460	   acknowledged by one delayed ACK.

462	4.  Setting the ConEx Flags

464	   By setting the X flag, a packet is marked as ConEx-capable.  All
465	   packets carrying payload MUST be marked with the X flag set,
466	   including retransmissions.  Only if no congestion feedback
467	   information is (currently) available, the X flag SHOULD be zero (e.g.
468	   for control packets on a connection that not sent any user data for
469	   some time and therefore is sending only pure ACKs that are not
470	   carrying any payload).

472	4.1.  Setting the E or the L Flag

474	   As described in section Section 3.1, the sender needs to maintain a
475	   CEG counter and might maintain a LEG counter.  If no LEG is used, all
476	   retransmission will be marked with the L flag.

478	   Further, as long as the LEG or CEG counter is positive, the sender
479	   marks each ConEx-capable packet with L or E respectively, and
480	   decreases the LEG or CEG counter by the TCP payload bytes carried in
481	   the marked packet (assuming headers are not being counted because
482	   packet sizes are regular).  No matter how small the value of LEG or
483	   CEG, if the value is positive the sender MUST NOT defer packet
484	   marking; this ensure ConEx signals are timely.  Therefore the value
485	   of LEG and CEG will commonly be negative.

487	   If both LEG and CEG are positive, the sender MUST mark each ConEx-
488	   capable packet with both L and E.  If a credit signal is also pending
489	   (see next section), the C flag can be set as well.

491	4.2.  Setting the Credit Flag

493	   The ConEx abstract mechanism [draft-ietf-conex-abstract-mech]
494	   requires that sufficient credit MUST be signaled in advance to cover
495	   the expected congestion during the feedback delay of one RTT.

497	   To monitor the credit state at the audit, a ConEx sender needs to
498	   maintain a Credit State Counter (CSC) in bytes.  If congestion
499	   occurs, credits will be consumed and the CSC is reduced by the number
500	   of bytes that where lost or estimated to be ECN-marked.  If the risk
501	   of congestion was estimated wrongly and thus too few credits were
502	   sent, the CSC becomes zero but cannot go negative.

504	   To be sure that the credit state in the audit never reaches zero, the
505	   number of credits should always equal the number of bytes in flight
506	   as all packets could potentially get lost or congestion marked.  In
507	   this case a ConEx sender also monitors the number of bytes in flight
508	   F.  If F ever becomes larger than CSC, the ConEx sender sets the C
509	   flag on each ConEx-capable packet and increase CSC by the payload
510	   size of each marked packet until CSC is no less than F again.
511	   However, a ConEx sender might also be less conservative and send
512	   fewer credits, if it e.g. assumes based on previous experience that
513	   the congestion will be low on a certain path.

515	   Recall that CSC will be decreased whenever congestion occurs;
516	   therefore CSC will need to be replenished as soon as CSC drops below
517	   F.  Also recall that the sender can set the C flag on a ConEx-capable
518	   packet whether or not the E or L flags are also set.

520	   In TCP Slow Start, the congestion window might grow much larger than
521	   during the rest of the transmission.  Likely, a sender could consider
522	   sending fewer than F credits but risking being penalized by an audit
523	   function.  However, the credits should at least cover the increase in
524	   sending rate.  Given the exponential increase as implemented in the
525	   TCP Slow Start algorithm which means that the sending rate doubles
526	   every RTT, a ConEx sender should at least cover half the number of
527	   packets in flight by credits.

529	   Note that the number of losses or markings within one RTT does not
530	   solely depend on the sender's actions.  In general, the behavior of
531	   the cross traffic, whether Active Queue Management (AQM) is used and
532	   how it is parameterized influence how many packets might be dropped
533	   or marked.  As long as any AQM encountered is not overly aggressive
534	   with ECN marking, sending half the flight size as credits should be
535	   sufficient whether congestion is signaled by loss or ECN.

537	   To maintain half of the packets in flight as credits, also half of
538	   the packet of the initial window must be C marked.  In Slow Start
539	   marking every fourth packet introduces the correct amount of credit
540	   as can be seen in Figure 1.

542	                                        in_flight  credits
543	                RTT1  |------XC------>|     1         1
544	                      |------X------->|     2         1
545	                      |------XC------>|     3         2
546	                      |               |
547	                RTT2  |------X------->|     3         2
548	                      |------X------->|     4         2
549	                      |------X------->|     4         2
550	                      |------XC------>|     5         3
551	                      |------X------->|     5         3
552	                      |------X------->|     6         3
553	                      |               |
554	                RTT3  |------X------->|     6         3
555	                      |------XC------>|     7         4
556	                      |------X------->|     7         4
557	                      |------X------->|     8         4
558	                      |------X------->|     8         4
559	                      |------XC------>|     9         5
560	                      |------X------->|     9         5
561	                      |------X------->|    10         5
562	                      |------X------->|    10         5
563	                      |------XC------>|    11         6
564	                      |------X------->|    11         6
565	                      |------X------->|    12         6
566	                      |      .        |
567	                      |      :        |

569	       Figure 1: Credits in Slow Start (with an initial window of 3)

571	   It is possible that a TCP flow will encounter an audit function
572	   without relevant flow state, due to e.g. rerouting or memory
573	   limitations.  Therefore, the sender needs to detect this case and
574	   resend credits.  A ConEx sender might reset the credit counter CSC to
575	   zero if losses occur in subsequent RTTs (assuming that the sending
576	   rate was correctly reduced based on the received congestion signal
577	   and using a conservatively large RTT estimation).

579	   This section proposes a concrete algorithm for determining how much
580	   credit to signal (with a separate approach used for Slow Start).
581	   However, experimentation in credit setting algorithms is expected and
582	   encouraged.  The wider goal of ConEx is to reflect the 'cost' of the
583	   risk of causing congestion on those that contribute most to it.
584	   Thus, experimentation is encouraged to improve or maintain
585	   performance while reducing the risk of causing congestion, and
586	   therefore potentially reducing the need to signal so much credit.

588	5.  Loss of ConEx Information

590	   Packets carrying ConEx signals could be discarded themselves.  This
591	   will be a second order problem (e.g. if the loss probability is 0.1%,
592	   the probability of losing a ConEx L signal will be 0.1% of 0.1% =
593	   0.01%).  Further, the penalty an audit induces should be proportional
594	   to the mismatch of expected ConEx marks and observed congestion,
595	   therefore the audit might only slightly increase the loss level of
596	   this flow.  Therefore, an implementer MAY choose to ignore this
597	   problem, accepting instead the risk that an audit function might
598	   wrongly penalize a flow.

600	   Nonetheless, a ConEx sender is responsible for always signalling
601	   sufficient congestion feedback and therefore SHOULD remember which
602	   packet was marked with either the L, the E or the C flag.  If one of
603	   these packets is detected as lost, the sender SHOULD increase the
604	   respective gauge(s), LEG or CEG, by the number of lost payload bytes
605	   in addition to increasing LEG for the loss.

607	6.  Timeliness of the ConEx Signals

609	   ConEx signals will only be useful to a network node within a time
610	   delay of about one RTT after the congestion occurred.  To avoid
611	   further delays, a ConEx sender SHOULD send the ConEx signaling on the
612	   next available packet.

614	   Any or all of the ConEx flags can be used in the same packet, which
615	   allows delay to be minimized when multiple signals are pending.  The
616	   need to set multiple ConEx flags at the same time can occur if e.g an
617	   ACK is received by the sender that simultaneously indicates that at
618	   least one ECN mark was received, and that one or more segments were
619	   lost.  This may happen during excessive congestion, if the queues
620	   overflow even though ECN was used and currently all forwarded packets
621	   are marked, while others have to be dropped.  Another case when this
622	   might happen is when ACKs are lost, so that a subsequent ACK carries
623	   summary information not previously available to the sender.

625	   If a flow becomes application-limited, there could be insufficient
626	   bytes to send to reduce the gauges to zero or below.  In such cases,
627	   the sender cannot help but delay ConEx signals.  Nonetheless, as long
628	   as the sender is marking all outgoing packets, an audit function is
629	   unlikely to penalize ConEx-marked packets.  Therefore, no matter how
630	   long a gauge has been positive, a sender MUST NOT reduce the gauge by
631	   more than the ConEx marked bytes it has sent.

633	   If the CEG or LEG counter is negative, the respective counter MAY be
634	   reset to zero within one RTT after it was decreased the last time or
635	   one RTT after recovery if no further congestion occurred.

637	7.  Open Areas for Experimentation

639	   All proposed mechanisms in this document are experimental, and
640	   therefore further large-scale experimentation in the Internet is
641	   required to evaluate if the signaling provided by these mechanisms is
642	   accurate and timely enough to produce value for ConEx-based (traffic
643	   management or other) mechanisms.

645	   The current ConEx specifications assume that congestion is counted in
646	   number of bytes (including the IP header that directly encapsulates
647	   the CDO and everything that IP header encapsulates)
648	   [draft-ietf-conex-destopt].  This decision was taken because most
649	   network devices today experience byte-congestion where the memory is
650	   filled exactly with the number of bytes a packet carries [RFC7141].
651	   However, there are also devices that may allocate a certain amount of
652	   memory per packet, no matter how large a packet is.  These devices
653	   get congested based on the number of packets in their memory and
654	   therefore in this case congestion is determined by the number of
655	   packets that have been lost or marked.  Furthermore, a transport
656	   layer endpoint, such as a TCP sender or receiver, might not know the
657	   exact number of bytes that a lower layer was carrying.  Therefore a
658	   TCP endpoint may only be able to estimate the exact number of
659	   congested bytes (assuming that all lower layer header have the same
660	   length).  If this estimation is sufficient to work with, the ConEx
661	   signal needs to be further evaluated in tests in the Internet
662	   together with different auditor implementations.

664	   Further, the proposed marking schemes in this document are designed
665	   under the assumption that all TCP packets of a ConEx-capable flow are
666	   of equal size or that flows have a constant mean packet size over a
667	   rather small time frame, like one RTT or less.  In most
668	   implementations this assumption might be taken as well and probably
669	   is true for most of the traffic flows.  If this proposed scheme is
670	   used, it is necessary to evaluate how much accuracy degrades if this
671	   precondition is not met.  Evaluating with real traffic from different
672	   applications is especially important in making the decision regarding
673	   whether the proposed schemes are sufficient or whether a more complex
674	   scheme is needed.

676	   In this context the proposed scheme to set credit markings in Slow
677	   Start runs a risk to provide an insufficient number of markings which
678	   can cause an audit function to penalize this flow.  Both the proposed
679	   credit scheme for Slow Start as well as the scheme in Congestion
680	   Avoidance must be evaluated together with one or more specific
681	   implementations of an ConEx auditor to ensure that both algorithms,
682	   in the sender and in the auditor, work properly together with a low
683	   risk of false positives (which would lead to penalization of an
684	   honest sender).  However, if a sender is wrongly assumed to cheat,
685	   the penalization of the audit should be adequate and should allow an
686	   honest sender using a congestion control scheme that is commonly used
687	   today to recover quickly.

689	   Another open issue is the accuracy of the ECN feedback signal.  At
690	   time of publication of this document there is no AccECN mechanism
691	   specified yet, and further AccECN will also take some time to be
692	   widely deployed.  This document proposes an advanced compatibility
693	   mode for Classic ECN.  The proposed mechanism can provide more
694	   accurate feedback by utilizing the way Classic ECN is specified but
695	   has a higher risk of losing information.  To figure out how high this
696	   risk is in a real deployment scenario, further experimental
697	   evaluation is needed.  The following argument is intended to prove
698	   that suppressing repetitions of ECE, however, is still safe against
699	   possible congestion collapse due to lost congestion feedback and
700	   should be further proven in experimentation:

702	   Repetition of ECE in classic ECN is intended to ensure reliable
703	   delivery of congestion feedback.  However, with advanced
704	   compatibility mode, it is possible to miss congestion notifications.
705	   This can happen in some implementations if delayed acknowledgements
706	   are used.  Further, an ACK containing ECE can simply get lost.  If
707	   only a few CE marks are received within one congestion event (e.g.,
708	   only one), the loss of one acknowledgements due to (heavy) congestion
709	   on the reverse path can prevent that any congestion notification is
710	   received by the sender.

712	   However, if loss of feedback exacerbates congestion on the forward
713	   path, more forward packets will be CE marked, increasing the
714	   likelihood that feedback from at least one CE will get through per
715	   RTT.  As long as one ECE reaches the sender per RTT, the sender's
716	   congestion response will be the same as if CWR were not continuous.
717	   The only way that heavy congestion on the forward path could be
718	   completely hidden would be if all ACKs on the reverse path were lost.
719	   If total ACK loss persisted, the sender would time out and do a
720	   congestion response anyway.  Therefore, the problem seems confined to
721	   potential suppression of a congestion response during light
722	   congestion.

724	   Furthermore, even if loss of all ECN feedback leads to no congestion
725	   response, the worst that could happen would be loss instead of ECN-
726	   signaled congestion on the forward path.  Given compatibility mode
727	   does not affect loss feedback, there would be no risk of congestion
728	   collapse.

730	8.  Acknowledgements

732	   The authors would like to thank Bob Briscoe who contributed with this
733	   initial ideas [I-D.briscoe-conex-re-ecn-tcp] and valuable feedback.
734	   Moreover, thanks to Jana Iyengar who also provided valuable feedback.

736	9.  IANA Considerations

738	   This document does not have any requests to IANA.

740	10.  Security Considerations

742	   General ConEx security considerations are covered extensively in the
743	   ConEx abstract mechanism [draft-ietf-conex-abstract-mech].  This
744	   section covers TCP-specific concerns that may occur with the addition
745	   of ConEx to TCP (while not discussing general well-known attacks
746	   against TCP).  It is assumed that any altering of ConEx information
747	   can be detected by protection mechanisms in the IP layer and is
748	   therefore not discussed here but in [draft-ietf-conex-destopt].
749	   Further, [draft-ietf-conex-destopt] describes how to use ConEx to
750	   mitigate flooding attacks by using preferential drop where the use of
751	   ConEx can even increase security.

753	   The ConEx modifications to TCP provide no mechanism for a receiver to
754	   force a sender not to use ConEx.  A receiver can degrade the accuracy
755	   of ConEx by claiming that it does not support SACK, AccECN or ECN,
756	   but the sender will never have to turn ConEx off.  Further, the
757	   receiver cannot force the sender to have to mark ConEx more
758	   conservatively, in order to cover the risk of any inaccuracy.
759	   Instead it is always the sender's choice to either mark very
760	   conservatively which ensures that the audits always sees enough
761	   markings to not penalize the flow, or estimate the needed number of
762	   markings more tightly.  This second case lead to inaccurate marking
763	   and therefore increases the likelihood of loss at an audit function
764	   which will only harm the receiver itself.

766	   Assuming the sender is limited in some way by a congestion allowance
767	   or quota, a receiver could spoof more loss or ECN congestion feedback
768	   than it actually experiences, in an attempt to make the sender draw
769	   down its allowance faster than necessary.  However, over-declaring
770	   congestion simply makes the sender slow down.  If the receiver is
771	   interested in the content it will not want to harm its own
772	   performance.

774	   However, if the receiver is solely interested in making the sender
775	   draw down its allowance, the net effect will depend on the sender's
776	   congestion control algorithm as permanently adding more and more
777	   additional congestion would cause the sender to more and more reduce
778	   its sending rate.  Therefore a receiver can only maintain a certain
779	   congestion level that is corresponding to a certain sending rate.
780	   With New Reno [RFC5681], doubling congestion feedback causes the
781	   sender to reduce its sending rate such that it would only to consume
782	   sqrt(2) = 1.4 times more congestion allowance.  However, to improve
783	   scaling, congestion control algorithms are tending towards less
784	   responsive algorithms like Cubic or Compound TCP, and ultimately to
785	   linear algorithms like DCTCP [DCTCP] that aim to maintain the same
786	   congestion level independent of the current sending rate and always
787	   reduce its sending window if the signaled congestion feedback is
788	   higher.  In each case, if the receiver doubles congestion feedback,
789	   it causes the sender to respectively consume more allowance by a
790	   factor of 1.2, 1.15 or 1, where 1 implies the attack has become
791	   completely ineffective as no further congestion allowance is consumed
792	   but the flow will decrease its sending rate to a minimum instead.

794	11.  References

796	11.1.  Normative References

798	   [draft-ietf-conex-abstract-mech]
799	              Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx)
800	              Concepts and Abstract Mechanism", draft-ietf-conex-
801	              abstract-mech-06 (work in progress), October 2012.

803	   [draft-ietf-conex-destopt]
804	              Krishnan, S., Kuehlewind, M., and C. Ucendo, "IPv6
805	              Destination Option for ConEx", draft-ietf-conex-destopt-04
806	              (work in progress), March 2013.

808	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
809	              Selective Acknowledgment Options", RFC 2018,
810	              DOI 10.17487/RFC2018, October 1996,
811	              <http://www.rfc-editor.org/info/rfc2018>.

813	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
814	              Requirement Levels", BCP 14, RFC 2119,
815	              DOI 10.17487/RFC2119, March 1997,
816	              <http://www.rfc-editor.org/info/rfc2119>.

818	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
819	              of Explicit Congestion Notification (ECN) to IP",
820	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
821	              <http://www.rfc-editor.org/info/rfc3168>.

823	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
824	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
825	              <http://www.rfc-editor.org/info/rfc5681>.

827	11.2.  Informative References

829	   [DCTCP]    Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel,
830	              P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP:
831	              Efficient Packet Transport for the Commoditized Data
832	              Center", Jan 2010.

834	   [draft-kuehlewind-tcpm-accurate-ecn]
835	              Kuehlewind, M. and R. Scheffenegger, "More Accurate ECN
836	              Feedback in TCP", draft-kuehlewind-tcpm-accurate-ecn-02
837	              (work in progress), Jun 2013.

839	   [I-D.briscoe-conex-re-ecn-tcp]
840	              Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith,
841	              "Re-ECN: Adding Accountability for Causing Congestion to
842	              TCP/IP", draft-briscoe-conex-re-ecn-tcp-04 (work in
843	              progress), July 2014.

845	   [RFC3522]  Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm
846	              for TCP", RFC 3522, DOI 10.17487/RFC3522, April 2003,
847	              <http://www.rfc-editor.org/info/rfc3522>.

849	   [RFC3708]  Blanton, E. and M. Allman, "Using TCP Duplicate Selective
850	              Acknowledgement (DSACKs) and Stream Control Transmission
851	              Protocol (SCTP) Duplicate Transmission Sequence Numbers
852	              (TSNs) to Detect Spurious Retransmissions", RFC 3708,
853	              DOI 10.17487/RFC3708, February 2004,
854	              <http://www.rfc-editor.org/info/rfc3708>.

856	   [RFC4015]  Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm
857	              for TCP", RFC 4015, DOI 10.17487/RFC4015, February 2005,
858	              <http://www.rfc-editor.org/info/rfc4015>.

860	   [RFC5682]  Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
861	              "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
862	              Spurious Retransmission Timeouts with TCP", RFC 5682,
863	              DOI 10.17487/RFC5682, September 2009,
864	              <http://www.rfc-editor.org/info/rfc5682>.

866	   [RFC6789]  Briscoe, B., Ed., Woundy, R., Ed., and A. Cooper, Ed.,
867	              "Congestion Exposure (ConEx) Concepts and Use Cases",
868	              RFC 6789, DOI 10.17487/RFC6789, December 2012,
869	              <http://www.rfc-editor.org/info/rfc6789>.

871	   [RFC7141]  Briscoe, B. and J. Manner, "Byte and Packet Congestion
872	              Notification", BCP 41, RFC 7141, DOI 10.17487/RFC7141,
873	              February 2014, <http://www.rfc-editor.org/info/rfc7141>.

875	Appendix A.  Revision history

877	   RFC Editor: This section is to be removed before RFC publication.

879	   00 ... initial draft, early submission to meet deadline.

881	   01 ... refined draft, updated LEG "drain" from per-packet to RTT-
882	   based.

884	   02 ... added Section 5 and expanded discussion about ECN interaction.

886	   03 ... expanded the discussion around credit bits.

888	   04 ... review comments of Jana addressed.  (Change in full compliance
889	   mode.)

891	   05 ... changes on Loss Detection without SACK, support of classic ECN
892	   and credit handling.

894	   07 ... review feedback provided by Nandita

896	   08 ... based on Bob's feedback: Wording edits and structuring of a
897	   few paragraphs; change of SHOULD to MAY for resetting negative LEG/
898	   CEG; additional security considerations provided by Bob (thanks!).

900	   09 ... experimentation section added

902	   10 ... final review comments based on IETF last call

904	Authors' Addresses

906	   Mirja Kuehlewind (editor)
907	   ETH Zurich
908	   Switzerland

910	   Email: mirja.kuehlewind@tik.ee.ethz.ch

912	   Richard Scheffenegger
913	   NetApp, Inc.
914	   Am Euro Platz 2
915	   Vienna  1120
916	   Austria

918	   Email: rs.ietf@gmx.at