idnits 2.17.1 

draft-ietf-conex-tcp-modifications-08.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 533 has weird spacing: '..._flight  credi...'

  -- The document date (April 22, 2015) is 3292 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

     No issues found here.

     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Congestion Exposure (ConEx)                           M. Kuehlewind, Ed.
3	Internet-Draft                                                ETH Zurich
4	Intended status: Experimental                           R. Scheffenegger
5	Expires: October 24, 2015                                   NetApp, Inc.
6	                                                          April 22, 2015

8	               TCP modifications for Congestion Exposure
9	                 draft-ietf-conex-tcp-modifications-08

11	Abstract

13	   Congestion Exposure (ConEx) is a mechanism by which senders inform
14	   the network about expected congestion based on congestion feedback
15	   from previous packets in the same flow.  This document describes the
16	   necessary modifications to use ConEx with the Transmission Control
17	   Protocol (TCP).

19	Status of This Memo

21	   This Internet-Draft is submitted in full conformance with the
22	   provisions of BCP 78 and BCP 79.

24	   Internet-Drafts are working documents of the Internet Engineering
25	   Task Force (IETF).  Note that other groups may also distribute
26	   working documents as Internet-Drafts.  The list of current Internet-
27	   Drafts is at http://datatracker.ietf.org/drafts/current/.

29	   Internet-Drafts are draft documents valid for a maximum of six months
30	   and may be updated, replaced, or obsoleted by other documents at any
31	   time.  It is inappropriate to use Internet-Drafts as reference
32	   material or to cite them other than as "work in progress."

34	   This Internet-Draft will expire on October 24, 2015.

36	Copyright Notice

38	   Copyright (c) 2015 IETF Trust and the persons identified as the
39	   document authors.  All rights reserved.

41	   This document is subject to BCP 78 and the IETF Trust's Legal
42	   Provisions Relating to IETF Documents
43	   (http://trustee.ietf.org/license-info) in effect on the date of
44	   publication of this document.  Please review these documents
45	   carefully, as they describe your rights and restrictions with respect
46	   to this document.  Code Components extracted from this document must
47	   include Simplified BSD License text as described in Section 4.e of
48	   the Trust Legal Provisions and are provided without warranty as
49	   described in the Simplified BSD License.

51	Table of Contents

53	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
54	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
55	   2.  Sender-side Modifications . . . . . . . . . . . . . . . . . .   3
56	   3.  Counting congestion . . . . . . . . . . . . . . . . . . . . .   4
57	     3.1.  Loss Detection  . . . . . . . . . . . . . . . . . . . . .   5
58	       3.1.1.  Without SACK Support  . . . . . . . . . . . . . . . .   6
59	     3.2.  ECN . . . . . . . . . . . . . . . . . . . . . . . . . . .   7
60	       3.2.1.  Accurate ECN feedback . . . . . . . . . . . . . . . .   9
61	       3.2.2.  Classic ECN support . . . . . . . . . . . . . . . . .   9
62	   4.  Setting the ConEx Flags . . . . . . . . . . . . . . . . . . .  10
63	     4.1.  Setting the E or the L Flag . . . . . . . . . . . . . . .  11
64	     4.2.  Setting the Credit Flag . . . . . . . . . . . . . . . . .  11
65	   5.  Loss of ConEx information . . . . . . . . . . . . . . . . . .  14
66	   6.  Timeliness of the ConEx Signals . . . . . . . . . . . . . . .  14
67	   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  15
68	   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15
69	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  15
70	   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  16
71	     10.1.  Normative References . . . . . . . . . . . . . . . . . .  16
72	     10.2.  Informative References . . . . . . . . . . . . . . . . .  16
73	   Appendix A.  Revision history . . . . . . . . . . . . . . . . . .  17
74	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  18

76	1.  Introduction

78	   Congestion Exposure (ConEx) is a mechanism by which senders inform
79	   the network about expected congestion based on congestion feedback
80	   from previous packets in the same flow.  ConEx concepts and use cases
81	   are further explained in [RFC6789].  The abstract ConEx mechanism is
82	   explained in [draft-ietf-conex-abstract-mech].  This document
83	   describes the necessary modifications to use ConEx with the
84	   Transmission Control Protocol (TCP).

86	   The markings for ConEx signaling are defined in the ConEx Destination
87	   Option (CDO) for IPv6 [draft-ietf-conex-destopt].  Specifically, the
88	   use of four flags is defined: X (ConEx-capable), L (loss
89	   experienced), E (ECN experienced) and C (credit).

91	   ConEx signaling is based on loss or Explicit Congestion Notification
92	   (ECN) marks [RFC3168] as congestion indications.  The sender collects
93	   this congestion information based on existing TCP feedback mechanisms
94	   from the receiver to the sender.  No changes are needed at the
95	   receiver to implement ConEx signaling.  Therefore no additional
96	   negotiation is needed to implement and use ConEx at the sender.  This
97	   document specifies the sender's actions that are needed to provide
98	   meaningful ConEx information to the network.

100	   Section 2 provides an overview of the modifications needed for TCP
101	   senders to implement ConEx.  First congestion information has to be
102	   extracted from TCP's loss or ECN feedback as described in section 3.
103	   Section 4 details how to set the CDO marking based on this congestion
104	   information.  Section 5 discusses loss of packets carrying ConEx
105	   information.  Section 6 discusses timeliness of the ConEx feedback
106	   signal, given congestion is a temporary state.

108	   This document describes congestion accounting for TCP with and
109	   without the Selective Acknowledgment (SACK) extension [RFC2018] (in
110	   section 3.1).  However, ConEx benefits from the more accurate
111	   information that SACK provides about the number of bytes dropped in
112	   the network.  It is therefore preferable to use the SACK extension
113	   when using TCP with ConEx.  The detailed mechanism to set the L flag
114	   in response to loss-based congestion feedback signal is given in
115	   section 4.1.

117	   Whereas loss has to be minimized, ECN can provide more fine-grained
118	   feedback information.  ConEx-based traffic measurement or management
119	   mechanisms could benefit from this.  Unfortunately, the current ECN
120	   feedback mechanism does not reflect multiple congestion markings if
121	   they occur within the same Round-Trip Time (RTT).  A more accurate
122	   feedback extension to ECN (AccECN) is proposed in a separate document
123	   [draft-kuehlewind-tcpm-accurate-ecn], as this is also useful for
124	   other mechanisms.

126	   Congestion accounting for both classic ECN feedback and AccECN
127	   feedback is explained in detail in section 3.2.  Setting the E flag
128	   in response to ECN-based congestion feedback is again detailed in
129	   section 4.1.

131	1.1.  Requirements Language

133	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
134	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
135	   document are to be interpreted as described in [RFC2119].

137	2.  Sender-side Modifications

139	   This section gives an overview of actions that need to be taken by a
140	   TCP sender modified to use ConEx signaling.

142	   In the TCP handshake, a ConEx sender MUST negotiate for SACK and ECN
143	   preferably with AccECN feedback.  Therefore a ConEx sender MUST also
144	   implement SACK and ECN.  Depending on the capability of the receiver,
145	   the following operation modes exist:

147	   o  SACK-accECN-ConEx (SACK and accurate ECN feedback)

149	   o  SACK-ECN-ConEx (SACK and 'classic' instead of accurate ECN)

151	   o  accECN-ConEx (no SACK but accurate ECN feedback)

153	   o  ECN-ConEx (no SACK and no accurate ECN feedback but 'classic' ECN)

155	   o  SACK-ConEx (SACK but no ECN at all)

157	   o  Basic-ConEx (neither SACK nor ECN)

159	   A ConEx sender MUST expose all congestion information to the network
160	   according to the congestion information received by ECN or based on
161	   loss information provided by the TCP feedback loop.  A TCP sender
162	   SHOULD count congestion byte-wise (rather than packet-wise; see next
163	   paragraph).  After any congestion notification, a sender MUST mark
164	   subsequent packets with the appropriate ConEx flag in the IP header.
165	   Furthermore, a ConEx sender must send enough credit to cover all
166	   experienced congestion for the connection so far, as well as the risk
167	   of congestion for the current transmission (see Section 4.2).

169	   With SACK the number of lost payload bytes is known, but not the
170	   number of packets carrying these bytes.  With classic ECN only an
171	   indication is given that a marking occurred but not the exact number
172	   of payload bytes nor packets.  As network congestion is usually byte-
173	   congestion [RFC7141], the byte-size of a packet marked with a CDO
174	   flag is defined to represent that number of bytes of congestion
175	   signalling [draft-ietf-conex-destopt].  Therefore the exact number of
176	   bytes should be taken into account, if available, to make the ConEx
177	   signal as exact as possible.

179	   Detailed mechanisms for congestion counting in each operation mode
180	   are described in the next section.

182	3.  Counting congestion

184	   A ConEx TCP sender maintains two counters: one that counts congestion
185	   based on the information retrieved by loss detection, and a second
186	   that accounts for ECN based congestion feedback.  These counters hold
187	   the number of outstanding bytes that should be ConEx marked with
188	   respectively the E flag or the L flag in subsequent packets.

190	   The outstanding bytes for congestion indications based on loss are
191	   maintained in the loss exposure gauge (LEG), as explained in
192	   Section 3.1.

194	   The outstanding bytes counted based on ECN feedback information are
195	   maintained in the congestion exposure gauge (CEG), as explained in
196	   Section 3.2.

198	   When the sender sends a ConEx capable packet with the E or L flag set
199	   it reduces the respective counter by the byte-size of the packet.
200	   This is explained for both counters in Section 4.1.  Usually all
201	   bytes of an IP packet must be counted.  Therefore the sender SHOULD
202	   take the payload and headers into account, up to and including the IP
203	   header.

205	   If equal-sized packets, or at least equally distributed packet sizes
206	   can be assumed, the sender MAY only add and subtract TCP payload
207	   bytes.  In this case there should be about the same number of ConEx
208	   marked packets as the original packets that were causing the
209	   congestion.  Thus both contain about the same number of header bytes
210	   so they will cancel out.  This case is assumed for simplicity in the
211	   following sections.

213	   Otherwise, if a sender sends different sized packets (with unequally
214	   distributed packet sizes), the sender needs to memorize or estimate
215	   the number of lost or ECN-marked packets.  A sender might be able to
216	   reconstruct the number of packets and thus the header bytes if the
217	   packet sizes of all packets that were sent during the last RTT are
218	   known.  Otherwise, if no additional information is available, the
219	   worst case number of packets and thus header bytes should be
220	   estimated, e.g. based on the minimum packet size (of all packets sent
221	   in the last RTT).  If the number of newly sent-out packets with the
222	   ConEx L or E flag set is smaller (or larger) than this estimated
223	   number of lost/ECN-marked packets, the additional header bytes should
224	   be added to (or can be subtracted from) the respective gauge.

226	3.1.  Loss Detection

228	   This section applies whether or not SACK support is available.  The
229	   following subsection in addition handles the case when SACK is not
230	   available.

232	   A TCP sender detects losses and subsequently retransmits the lost
233	   data.  Therefore, ConEx sender can simply set the ConEx L flag on all
234	   retransmissions in order to at least cover the amount of bytes lost.
235	   If this aprroach is taken, no LEG is needed.

237	   However, any retransmission may be spurious.  In this case more bytes
238	   have been marked than necessary.  To compensate this effect a ConEx
239	   sender can maintain a local signed counter, the (LEG), that indicats
240	   the number of outstanding bytes to be sent with the ConEx L flag and
241	   also can become negative.  Using the LEG, when a TCP sender decides
242	   that a data segment needs to be retransmitted, it will increase LEG
243	   by the size of the TCP payload bytes in the retransmission (assuming
244	   equal sized segments such that the retransmitted packet will have the
245	   same number of header bytes as the original ones) and reduce the LEG
246	   as described in section Section 4.  Further to accommodate spurious
247	   restransmission, a ConEx sender SHOULD make use of heuristics to
248	   detect such spurious retransmissions (e.g.  F-RTO [RFC5682], DSACK
249	   [RFC3708], and Eifel [RFC3522], [RFC4015]).  When such a heuristic
250	   has determined that a certain number of packets were retransmitted
251	   erroneously, the ConEx sender subtracts the payload size of these TCP
252	   packets from LEG.

254	3.1.1.  Without SACK Support

256	   If multiple losses occur within one RTT and SACK is not used, it may
257	   take several RTTs until all lost data is retransmitted.  With the
258	   scheme described above, the ConEx information will be delayed
259	   considerably, but timeliness is important for ConEx.  However, for
260	   ConEx it is not important to know which data got lost but only how
261	   much.  During the first RTT after the initial loss detection, the
262	   amount of received data and thus also the amount of lost data can be
263	   estimated based on the number of received ACKs.  Therefore a ConEx
264	   sender can use the following algorithm to estimated the number of
265	   lost bytes with an additional delay of one RTT using an additional
266	   Loss Estimation Counter (LEC):

268	      flight_bytes:      current flight size in bytes
269	      retransmit_bytes:  payload size of the retransmission

271	      At the first retransmission in a congestion event LEC is set:

273	         LEC = flight_bytes - 3*SMSS

275	         (At this point of time in the transmission, in the worst case,
276	         all packets in flight minus three that trigged the dupACks
277	         could have been lost.)

279	      Then during the first RTT of the congestion event:

281	         For each retransmission:
282	            LEG += retransmit_bytes
283	            LEC -= retransmit_bytes

285	         For each ACK:
286	            LEC -= SMSS

288	      After one RTT:

290	         LEG += LEC

292	         (The LEC now estimates the number of outstanding bytes
293	         that should be ConEx L marked.)

295	      After the first RTT for each following retransmissions:

297	         if (LEC > 0): LEC -= retransmit_bytes
298	         else if (LEC==0): LEG += retransmit_bytes

300	         if (LEC < 0): LEG += -LEC

302	         (The LEG is not increased for those bytes that were
303	         already counted.)

305	3.2.  ECN

307	   ECN [RFC3168] is an IP/TCP mechanism that allows network nodes to
308	   mark packets with the Congestion Experienced (CE) mark instead of
309	   dropping them when congestion occurs.

311	   A receiver might support 'classic' ECN, the more accurate ECN
312	   feedback scheme (AccECN), or neither.  In the case that ECN is not
313	   supported for a connection, of course, no ECN marks will occur; thus
314	   the sender will never set the E flag.  Otherwise, a ConEx sender
315	   needs to maintain a signed counter, the congestion exposure gauge
316	   (CEG), for the number of outstanding bytes that have to be ConEx
317	   marked with the E flag.

319	   The CEG is increased when ECN information is received from an ECN-
320	   capable receiver supporting the 'classic' ECN scheme or the accurate
321	   ECN feedback scheme.  When the ConEx sender receives an ACK
322	   indicating one or more segments were received with a CE mark, CEG is
323	   increased by the appropriate number of bytes as described further
324	   below.

326	   Unfortunately in case of duplicate acknowledgements the number of
327	   newly acknowledged bytes will be zero even though (CE marked) data
328	   has been received.  Therefore, we increase the CEG by DeliveredData,
329	   as defined below:

331	   DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS -
332	   (is_after_dup)*num_dup*1SMSS +

334	   DeliveredData covers the number of bytes that has been newly
335	   delivered to the receiver.  Therefore on each arrival of an ACK,
336	   DeliveredData will be increased by the newly acknowledged bytes
337	   (acked_bytes) as indicated by the current ACK, relative to all past
338	   ACKs.  The formula depends on whether SACK is available: if SACK is
339	   not avaialble SACK_diff is always zero, whereas is ACK information is
340	   available is_dup and is_after_dup are always zero.

342	   With SACK, DeliveredData is increased by the number of bytes provided
343	   by (new) SACK information (SACK_diff).  Note, if less unacknowledged
344	   bytes are announced in the new SACK information than in the previous
345	   ACK, SACK_diff can be negative.  In this case, data is newly
346	   acknowledged (in acked_bytes), that has previously already been
347	   accumulated into DeliveredData based on SACK information.

349	   Otherwise without SACK, DeliveredData is increased by 1 SMSS on
350	   duplicate acknowledgements as duplicate acknowledgements do not
351	   acknowlegde any new data (and acked_bytes will be zero).  For the
352	   subsequent partial or full ACK, acked_bytes cover all newly
353	   acknowledged bytes including the ones that where already accounted
354	   which the receiption of any duplicate acknowledgement.  Therefore
355	   DeliveredData is reduced by one SMSS for each preceding duplicate
356	   ACK.  Consequently, is_dup is one if the current ACK is a duplicated
357	   ACK without SACK, and zero otherwise. is_after_dup is only one for
358	   the next full or partial ACK after a number of duplicated ACKs
359	   without SACK and num_dup counts the number of duplicated ACKs in a
360	   row (which usually is 3 or more).

362	   With classic ECN, one congestion marked packet causes continuous
363	   congestion feedback for a whole round trip, thus hiding the arrival
364	   of any further congestion marked packets during that round trip.  The
365	   more accurate ECN feedback scheme (AccECN) is needed to ensure that
366	   feedback properly reflects the extent of congestion marking.  The two
367	   cases, with and without a receiver capable of AccECN, are discussed
368	   in the following sections.

370	3.2.1.  Accurate ECN feedback

372	   With a more accurate ECN feedback scheme (AccECN) either the number
373	   of marked packets or the number of marked bytes is known.  In the
374	   latter case the CEG can directly be increased by the number of marked
375	   bytes.  Otherwise if D is assumed to be the number of marks, the
376	   gauge (CEG) will be conservatively increased by one SMSS for each
377	   marking or at max the number of newly acknowledged bytes:

379	   CEG += min(SMSS*D, DeliveredData)

381	3.2.2.  Classic ECN support

383	   With classic ECN, as soon as a CE mark is seen at the receiver, it
384	   will feed this information back to the sender by setting the Echo
385	   Congestion Experienced (ECE) flag in the TCP header of subsequent
386	   ACKs.  Once the sender receives the first ECE of a congestion
387	   notification, it sets the CWR flag in the TCP header once.  When this
388	   packet with Congestion Window Reduced (CWR) flag in the TCP header
389	   arrives at the receiver, acknowledging its first ECE feedback, the
390	   receiver stops setting ECE.

392	   If the ConEx sender fully conforms to the semantics of ECN signaling
393	   as defined by [RFC3168], it will receive one full RTT of ACKs with
394	   the ECE flag set whenever at least one CE mark was received by the
395	   receiver.  As the sender cannot estimate how many packets have
396	   actually been CE marked during this RTT, the most conservative
397	   assumption MAY be taken, namely assuming that all packets were
398	   marked.  This can be achieved by increasing the CEG by DeliveredData
399	   for each ACK with the ECE flag:

401	   CEG += DeliveredData

403	   Optionally a ConEx sender could implement the following technique
404	   (that not conforms to [RFC3168]), called advanced compatibility mode,
405	   to considerably improve its estimate of the number of ECN-marked
406	   packets:

408	   To extract more than one ECE indication per RTT, a ConEx sender could
409	   set the CWR flag continuously to force the receiver to signal only
410	   one ECE per CE mark.  Unfortunately, the use of delayed ACKs
411	   [RFC5681] (which is common) will prevent feedback of every CE mark;
412	   if a CWR confirmation is received before the ECE can be sent out on
413	   the next ACK, ECN feedback information could get lost (depeding on
414	   the actual receiver implementation).  Thus a sender SHOULD set CWR
415	   only on those data segments that will presumably trigger a (delayed)
416	   ACK.  The sender would need an additional control loop to estimated
417	   which data segments will trigger an ACK in order to extract more
418	   timely congestion notifications.  Still the CEG SHOULD be increased
419	   by DeliveredData, as one or more CE marked packets could be
420	   acknowledged by one delayed ACK.

422	   The following argument is intended to prove that suppressing
423	   repetitions of ECE is safe against possible congestion collapse due
424	   to lost congestion feedback:

426	   Repetition of ECE in classic ECN is intended to ensure reliable
427	   delivery of congestion feedback.  However, with advanced
428	   compatibility mode, it is possible to miss congestion notifications.
429	   This can happen in some implementations if delayed acknowledgements
430	   are used, as described above.  Further an ACK containing ECE can
431	   simply get lost.  If only a few CE mark are received within one
432	   congestion event (e.g., only one), the loss of acknowledgements due
433	   to (heavy) congestion on the reverse path, can hinder that any
434	   congestion notification is received by the sender.

436	   However, if loss of feedback exacerbates congestion on the forward
437	   path, more forward packets will be CE marked, increasing the
438	   likelihood that feedback from at least one CE will get through per
439	   RTT.  As long as one ECE reaches the sender per RTT, the sender's
440	   congestion response will be the same as if CWR were not continuous.
441	   The only way that heavy congestion on the forward path could be
442	   completely hidden would be if all ACKs on the reverse path were lost.
443	   If total ACK loss persisted, the sender would time out and do a
444	   congestion response anyway.  Therefore, the problem seems confined to
445	   potential suppression of a congestion response during light
446	   congestion.

448	   Anyway, even if loss of all ECN feedback led to no congestion
449	   response, the worst that could happen would be loss instead of ECN-
450	   signalled congestion on the forward path.  Given compatibility mode
451	   does not affect loss feedback, there would be no risk of congestion
452	   collapse.

454	4.  Setting the ConEx Flags

456	   By setting the X flag, a packet is marked as ConEx-capable.  All
457	   packets carrying payload MUST be marked with the X flag set,
458	   including retransmissions.  Only if no congestion feedback
459	   information is (currently) available, the X flag SHOULD be zero, such
460	   as for control packets on a connection that has not sent any (user)
461	   data for some time e.g., sending only pure ACKs which are not
462	   carrying any payload.

464	4.1.  Setting the E or the L Flag

466	   As described in section Section 3.1, the sender needs to maintain a
467	   CEG counter and might maintain a LEG counter.  If no LEG is used, all
468	   retransmission will be marked with the L flag.

470	   Further, as long as the LEG or CEG counter is positive, the sender
471	   marks each ConEx-capable packet with L or E respectively, and
472	   decreases the LEG or CEG counter by the TCP payload bytes carried in
473	   the marked packet (assuming headers are not being counted because
474	   packet sizes are regular).  No matter how small the value of LEG or
475	   CEG, if it is positive, the sender MUST NOT defer packet marking to
476	   ensure ConEx signals are timely.  Therefore the value of LEG and CEG
477	   will commonly be negative.

479	   If both LEG and CEG are positive, the sender MUST mark each ConEx-
480	   capable packet with both L and E.  If a credit signal is also pending
481	   (see next section), the C flag can be set as well.

483	4.2.  Setting the Credit Flag

485	   The ConEx abstract mechanism [draft-ietf-conex-abstract-mech]
486	   requires that sufficient credit MUST be signaled in advance to cover
487	   the expected congestion during the feedback delay of one RTT.

489	   To monitor the credit state at the audit, a ConEx sender needs to
490	   maintain a credit state counter CSC in bytes.  If congestion occurs,
491	   credits will be consumed and the CSC is reduced by the number of
492	   bytes that where lost or estimated to be ECN-marked.  If the risk of
493	   congestion was estimated wrongly and thus too few credits were sent,
494	   the CSC becomes zero but cannot go negative.

496	   To be sure that the credit state in the audit never reaches zero, the
497	   number of credits should always equal the number of bytes in flight
498	   as all packets could potentially get lost or congestion marked.  In
499	   this case a ConEx sender also monitors the number of bytes in flight
500	   F.  If F ever becomes larger than CSC, the ConEx sender sets the C
501	   flag on each ConEx-capable packet and increase CSC by the payload
502	   size of each marked packet until CSC is no less than F again.
503	   However, a ConEx sender might also be less conservative and send
504	   fewer credits, if it e.g. assumes based on previous experience that
505	   the congestion will be low on a certain path.

507	   Recall that CSC will be decreased whenever congestion occurs,
508	   therefore CSC will need to be replenished as soon as CSC drops below
509	   F.  Also recall that the sender can set the C flag on a ConEx-capable
510	   packet whether or not the E or L flags are also set.

512	   In TCP slow start, the congestion window might grow much larger than
513	   during the rest of the transmission.  Likely, a sender could consider
514	   sending fewer than F credits but risking being penalized by an audit
515	   function.  Howver, the credits should at least cover the increase in
516	   sending rate.  Given the sending rate doubles every RTT in Slow
517	   Start, a ConEx sender should at least cover half the number of
518	   packets in flight by credits.

520	   Note that the number of losses or markings within one RTT does not
521	   solely depend on the sender's actions.  In general, the behavior of
522	   the cross traffic, whether active queue management (AQM) is used and
523	   how it is parameterized influence how many packets might be dropped
524	   or marked.  As long as any AQM encountered is not overly aggressive
525	   with ECN marking, sending half the flight size as credits should be
526	   sufficient whether congestion is signaled by loss or ECN.

528	   To maintain halve of the packet in flight as credits, of course halve
529	   of the packet of the initial window must be C marked.  In Slow Start
530	   marking every fourth packet introduces the correct amount of credit
531	   as can be seen in Figure 1.

533	                                        in_flight  credits
534	                RTT1  |------XC------>|     1         1
535	                      |------X------->|     2         1
536	                      |------XC------>|     3         2
537	                      |               |
538	                RTT2  |------X------->|     3         2
539	                      |------X------->|     4         2
540	                      |------X------->|     4         2
541	                      |------XC------>|     5         3
542	                      |------X------->|     5         3
543	                      |------X------->|     6         3
544	                      |               |
545	                RTT3  |------X------->|     6         3
546	                      |------XC------>|     7         4
547	                      |------X------->|     7         4
548	                      |------X------->|     8         4
549	                      |------X------->|     8         4
550	                      |------XC------>|     9         5
551	                      |------X------->|     9         5
552	                      |------X------->|    10         5
553	                      |------X------->|    10         5
554	                      |------XC------>|    11         6
555	                      |------X------->|    11         6
556	                      |------X------->|    12         6
557	                      |      .        |
558	                      |      :        |

560	       Figure 1: Credits in Slow Start (with an initial window of 3)

562	   It is possible that a TCP flow will encounter an audit function
563	   without relevant flow state, due to e.g. rerouting or memory
564	   limitations.  Therefore, the sender needs to detect this case and
565	   resend credits.  A ConEx sender might reset the credit counter CSC to
566	   zero if losses occur in subsequent RTTs (assuming that the sending
567	   rate was correctly reduced based on the received congestion signal
568	   and using a conservatively large RTT estimation).

570	   This section proposes concrete algorithms for determining how much
571	   credit to signal during congestion avoidance and slow start.
572	   However, experimentation in credit setting algorithms is expected and
573	   encouraged.  The wider goal of ConEx is to reflect the 'cost' of the
574	   risk of causing congestion on those that contribute most to it.
575	   Thus, experimentation is encouraged to improve or maintain
576	   performance while reducing the risk of causing congestion, and
577	   therefore potentially reducing the need to signal so much credit.

579	5.  Loss of ConEx information

581	   Packets carrying ConEx signals could be discarded themselves.  This
582	   will be a second order problem (e.g. if the loss probability is 0.1%,
583	   the probability of losing a ConEx L signal will be 0.1% of 0.1% =
584	   0.01%).  Further, the penality an audit induces should be propotional
585	   to the mismatch of expected ConEx marks and observed congestion,
586	   therefore the audit might only slightly increase the loss level of
587	   this flow.  Therefore, an implementer MAY choose to ignore this
588	   problem, accepting instead the risk that an audit function might
589	   wrongly penalize a flow.

591	   Nonetheless, a ConEx sender is responsible to always signal
592	   sufficient congestion feedback and therefore SHOULD remember which
593	   packet was marked with either the L, the E or the C flag.  If one of
594	   these packets is detected as lost, the sender SHOULD increase the
595	   respective gauge(s), LEG or CEG, by the number of lost payload bytes
596	   in addition to increasing LEG for the loss.

598	6.  Timeliness of the ConEx Signals

600	   ConEx signals will only be useful to a network node within a time
601	   delay of about one RTT after the congestion occurred.  To avoid
602	   further delays, a ConEx sender SHOULD send the ConEx signaling on the
603	   next available packet.

605	   Any or all of the ConEx flags can be used in the same packet, which
606	   allows delay to be minimised when multiple signals are pending.  The
607	   need to set multiple ConEx flags at the same time, can occur if e.g
608	   an ACK is received by the sender that simultaneously indicates that
609	   at least one ECN mark was received, and that one or more segements
610	   were lost.  This may e.g. happen during excessive congestion, where
611	   the queues overflow even though ECN was used and currently all
612	   forwarded packets are marked, while others have to be dropped
613	   nevertheless.  Another case when this might happen is when ACKs are
614	   lost, so that a subsequent ACK carries summary information not
615	   previously available to the sender.

617	   If a flow becomes application-limited, there could be insufficient
618	   bytes to send to reduce the gauges to zero or below.  In such cases,
619	   the sender cannot help but delay ConEx signals.  Nonetheless, as long
620	   as the sender is marking all outgoing packets, an audit function is
621	   unlikely to penalize ConEx-marked packets.  Therefore, no matter how
622	   long a gauge has been positive, a sender MUST NOT reduce the gauge by
623	   more than the ConEx marked bytes it has sent.

625	   If the CEG or LEG counter is negative, the respective counter MAY be
626	   reset to zero within one RTT after it was decreased the last time or
627	   one RTT after recovery if no further congestion occurred.

629	7.  Acknowledgements

631	   The authors would like to thank Bob Briscoe who contributed with this
632	   initial ideas [I-D.briscoe-conex-re-ecn-tcp] and valuable feedback.
633	   Moreover, thanks to Jana Iyengar who provided valuable feedback.

635	8.  IANA Considerations

637	   This document does not have any requests to IANA.

639	9.  Security Considerations

641	   General ConEx security considerations are covered extensively in the
642	   ConEx abstract mechanism [draft-ietf-conex-abstract-mech].  This
643	   section covers TCP-specific concerns.

645	   The ConEx modifications to TCP provide no mechanism for a receiver to
646	   force a sender not to use ConEx.  A receiver can degrade the accuracy
647	   of ConEx by claiming that it does not support SACK, AccECN or ECN,
648	   but the sender will never have to turn ConEx off.  The receiver
649	   cannot force the sender to have to mark ConEx more conservatively, in
650	   order to cover the risk of any inaccuracy.  Instead the sender can
651	   choose to mark inaccurately, which will only increase the likelihood
652	   of loss at an audit function.  Thus the receiver will only harm
653	   itself.

655	   Assuming the sender is limited in some way by a congestion allowance
656	   or quota, a receiver could spoof more loss or ECN congestion feedback
657	   than it actually experiences, in an attempt to make the sender draw
658	   down its allowance faster than necessary.  However, over-declaring
659	   congestion simply makes the sender slow down.  If the receiver is
660	   interested in the content it will not want to harm its own
661	   performance.

663	   However, if the receiver is solely interested in making the sender
664	   draw down its allowance, the net effect will depend on the sender's
665	   congestion control algorithm as permanetly adding more and more
666	   additional congestion would cause the sender to more and more reduce
667	   its sending rate.  Therefore a receiver can only maintain a certain
668	   congestion level that is corresponding to a certain sending rate.
669	   With New Reno [RFC5681], doubling congestion feedback causes the
670	   sender to reduce its sending rate such that it would only to consume
671	   sqrt(2) = 1.4 times more congestion allowance.  However, to improve
672	   scaling, congestion control algorithms are tending towards less
673	   responsive algorithms like Cubic or Compound TCP, and ultimately to
674	   linear algorithms like DCTCP [DCTCP] that aim to maintain the same
675	   congestion level independent of the current sending rate and always
676	   reduce its sending window if the signaled congestion feedback is
677	   higher.  In each case, if the receiver doubles congestion feedback,
678	   it causes the sender to respectively consume more allowance by a
679	   factor of 1.2, 1.15 or 1, where 1 implies the attack has become
680	   completely ineffective as no further congestion allowance is consumed
681	   but the flow will decrease its sending rate to a minimum instead.

683	10.  References

685	10.1.  Normative References

687	   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
688	              Selective Acknowledgment Options", RFC 2018, October 1996.

690	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
691	              Requirement Levels", BCP 14, RFC 2119, March 1997.

693	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
694	              of Explicit Congestion Notification (ECN) to IP", RFC
695	              3168, September 2001.

697	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
698	              Control", RFC 5681, September 2009.

700	   [draft-ietf-conex-abstract-mech]
701	              Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx)
702	              Concepts and Abstract Mechanism", draft-ietf-conex-
703	              abstract-mech-06 (work in progress), October 2012.

705	   [draft-ietf-conex-destopt]
706	              Krishnan, S., Kuehlewind, M., and C. Ucendo, "IPv6
707	              Destination Option for ConEx", draft-ietf-conex-destopt-04
708	              (work in progress), March 2013.

710	10.2.  Informative References

712	   [DCTCP]    Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel,
713	              P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP:
714	              Efficient Packet Transport for the Commoditized Data
715	              Center", Jan 2010.

717	   [I-D.briscoe-conex-re-ecn-tcp]
718	              Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith,
719	              "Re-ECN: Adding Accountability for Causing Congestion to
720	              TCP/IP", draft-briscoe-conex-re-ecn-tcp-04 (work in
721	              progress), July 2014.

723	   [RFC3522]  Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm
724	              for TCP", RFC 3522, April 2003.

726	   [RFC3708]  Blanton, E. and M. Allman, "Using TCP Duplicate Selective
727	              Acknowledgement (DSACKs) and Stream Control Transmission
728	              Protocol (SCTP) Duplicate Transmission Sequence Numbers
729	              (TSNs) to Detect Spurious Retransmissions", RFC 3708,
730	              February 2004.

732	   [RFC4015]  Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm
733	              for TCP", RFC 4015, February 2005.

735	   [RFC5682]  Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
736	              "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
737	              Spurious Retransmission Timeouts with TCP", RFC 5682,
738	              September 2009.

740	   [RFC6789]  Briscoe, B., Woundy, R., and A. Cooper, "Congestion
741	              Exposure (ConEx) Concepts and Use Cases", RFC 6789,
742	              December 2012.

744	   [RFC7141]  Briscoe, B. and J. Manner, "Byte and Packet Congestion
745	              Notification", BCP 41, RFC 7141, February 2014.

747	   [draft-kuehlewind-tcpm-accurate-ecn]
748	              Kuehlewind, M. and R. Scheffenegger, "More Accurate ECN
749	              Feedback in TCP", draft-kuehlewind-tcpm-accurate-ecn-02
750	              (work in progress), Jun 2013.

752	Appendix A.  Revision history

754	   RFC Editor: This section is to be removed before RFC publication.

756	   00 ... initial draft, early submission to meet deadline.

758	   01 ... refined draft, updated LEG "drain" from per-packet to RTT-
759	   based.

761	   02 ... added Section 5 and expanded discussion about ECN interaction.

763	   03 ... expanded the discussion around credit bits.

765	   04 ... review comments of Jana addressed.  (Change in full compliance
766	   mode.)

768	   05 ... changes on Loss Detection without SACK, support of classic ECN
769	   and credit handling.

771	   07 ... review feedback provided by Nandita

773	   08 ... based on Bob's feedback: Wording edits and structuring of a
774	   few paragraphs; change of SHOULD to MAY for resetting negative LEG/
775	   CEG; additional security considerations provided by Bob (thanks!).

777	Authors' Addresses

779	   Mirja Kuehlewind (editor)
780	   ETH Zurich
781	   Switzerland

783	   Email: mirja.kuehlewind@tik.ee.ethz.ch

785	   Richard Scheffenegger
786	   NetApp, Inc.
787	   Am Euro Platz 2
788	   Vienna  1120
789	   Austria

791	   Phone: +43 1 3676811 3146
792	   Email: rs@netapp.com