idnits 2.17.1 draft-ietf-conex-tcp-modifications-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 542 has weird spacing: '..._flight credi...' -- The document date (October 13, 2015) is 3117 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Congestion Exposure (ConEx) M. Kuehlewind, Ed. 3 Internet-Draft ETH Zurich 4 Intended status: Experimental R. Scheffenegger 5 Expires: April 15, 2016 NetApp, Inc. 6 October 13, 2015 8 TCP modifications for Congestion Exposure 9 draft-ietf-conex-tcp-modifications-10 11 Abstract 13 Congestion Exposure (ConEx) is a mechanism by which senders inform 14 the network about expected congestion based on congestion feedback 15 from previous packets in the same flow. This document describes the 16 necessary modifications to use ConEx with the Transmission Control 17 Protocol (TCP). 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on April 15, 2016. 36 Copyright Notice 38 Copyright (c) 2015 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 55 2. Sender-side Modifications . . . . . . . . . . . . . . . . . . 3 56 3. Counting Congestion . . . . . . . . . . . . . . . . . . . . . 4 57 3.1. Loss Detection . . . . . . . . . . . . . . . . . . . . . 6 58 3.1.1. Without SACK Support . . . . . . . . . . . . . . . . 7 59 3.2. ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 60 3.2.1. Accurate ECN Feedback . . . . . . . . . . . . . . . . 10 61 3.2.2. Classic ECN Support . . . . . . . . . . . . . . . . . 10 62 4. Setting the ConEx Flags . . . . . . . . . . . . . . . . . . . 11 63 4.1. Setting the E or the L Flag . . . . . . . . . . . . . . . 11 64 4.2. Setting the Credit Flag . . . . . . . . . . . . . . . . . 11 65 5. Loss of ConEx Information . . . . . . . . . . . . . . . . . . 14 66 6. Timeliness of the ConEx Signals . . . . . . . . . . . . . . . 14 67 7. Open Areas for Experimentation . . . . . . . . . . . . . . . 15 68 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 17 69 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 70 10. Security Considerations . . . . . . . . . . . . . . . . . . . 17 71 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 72 11.1. Normative References . . . . . . . . . . . . . . . . . . 18 73 11.2. Informative References . . . . . . . . . . . . . . . . . 19 74 Appendix A. Revision history . . . . . . . . . . . . . . . . . . 20 75 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 77 1. Introduction 79 Congestion Exposure (ConEx) is a mechanism by which senders inform 80 the network about expected congestion based on congestion feedback 81 from previous packets in the same flow. ConEx concepts and use cases 82 are further explained in [RFC6789]. The abstract ConEx mechanism is 83 explained in [draft-ietf-conex-abstract-mech]. This document 84 describes the necessary modifications to use ConEx with the 85 Transmission Control Protocol (TCP). 87 The markings for ConEx signaling are defined in the ConEx Destination 88 Option (CDO) for IPv6 [draft-ietf-conex-destopt]. Specifically, the 89 use of four flags is defined: X (ConEx-capable), L (loss 90 experienced), E (ECN experienced) and C (credit). 92 ConEx signaling is based on loss or Explicit Congestion Notification 93 (ECN) marks [RFC3168] as congestion indications. The sender collects 94 this congestion information based on existing TCP feedback mechanisms 95 from the receiver to the sender. No changes are needed at the 96 receiver to implement ConEx signaling. Therefore no additional 97 negotiation is needed to implement and use ConEx at the sender. This 98 document specifies the sender's actions that are needed to provide 99 meaningful ConEx information to the network. 101 Section 2 provides an overview of the modifications needed for TCP 102 senders to implement ConEx. First congestion information has to be 103 extracted from TCP's loss or ECN feedback as described in section 3. 104 Section 4 details how to set the CDO marking based on this congestion 105 information. Section 5 discusses loss of packets carrying ConEx 106 information. Section 6 discusses timeliness of the ConEx feedback 107 signal, given congestion is a temporary state. 109 This document describes congestion accounting for TCP with and 110 without the Selective Acknowledgment (SACK) extension [RFC2018] (in 111 section 3.1). However, ConEx benefits from the more accurate 112 information that SACK provides about the number of bytes dropped in 113 the network. It is therefore preferable to use the SACK extension 114 when using TCP with ConEx. The detailed mechanism to set the L flag 115 in response to loss-based congestion feedback signal is given in 116 section 4.1. 118 While loss has to be minimized, ECN can provide more fine-grained 119 feedback information. ConEx-based traffic measurement or management 120 mechanisms could benefit from this. Unfortunately, the current ECN 121 feedback mechanism does not reflect multiple congestion markings if 122 they occur within the same Round-Trip Time (RTT). A more accurate 123 feedback extension to ECN (AccECN) is proposed in a separate document 124 [draft-kuehlewind-tcpm-accurate-ecn], as this is also useful for 125 other mechanisms. 127 Congestion accounting for both classic ECN feedback and AccECN 128 feedback is explained in detail in section 3.2. Setting the E flag 129 in response to ECN-based congestion feedback is again detailed in 130 section 4.1. 132 1.1. Requirements Language 134 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 135 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 136 document are to be interpreted as described in [RFC2119]. 138 2. Sender-side Modifications 140 This section gives an overview of actions that need to be taken by a 141 TCP sender modified to use ConEx signaling. 143 In the TCP handshake, a ConEx sender MUST negotiate for SACK and ECN 144 preferably with AccECN feedback. Therefore a ConEx sender MUST also 145 implement SACK and ECN. Depending on the capability of the receiver, 146 the following operation modes exist: 148 o SACK-accECN-ConEx (SACK and accurate ECN feedback) 150 o SACK-ECN-ConEx (SACK and 'classic' instead of accurate ECN) 152 o accECN-ConEx (no SACK but accurate ECN feedback) 154 o ECN-ConEx (no SACK and no accurate ECN feedback but 'classic' ECN) 156 o SACK-ConEx (SACK but no ECN at all) 158 o Basic-ConEx (neither SACK nor ECN) 160 A ConEx sender MUST expose all congestion information to the network 161 according to the congestion information received by ECN or based on 162 loss information provided by the TCP feedback loop. A TCP sender 163 SHOULD count congestion byte-wise (rather than packet-wise; see next 164 paragraph). After any congestion notification, a sender MUST mark 165 subsequent packets with the appropriate ConEx flag in the IP header. 166 Furthermore, a ConEx sender must send enough credit to cover all 167 experienced congestion for the connection so far, as well as the risk 168 of congestion for the current transmission (see Section 4.2). 170 With SACK the number of lost payload bytes is known, but not the 171 number of packets carrying these bytes. With classic ECN only an 172 indication is given that a marking occurred but not the exact number 173 of payload bytes nor packets. As network congestion is usually byte- 174 congestion [RFC7141], the byte-size of a packet marked with a CDO 175 flag is defined to represent that number of bytes of congestion 176 signaling [draft-ietf-conex-destopt]. Therefore the exact number of 177 bytes should be taken into account, if available, to make the ConEx 178 signal as exact as possible. 180 Detailed mechanisms for congestion counting in each operation mode 181 are described in the next section. 183 3. Counting Congestion 185 A ConEx TCP sender maintains two counters: one that counts congestion 186 based on the information retrieved by loss detection, and a second 187 that accounts for ECN based congestion feedback. These counters hold 188 the number of outstanding bytes that should be ConEx marked with 189 respectively the E flag or the L flag in subsequent packets. 191 The outstanding bytes for congestion indications based on loss are 192 maintained in the loss exposure gauge (LEG), as explained in 193 Section 3.1. 195 The outstanding bytes counted based on ECN feedback information are 196 maintained in the congestion exposure gauge (CEG), as explained in 197 Section 3.2. 199 When the sender sends a ConEx capable packet with the E or L flag 200 set, it reduces the respective counter by the byte-size of the 201 packet. This is explained for both counters in Section 4.1. 203 Note that all bytes of an IP packet must be counted in the LEG or CEG 204 to capture the right number of bytes that should be marked. 205 Therefore the sender SHOULD take the payload and headers into 206 account, up to and including the IP header. However, in TCP the 207 information regarding how large the headers of a lost or marked 208 packet were is usually not available, as only payload data will be 209 acknowledged. 211 If equal-sized packets, or at least equally distributed packet sizes, 212 can be assumed, the sender MAY only add and subtract TCP payload 213 bytes. In this case there should be about the same number of ConEx 214 marked packets as the original packets that were causing the 215 congestion. Thus both contain about the same number of header bytes 216 so they will cancel out. This case is assumed for simplicity in the 217 following sections. 219 Otherwise, if a sender sends different sized packets (with unequally 220 distributed packet sizes), the sender needs to memorize or estimate 221 the number of lost or ECN-marked packets. If the sender has 222 sufficient memory available, the most accurate way to reconstruct the 223 number of lost or marked packets is to remember the sequence number 224 of all sent but not acknowledged packets. In this case a sender is 225 able to reconstruct the number of packets and thus the header bytes 226 that were sent during the last RTT. Otherwise, if e.g. not enough 227 memory is available, the sender should estimate the packet size, e.g. 228 if the packet size distribution follows a certain known pattern, or 229 by using the minimum packet size seen in the last RTT. 231 If the number of newly sent-out packets with the ConEx L or E flag 232 set is smaller (or larger) than this estimated number of lost/ECN- 233 marked packets, the additional header bytes should be added to (or 234 can be subtracted from) the respective gauge. 236 3.1. Loss Detection 238 This section applies whether or not SACK support is available. The 239 following subsection (Section 3.1.1) handles the case when SACK is 240 not available. 242 A TCP sender detects losses and subsequently retransmits the lost 243 data. Therefore, ConEx sender can simply set the ConEx L flag on all 244 retransmissions in order to at least cover the amount of bytes lost. 245 If this approach is taken, no LEG is needed. 247 However, any retransmission may be spurious. In this case more bytes 248 have been marked than necessary. To compensate for this effect a 249 ConEx sender can maintain a local signed counter, the (LEG), that 250 indicates the number of outstanding bytes to be sent with the ConEx L 251 flag and also can become negative. 253 Using the LEG, when a TCP sender decides that a data segment needs to 254 be retransmitted, it will increase LEG by the size of the TCP payload 255 bytes in the retransmission (assuming equal sized segments such that 256 the retransmitted packet will have the same number of header bytes as 257 the original ones): 259 For each retransmission: 261 LEG += payload 263 Note, how the LEG is reduced when the ConEx L marking are set is 264 described in section Section 4. 266 Further to accommodate spurious retransmissions, a ConEx sender 267 SHOULD make use of heuristics to detect such spurious retransmissions 268 (e.g. F-RTO [RFC5682], DSACK [RFC3708], and Eifel [RFC3522], 269 [RFC4015]) if already available in a given implementation. If no 270 mechanism for detecting spurious retransmissions is available, the 271 ConEx sender MAY chose to implement one of the mechanism stated 272 above. However, given the inaccuracy that ConEx may have anyway and 273 the timeliness of ConEx information, a ConEx MAY also chose to not 274 compensate for spurious retransmission. In this case if spurious 275 retransmissions occur, the ConEx sender simple has sent too many 276 ConEx signals which e.g. would decrease the congestion allowance in a 277 ConEx policer unnecessarily. 279 If a heuristic method is used to detect spurious retransmission and 280 has determined that a certain number of packets were retransmitted 281 erroneously, the ConEx sender subtracts the payload size of these TCP 282 packets from LEG. 284 If a spurious retransmission is detected: 286 LEG -= payload 288 Note that LEG can become negative, if too many L marking have already 289 been sent. This case is further discussed in section Section 6. 291 3.1.1. Without SACK Support 293 If multiple losses occur within one RTT and SACK is not used, it may 294 take several RTTs until all lost data is retransmitted. With the 295 scheme described above, the ConEx information will be delayed 296 considerably, but timeliness is important for ConEx. For ConEx, it 297 is important to know how much data was lot; it is not important to 298 know what data is lost. During the first RTT after the initial loss 299 detection, the amount of received data and thus also the amount of 300 lost data can be estimated based on the number of received ACKs. 302 Therefore a ConEx sender can use the following algorithm to estimated 303 the number of lost bytes with an additional delay of one RTT using an 304 additional Loss Estimation Counter (LEC): 306 flight_bytes: current flight size in bytes 307 retransmit_bytes: payload size of the retransmission 309 At the first retransmission in a congestion event LEC is set: 311 LEC = flight_bytes - 3*SMSS 313 (At this point of time in the transmission, in the worst case, 314 all packets in flight minus three that trigged the dupACks 315 could have been lost.) 317 Then during the first RTT of the congestion event: 319 For each retransmission: 320 LEG += retransmit_bytes 321 LEC -= retransmit_bytes 323 For each ACK: 324 LEC -= SMSS 326 After one RTT: 328 LEG += LEC 330 (The LEC now estimates the number of outstanding bytes 331 that should be ConEx L marked.) 333 After the first RTT for each following retransmissions: 335 if (LEC > 0): LEC -= retransmit_bytes 336 else if (LEC==0): LEG += retransmit_bytes 338 if (LEC < 0): LEG += -LEC 340 (The LEG is not increased for those bytes that were 341 already counted.) 343 3.2. ECN 345 ECN [RFC3168] is an IP/TCP mechanism that allows network nodes to 346 mark packets with the Congestion Experienced (CE) mark instead of 347 dropping them when congestion occurs. 349 A receiver might support 'classic' ECN, the more accurate ECN 350 feedback scheme (AccECN), or neither. In the case that ECN is not 351 supported for a connection, of course, no ECN marks will occur; thus 352 the sender will never set the E flag. Otherwise, a ConEx sender 353 needs to maintain a signed counter, the congestion exposure gauge 354 (CEG), for the number of outstanding bytes that have to be ConEx 355 marked with the E flag. 357 The CEG is increased when ECN information is received from an ECN- 358 capable receiver supporting the 'classic' ECN scheme or the accurate 359 ECN feedback scheme. When the ConEx sender receives an ACK 360 indicating one or more segments were received with a CE mark, CEG is 361 increased by the appropriate number of bytes as described further 362 below. 364 Unfortunately in case of duplicate acknowledgements the number of 365 newly acknowledged bytes will be zero even though (CE marked) data 366 has been received. Therefore, we increase the CEG by DeliveredData, 367 as defined below: 369 DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS - 370 (is_after_dup)*num_dup*1SMSS + 372 DeliveredData covers the number of bytes that has been newly 373 delivered to the receiver. Therefore on each arrival of an ACK, 374 DeliveredData will be increased by the newly acknowledged bytes 375 (acked_bytes) as indicated by the current ACK, relative to all past 376 ACKs. The formula depends on whether SACK is available: if SACK is 377 not available SACK_diff is always zero, whereas is ACK information is 378 available is_dup and is_after_dup are always zero. 380 With SACK, DeliveredData is increased by the number of bytes provided 381 by (new) SACK information (SACK_diff). Note, if less unacknowledged 382 bytes are announced in the new SACK information than in the previous 383 ACK, SACK_diff can be negative. In this case, data is newly 384 acknowledged (in acked_bytes), that has previously already been 385 accumulated into DeliveredData based on SACK information. 387 Otherwise without SACK, DeliveredData is increased by 1 SMSS on 388 duplicate acknowledgements because duplicate acknowledgements do not 389 acknowledge any new data (and acked_bytes will be zero). For the 390 subsequent partial or full ACK, acked_bytes cover all newly 391 acknowledged bytes including those already accounted for with the 392 receipt of any duplicate acknowledgement. Therefore DeliveredData is 393 reduced by one SMSS for each preceding duplicate ACK. Consequently, 394 is_dup is one if the current ACK is a duplicated ACK without SACK, 395 and zero otherwise. is_after_dup is only one for the next full or 396 partial ACK after a number of duplicated ACKs without SACK and 397 num_dup counts the number of duplicated ACKs in a row (which usually 398 is 3 or more). 400 With classic ECN, one congestion marked packet causes continuous 401 congestion feedback for a whole round trip, thus hiding the arrival 402 of any further congestion marked packets during that round trip. A 403 more accurate ECN feedback scheme (AccECN) is needed to ensure that 404 feedback properly reflects the extent of congestion marking. The two 405 cases, with and without a receiver capable of AccECN, are discussed 406 in the following sections. 408 3.2.1. Accurate ECN Feedback 410 With a more accurate ECN feedback scheme (AccECN) that is supported 411 by the receiver, either the number of marked packets or the number of 412 marked bytes will be fed back from the receiver to the sender and is 413 therefore know at sender-side. In the latter case, the CEG can 414 directly be increased by the number of marked bytes. Otherwise if D 415 is assumed to be the number of marks, the gauge (CEG) will be 416 conservatively increased by one SMSS for each marking or at max the 417 number of newly acknowledged bytes: 419 CEG += min(SMSS*D, DeliveredData) 421 3.2.2. Classic ECN Support 423 With classic ECN, as soon as a CE mark is seen at the receiver, it 424 will feed this information back to the sender by setting the Echo 425 Congestion Experienced (ECE) flag in the TCP header of subsequent 426 ACKs. Once the sender receives the first ECE of a congestion 427 notification, it sets the CWR flag in the TCP header once. When this 428 packet with Congestion Window Reduced (CWR) flag in the TCP header 429 arrives at the receiver, acknowledging its first ECE feedback, the 430 receiver stops setting ECE. 432 If the ConEx sender fully conforms to the semantics of ECN signaling 433 as defined by [RFC3168], it will receive one full RTT of ACKs with 434 the ECE flag set whenever at least one CE mark was received by the 435 receiver. As the sender cannot estimate how many packets have 436 actually been CE marked during this RTT, the most conservative 437 assumption MAY be taken, namely assuming that all packets were 438 marked. This can be achieved by increasing the CEG by DeliveredData 439 for each ACK with the ECE flag: 441 CEG += DeliveredData 443 Optionally a ConEx sender could implement the following technique 444 (that not conforms to [RFC3168]), called advanced compatibility mode, 445 to considerably improve its estimate of the number of ECN-marked 446 packets: 448 To extract more than one ECE indication per RTT, a ConEx sender could 449 set the CWR flag continuously to force the receiver to signal only 450 one ECE per CE mark. Unfortunately, the use of delayed ACKs 451 [RFC5681] (which is common) will prevent feedback of every CE mark; 452 if a CWR confirmation is received before the ECE can be sent out on 453 the next ACK, ECN feedback information could get lost (depending on 454 the actual receiver implementation). Thus a sender SHOULD set CWR 455 only on those data segments that will presumably trigger a (delayed) 456 ACK. The sender would need an additional control loop to estimate 457 which data segments will trigger an ACK in order to extract more 458 timely congestion notifications. Still, the CEG SHOULD be increased 459 by DeliveredData, as one or more CE marked packets could be 460 acknowledged by one delayed ACK. 462 4. Setting the ConEx Flags 464 By setting the X flag, a packet is marked as ConEx-capable. All 465 packets carrying payload MUST be marked with the X flag set, 466 including retransmissions. Only if no congestion feedback 467 information is (currently) available, the X flag SHOULD be zero (e.g. 468 for control packets on a connection that not sent any user data for 469 some time and therefore is sending only pure ACKs that are not 470 carrying any payload). 472 4.1. Setting the E or the L Flag 474 As described in section Section 3.1, the sender needs to maintain a 475 CEG counter and might maintain a LEG counter. If no LEG is used, all 476 retransmission will be marked with the L flag. 478 Further, as long as the LEG or CEG counter is positive, the sender 479 marks each ConEx-capable packet with L or E respectively, and 480 decreases the LEG or CEG counter by the TCP payload bytes carried in 481 the marked packet (assuming headers are not being counted because 482 packet sizes are regular). No matter how small the value of LEG or 483 CEG, if the value is positive the sender MUST NOT defer packet 484 marking; this ensure ConEx signals are timely. Therefore the value 485 of LEG and CEG will commonly be negative. 487 If both LEG and CEG are positive, the sender MUST mark each ConEx- 488 capable packet with both L and E. If a credit signal is also pending 489 (see next section), the C flag can be set as well. 491 4.2. Setting the Credit Flag 493 The ConEx abstract mechanism [draft-ietf-conex-abstract-mech] 494 requires that sufficient credit MUST be signaled in advance to cover 495 the expected congestion during the feedback delay of one RTT. 497 To monitor the credit state at the audit, a ConEx sender needs to 498 maintain a Credit State Counter (CSC) in bytes. If congestion 499 occurs, credits will be consumed and the CSC is reduced by the number 500 of bytes that where lost or estimated to be ECN-marked. If the risk 501 of congestion was estimated wrongly and thus too few credits were 502 sent, the CSC becomes zero but cannot go negative. 504 To be sure that the credit state in the audit never reaches zero, the 505 number of credits should always equal the number of bytes in flight 506 as all packets could potentially get lost or congestion marked. In 507 this case a ConEx sender also monitors the number of bytes in flight 508 F. If F ever becomes larger than CSC, the ConEx sender sets the C 509 flag on each ConEx-capable packet and increase CSC by the payload 510 size of each marked packet until CSC is no less than F again. 511 However, a ConEx sender might also be less conservative and send 512 fewer credits, if it e.g. assumes based on previous experience that 513 the congestion will be low on a certain path. 515 Recall that CSC will be decreased whenever congestion occurs; 516 therefore CSC will need to be replenished as soon as CSC drops below 517 F. Also recall that the sender can set the C flag on a ConEx-capable 518 packet whether or not the E or L flags are also set. 520 In TCP Slow Start, the congestion window might grow much larger than 521 during the rest of the transmission. Likely, a sender could consider 522 sending fewer than F credits but risking being penalized by an audit 523 function. However, the credits should at least cover the increase in 524 sending rate. Given the exponential increase as implemented in the 525 TCP Slow Start algorithm which means that the sending rate doubles 526 every RTT, a ConEx sender should at least cover half the number of 527 packets in flight by credits. 529 Note that the number of losses or markings within one RTT does not 530 solely depend on the sender's actions. In general, the behavior of 531 the cross traffic, whether Active Queue Management (AQM) is used and 532 how it is parameterized influence how many packets might be dropped 533 or marked. As long as any AQM encountered is not overly aggressive 534 with ECN marking, sending half the flight size as credits should be 535 sufficient whether congestion is signaled by loss or ECN. 537 To maintain half of the packets in flight as credits, also half of 538 the packet of the initial window must be C marked. In Slow Start 539 marking every fourth packet introduces the correct amount of credit 540 as can be seen in Figure 1. 542 in_flight credits 543 RTT1 |------XC------>| 1 1 544 |------X------->| 2 1 545 |------XC------>| 3 2 546 | | 547 RTT2 |------X------->| 3 2 548 |------X------->| 4 2 549 |------X------->| 4 2 550 |------XC------>| 5 3 551 |------X------->| 5 3 552 |------X------->| 6 3 553 | | 554 RTT3 |------X------->| 6 3 555 |------XC------>| 7 4 556 |------X------->| 7 4 557 |------X------->| 8 4 558 |------X------->| 8 4 559 |------XC------>| 9 5 560 |------X------->| 9 5 561 |------X------->| 10 5 562 |------X------->| 10 5 563 |------XC------>| 11 6 564 |------X------->| 11 6 565 |------X------->| 12 6 566 | . | 567 | : | 569 Figure 1: Credits in Slow Start (with an initial window of 3) 571 It is possible that a TCP flow will encounter an audit function 572 without relevant flow state, due to e.g. rerouting or memory 573 limitations. Therefore, the sender needs to detect this case and 574 resend credits. A ConEx sender might reset the credit counter CSC to 575 zero if losses occur in subsequent RTTs (assuming that the sending 576 rate was correctly reduced based on the received congestion signal 577 and using a conservatively large RTT estimation). 579 This section proposes a concrete algorithm for determining how much 580 credit to signal (with a separate approach used for Slow Start). 581 However, experimentation in credit setting algorithms is expected and 582 encouraged. The wider goal of ConEx is to reflect the 'cost' of the 583 risk of causing congestion on those that contribute most to it. 584 Thus, experimentation is encouraged to improve or maintain 585 performance while reducing the risk of causing congestion, and 586 therefore potentially reducing the need to signal so much credit. 588 5. Loss of ConEx Information 590 Packets carrying ConEx signals could be discarded themselves. This 591 will be a second order problem (e.g. if the loss probability is 0.1%, 592 the probability of losing a ConEx L signal will be 0.1% of 0.1% = 593 0.01%). Further, the penalty an audit induces should be proportional 594 to the mismatch of expected ConEx marks and observed congestion, 595 therefore the audit might only slightly increase the loss level of 596 this flow. Therefore, an implementer MAY choose to ignore this 597 problem, accepting instead the risk that an audit function might 598 wrongly penalize a flow. 600 Nonetheless, a ConEx sender is responsible for always signalling 601 sufficient congestion feedback and therefore SHOULD remember which 602 packet was marked with either the L, the E or the C flag. If one of 603 these packets is detected as lost, the sender SHOULD increase the 604 respective gauge(s), LEG or CEG, by the number of lost payload bytes 605 in addition to increasing LEG for the loss. 607 6. Timeliness of the ConEx Signals 609 ConEx signals will only be useful to a network node within a time 610 delay of about one RTT after the congestion occurred. To avoid 611 further delays, a ConEx sender SHOULD send the ConEx signaling on the 612 next available packet. 614 Any or all of the ConEx flags can be used in the same packet, which 615 allows delay to be minimized when multiple signals are pending. The 616 need to set multiple ConEx flags at the same time can occur if e.g an 617 ACK is received by the sender that simultaneously indicates that at 618 least one ECN mark was received, and that one or more segments were 619 lost. This may happen during excessive congestion, if the queues 620 overflow even though ECN was used and currently all forwarded packets 621 are marked, while others have to be dropped. Another case when this 622 might happen is when ACKs are lost, so that a subsequent ACK carries 623 summary information not previously available to the sender. 625 If a flow becomes application-limited, there could be insufficient 626 bytes to send to reduce the gauges to zero or below. In such cases, 627 the sender cannot help but delay ConEx signals. Nonetheless, as long 628 as the sender is marking all outgoing packets, an audit function is 629 unlikely to penalize ConEx-marked packets. Therefore, no matter how 630 long a gauge has been positive, a sender MUST NOT reduce the gauge by 631 more than the ConEx marked bytes it has sent. 633 If the CEG or LEG counter is negative, the respective counter MAY be 634 reset to zero within one RTT after it was decreased the last time or 635 one RTT after recovery if no further congestion occurred. 637 7. Open Areas for Experimentation 639 All proposed mechanisms in this document are experimental, and 640 therefore further large-scale experimentation in the Internet is 641 required to evaluate if the signaling provided by these mechanisms is 642 accurate and timely enough to produce value for ConEx-based (traffic 643 management or other) mechanisms. 645 The current ConEx specifications assume that congestion is counted in 646 number of bytes (including the IP header that directly encapsulates 647 the CDO and everything that IP header encapsulates) 648 [draft-ietf-conex-destopt]. This decision was taken because most 649 network devices today experience byte-congestion where the memory is 650 filled exactly with the number of bytes a packet carries [RFC7141]. 651 However, there are also devices that may allocate a certain amount of 652 memory per packet, no matter how large a packet is. These devices 653 get congested based on the number of packets in their memory and 654 therefore in this case congestion is determined by the number of 655 packets that have been lost or marked. Furthermore, a transport 656 layer endpoint, such as a TCP sender or receiver, might not know the 657 exact number of bytes that a lower layer was carrying. Therefore a 658 TCP endpoint may only be able to estimate the exact number of 659 congested bytes (assuming that all lower layer header have the same 660 length). If this estimation is sufficient to work with, the ConEx 661 signal needs to be further evaluated in tests in the Internet 662 together with different auditor implementations. 664 Further, the proposed marking schemes in this document are designed 665 under the assumption that all TCP packets of a ConEx-capable flow are 666 of equal size or that flows have a constant mean packet size over a 667 rather small time frame, like one RTT or less. In most 668 implementations this assumption might be taken as well and probably 669 is true for most of the traffic flows. If this proposed scheme is 670 used, it is necessary to evaluate how much accuracy degrades if this 671 precondition is not met. Evaluating with real traffic from different 672 applications is especially important in making the decision regarding 673 whether the proposed schemes are sufficient or whether a more complex 674 scheme is needed. 676 In this context the proposed scheme to set credit markings in Slow 677 Start runs a risk to provide an insufficient number of markings which 678 can cause an audit function to penalize this flow. Both the proposed 679 credit scheme for Slow Start as well as the scheme in Congestion 680 Avoidance must be evaluated together with one or more specific 681 implementations of an ConEx auditor to ensure that both algorithms, 682 in the sender and in the auditor, work properly together with a low 683 risk of false positives (which would lead to penalization of an 684 honest sender). However, if a sender is wrongly assumed to cheat, 685 the penalization of the audit should be adequate and should allow an 686 honest sender using a congestion control scheme that is commonly used 687 today to recover quickly. 689 Another open issue is the accuracy of the ECN feedback signal. At 690 time of publication of this document there is no AccECN mechanism 691 specified yet, and further AccECN will also take some time to be 692 widely deployed. This document proposes an advanced compatibility 693 mode for Classic ECN. The proposed mechanism can provide more 694 accurate feedback by utilizing the way Classic ECN is specified but 695 has a higher risk of losing information. To figure out how high this 696 risk is in a real deployment scenario, further experimental 697 evaluation is needed. The following argument is intended to prove 698 that suppressing repetitions of ECE, however, is still safe against 699 possible congestion collapse due to lost congestion feedback and 700 should be further proven in experimentation: 702 Repetition of ECE in classic ECN is intended to ensure reliable 703 delivery of congestion feedback. However, with advanced 704 compatibility mode, it is possible to miss congestion notifications. 705 This can happen in some implementations if delayed acknowledgements 706 are used. Further, an ACK containing ECE can simply get lost. If 707 only a few CE marks are received within one congestion event (e.g., 708 only one), the loss of one acknowledgements due to (heavy) congestion 709 on the reverse path can prevent that any congestion notification is 710 received by the sender. 712 However, if loss of feedback exacerbates congestion on the forward 713 path, more forward packets will be CE marked, increasing the 714 likelihood that feedback from at least one CE will get through per 715 RTT. As long as one ECE reaches the sender per RTT, the sender's 716 congestion response will be the same as if CWR were not continuous. 717 The only way that heavy congestion on the forward path could be 718 completely hidden would be if all ACKs on the reverse path were lost. 719 If total ACK loss persisted, the sender would time out and do a 720 congestion response anyway. Therefore, the problem seems confined to 721 potential suppression of a congestion response during light 722 congestion. 724 Furthermore, even if loss of all ECN feedback leads to no congestion 725 response, the worst that could happen would be loss instead of ECN- 726 signaled congestion on the forward path. Given compatibility mode 727 does not affect loss feedback, there would be no risk of congestion 728 collapse. 730 8. Acknowledgements 732 The authors would like to thank Bob Briscoe who contributed with this 733 initial ideas [I-D.briscoe-conex-re-ecn-tcp] and valuable feedback. 734 Moreover, thanks to Jana Iyengar who also provided valuable feedback. 736 9. IANA Considerations 738 This document does not have any requests to IANA. 740 10. Security Considerations 742 General ConEx security considerations are covered extensively in the 743 ConEx abstract mechanism [draft-ietf-conex-abstract-mech]. This 744 section covers TCP-specific concerns that may occur with the addition 745 of ConEx to TCP (while not discussing general well-known attacks 746 against TCP). It is assumed that any altering of ConEx information 747 can be detected by protection mechanisms in the IP layer and is 748 therefore not discussed here but in [draft-ietf-conex-destopt]. 749 Further, [draft-ietf-conex-destopt] describes how to use ConEx to 750 mitigate flooding attacks by using preferential drop where the use of 751 ConEx can even increase security. 753 The ConEx modifications to TCP provide no mechanism for a receiver to 754 force a sender not to use ConEx. A receiver can degrade the accuracy 755 of ConEx by claiming that it does not support SACK, AccECN or ECN, 756 but the sender will never have to turn ConEx off. Further, the 757 receiver cannot force the sender to have to mark ConEx more 758 conservatively, in order to cover the risk of any inaccuracy. 759 Instead it is always the sender's choice to either mark very 760 conservatively which ensures that the audits always sees enough 761 markings to not penalize the flow, or estimate the needed number of 762 markings more tightly. This second case lead to inaccurate marking 763 and therefore increases the likelihood of loss at an audit function 764 which will only harm the receiver itself. 766 Assuming the sender is limited in some way by a congestion allowance 767 or quota, a receiver could spoof more loss or ECN congestion feedback 768 than it actually experiences, in an attempt to make the sender draw 769 down its allowance faster than necessary. However, over-declaring 770 congestion simply makes the sender slow down. If the receiver is 771 interested in the content it will not want to harm its own 772 performance. 774 However, if the receiver is solely interested in making the sender 775 draw down its allowance, the net effect will depend on the sender's 776 congestion control algorithm as permanently adding more and more 777 additional congestion would cause the sender to more and more reduce 778 its sending rate. Therefore a receiver can only maintain a certain 779 congestion level that is corresponding to a certain sending rate. 780 With New Reno [RFC5681], doubling congestion feedback causes the 781 sender to reduce its sending rate such that it would only to consume 782 sqrt(2) = 1.4 times more congestion allowance. However, to improve 783 scaling, congestion control algorithms are tending towards less 784 responsive algorithms like Cubic or Compound TCP, and ultimately to 785 linear algorithms like DCTCP [DCTCP] that aim to maintain the same 786 congestion level independent of the current sending rate and always 787 reduce its sending window if the signaled congestion feedback is 788 higher. In each case, if the receiver doubles congestion feedback, 789 it causes the sender to respectively consume more allowance by a 790 factor of 1.2, 1.15 or 1, where 1 implies the attack has become 791 completely ineffective as no further congestion allowance is consumed 792 but the flow will decrease its sending rate to a minimum instead. 794 11. References 796 11.1. Normative References 798 [draft-ietf-conex-abstract-mech] 799 Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 800 Concepts and Abstract Mechanism", draft-ietf-conex- 801 abstract-mech-06 (work in progress), October 2012. 803 [draft-ietf-conex-destopt] 804 Krishnan, S., Kuehlewind, M., and C. Ucendo, "IPv6 805 Destination Option for ConEx", draft-ietf-conex-destopt-04 806 (work in progress), March 2013. 808 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 809 Selective Acknowledgment Options", RFC 2018, 810 DOI 10.17487/RFC2018, October 1996, 811 . 813 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 814 Requirement Levels", BCP 14, RFC 2119, 815 DOI 10.17487/RFC2119, March 1997, 816 . 818 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 819 of Explicit Congestion Notification (ECN) to IP", 820 RFC 3168, DOI 10.17487/RFC3168, September 2001, 821 . 823 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 824 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 825 . 827 11.2. Informative References 829 [DCTCP] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 830 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP: 831 Efficient Packet Transport for the Commoditized Data 832 Center", Jan 2010. 834 [draft-kuehlewind-tcpm-accurate-ecn] 835 Kuehlewind, M. and R. Scheffenegger, "More Accurate ECN 836 Feedback in TCP", draft-kuehlewind-tcpm-accurate-ecn-02 837 (work in progress), Jun 2013. 839 [I-D.briscoe-conex-re-ecn-tcp] 840 Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, 841 "Re-ECN: Adding Accountability for Causing Congestion to 842 TCP/IP", draft-briscoe-conex-re-ecn-tcp-04 (work in 843 progress), July 2014. 845 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm 846 for TCP", RFC 3522, DOI 10.17487/RFC3522, April 2003, 847 . 849 [RFC3708] Blanton, E. and M. Allman, "Using TCP Duplicate Selective 850 Acknowledgement (DSACKs) and Stream Control Transmission 851 Protocol (SCTP) Duplicate Transmission Sequence Numbers 852 (TSNs) to Detect Spurious Retransmissions", RFC 3708, 853 DOI 10.17487/RFC3708, February 2004, 854 . 856 [RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm 857 for TCP", RFC 4015, DOI 10.17487/RFC4015, February 2005, 858 . 860 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 861 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 862 Spurious Retransmission Timeouts with TCP", RFC 5682, 863 DOI 10.17487/RFC5682, September 2009, 864 . 866 [RFC6789] Briscoe, B., Ed., Woundy, R., Ed., and A. Cooper, Ed., 867 "Congestion Exposure (ConEx) Concepts and Use Cases", 868 RFC 6789, DOI 10.17487/RFC6789, December 2012, 869 . 871 [RFC7141] Briscoe, B. and J. Manner, "Byte and Packet Congestion 872 Notification", BCP 41, RFC 7141, DOI 10.17487/RFC7141, 873 February 2014, . 875 Appendix A. Revision history 877 RFC Editor: This section is to be removed before RFC publication. 879 00 ... initial draft, early submission to meet deadline. 881 01 ... refined draft, updated LEG "drain" from per-packet to RTT- 882 based. 884 02 ... added Section 5 and expanded discussion about ECN interaction. 886 03 ... expanded the discussion around credit bits. 888 04 ... review comments of Jana addressed. (Change in full compliance 889 mode.) 891 05 ... changes on Loss Detection without SACK, support of classic ECN 892 and credit handling. 894 07 ... review feedback provided by Nandita 896 08 ... based on Bob's feedback: Wording edits and structuring of a 897 few paragraphs; change of SHOULD to MAY for resetting negative LEG/ 898 CEG; additional security considerations provided by Bob (thanks!). 900 09 ... experimentation section added 902 10 ... final review comments based on IETF last call 904 Authors' Addresses 906 Mirja Kuehlewind (editor) 907 ETH Zurich 908 Switzerland 910 Email: mirja.kuehlewind@tik.ee.ethz.ch 912 Richard Scheffenegger 913 NetApp, Inc. 914 Am Euro Platz 2 915 Vienna 1120 916 Austria 918 Email: rs.ietf@gmx.at