idnits 2.17.1 draft-ietf-conex-tcp-modifications-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 533 has weird spacing: '..._flight credi...' -- The document date (April 22, 2015) is 3292 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Congestion Exposure (ConEx) M. Kuehlewind, Ed. 3 Internet-Draft ETH Zurich 4 Intended status: Experimental R. Scheffenegger 5 Expires: October 24, 2015 NetApp, Inc. 6 April 22, 2015 8 TCP modifications for Congestion Exposure 9 draft-ietf-conex-tcp-modifications-08 11 Abstract 13 Congestion Exposure (ConEx) is a mechanism by which senders inform 14 the network about expected congestion based on congestion feedback 15 from previous packets in the same flow. This document describes the 16 necessary modifications to use ConEx with the Transmission Control 17 Protocol (TCP). 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on October 24, 2015. 36 Copyright Notice 38 Copyright (c) 2015 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 55 2. Sender-side Modifications . . . . . . . . . . . . . . . . . . 3 56 3. Counting congestion . . . . . . . . . . . . . . . . . . . . . 4 57 3.1. Loss Detection . . . . . . . . . . . . . . . . . . . . . 5 58 3.1.1. Without SACK Support . . . . . . . . . . . . . . . . 6 59 3.2. ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 60 3.2.1. Accurate ECN feedback . . . . . . . . . . . . . . . . 9 61 3.2.2. Classic ECN support . . . . . . . . . . . . . . . . . 9 62 4. Setting the ConEx Flags . . . . . . . . . . . . . . . . . . . 10 63 4.1. Setting the E or the L Flag . . . . . . . . . . . . . . . 11 64 4.2. Setting the Credit Flag . . . . . . . . . . . . . . . . . 11 65 5. Loss of ConEx information . . . . . . . . . . . . . . . . . . 14 66 6. Timeliness of the ConEx Signals . . . . . . . . . . . . . . . 14 67 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15 68 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 69 9. Security Considerations . . . . . . . . . . . . . . . . . . . 15 70 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 16 71 10.1. Normative References . . . . . . . . . . . . . . . . . . 16 72 10.2. Informative References . . . . . . . . . . . . . . . . . 16 73 Appendix A. Revision history . . . . . . . . . . . . . . . . . . 17 74 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 18 76 1. Introduction 78 Congestion Exposure (ConEx) is a mechanism by which senders inform 79 the network about expected congestion based on congestion feedback 80 from previous packets in the same flow. ConEx concepts and use cases 81 are further explained in [RFC6789]. The abstract ConEx mechanism is 82 explained in [draft-ietf-conex-abstract-mech]. This document 83 describes the necessary modifications to use ConEx with the 84 Transmission Control Protocol (TCP). 86 The markings for ConEx signaling are defined in the ConEx Destination 87 Option (CDO) for IPv6 [draft-ietf-conex-destopt]. Specifically, the 88 use of four flags is defined: X (ConEx-capable), L (loss 89 experienced), E (ECN experienced) and C (credit). 91 ConEx signaling is based on loss or Explicit Congestion Notification 92 (ECN) marks [RFC3168] as congestion indications. The sender collects 93 this congestion information based on existing TCP feedback mechanisms 94 from the receiver to the sender. No changes are needed at the 95 receiver to implement ConEx signaling. Therefore no additional 96 negotiation is needed to implement and use ConEx at the sender. This 97 document specifies the sender's actions that are needed to provide 98 meaningful ConEx information to the network. 100 Section 2 provides an overview of the modifications needed for TCP 101 senders to implement ConEx. First congestion information has to be 102 extracted from TCP's loss or ECN feedback as described in section 3. 103 Section 4 details how to set the CDO marking based on this congestion 104 information. Section 5 discusses loss of packets carrying ConEx 105 information. Section 6 discusses timeliness of the ConEx feedback 106 signal, given congestion is a temporary state. 108 This document describes congestion accounting for TCP with and 109 without the Selective Acknowledgment (SACK) extension [RFC2018] (in 110 section 3.1). However, ConEx benefits from the more accurate 111 information that SACK provides about the number of bytes dropped in 112 the network. It is therefore preferable to use the SACK extension 113 when using TCP with ConEx. The detailed mechanism to set the L flag 114 in response to loss-based congestion feedback signal is given in 115 section 4.1. 117 Whereas loss has to be minimized, ECN can provide more fine-grained 118 feedback information. ConEx-based traffic measurement or management 119 mechanisms could benefit from this. Unfortunately, the current ECN 120 feedback mechanism does not reflect multiple congestion markings if 121 they occur within the same Round-Trip Time (RTT). A more accurate 122 feedback extension to ECN (AccECN) is proposed in a separate document 123 [draft-kuehlewind-tcpm-accurate-ecn], as this is also useful for 124 other mechanisms. 126 Congestion accounting for both classic ECN feedback and AccECN 127 feedback is explained in detail in section 3.2. Setting the E flag 128 in response to ECN-based congestion feedback is again detailed in 129 section 4.1. 131 1.1. Requirements Language 133 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 134 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 135 document are to be interpreted as described in [RFC2119]. 137 2. Sender-side Modifications 139 This section gives an overview of actions that need to be taken by a 140 TCP sender modified to use ConEx signaling. 142 In the TCP handshake, a ConEx sender MUST negotiate for SACK and ECN 143 preferably with AccECN feedback. Therefore a ConEx sender MUST also 144 implement SACK and ECN. Depending on the capability of the receiver, 145 the following operation modes exist: 147 o SACK-accECN-ConEx (SACK and accurate ECN feedback) 149 o SACK-ECN-ConEx (SACK and 'classic' instead of accurate ECN) 151 o accECN-ConEx (no SACK but accurate ECN feedback) 153 o ECN-ConEx (no SACK and no accurate ECN feedback but 'classic' ECN) 155 o SACK-ConEx (SACK but no ECN at all) 157 o Basic-ConEx (neither SACK nor ECN) 159 A ConEx sender MUST expose all congestion information to the network 160 according to the congestion information received by ECN or based on 161 loss information provided by the TCP feedback loop. A TCP sender 162 SHOULD count congestion byte-wise (rather than packet-wise; see next 163 paragraph). After any congestion notification, a sender MUST mark 164 subsequent packets with the appropriate ConEx flag in the IP header. 165 Furthermore, a ConEx sender must send enough credit to cover all 166 experienced congestion for the connection so far, as well as the risk 167 of congestion for the current transmission (see Section 4.2). 169 With SACK the number of lost payload bytes is known, but not the 170 number of packets carrying these bytes. With classic ECN only an 171 indication is given that a marking occurred but not the exact number 172 of payload bytes nor packets. As network congestion is usually byte- 173 congestion [RFC7141], the byte-size of a packet marked with a CDO 174 flag is defined to represent that number of bytes of congestion 175 signalling [draft-ietf-conex-destopt]. Therefore the exact number of 176 bytes should be taken into account, if available, to make the ConEx 177 signal as exact as possible. 179 Detailed mechanisms for congestion counting in each operation mode 180 are described in the next section. 182 3. Counting congestion 184 A ConEx TCP sender maintains two counters: one that counts congestion 185 based on the information retrieved by loss detection, and a second 186 that accounts for ECN based congestion feedback. These counters hold 187 the number of outstanding bytes that should be ConEx marked with 188 respectively the E flag or the L flag in subsequent packets. 190 The outstanding bytes for congestion indications based on loss are 191 maintained in the loss exposure gauge (LEG), as explained in 192 Section 3.1. 194 The outstanding bytes counted based on ECN feedback information are 195 maintained in the congestion exposure gauge (CEG), as explained in 196 Section 3.2. 198 When the sender sends a ConEx capable packet with the E or L flag set 199 it reduces the respective counter by the byte-size of the packet. 200 This is explained for both counters in Section 4.1. Usually all 201 bytes of an IP packet must be counted. Therefore the sender SHOULD 202 take the payload and headers into account, up to and including the IP 203 header. 205 If equal-sized packets, or at least equally distributed packet sizes 206 can be assumed, the sender MAY only add and subtract TCP payload 207 bytes. In this case there should be about the same number of ConEx 208 marked packets as the original packets that were causing the 209 congestion. Thus both contain about the same number of header bytes 210 so they will cancel out. This case is assumed for simplicity in the 211 following sections. 213 Otherwise, if a sender sends different sized packets (with unequally 214 distributed packet sizes), the sender needs to memorize or estimate 215 the number of lost or ECN-marked packets. A sender might be able to 216 reconstruct the number of packets and thus the header bytes if the 217 packet sizes of all packets that were sent during the last RTT are 218 known. Otherwise, if no additional information is available, the 219 worst case number of packets and thus header bytes should be 220 estimated, e.g. based on the minimum packet size (of all packets sent 221 in the last RTT). If the number of newly sent-out packets with the 222 ConEx L or E flag set is smaller (or larger) than this estimated 223 number of lost/ECN-marked packets, the additional header bytes should 224 be added to (or can be subtracted from) the respective gauge. 226 3.1. Loss Detection 228 This section applies whether or not SACK support is available. The 229 following subsection in addition handles the case when SACK is not 230 available. 232 A TCP sender detects losses and subsequently retransmits the lost 233 data. Therefore, ConEx sender can simply set the ConEx L flag on all 234 retransmissions in order to at least cover the amount of bytes lost. 235 If this aprroach is taken, no LEG is needed. 237 However, any retransmission may be spurious. In this case more bytes 238 have been marked than necessary. To compensate this effect a ConEx 239 sender can maintain a local signed counter, the (LEG), that indicats 240 the number of outstanding bytes to be sent with the ConEx L flag and 241 also can become negative. Using the LEG, when a TCP sender decides 242 that a data segment needs to be retransmitted, it will increase LEG 243 by the size of the TCP payload bytes in the retransmission (assuming 244 equal sized segments such that the retransmitted packet will have the 245 same number of header bytes as the original ones) and reduce the LEG 246 as described in section Section 4. Further to accommodate spurious 247 restransmission, a ConEx sender SHOULD make use of heuristics to 248 detect such spurious retransmissions (e.g. F-RTO [RFC5682], DSACK 249 [RFC3708], and Eifel [RFC3522], [RFC4015]). When such a heuristic 250 has determined that a certain number of packets were retransmitted 251 erroneously, the ConEx sender subtracts the payload size of these TCP 252 packets from LEG. 254 3.1.1. Without SACK Support 256 If multiple losses occur within one RTT and SACK is not used, it may 257 take several RTTs until all lost data is retransmitted. With the 258 scheme described above, the ConEx information will be delayed 259 considerably, but timeliness is important for ConEx. However, for 260 ConEx it is not important to know which data got lost but only how 261 much. During the first RTT after the initial loss detection, the 262 amount of received data and thus also the amount of lost data can be 263 estimated based on the number of received ACKs. Therefore a ConEx 264 sender can use the following algorithm to estimated the number of 265 lost bytes with an additional delay of one RTT using an additional 266 Loss Estimation Counter (LEC): 268 flight_bytes: current flight size in bytes 269 retransmit_bytes: payload size of the retransmission 271 At the first retransmission in a congestion event LEC is set: 273 LEC = flight_bytes - 3*SMSS 275 (At this point of time in the transmission, in the worst case, 276 all packets in flight minus three that trigged the dupACks 277 could have been lost.) 279 Then during the first RTT of the congestion event: 281 For each retransmission: 282 LEG += retransmit_bytes 283 LEC -= retransmit_bytes 285 For each ACK: 286 LEC -= SMSS 288 After one RTT: 290 LEG += LEC 292 (The LEC now estimates the number of outstanding bytes 293 that should be ConEx L marked.) 295 After the first RTT for each following retransmissions: 297 if (LEC > 0): LEC -= retransmit_bytes 298 else if (LEC==0): LEG += retransmit_bytes 300 if (LEC < 0): LEG += -LEC 302 (The LEG is not increased for those bytes that were 303 already counted.) 305 3.2. ECN 307 ECN [RFC3168] is an IP/TCP mechanism that allows network nodes to 308 mark packets with the Congestion Experienced (CE) mark instead of 309 dropping them when congestion occurs. 311 A receiver might support 'classic' ECN, the more accurate ECN 312 feedback scheme (AccECN), or neither. In the case that ECN is not 313 supported for a connection, of course, no ECN marks will occur; thus 314 the sender will never set the E flag. Otherwise, a ConEx sender 315 needs to maintain a signed counter, the congestion exposure gauge 316 (CEG), for the number of outstanding bytes that have to be ConEx 317 marked with the E flag. 319 The CEG is increased when ECN information is received from an ECN- 320 capable receiver supporting the 'classic' ECN scheme or the accurate 321 ECN feedback scheme. When the ConEx sender receives an ACK 322 indicating one or more segments were received with a CE mark, CEG is 323 increased by the appropriate number of bytes as described further 324 below. 326 Unfortunately in case of duplicate acknowledgements the number of 327 newly acknowledged bytes will be zero even though (CE marked) data 328 has been received. Therefore, we increase the CEG by DeliveredData, 329 as defined below: 331 DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS - 332 (is_after_dup)*num_dup*1SMSS + 334 DeliveredData covers the number of bytes that has been newly 335 delivered to the receiver. Therefore on each arrival of an ACK, 336 DeliveredData will be increased by the newly acknowledged bytes 337 (acked_bytes) as indicated by the current ACK, relative to all past 338 ACKs. The formula depends on whether SACK is available: if SACK is 339 not avaialble SACK_diff is always zero, whereas is ACK information is 340 available is_dup and is_after_dup are always zero. 342 With SACK, DeliveredData is increased by the number of bytes provided 343 by (new) SACK information (SACK_diff). Note, if less unacknowledged 344 bytes are announced in the new SACK information than in the previous 345 ACK, SACK_diff can be negative. In this case, data is newly 346 acknowledged (in acked_bytes), that has previously already been 347 accumulated into DeliveredData based on SACK information. 349 Otherwise without SACK, DeliveredData is increased by 1 SMSS on 350 duplicate acknowledgements as duplicate acknowledgements do not 351 acknowlegde any new data (and acked_bytes will be zero). For the 352 subsequent partial or full ACK, acked_bytes cover all newly 353 acknowledged bytes including the ones that where already accounted 354 which the receiption of any duplicate acknowledgement. Therefore 355 DeliveredData is reduced by one SMSS for each preceding duplicate 356 ACK. Consequently, is_dup is one if the current ACK is a duplicated 357 ACK without SACK, and zero otherwise. is_after_dup is only one for 358 the next full or partial ACK after a number of duplicated ACKs 359 without SACK and num_dup counts the number of duplicated ACKs in a 360 row (which usually is 3 or more). 362 With classic ECN, one congestion marked packet causes continuous 363 congestion feedback for a whole round trip, thus hiding the arrival 364 of any further congestion marked packets during that round trip. The 365 more accurate ECN feedback scheme (AccECN) is needed to ensure that 366 feedback properly reflects the extent of congestion marking. The two 367 cases, with and without a receiver capable of AccECN, are discussed 368 in the following sections. 370 3.2.1. Accurate ECN feedback 372 With a more accurate ECN feedback scheme (AccECN) either the number 373 of marked packets or the number of marked bytes is known. In the 374 latter case the CEG can directly be increased by the number of marked 375 bytes. Otherwise if D is assumed to be the number of marks, the 376 gauge (CEG) will be conservatively increased by one SMSS for each 377 marking or at max the number of newly acknowledged bytes: 379 CEG += min(SMSS*D, DeliveredData) 381 3.2.2. Classic ECN support 383 With classic ECN, as soon as a CE mark is seen at the receiver, it 384 will feed this information back to the sender by setting the Echo 385 Congestion Experienced (ECE) flag in the TCP header of subsequent 386 ACKs. Once the sender receives the first ECE of a congestion 387 notification, it sets the CWR flag in the TCP header once. When this 388 packet with Congestion Window Reduced (CWR) flag in the TCP header 389 arrives at the receiver, acknowledging its first ECE feedback, the 390 receiver stops setting ECE. 392 If the ConEx sender fully conforms to the semantics of ECN signaling 393 as defined by [RFC3168], it will receive one full RTT of ACKs with 394 the ECE flag set whenever at least one CE mark was received by the 395 receiver. As the sender cannot estimate how many packets have 396 actually been CE marked during this RTT, the most conservative 397 assumption MAY be taken, namely assuming that all packets were 398 marked. This can be achieved by increasing the CEG by DeliveredData 399 for each ACK with the ECE flag: 401 CEG += DeliveredData 403 Optionally a ConEx sender could implement the following technique 404 (that not conforms to [RFC3168]), called advanced compatibility mode, 405 to considerably improve its estimate of the number of ECN-marked 406 packets: 408 To extract more than one ECE indication per RTT, a ConEx sender could 409 set the CWR flag continuously to force the receiver to signal only 410 one ECE per CE mark. Unfortunately, the use of delayed ACKs 411 [RFC5681] (which is common) will prevent feedback of every CE mark; 412 if a CWR confirmation is received before the ECE can be sent out on 413 the next ACK, ECN feedback information could get lost (depeding on 414 the actual receiver implementation). Thus a sender SHOULD set CWR 415 only on those data segments that will presumably trigger a (delayed) 416 ACK. The sender would need an additional control loop to estimated 417 which data segments will trigger an ACK in order to extract more 418 timely congestion notifications. Still the CEG SHOULD be increased 419 by DeliveredData, as one or more CE marked packets could be 420 acknowledged by one delayed ACK. 422 The following argument is intended to prove that suppressing 423 repetitions of ECE is safe against possible congestion collapse due 424 to lost congestion feedback: 426 Repetition of ECE in classic ECN is intended to ensure reliable 427 delivery of congestion feedback. However, with advanced 428 compatibility mode, it is possible to miss congestion notifications. 429 This can happen in some implementations if delayed acknowledgements 430 are used, as described above. Further an ACK containing ECE can 431 simply get lost. If only a few CE mark are received within one 432 congestion event (e.g., only one), the loss of acknowledgements due 433 to (heavy) congestion on the reverse path, can hinder that any 434 congestion notification is received by the sender. 436 However, if loss of feedback exacerbates congestion on the forward 437 path, more forward packets will be CE marked, increasing the 438 likelihood that feedback from at least one CE will get through per 439 RTT. As long as one ECE reaches the sender per RTT, the sender's 440 congestion response will be the same as if CWR were not continuous. 441 The only way that heavy congestion on the forward path could be 442 completely hidden would be if all ACKs on the reverse path were lost. 443 If total ACK loss persisted, the sender would time out and do a 444 congestion response anyway. Therefore, the problem seems confined to 445 potential suppression of a congestion response during light 446 congestion. 448 Anyway, even if loss of all ECN feedback led to no congestion 449 response, the worst that could happen would be loss instead of ECN- 450 signalled congestion on the forward path. Given compatibility mode 451 does not affect loss feedback, there would be no risk of congestion 452 collapse. 454 4. Setting the ConEx Flags 456 By setting the X flag, a packet is marked as ConEx-capable. All 457 packets carrying payload MUST be marked with the X flag set, 458 including retransmissions. Only if no congestion feedback 459 information is (currently) available, the X flag SHOULD be zero, such 460 as for control packets on a connection that has not sent any (user) 461 data for some time e.g., sending only pure ACKs which are not 462 carrying any payload. 464 4.1. Setting the E or the L Flag 466 As described in section Section 3.1, the sender needs to maintain a 467 CEG counter and might maintain a LEG counter. If no LEG is used, all 468 retransmission will be marked with the L flag. 470 Further, as long as the LEG or CEG counter is positive, the sender 471 marks each ConEx-capable packet with L or E respectively, and 472 decreases the LEG or CEG counter by the TCP payload bytes carried in 473 the marked packet (assuming headers are not being counted because 474 packet sizes are regular). No matter how small the value of LEG or 475 CEG, if it is positive, the sender MUST NOT defer packet marking to 476 ensure ConEx signals are timely. Therefore the value of LEG and CEG 477 will commonly be negative. 479 If both LEG and CEG are positive, the sender MUST mark each ConEx- 480 capable packet with both L and E. If a credit signal is also pending 481 (see next section), the C flag can be set as well. 483 4.2. Setting the Credit Flag 485 The ConEx abstract mechanism [draft-ietf-conex-abstract-mech] 486 requires that sufficient credit MUST be signaled in advance to cover 487 the expected congestion during the feedback delay of one RTT. 489 To monitor the credit state at the audit, a ConEx sender needs to 490 maintain a credit state counter CSC in bytes. If congestion occurs, 491 credits will be consumed and the CSC is reduced by the number of 492 bytes that where lost or estimated to be ECN-marked. If the risk of 493 congestion was estimated wrongly and thus too few credits were sent, 494 the CSC becomes zero but cannot go negative. 496 To be sure that the credit state in the audit never reaches zero, the 497 number of credits should always equal the number of bytes in flight 498 as all packets could potentially get lost or congestion marked. In 499 this case a ConEx sender also monitors the number of bytes in flight 500 F. If F ever becomes larger than CSC, the ConEx sender sets the C 501 flag on each ConEx-capable packet and increase CSC by the payload 502 size of each marked packet until CSC is no less than F again. 503 However, a ConEx sender might also be less conservative and send 504 fewer credits, if it e.g. assumes based on previous experience that 505 the congestion will be low on a certain path. 507 Recall that CSC will be decreased whenever congestion occurs, 508 therefore CSC will need to be replenished as soon as CSC drops below 509 F. Also recall that the sender can set the C flag on a ConEx-capable 510 packet whether or not the E or L flags are also set. 512 In TCP slow start, the congestion window might grow much larger than 513 during the rest of the transmission. Likely, a sender could consider 514 sending fewer than F credits but risking being penalized by an audit 515 function. Howver, the credits should at least cover the increase in 516 sending rate. Given the sending rate doubles every RTT in Slow 517 Start, a ConEx sender should at least cover half the number of 518 packets in flight by credits. 520 Note that the number of losses or markings within one RTT does not 521 solely depend on the sender's actions. In general, the behavior of 522 the cross traffic, whether active queue management (AQM) is used and 523 how it is parameterized influence how many packets might be dropped 524 or marked. As long as any AQM encountered is not overly aggressive 525 with ECN marking, sending half the flight size as credits should be 526 sufficient whether congestion is signaled by loss or ECN. 528 To maintain halve of the packet in flight as credits, of course halve 529 of the packet of the initial window must be C marked. In Slow Start 530 marking every fourth packet introduces the correct amount of credit 531 as can be seen in Figure 1. 533 in_flight credits 534 RTT1 |------XC------>| 1 1 535 |------X------->| 2 1 536 |------XC------>| 3 2 537 | | 538 RTT2 |------X------->| 3 2 539 |------X------->| 4 2 540 |------X------->| 4 2 541 |------XC------>| 5 3 542 |------X------->| 5 3 543 |------X------->| 6 3 544 | | 545 RTT3 |------X------->| 6 3 546 |------XC------>| 7 4 547 |------X------->| 7 4 548 |------X------->| 8 4 549 |------X------->| 8 4 550 |------XC------>| 9 5 551 |------X------->| 9 5 552 |------X------->| 10 5 553 |------X------->| 10 5 554 |------XC------>| 11 6 555 |------X------->| 11 6 556 |------X------->| 12 6 557 | . | 558 | : | 560 Figure 1: Credits in Slow Start (with an initial window of 3) 562 It is possible that a TCP flow will encounter an audit function 563 without relevant flow state, due to e.g. rerouting or memory 564 limitations. Therefore, the sender needs to detect this case and 565 resend credits. A ConEx sender might reset the credit counter CSC to 566 zero if losses occur in subsequent RTTs (assuming that the sending 567 rate was correctly reduced based on the received congestion signal 568 and using a conservatively large RTT estimation). 570 This section proposes concrete algorithms for determining how much 571 credit to signal during congestion avoidance and slow start. 572 However, experimentation in credit setting algorithms is expected and 573 encouraged. The wider goal of ConEx is to reflect the 'cost' of the 574 risk of causing congestion on those that contribute most to it. 575 Thus, experimentation is encouraged to improve or maintain 576 performance while reducing the risk of causing congestion, and 577 therefore potentially reducing the need to signal so much credit. 579 5. Loss of ConEx information 581 Packets carrying ConEx signals could be discarded themselves. This 582 will be a second order problem (e.g. if the loss probability is 0.1%, 583 the probability of losing a ConEx L signal will be 0.1% of 0.1% = 584 0.01%). Further, the penality an audit induces should be propotional 585 to the mismatch of expected ConEx marks and observed congestion, 586 therefore the audit might only slightly increase the loss level of 587 this flow. Therefore, an implementer MAY choose to ignore this 588 problem, accepting instead the risk that an audit function might 589 wrongly penalize a flow. 591 Nonetheless, a ConEx sender is responsible to always signal 592 sufficient congestion feedback and therefore SHOULD remember which 593 packet was marked with either the L, the E or the C flag. If one of 594 these packets is detected as lost, the sender SHOULD increase the 595 respective gauge(s), LEG or CEG, by the number of lost payload bytes 596 in addition to increasing LEG for the loss. 598 6. Timeliness of the ConEx Signals 600 ConEx signals will only be useful to a network node within a time 601 delay of about one RTT after the congestion occurred. To avoid 602 further delays, a ConEx sender SHOULD send the ConEx signaling on the 603 next available packet. 605 Any or all of the ConEx flags can be used in the same packet, which 606 allows delay to be minimised when multiple signals are pending. The 607 need to set multiple ConEx flags at the same time, can occur if e.g 608 an ACK is received by the sender that simultaneously indicates that 609 at least one ECN mark was received, and that one or more segements 610 were lost. This may e.g. happen during excessive congestion, where 611 the queues overflow even though ECN was used and currently all 612 forwarded packets are marked, while others have to be dropped 613 nevertheless. Another case when this might happen is when ACKs are 614 lost, so that a subsequent ACK carries summary information not 615 previously available to the sender. 617 If a flow becomes application-limited, there could be insufficient 618 bytes to send to reduce the gauges to zero or below. In such cases, 619 the sender cannot help but delay ConEx signals. Nonetheless, as long 620 as the sender is marking all outgoing packets, an audit function is 621 unlikely to penalize ConEx-marked packets. Therefore, no matter how 622 long a gauge has been positive, a sender MUST NOT reduce the gauge by 623 more than the ConEx marked bytes it has sent. 625 If the CEG or LEG counter is negative, the respective counter MAY be 626 reset to zero within one RTT after it was decreased the last time or 627 one RTT after recovery if no further congestion occurred. 629 7. Acknowledgements 631 The authors would like to thank Bob Briscoe who contributed with this 632 initial ideas [I-D.briscoe-conex-re-ecn-tcp] and valuable feedback. 633 Moreover, thanks to Jana Iyengar who provided valuable feedback. 635 8. IANA Considerations 637 This document does not have any requests to IANA. 639 9. Security Considerations 641 General ConEx security considerations are covered extensively in the 642 ConEx abstract mechanism [draft-ietf-conex-abstract-mech]. This 643 section covers TCP-specific concerns. 645 The ConEx modifications to TCP provide no mechanism for a receiver to 646 force a sender not to use ConEx. A receiver can degrade the accuracy 647 of ConEx by claiming that it does not support SACK, AccECN or ECN, 648 but the sender will never have to turn ConEx off. The receiver 649 cannot force the sender to have to mark ConEx more conservatively, in 650 order to cover the risk of any inaccuracy. Instead the sender can 651 choose to mark inaccurately, which will only increase the likelihood 652 of loss at an audit function. Thus the receiver will only harm 653 itself. 655 Assuming the sender is limited in some way by a congestion allowance 656 or quota, a receiver could spoof more loss or ECN congestion feedback 657 than it actually experiences, in an attempt to make the sender draw 658 down its allowance faster than necessary. However, over-declaring 659 congestion simply makes the sender slow down. If the receiver is 660 interested in the content it will not want to harm its own 661 performance. 663 However, if the receiver is solely interested in making the sender 664 draw down its allowance, the net effect will depend on the sender's 665 congestion control algorithm as permanetly adding more and more 666 additional congestion would cause the sender to more and more reduce 667 its sending rate. Therefore a receiver can only maintain a certain 668 congestion level that is corresponding to a certain sending rate. 669 With New Reno [RFC5681], doubling congestion feedback causes the 670 sender to reduce its sending rate such that it would only to consume 671 sqrt(2) = 1.4 times more congestion allowance. However, to improve 672 scaling, congestion control algorithms are tending towards less 673 responsive algorithms like Cubic or Compound TCP, and ultimately to 674 linear algorithms like DCTCP [DCTCP] that aim to maintain the same 675 congestion level independent of the current sending rate and always 676 reduce its sending window if the signaled congestion feedback is 677 higher. In each case, if the receiver doubles congestion feedback, 678 it causes the sender to respectively consume more allowance by a 679 factor of 1.2, 1.15 or 1, where 1 implies the attack has become 680 completely ineffective as no further congestion allowance is consumed 681 but the flow will decrease its sending rate to a minimum instead. 683 10. References 685 10.1. Normative References 687 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 688 Selective Acknowledgment Options", RFC 2018, October 1996. 690 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 691 Requirement Levels", BCP 14, RFC 2119, March 1997. 693 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 694 of Explicit Congestion Notification (ECN) to IP", RFC 695 3168, September 2001. 697 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 698 Control", RFC 5681, September 2009. 700 [draft-ietf-conex-abstract-mech] 701 Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 702 Concepts and Abstract Mechanism", draft-ietf-conex- 703 abstract-mech-06 (work in progress), October 2012. 705 [draft-ietf-conex-destopt] 706 Krishnan, S., Kuehlewind, M., and C. Ucendo, "IPv6 707 Destination Option for ConEx", draft-ietf-conex-destopt-04 708 (work in progress), March 2013. 710 10.2. Informative References 712 [DCTCP] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 713 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP: 714 Efficient Packet Transport for the Commoditized Data 715 Center", Jan 2010. 717 [I-D.briscoe-conex-re-ecn-tcp] 718 Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, 719 "Re-ECN: Adding Accountability for Causing Congestion to 720 TCP/IP", draft-briscoe-conex-re-ecn-tcp-04 (work in 721 progress), July 2014. 723 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm 724 for TCP", RFC 3522, April 2003. 726 [RFC3708] Blanton, E. and M. Allman, "Using TCP Duplicate Selective 727 Acknowledgement (DSACKs) and Stream Control Transmission 728 Protocol (SCTP) Duplicate Transmission Sequence Numbers 729 (TSNs) to Detect Spurious Retransmissions", RFC 3708, 730 February 2004. 732 [RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm 733 for TCP", RFC 4015, February 2005. 735 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 736 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 737 Spurious Retransmission Timeouts with TCP", RFC 5682, 738 September 2009. 740 [RFC6789] Briscoe, B., Woundy, R., and A. Cooper, "Congestion 741 Exposure (ConEx) Concepts and Use Cases", RFC 6789, 742 December 2012. 744 [RFC7141] Briscoe, B. and J. Manner, "Byte and Packet Congestion 745 Notification", BCP 41, RFC 7141, February 2014. 747 [draft-kuehlewind-tcpm-accurate-ecn] 748 Kuehlewind, M. and R. Scheffenegger, "More Accurate ECN 749 Feedback in TCP", draft-kuehlewind-tcpm-accurate-ecn-02 750 (work in progress), Jun 2013. 752 Appendix A. Revision history 754 RFC Editor: This section is to be removed before RFC publication. 756 00 ... initial draft, early submission to meet deadline. 758 01 ... refined draft, updated LEG "drain" from per-packet to RTT- 759 based. 761 02 ... added Section 5 and expanded discussion about ECN interaction. 763 03 ... expanded the discussion around credit bits. 765 04 ... review comments of Jana addressed. (Change in full compliance 766 mode.) 768 05 ... changes on Loss Detection without SACK, support of classic ECN 769 and credit handling. 771 07 ... review feedback provided by Nandita 773 08 ... based on Bob's feedback: Wording edits and structuring of a 774 few paragraphs; change of SHOULD to MAY for resetting negative LEG/ 775 CEG; additional security considerations provided by Bob (thanks!). 777 Authors' Addresses 779 Mirja Kuehlewind (editor) 780 ETH Zurich 781 Switzerland 783 Email: mirja.kuehlewind@tik.ee.ethz.ch 785 Richard Scheffenegger 786 NetApp, Inc. 787 Am Euro Platz 2 788 Vienna 1120 789 Austria 791 Phone: +43 1 3676811 3146 792 Email: rs@netapp.com