idnits 2.17.1 draft-kuehlewind-conex-accurate-ecn-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 31, 2011) is 4561 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'SYN' is mentioned on line 269, but not defined == Missing Reference: 'ACK' is mentioned on line 269, but not defined -- Looks like a reference, but probably isn't: '0' on line 880 -- Looks like a reference, but probably isn't: '1' on line 880 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Congestion Exposure (ConEx) M. Kuehlewind, Ed. 3 Internet-Draft University of Stuttgart 4 Intended status: Experimental R. Scheffenegger 5 Expires: May 3, 2012 NetApp, Inc. 6 October 31, 2011 8 Accurate ECN Feedback in TCP 9 draft-kuehlewind-conex-accurate-ecn-01 11 Abstract 13 Explicit Congestion Notification (ECN) is an IP/TCP mechanism where 14 network nodes can mark IP packets instead of dropping them to 15 indicate congestion to the end-points. An ECN-capable receiver will 16 feedback this information to the sender. ECN is specified for TCP in 17 such a way that only one feedback signal can be transmitted per 18 Round-Trip Time (RTT). Recently new TCP mechanisms like ConEx or 19 DCTCP need more accurate feedback information in the case where more 20 than one marking is received in one RTT. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on May 3, 2012. 39 Copyright Notice 41 Copyright (c) 2011 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. Overview ECN and ECN Nonce in TCP . . . . . . . . . . . . 4 58 1.2. Design choices . . . . . . . . . . . . . . . . . . . . . . 4 59 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 5 60 2. Negotiation in TCP handshake . . . . . . . . . . . . . . . . . 6 61 3. Accurate Feedback . . . . . . . . . . . . . . . . . . . . . . 7 62 3.1. Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 7 63 3.1.1. Requirements . . . . . . . . . . . . . . . . . . . . . 7 64 3.1.2. One bit feedback flag . . . . . . . . . . . . . . . . 9 65 3.1.2.1. Discussion . . . . . . . . . . . . . . . . . . . . 10 66 3.1.3. Three bit field with counter feedback . . . . . . . . 11 67 3.1.3.1. Discussion . . . . . . . . . . . . . . . . . . . . 12 68 3.1.4. Codepoints with dual counter feedback . . . . . . . . 13 69 3.1.4.1. Implementation . . . . . . . . . . . . . . . . . . 14 70 3.1.4.2. Discussion . . . . . . . . . . . . . . . . . . . . 15 71 3.1.5. Short Summary of the Discussions . . . . . . . . . . . 16 72 3.2. TCP Sender . . . . . . . . . . . . . . . . . . . . . . . . 17 73 3.3. TCP Receiver . . . . . . . . . . . . . . . . . . . . . . . 17 74 3.4. Advanced Compatibility Mode . . . . . . . . . . . . . . . 17 75 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 18 76 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 77 6. Security Considerations . . . . . . . . . . . . . . . . . . . 18 78 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 19 79 7.1. Normative References . . . . . . . . . . . . . . . . . . . 19 80 7.2. Informative References . . . . . . . . . . . . . . . . . . 19 81 Appendix A. Pseudo Code for the Codepoint Coding . . . . . . . . 19 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 84 1. Introduction 86 Explicit Congestion Notification (ECN) [RFC3168] is an IP/TCP 87 mechanism where network nodes can mark IP packets instead of dropping 88 them to indicate congestion to the end-points. An ECN-capable 89 receiver will feedback this information to the sender. ECN is 90 specified for TCP in such a way that only one feedback signal can be 91 transmitted per Round-Trip Time (RTT). Recently proposed mechanisms 92 like Congestion Exposure (ConEx) or DCTCP [Ali10] need more accurate 93 feedback information in case when more than one marking is received 94 in one RTT. 96 This documents discusses and (will in a further version specify) a 97 different scheme for the ECN feedback in the TCP header to provide 98 more than one feedback signal per RTT. This modification does not 99 obsolete [RFC3168]. It provides an extension that requires 100 additional negotiation in the TCP handshake by using the TCP nonce 101 sum (NS) bit which is currently not used when SYN is set. 103 In the current version of this document there are different coding 104 schemes proposed for discussion. All proposed codings aim to scope 105 with the given bit space. All schemes require the use of the NS bit 106 at least in the TCP handshake. Depending of the coding scheme the 107 accurate ECN feedback extension will or will not include the ECN- 108 Nonce integrity mechanism. A later version of this document will 109 choose between the coding options, and remove the rationale for the 110 choice and the specs of those schemes not chosen. If a scheme will 111 be chosen that does not include ECN Nonce, a mechanism that is 112 requiring a more accurate ECN feedback needs to provide an own method 113 to ensure the integrity of the congestion feedback information or has 114 to scope with the uncertainty of this information. 116 The following scenarios should briefly show where the accurate 117 feedback is needed or provides additional value: 119 a. A Standard TCP sender with [RFC5681] congestion control algorithm 120 that supports ConEx: 121 In this case the congestion control algorithm still ignores 122 multiple marks per RTT, while the ConEx mechanism uses the extra 123 information per RTT to re-echo more precise congestion 124 information. 126 b. A sender using DCTCP without ConEx: 127 The congestion control algorithm uses the extra info per RTT to 128 perform its decrease depending on the number of congestion marks. 130 c. A sender using DCTCP congestion control and supports ConEx: 131 Both the congestion control algorithm and ConEx use the accurate 132 ECN feedback mechanism. 134 d. A standard TCP sender using RFC5681 congestion control algorithm 135 without ConEx: 136 No accurate feedback is necessary here. The congestion control 137 algorithm still react only on one signal per RTT. But its best 138 to have one generic feedback mechanism, whether you use it or 139 not. 141 1.1. Overview ECN and ECN Nonce in TCP 143 ECN requires two bits in the IP header. The ECN capability of a 144 packet is indicated, when either one of the two bits is set. An ECN 145 sender can set one or the other bit to indicate an ECN-capable 146 transport (ETC) which results in two signals --- ECT(0) and 147 respectively ECT(1). A network node can set both bits simultaneously 148 when it experiences congestion. When both bits are set the packets 149 is regarded as "Congestion Experienced" (CE). 151 In the TCP header two bits in byte 14 are defined for the use of ECN. 152 The TCP mechanism for signaling the reception of a congestion mark 153 uses the ECN-Echo (ECE) flag in the TCP header. To enable the TCP 154 receiver to determine when to stop setting the ECN-Echo flag, the CWR 155 flag is set by the sender upon reception of the feedback signal. 157 ECN-Nonce [RFC3540] is an optional addition to ECN that is used to 158 protects the TCP sender against accidental or malicious concealment 159 of marked or dropped packets. This addition defines the last bit of 160 the 13 byte in the TCP header as the Nonce Sum (NS) bit. With ECN- 161 Nonce a nonce sum is maintain that counts the occurrence of ECT(1) 162 packets. 164 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 165 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 166 | | | N | C | E | U | A | P | R | S | F | 167 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 168 | | | | R | E | G | K | H | T | N | N | 169 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 171 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 173 1.2. Design choices 175 The idea of this document is to use the ECE, CWR and NS bits for 176 additional capability negotiation during the SYN/SYN-ACK exchange, 177 and then for the more accurate feedback itself on subsequent packets 178 in the flow (with SYN=0). 180 Alternatively, a new TCP option could be introduced, to help maintain 181 the accuracy, and integrity of the ECN feedback between receiver and 182 sender. Such an option could provide more information. E.g. ECN 183 for RTP/UDP provides explicit the number of ECT(0), ECT(1), CE, non- 184 ECT marked and lost packets. However, deploying new TCP options has 185 it's own challenges. A seperate documents proposed a new TCP Option 186 for accurate ECN feedback. This option could be used in addition to 187 an more accurate ECN feedback scheme described here or in addtion to 188 the classic ECN, when available and needed. 190 As seen in Figure 1, there are currently three unused flag bits in 191 the TCP header. Any of the below described schemes could be extended 192 by one or more bits, to add higher resiliency against ACK loss. The 193 relative gains would be proportional to each of the described 194 schemes, while the respective drawbacks would remain identical. Thus 195 the approach in this document is to scope with the given number of 196 bits as they seem to be already sufficient and the accurate ECN 197 feedback scheme will only be used instead of the classic ECN and 198 never in parallel. 200 1.3. Requirements Language 202 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 203 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 204 document are to be interpreted as described in RFC 2119 [RFC2119]. 206 We use the following terminology from [RFC3168] and [RFC3540]: 208 The ECN field in the IP header: 210 CE: the Congestion Experienced codepoint; and 212 ECT(0)/ECT(1): either one of the two ECN-Capable Transport 213 codepoints. 215 The ECN flags in the TCP header: 217 CWR: the Congestion Window Reduced flag; 219 ECE: the ECN-Echo flag; and 221 NS: ECN Nonce Sum. 223 In this document, we will call the ECN feedback scheme as specified 224 in [RFC3168] the 'classic ECN' and our new proposal the 'accurate ECN 225 feedback' scheme. A 'congestion mark' is defined as an IP packet 226 where the CE codepoint is set. A 'congestion event' refers to one or 227 more congestion marks belong to the same overload situation in the 228 network (usually during one RTT). 230 2. Negotiation in TCP handshake 232 During the TCP hand-shake at the start of a connection, an originator 233 of the connection (host A) MUST indicate a request to get more 234 accurate ECN feedback by setting the TCP flags NS=1, CWR=1 and ECE=1 235 in the initial SYN. 237 A responding host (host B) MUST return a SYN ACK with flags CWR=1 and 238 ECE=0. The responding host MUST NOT set this combination of flags 239 unless the preceding SYN has already requested support for accurate 240 ECN feedback as above. Normally a server (B) will reply to a client 241 with NS=0, but if the initial SYN from client A is marked CE, the 242 sever B can set the NS flag to 1 to indicate the congestion 243 immediately instead of delaying the signal to the first 244 acknowledgment when the actually data transmission already started. 245 So, server B MAY set the alternative TCP header flags in its SYN ACK: 246 NS=1, CWR=1 and ECE=0. 248 The Addition of ECN to TCP SYN/ACK packets is discussed and specified 249 as experimental in [RFC5562]. The addition of ECN to the SYN packet 250 is optional. The security implication when using this option are not 251 further discussed here. 253 These handshakes are summarized in Table 1 below, with X indicating 254 NS can be either 0 or 1 depending on whether congestion had been 255 experienced. The handshakes used for the other flavors of ECN are 256 also shown for comparison. To compress the width of the table, the 257 headings of the first four columns have been severely abbreviated, as 258 follows: 260 Ac: *Ac*curate ECN Feedback 262 N: ECN-*N*once (RFC3540) 264 E: *E*CN (RFC3168) 266 I: Not-ECN (*I*mplicit congestion notification). 268 +----+---+---+---+------------+----------------+------------------+ 269 | Ac | N | E | I | [SYN] A->B | [SYN,ACK] B->A | Mode | 270 +----+---+---+---+------------+----------------+------------------+ 271 | | | | | NS CWR ECE | NS CWR ECE | | 272 | AB | | | | 1 1 1 | X 1 0 | accurate ECN | 273 | A | B | | | 1 1 1 | 1 0 1 | ECN Nonce | 274 | A | | B | | 1 1 1 | 0 0 1 | classic ECN | 275 | A | | | B | 1 1 1 | 0 0 0 | Not ECN | 276 | A | | | B | 1 1 1 | 1 1 1 | Not ECN (broken) | 277 +----+---+---+---+------------+----------------+------------------+ 279 Table 1: ECN capability negotiation between Sender (A) and 280 Receiver (B) 282 Recall that, if the SYN ACK reflects the same flag settings as the 283 preceding SYN (because there is a broken RFC3168 compliant 284 implementation that behaves this way), RFC3168 specifies that the 285 whole connection MUST revert to Not-ECT. 287 3. Accurate Feedback 289 In this section we refer the sender to be the on sending data and the 290 receiver as the one that will acknowledge this data. Of course such 291 a scenario is describing only one half connection of a TCP 292 connection. The proposed scheme, if negotiated, will be used for 293 both half connection as both, sender and receiver, need to be capable 294 to echo and understand the accurate ECN feedback scheme. 296 3.1. Coding 298 This section proposes three different coding schemes for discussion. 299 First, requirements are listed that will allow to evaluate the 300 proposed schemes against each other. A later version of this 301 document will choose between the coding options, and remove the 302 rationale for the choice and the specs of those schemes not chosen. 303 The next section provides basically a fourth alternative to allow a 304 compatibility mode when a sender needs accurate feedback but has to 305 operate with a legacy [RFC3168] receiver. 307 3.1.1. Requirements 309 The requirements of the accurate ECN feedback protocol for the use of 310 e.g. Conex or DCTCP are to have a fairly accurate (not necessarily 311 perfect), timely and protected signaling. This leads to the 312 following requirements: 314 Resilience 315 The ECN feedback signal is implicit carried within the TCP 316 acknowledgment. TCP ACKs can get lost. Moreover, delayed 317 ACK are usually used with TCP. That means in most cases only 318 every second data packets gets acknowledged. In a high 319 congestion situation where most of the packet are marked with 320 CE, an accurate feedback mechanism must still be able to 321 signal sufficient congestion information. Thus the accurate 322 ECN feedback extension has to take delayed ACK and ACK loss 323 into account. 325 Timely 326 The CE marking is induced by a network node on the 327 transmission path and echoed by the receiver in the TCP 328 acknowledgment. Thus when this information arrives at the 329 sender, its naturally already about one RTT old. With a 330 sufficient ACK rate a further delay of a small number of ACK 331 can be tolerated but with large delays this information will 332 be out dated due to high dynamic in the network. TCP 333 congestion control which introduces parts of this dynamic 334 operates on an time scale of one RTT. Thus the congestion 335 feedback information should be delivered timely (within one 336 RTT). 338 Integrity 339 With ECN Nonce, a misbehaving receiver can be detected with a 340 certain probability. As this accurate ECN feedback might 341 reuse the NS bit it is encouraged to ensure integrity as 342 least as good as ECN Nonce. If this is not possible, 343 alternative approaches should be provided how a mechanism 344 using the accurate ECN feedback extension can re-ensure 345 integrity or give strong incentives for the receiver and 346 network node to cooperate honestly. 348 Accuracy 349 Classic ECN feeds back one congestion notification per RTT, 350 as this is supposed to be used for TCP congestion control 351 which reduces the sending rate at most once per RTT. The 352 accurate ECN feedback scheme has to ensure that if a 353 congestion events occurs at least one congestion notification 354 is echoed and received per RRT as classic ECN would do. Of 355 course, the goal of this extension is to reconstruct the 356 number of CE marking more accurately. However, a sender 357 should not assume to get the exact number of congestion 358 marking in a high congestion situation. 360 Complexity 361 Of course, the more accurate ECN feedback can also be used, 362 even if only one ECN feedback signal per RTT is need. To 363 enable this proposal for a more accurate ECN feedback as the 364 standard ECN feedback mechanism, the implementation should be 365 as simple as possible and a minimum of addition state 366 information should be needed. 368 3.1.2. One bit feedback flag 370 Remark: In one Acknowledgment all acknowledged bytes are regarded as 371 congested 373 This option is using a one bit flag, namely the ECE bit, to signal 374 more accurate ECN feedback. Other than classic ECN feedback, a 375 accurate ECN feedback receiver MUST set the ECE bit only in one ACK 376 packets for each one CE received. An more accurate ECN feedback 377 receiver MUST NOT wait for a CWR bit from the sender to reset the ECE 378 bit. 380 As the CWR would now be unused, the CWR MUST be set in the subsequent 381 ACK after the ECE was set. 383 CWR(t) = ECE(t-1) 385 This provides some redundancy in case of ACK loss. If the sender 386 know the ACK'ing scheme of the receiver (e.g. delayed ACKs will send 387 minimum one ACK for every two data packets), the sender can detect 388 ACK loss. If two subsequent ACK or more got lost, the sender SHOULD 389 assume congestion marks for the respective number of ack'ed bytes. 391 Moreover, when a congestion situation occurs or stops, the receiver 392 MUST immediately acknowledge the data packet and MUST NOT delay the 393 acknowledgment until a further data packet is arrived. A congestion 394 situation occurs when the previous data packet was CE=0 but the 395 current one is CE=1. And a congestion situation stops when the 396 previous data packet was CE=1 and the current one is CE=0. 398 The following figure shows a simple state machine to describe the 399 receiver behavior. 401 Send immediate 402 ACK with ECE=0 403 .---. .------------. .---. 404 Send 1 ACK / v v | | \ 405 for every | .------. .------. | Send 1 ACK 406 m packets | | CE=0 | | CE=1 | | for every 407 with ECE=0 | '------' '------' | m packets 408 \ | | ^ ^ / with ECE=1 409 '---' '------------' '---' 410 Send immediate 411 ACK with ECE=1 413 Figure 2: Two state ACK generation state machine 415 Thus whenever an ACK with the ECE flag set arrives, all acknowledged 416 byte were congestion marked. This scheme provides a byte-wise ECN 417 feedback. The number of CE-marked packet can be estimated by 418 dividing the amount of ack'ed bytes by the Maximum Segment Size 419 (MSS). 421 When one ACK was lost and the ECN feedback is received based on the 422 CWR set, the sender conservatively SHOULD assume all newly acked 423 bytes as congestion marked. 425 3.1.2.1. Discussion 427 ACK loss 429 In low congestion situations (less than one CE mark per RTT on 430 average), the loss of two subsequent ACKs would result in complete 431 loss of the congestion information. The opposite would be true 432 during high congestion, where the sender can incorrectly assume that 433 all segments were received with the CE codepoint. 435 One solution would be to carry the same information in a defined 436 number of subsequent ACK packets. This would reduce the number of 437 feedback signals that can be transmitted in one RTT but improve the 438 integrity. More sophisticated solutions based on ACK loss detection 439 might be possible as well. 441 With DCTCP [Ali10] it was proposed to acknowledge a data packet 442 directly without delay when a congestion situation occurs, as already 443 described above. This scheme allows a more accurate feedback signal 444 in a high congestion/marking situation. However, using delayed ACKs 445 is important for a variety of reasons, including reducing the load on 446 the data sender. 448 As this heuristic is triggering immediate ACKs whenever the received 449 CE bit toggles, arbitrarily large ACK ratios are supported. However, 450 the effective ACK ratio is depending on the congestion state of the 451 network. Thus it may collapse to 1 (one ACK for each data segment). 452 More sophisticated solutions based on ACK loss detection might be 453 possible as well, when every other segment is received with CE set. 455 ECN Nonce 457 As the ECN Nonce bit is not used otherwise, ECN Nonce [RFC3540] can 458 be used complementary. Network paths not supporting ECN, 459 misbehaving, or malicious receivers withholding ECN information can 460 therefore be detected. 462 3.1.3. Three bit field with counter feedback 464 The receiver maintains an unsigned integer counter which we call ECC 465 (echo congestion counter). This counter maintains a count of how 466 many times a CE marked packet has arrived during the half-connection. 467 Once a TCP connection is established, the three TCP option flags 468 (ECE, CWR and NS) are used as a 3-bit field for the receiver to 469 permanently signal the sender the current value of ECC, modulo 8, 470 whenever it sends a TCP ACK. We will call these three bits the echo 471 congestion increment (ECI) field. 473 This overloaded use of these 3 option flags as one 3-bit ECI field is 474 shown in Figure 3. The actual definition of the TCP header, 475 including the addition of support for the ECN Nonce, is shown for 476 comparison in Figure 1. This specification does not redefine the 477 names of these three TCP option flags, it merely overloads them with 478 another definition once a flow with accurate ECN feedback is 479 established. 481 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 482 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 483 | | | | U | A | P | R | S | F | 484 | Header Length | Reserved | ECI | R | C | S | S | Y | I | 485 | | | | G | K | H | T | N | N | 486 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 488 Figure 3: Definition of the ECI field within bytes 13 and 14 of the 489 TCP Header (when SYN=0). 491 Also note that, whenever the SYN flag of a TCP segment is set 492 (including when the ACK flag is also set), the NS, CWR and ECE flags 493 (i.e. the ECI field of the SYNACK) MUST NOT be interpreted as the 494 3-bit ECI value, which is only set as a copy of the local ECC value 495 in non-SYN packets. 497 This scheme was first proposed in [I-D.briscoe-tsvwg-re-ecn-tcp] for 498 the use with re-ECN. 500 3.1.3.1. Discussion 502 ACK loss 504 As pure ACKs are not protected by TCP reliable delivery, we repeat 505 the same ECI value in every ACK until it changes. Even if many ACKs 506 in a row are lost, as soon as one gets through, the ECI field it 507 repeats from previous ACKs that didn't get through will update the 508 sender on how many CE marks arrived since the last ACK got through. 510 The sender will only lose a record of the arrival of a CE mark if all 511 the ACKS are lost (and all of them were pure ACKs) for a stream of 512 data long enough to contain 8 or more CE marks. So, if the marking 513 fraction was p, at least 8/p pure ACKs would have to be lost. For 514 example, if p was 5%, a sequence of 160 pure ACKs (without delayed 515 ACKs) would all have to be lost. When ACK are delay this number has 516 to be reduced by 1/m. This would still require a sequence of 80 pure 517 lost ACKs with the usual delay rate of m=2. 519 Additionally, to protect against such extremely unlikely events, if a 520 re- ECN sender detects a sequence of pure ACKs has been lost it can 521 assume the ECI field wrapped as many times as possible within the 522 sequence. E.g., if a re-ECN sender receives an ACK with an 523 acknowledgement number that acknowledges L (>m) segments since the 524 previous ACK but with a sequence number unchanged from the previously 525 received ACK, it can conservatively assume that the ECI field 526 incremented by D' = L - ((L-D) mod 8), where D is the apparent 527 increase in the ECI field. For example if the ACK arriving after 9 528 pure ACK losses apparently increased ECI by 2, the assumed increment 529 of ECI would still be 2. But if ECI apparently increased by 2 after 530 11 pure ACK losses, ECI should be assumed to have increased by 10. 532 ECN Nonce 534 ECN Nonce cannot be used in parallel to this scheme. But mechanism 535 that make use of this new scheme might provide stronger incentives to 536 declare congestion honestly when needed. E.g. with ConEx each 537 congestion notification suppressed by the receiver should lead the 538 ConEx audit function to discard an equivalent number of bytes such 539 that the receiver does not gain from suppressing feedback. This 540 mechanism would even provide a stronger integrity mechanism than ECN- 541 Nonce does. Without an external framework to discourage the 542 withholding of ECN information, this scheme is vulnerable to the 543 problems described in [RFC3540]. 545 3.1.4. Codepoints with dual counter feedback 547 In-line with the definition of the previous section in Figure 3, the 548 ECE, CWR and NS bits are used as one field but instead they are 549 encoding 8 codepoints. These 8 codepoints, as shown below, encode 550 either a "congestion indication" (CI) counter or an ECT(1) counter 551 (E1). These counters maintain the number of CE marks or the number 552 of ECT(1) signals observed at the receiver respectively. 554 +-----+----+-----+-----+------------+------------+ 555 | ECI | NS | CWR | ECE | CI (base5) | E1 (base3) | 556 +-----+----+-----+-----+------------+------------+ 557 | 0 | 0 | 0 | 0 | 0 | - | 558 | 1 | 0 | 0 | 1 | 1 | - | 559 | 2 | 0 | 1 | 0 | 2 | - | 560 | 3 | 0 | 1 | 1 | 3 | - | 561 | 4 | 1 | 0 | 0 | 4 | - | 562 | 5 | 1 | 0 | 1 | - | 0 | 563 | 6 | 1 | 1 | 0 | - | 1 | 564 | 7 | 1 | 1 | 1 | - | 2 | 565 +-----+----+-----+-----+------------+------------+ 567 Table 2: Codepoint assignment for accurate ECN feedback 569 By default an accurate ECN receiver MUST echo the CI counter (modulo 570 5) with the respective codepoints. Whenever an CE occurs and thus 571 the value of the CI has changed, the receiver MUST echo the CI in the 572 next ACK. Moreover, the receiver MUST repeat the codepoint, that 573 provides the CI counter, directly on the subsequent ACK. Thus every 574 value of CI will be transmitted at least twice. 576 If an ECT(1) mark is receipt and thus E1 increases, the receiver has 577 to convey that updated information to the sender as soon as possible. 578 Thus on the reception of a ECT(1) marked packet, the receiver MUST 579 signal the current value of the E1 counter (modulo 3) in the next 580 ACK, unless a CE mark was receipt which is not echoed yet twice. The 581 receiver MUST also repeat very E1 value. But this repetition does 582 not need to be in the subsequent ACK as the E1 value will only be 583 transmitted when no changes in the CI have occured. Each E1 value 584 will be send excatly twice. The repetition of every signal will 585 provide further resilience against lost ACKs. 587 As only a limited number of E1 codepoints exist and the receiver 588 might not acknowledge every single data packet immediately (delayed 589 ACKs), a sender SHOULD NOT mark more than 1/m of the packets with 590 ECT(1), where m is the ACK ratio (e.g. 50% when every second data 591 packet triggers an ACK). This constraint will avoid a permanent 592 feedback of E1 only. 594 This requirement may conflict with delayed ACK ratios larger than 595 two, using the available number of codepoints. A receiver MUST 596 change the ACK'ing rate such that a sufficient rate of feedback 597 signals can be sent. Details on how the change in the ACK'ing rate 598 should be implemented are given in the next subsection. 600 3.1.4.1. Implementation 602 The basic idea is for the receiver to count how many packets carry a 603 congestion notification. This could, in principle, be achieved by 604 increasing a "congestion indication" counter (CI.c) for every 605 incoming CE marked segment. Since the space for communicating the 606 information back to the sender in ACKs is limited, instead of 607 directly increasing this counter, a "gauge" (CI.g) is increased 608 instead. 610 When sending an ACK, the content of this gauge (capped by the maximum 611 number that can be encoded in the ACK, e.g. 4 for CI, and 2 for E1) 612 is copied to the actual counter, and CI.g is reduced by the value 613 that was copied over and transmitted, unless CI.g was zero before. 614 To avoid losing information, it is ensured that an ACK is sent at 615 least after 5 incoming congestion marks (i.e. when CI.g exceeds 5). 617 For resilience against lost ACKs, an indicator flag (CI.i) ensures 618 that, whether another congestion indication arrives or not, a second 619 ACK transmits the previous counter value again. 621 The same counter / gauge method is used to count and feed back (using 622 a different mapping) the number of incoming packets marked ECT(1) 623 (called E1 in the algorithm). As fewer codepoints are available for 624 conveying the E1 counter value, an immediate ACK MUST be triggered 625 whenever the gauge E1.g exceeds a threshold of 3. The sender 626 receives the receiver's counter values and compares them with the 627 locally maintained counter. Any increase of these counters is added 628 to the sender's internal counters, yielding a precise number of CE- 629 marked and ECT(1) marked packets. Architecturally the counters never 630 decrease during a TCP session. However, any overflow must be modulo 631 5 for CI, and modulo 3 for E1. 633 The following table provides an example showing an half-connection 634 with an TCP sender A and receiver B. The sender maintains a counter 635 CI.r to reconstruct the number of CE mark receipt at receiver-side. 637 +----+------+---------------+------------+---------------+------+ 638 | | Data | TCP A | IP | TCP B | Data | 639 +----+------+---------------+------------+---------------+------+ 640 | | | SEQ ACK CTL | | SEQ ACK CTL | | 641 | -- | | ------------- | ---------- | ------------- | | 642 | 1 | | 0100 SYN | ----> | | | 643 | | | CWR,ECE,NS | | | | 644 | 2 | | | <---- | 0300 0101 SYN | | 645 | | | | | ACK,CWR | | 646 | 3 | | 0101 0301 ACK | ECT0 -CE-> | | | 647 | | | | | CI.c=0 CI.g=1 | | 648 | 4 | 100 | 0101 0301 ACK | ECT0 ----> | | | 649 | | | | | CI.c=1 CI.g=0 | | 650 | 5 | | | <---- | 0301 0201 ACK | | 651 | | | | | ECI=CI.1 | | 652 | | | CI.r=1 | | | | 653 | 6 | 100 | 0201 0301 ACK | ECT0 -CE-> | | | 654 | | | | | CI.c=1 CI.g=1 | | 655 | 7 | 100 | 0301 0301 ACK | ECT0 -CE-> | | | 656 | | | | | CI.c=1 CI.g=2 | | 657 | 8 | | | XX-- | 0301 0401 ACK | | 658 | | | | | ECI=CI.1 | | 659 | | | CI.r=1 | | | | 660 | 9 | 100 | 0401 0301 ACK | ECT0 -CE-> | | | 661 | | | | | CI.c=1 CI.g=3 | | 662 | 10 | 100 | 0501 0301 ACK | ECT0 -CE-> | | | 663 | | | | | CI.c=5 CI.g=0 | | 664 | 11 | | | <---- | 0301 0601 ACK | | 665 | | | | | ECI=CI.0 | | 666 | | | CI.r=5 | | | | 667 | 12 | 100 | 0601 0301 ACK | ECT0 -CE-> | | | 668 | | | | | CI.c=5 CI.g=1 | | 669 | 13 | 100 | 0701 0301 ACK | ECT0 -CE-> | | | 670 | | | | | CI.c=5 CI.g=2 | | 671 | 14 | | | <---- | 0301 0801 ACK | | 672 | | | | | ECI=CI.0 | | 673 | | | CI.r=5 | | | | 674 +----+------+---------------+------------+---------------+------+ 676 Table 3: Codepoint signal example 678 3.1.4.2. Discussion 680 ACK loss 682 As this scheme sends each codepoint (of the two subsets) at least two 683 times, at least one, and up to two consecutive ACKs can be lost. 684 Further refinements, such as interleaving ACKs when sending 685 codepoints belonging to the two subsets (e.g. CI, E1), can allow the 686 loss of any two consecutive ACKs, without the sender losing 687 congestion information, at the cost of also reducing the ACK ratio. 689 At low congestion rates, the sending of the current value of the CI 690 counter by default allows higher numbers of consecutive ACKs to be 691 lost, without impacting the accuracy of the ECN signal. 693 ECN Nonce 695 By comparing the number of incoming ECT(1) notifications with the 696 actual number of packets that were transmitted with an ECT(1) mark as 697 well as the sum of the sender's two internal counters, the sender can 698 probabilistic detect a receiver that would send false marks or 699 supress accurate ECN feedback, or a path that doesn't properly 700 support ECN. 702 This approach maintains a balanced selection of properties found in 703 ECN Nonce, Section 3.1.3, and Section 3.1.2. A delayed ACK ratio of 704 two can be sustained indefinitely even during heavy congestion, but 705 not during excessive ECT(1) marking, which is under the control of 706 the sender. An higher ACK ratios can be sustained even when 707 congestion is low but its need for the E1 feedback. 709 3.1.5. Short Summary of the Discussions 711 With the exception of the signaling scheme described in 712 Section 3.1.2, all signaling may fail to work, if middleboxes 713 intervene and check on the semantic of [RFC3168] signals. 715 The scheme described in Section 3.1.4 is the most complex to 716 implement especially on a receiver, with much additional state to be 717 kept there, compared to the other signaling schemes. With the 718 advances in compute power, many more cycles are available to process 719 TCP than ever before. 721 Table 4 gives an overview of the relative implications of the 722 different proposed signaling schemes. Further discussion should be 723 included here in the next version of this document. 725 +-------------+--------+--------+-----------+----------+------------+ 726 | Section | Resi- | Timely | Integrity | Accuracy | Complexity | 727 | | liency | | | | | 728 +-------------+--------+--------+-----------+----------+------------+ 729 | 1-bit-flag | - | + | + | - | + | 730 | 3-bit-field | ++ | ++ | -- | ++ | - | 731 | Codepoints | + | + | + | ++ | -- | 732 +-------------+--------+--------+-----------+----------+------------+ 733 Table 4: Overview of accurate feedback schemes 735 Whereas the first scheme is the simplest one (and also provides byte- 736 wise feedback which might be preferable), it has a drawback with 737 respect to reliability. The second one is the most reliable but does 738 not provide an integrity mechanism. 740 3.2. TCP Sender 742 This section will specify the sender-side action describing how to 743 exclude the accurate number of congestion markings from the given 744 receiver feedback signal. 746 When the accurate ECN feedback scheme is supported by the receiver, 747 the receiver will maintain an echo congestion counter (ECC). The ECC 748 will hold the number of CE marks received. A sender that is 749 understanding the accurate ECN feedback will be able to reconstruct 750 this ECC value on the sender side by maintaining a counter ECC.r. 752 On the arrival of every ACK, the sender calculates the difference D 753 between the local ECC.r counter, and the signaled value of the 754 receiver side ECC counter. The value of ECC.r is increased by D, and 755 D is assumed to be the number of CE marked packets that arrived at 756 the receiver since it sent the previously received ACK. 758 3.3. TCP Receiver 760 This section will describe the receiver-side action to signal the 761 accurate ECN feedback back to the sender. In any case the receiver 762 will need to maintain a counter of how many CE marking has been seen 763 during a connection. Depending on the chosen coding scheme there 764 will be different action to set the corresponding bits in the TCP 765 header. For all case it might be helpful if the receiver is able to 766 switch form a delayed ACK behavior to send ACKs immediately after the 767 data packet reception in a hight congestion situation. 769 3.4. Advanced Compatibility Mode 771 This section describes a possiblity to achieve more accurate feedback 772 even when the receiver is not capable of the new accurate ECN 773 feedback scheme with the drawback of less reliability. 775 During initial deployment, a large number of receivers will only 776 support [RFC3168] classic ECN feedback. Such a receiver will set the 777 ECE bit whenever it receives a segment with the CE codepoint set, and 778 clear the ECE bit only when it receives a segment with the CWR bit 779 set. As the CE codepoint has priority over the CWR bit (Note: the 780 wording in this regard is ambiguous in [RFC3168], but the reference 781 implementation of ECN in ns2 is clear), a [RFC3168] compliant 782 receiver will not clear the ECE bit on the reception of a segment, 783 where both CE and CWR are set simultaneously. This property allows 784 the use of a compatibility mode, to extract more accurate feedback 785 from legacy [RFC3168] receivers by setting the CWR permanently. 787 Assuming an delayed ACK ratio of one, a sender can permanently set 788 the CWR bit in the TCP header, to receive a more accurate feedback of 789 the CE codepoints as seen at the receiver. This feedback signal is 790 however very brittle and any ACK loss may cause congestion 791 information to become lost. Delayed ACKs and ACK loss can both not 792 be accounted for in a reliable way, however. Therefore, a sender 793 would need to use heuristics to determine the current delay ACK ratio 794 m used by the receiver (e.g. most receivers will use m=2), and also 795 the recent ACK loss ratio (l). Acknowledge Congestion Control 796 (AckCC) as defined in [RFC5690] can not be used, as deployment of 797 this feature is only experimental. 799 Using a phase locked loop algorithm, the CWR bit can then be set only 800 on those data segments, that will trigger a (delayed) ACK. Thereby, 801 no congestion information is lost, as long as the ACK carrying the 802 ECE bit is seen by the sender. 804 Whenever the sender sees an ACK with ECE set, this indicates that at 805 least one, and at most m / (m - l) data segments with the CE 806 codepoint set where seen by the receiver. The sender SHOULD react, 807 as if m CE indications where reflected back to the sender by the 808 receiver, unless additional heuristics (e.g. dead time correction) 809 can determine a more accurate value of the "true" number of received 810 CE marks. 812 4. Acknowledgements 814 We want to thank Michael Welzl and Bob Briscoe for their input and 815 discussion. 817 5. IANA Considerations 819 This memo includes no request to IANA. 821 6. Security Considerations 823 For coding schemes that increase robustness for the ECN feedback, 824 similar considerations as in RFC3540 apply for the selection of when 825 to sent a ECT(1) codepoint. 827 7. References 829 7.1. Normative References 831 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 832 Requirement Levels", BCP 14, RFC 2119, March 1997. 834 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 835 of Explicit Congestion Notification (ECN) to IP", 836 RFC 3168, September 2001. 838 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 839 Congestion Notification (ECN) Signaling with Nonces", 840 RFC 3540, June 2003. 842 7.2. Informative References 844 [Ali10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 845 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP: 846 Efficient Packet Transport for the Commoditized Data 847 Center", Jan 2010. 849 [I-D.briscoe-tsvwg-re-ecn-tcp] 850 Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, 851 "Re-ECN: Adding Accountability for Causing Congestion to 852 TCP/IP", draft-briscoe-tsvwg-re-ecn-tcp-09 (work in 853 progress), October 2010. 855 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 856 Ramakrishnan, "Adding Explicit Congestion Notification 857 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 858 June 2009. 860 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 861 Control", RFC 5681, September 2009. 863 [RFC5690] Floyd, S., Arcia, A., Ros, D., and J. Iyengar, "Adding 864 Acknowledgement Congestion Control to TCP", RFC 5690, 865 February 2010. 867 Appendix A. Pseudo Code for the Codepoint Coding 869 Receiver: 871 Input signals: CE , ECT(1) 872 TCP Fields: ECI (3-bit field from CWR and ECE). CI.cm and E1.cm map 873 into these 8 codepoints (ie. 5 and 3 codepoints) 874 These counters get tracked by the following variables: 876 CI.c (congestion indication - counter, modulo a multiple of the 877 available codepoints to represent CI.c in the ECI field. 878 Range[0..n*CI.cp-1]) 879 CI.g (congestion indication - gauge, [0.."inf"]) 880 CI.i (congestion indication - iteration, [0,1]) 881 These are to track CE indications. 883 E1.c, E1.g and E1.r (doing the same, but for ECT(1) signals). 885 Constants: 886 CI.cp (number of codepoints available to signal) 887 CI.cm[] (codepoint mapping for CI) 888 E1.cp (number of codepoints available for E1 signal) 889 E1.cm[0..(E1.cp-1)] (codepoint mappings for E1) 891 At session initialization, all these counters are set to 0; 893 When a Segement (Data, ACK) is received, 894 perform the following steps: 896 If a CE codepoint is received, 897 Increase CI.g by 1 898 If a ECT(1) codepoint is received, 899 Increase E1.g by 1 900 If (CI.g > 5) # When ACK rate is not sufficient to keep 901 or (E1.g > 3) # gauge close to zero, increase ACK rate 902 # works independent of delACK number (ie AckCC) 903 Cancel pending delayed ACK (ACK this segment immediately) 904 # this increases the ACK rate to a maximum of 1.5 data segments 905 # per ACK, with delACK=2, 906 # and CE mark rate exceeds 75% for a number 907 # of at least 18 segments. 908 # 5 codepoints would allow delack=2 indefinitely btw 910 When preparing an ACK to be sent: 912 If (CI.g > 0) or 913 ((E1.i != 0) and (CI.i != 0)) # E1.g = 0 is to skip this 914 # if only the 2nd CI.c ACK 915 # has to be sent - effectively alternating CI.c and E1.c on ACKs 916 # should give slightly better resiliency against ack losses 917 If CI.i == 0 # updates to CI.c allowed 918 and CI.g > 0 # update is meaningful 919 CI.i = 1 # may be larger 920 #if more resiliency is reqd 921 CI.c += min(CI.cp-1,CI.g) # CI.cp-1 is 3 for 4 codepoints, 922 # 4 for 5 etc 923 CI.c = CI.c modulo CI.cp*CI.cp # using modulo the square of 924 # available codepoints, 925 # for convinience (debugging) 926 CI.g -= min(CI.cp-1,CI.g) # 927 Else 928 CI.i-- # just in case CI.f was set to 929 # more than 1 for resiliency 930 Send next ACK with ECI = CI.cm[CI.c modulo CI.cp] 931 Else 932 If (E1.g > 0) or (E1.i != 0) 934 If (E1.i == 0) and (E1.g > 0) 935 E1.i = 1 936 E1.c += min(E1.cp-1,E1.g) 937 E1.c = E1.c modulo E1.cp*E1.cp 938 E1.g -= min(E1.cp-1,E1.g) 939 Else 940 E1.i-- 941 Send next ACK with ECI = E1.cm[E1.c modulo E1.cp] 942 Else 943 Send next ACK with ECI = CI.cm[CI.c modulo CI.cp] # default action 945 Sender: 947 Counters: 949 CI.r - current value of CEs seen by receiver 950 E1.s - sum of all sent ECT(1) marked packets (up to snd.nxt) 951 E1.s(t) - value of E1.s at time (in sequence space) t 952 E1.r - value signaled by receiver about received ECT(1) segments 953 E1.r(t) - value of E1.r at time (in sequence space) t 954 CI.r(t) - ditto 955 # Note: With a codepoint-implementation, 956 # a reverse table ECI[n] -> CI.r / E1.r is needed. 957 # This example is simplified with 4/4 codepoints 958 # instead of 5/3 960 If ACK with NS=0 961 CI.r += (ECI + 4 - (CI.r mod CI.cp)) mod CI.cp 962 # The wire protocol transports the absolute value 963 # of the receiver-side counter. 964 # Thus the (positive only) delta needs to be calculated, 965 # and added to the sender-side counter. 966 If ACK with NS=1 967 E1.r += (ECI + 4 - (E1.r mod E1.cp)) mod E1.c 969 # Before CI.r or E1.r reach a (binary) rollover, 970 # they need to roll over some multiple of CI.cp 971 # and E1.cp respectively. 973 CI.r = CI.r modulo CI.cp * n_CI 974 E1.r = E1.r modulo E1.cp * n_E1 976 # (an implementation may choose to use a single constant, 977 # ie 3^4*5^4 for 16-bit integers, 978 # or 3^8*5^8 for 32-bit integers) 980 # The following test can (probabilistically) reveal, 981 # if the receiver or path is not properly 982 # handling ECN (CE, E1) marks 984 If not E1.r(t) <= E1.s(t) <= E1.r(t) + CI.r(t) 985 # -> receiver lies (or too many ACKs got lost, 986 # which can be checked too by the sender). 988 Authors' Addresses 990 Mirja Kuehlewind (editor) 991 University of Stuttgart 992 Pfaffenwaldring 47 993 Stuttgart 70569 994 Germany 996 Email: mirja.kuehlewind@ikr.uni-stuttgart.de 997 Richard Scheffenegger 998 NetApp, Inc. 999 Am Euro Platz 2 1000 Vienna, 1120 1001 Austria 1003 Phone: +43 1 3676811 3146 1004 Email: rs@netapp.com