idnits 2.17.1 draft-kuehlewind-conex-accurate-ecn-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 4, 2011) is 4678 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Missing Reference: 'SYN' is mentioned on line 264, but not defined == Missing Reference: 'ACK' is mentioned on line 264, but not defined -- Looks like a reference, but probably isn't: '0' on line 852 -- Looks like a reference, but probably isn't: '1' on line 852 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Congestion Exposure (ConEx) M. Kuehlewind, Ed. 3 Internet-Draft University of Stuttgart 4 Intended status: Experimental R. Scheffenegger 5 Expires: January 5, 2012 NetApp, Inc. 6 July 4, 2011 8 Accurate ECN Feedback in TCP 9 draft-kuehlewind-conex-accurate-ecn-00 11 Abstract 13 Explicit Congestion Notification (ECN) is an IP/TCP mechanism where 14 network nodes can mark IP packets instead of dropping them to 15 indicate congestion to the end-points. An ECN-capable receiver will 16 feedback this information to the sender. ECN is specified for TCP in 17 such a way that only one feedback signal can be transmitted per 18 Round-Trip Time (RTT). Recently new TCP mechanisms like ConEx or 19 DCTCP need more accurate feedback information in the case where more 20 than one marking is received in one RTT. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on January 5, 2012. 39 Copyright Notice 41 Copyright (c) 2011 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1. Overview ECN and ECN Nonce in TCP . . . . . . . . . . . . 4 58 1.2. Design choices . . . . . . . . . . . . . . . . . . . . . . 4 59 1.3. Requirements Language . . . . . . . . . . . . . . . . . . 5 60 2. Negotiation in TCP handshake . . . . . . . . . . . . . . . . . 6 61 3. Accurate Feedback . . . . . . . . . . . . . . . . . . . . . . 7 62 3.1. Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 7 63 3.1.1. Requirements . . . . . . . . . . . . . . . . . . . . . 7 64 3.1.2. One bit feedback flag . . . . . . . . . . . . . . . . 9 65 3.1.2.1. Discussion . . . . . . . . . . . . . . . . . . . . 9 66 3.1.3. Three bit field with counter feedback . . . . . . . . 11 67 3.1.3.1. Discussion . . . . . . . . . . . . . . . . . . . . 11 68 3.1.4. Codepoints with dual counter feedback . . . . . . . . 12 69 3.1.4.1. Implementation . . . . . . . . . . . . . . . . . . 14 70 3.1.4.2. Discussion . . . . . . . . . . . . . . . . . . . . 15 71 3.1.5. Short Summary of the Discussions . . . . . . . . . . . 16 72 3.2. TCP Sender . . . . . . . . . . . . . . . . . . . . . . . . 17 73 3.3. TCP Receiver . . . . . . . . . . . . . . . . . . . . . . . 17 74 3.4. Advanced Compatibility Mode . . . . . . . . . . . . . . . 17 75 4. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 18 76 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 77 6. Security Considerations . . . . . . . . . . . . . . . . . . . 18 78 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 79 7.1. Normative References . . . . . . . . . . . . . . . . . . . 18 80 7.2. Informative References . . . . . . . . . . . . . . . . . . 19 81 Appendix A. Pseudo Code for the Codepoint Coding . . . . . . . . 19 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 84 1. Introduction 86 Explicit Congestion Notification (ECN) [RFC3168] is an IP/TCP 87 mechanism where network nodes can mark IP packets instead of dropping 88 them to indicate congestion to the end-points. An ECN-capable 89 receiver will feedback this information to the sender. ECN is 90 specified for TCP in such a way that only one feedback signal can be 91 transmitted per Round-Trip Time (RTT). Recently proposed mechanisms 92 like Congestion Exposure (ConEx) or DCTCP [Ali10] need more accurate 93 feedback information in case when more than one marking is received 94 in one RTT. 96 This documents discusses and (will in a further version specify) a 97 different scheme for the ECN feedback in the TCP header to provide 98 more than one feedback signal per RTT. This modification does not 99 obsolete [RFC3168]. It provides an extension that requires 100 additional negotiation in the TCP handshake by using the TCP nonce 101 sum (NS) bit which is currently not used when SYN is set. 103 In the current version of this document there are different coding 104 schemes proposed for discussion. All proposed codings aim to scope 105 with the given bit space. All schemes require the use of the NS bit 106 at least in the TCP handshake. Depending of the coding scheme the 107 accurate ECN feedback extension will or will not include the ECN- 108 Nonce integrity mechanism. A later version of this document will 109 choose between the coding options, and remove the rationale for the 110 choice and the specs of those schemes not chosen. If a scheme will 111 be chosen that does not include ECN Nonce, a mechanism that is 112 requiring a more accurate ECN feedback needs to provide an own method 113 to ensure the integrity of the congestion feedback information or has 114 to scope with the uncertainty of this information. 116 The following scenarios should briefly show where the accurate 117 feedback is needed or provides additional value: 119 a. A Standard TCP sender with [RFC5681] congestion control algorithm 120 that supports ConEx: 121 In this case the congestion control algorithm still ignores 122 multiple marks per RTT, while the ConEx mechanism uses the extra 123 information per RTT to re-echo more precise congestion 124 information. 126 b. A sender using DCTCP without ConEx: 127 The congestion control algorithm uses the extra info per RTT to 128 perform its decrease depending on the number of congestion marks. 130 c. A sender using DCTCP congestion control and supports ConEx: 131 Both the congestion control algorithm and ConEx use the accurate 132 ECN feedback mechanism. 134 d. A standard TCP sender using RFC5681 congestion control algorithm 135 without ConEx: 136 No accurate feedback is necessary here. The congestion control 137 algorithm still react only on one signal per RTT. But its best 138 to have one generic feedback mechanism, whether you use it or 139 not. 141 1.1. Overview ECN and ECN Nonce in TCP 143 ECN requires two bits in the IP header. The ECN capability of a 144 packet is indicated, when either one of the two bits is set. An ECN 145 sender can set one or the other bit to indicate an ECN-capable 146 transport (ETC) which results in two signals --- ECT(0) and 147 respectively ECT(1). A network node can set both bits simultaneously 148 when it experiences congestion. When both bits are set the packets 149 is regarded as "Congestion Experienced" (CE). 151 In the TCP header two bits in byte 14 are defined for the use of ECN. 152 The TCP mechanism for signaling the reception of a congestion mark 153 uses the ECN-Echo (ECE) flag in the TCP header. To enable the TCP 154 receiver to determine when to stop setting the ECN-Echo flag, the CWR 155 flag is set by the sender upon reception of the feedback signal. 157 ECN-Nonce [RFC3540] is an optional addition to ECN that is used to 158 protects the TCP sender against accidental or malicious concealment 159 of marked or dropped packets. This addition defines the last bit of 160 the 13 byte in the TCP header as the Nonce Sum (NS) bit. With ECN- 161 Nonce a nonce sum is maintain that counts the occurrence of ECT(1) 162 packets. 164 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 165 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 166 | | | N | C | E | U | A | P | R | S | F | 167 | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | 168 | | | | R | E | G | K | H | T | N | N | 169 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 171 Figure 1: The (post-ECN Nonce) definition of the TCP header flags 173 1.2. Design choices 175 The idea of this document is to use the ECE, CWR and NS bits for 176 additional capability negotiation during the SYN/SYN-ACK exchange, 177 and then for the more accurate feedback itself on subsequent packets 178 in the flow (with SYN=0). 180 Alternatively, a new TCP option could be introduced, to help maintain 181 the accuracy, and integrity of the ECN feedback between receiver and 182 sender. Such an option could provide more information. E.g. ECN 183 for RTP/UDP provides explicit the number of ECT(0), ECT(1), CE, non- 184 ECT marked and lost packets. However, deploying new TCP options has 185 it's own challenges. 187 As seen in Figure 1, there are currently three unused flag bits in 188 the TCP header. Any of the below described schemes could be extended 189 by one or more bits, to add higher resiliency against ACK loss. The 190 relative gains would be proportional to each of the described 191 schemes, while the respective drawbacks would remain identical. Thus 192 the approach in this document is to scope with the given number of 193 bits as they seem to be already sufficient and the accurate ECN 194 feedback scheme will only be used instead of the classic ECN and 195 never in parallel. 197 1.3. Requirements Language 199 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 200 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 201 document are to be interpreted as described in RFC 2119 [RFC2119]. 203 We use the following terminology from [RFC3168] and [RFC3540]: 205 The ECN field in the IP header: 207 CE: the Congestion Experienced codepoint; and 209 ECT(0)/ECT(1): either one of the two ECN-Capable Transport 210 codepoints. 212 The ECN flags in the TCP header: 214 CWR: the Congestion Window Reduced flag; 216 ECE: the ECN-Echo flag; and 218 NS: ECN Nonce Sum. 220 In this document, we will call the ECN feedback scheme as specified 221 in [RFC3168] the 'classic ECN' and our new proposal the 'accurate ECN 222 feedback' scheme. A 'congestion mark' is defined as an IP packet 223 where the CE codepoint is set. 225 2. Negotiation in TCP handshake 227 During the TCP hand-shake at the start of a connection, an originator 228 of the connection (host A) MUST indicate a request to get more 229 accurate ECN feedback by setting the TCP flags NS=1, CWR=1 and ECE=1 230 in the initial SYN. 232 A responding host (host B) MUST return a SYN ACK with flags CWR=1 and 233 ECE=0. The responding host MUST NOT set this combination of flags 234 unless the preceding SYN has already requested support for accurate 235 ECN feedback as above. Normally a server (B) will reply to a client 236 with NS=0, but if the initial SYN from client A is marked CE, the 237 sever B can set the NS flag to 1 to indicate the congestion 238 immediately instead of delaying the signal to the first 239 acknowledgment when the actually data transmission already started. 240 So, server B MAY set the alternative TCP header flags in its SYN ACK: 241 NS=1, CWR=1 and ECE=0. 243 The Addition of ECN to TCP SYN/ACK packets is discussed and specified 244 as experimental in [RFC5562]. The addition of ECN to the SYN packet 245 is optional. The security implication when using this option are not 246 further discussed here. 248 These handshakes are summarized in Table 1 below, with X indicating 249 NS can be either 0 or 1 depending on whether congestion had been 250 experienced. The handshakes used for the other flavors of ECN are 251 also shown for comparison. To compress the width of the table, the 252 headings of the first four columns have been severely abbreviated, as 253 follows: 255 Ac: *Ac*curate ECN Feedback 257 N: ECN-*N*once (RFC3540) 259 E: *E*CN (RFC3168) 261 I: Not-ECN (*I*mplicit congestion notification). 263 +----+---+---+---+------------+----------------+------------------+ 264 | Ac | N | E | I | [SYN] A->B | [SYN,ACK] B->A | Mode | 265 +----+---+---+---+------------+----------------+------------------+ 266 | | | | | NS CWR ECE | NS CWR ECE | | 267 | AB | | | | 1 1 1 | X 1 0 | accurate ECN | 268 | A | B | | | 1 1 1 | 1 0 1 | ECN Nonce | 269 | A | | B | | 1 1 1 | 0 0 1 | classic ECN | 270 | A | | | B | 1 1 1 | 0 0 0 | Not ECN | 271 | A | | | B | 1 1 1 | 1 1 1 | Not ECN (broken) | 272 +----+---+---+---+------------+----------------+------------------+ 274 Table 1: ECN capability negotiation between Sender (A) and 275 Receiver (B) 277 Recall that, if the SYN ACK reflects the same flag settings as the 278 preceding SYN (because there is a broken RFC3168 compliant 279 implementation that behaves this way), RFC3168 specifies that the 280 whole connection MUST revert to Not-ECT. 282 3. Accurate Feedback 284 In this section we refer the sender to be the on sending data and the 285 receiver as the one that will acknowledge this data. Of course such 286 a scenario is describing only one half connection of a TCP 287 connection. The proposed scheme, if negotiated, will be used for 288 both half connection as both, sender and receiver, need to be capable 289 to echo and understand the accurate ECN feedback scheme. 291 3.1. Coding 293 This section proposes three different coding schemes for discussion. 294 First, requirements are listed that will allow to evaluate the 295 proposed schemes against each other. A later version of this 296 document will choose between the coding options, and remove the 297 rationale for the choice and the specs of those schemes not chosen. 298 The next section provides basically a fourth alternative to allow a 299 compatibility mode when a sender needs accurate feedback but has to 300 operate with a legacy [RFC3168] receiver. 302 3.1.1. Requirements 304 The requirements of the accurate ECN feedback protocol for the use of 305 e.g. Conex or DCTCP are to have a fairly accurate (not necessarily 306 perfect), timely and protected signaling. This leads to the 307 following requirements: 309 Resilience 310 The ECN feedback signal is implicit carried within the TCP 311 acknowledgment. TCP ACKs can get lost. Moreover, delayed 312 ACK are usually used with TCP. That means in most cases only 313 every second data packets gets acknowledged. In a high 314 congestion situation where most of the packet are marked with 315 CE, an accurate feedback mechanism must still be able to 316 signal sufficient congestion information. Thus the accurate 317 ECN feedback extension has to take delayed ACK and ACK loss 318 into account. 320 Timely 321 The CE marking is induced by a network node on the 322 transmission path and echoed by the receiver in the TCP 323 acknowledgment. Thus when this information arrives at the 324 sender, its naturally already about one RTT old. With a 325 sufficient ACK rate a further delay of a small number of ACK 326 can be tolerated but with large delays this information will 327 be out dated due to high dynamic in the network. TCP 328 congestion control which introduces parts of this dynamic 329 operates on an time scale of one RTT. Thus the congestion 330 feedback information should be delivered timely (within one 331 RTT). 333 Integrity 334 With ECN Nonce, a misbehaving receiver can be detected with a 335 certain probability. As this accurate ECN feedback might 336 reuse the NS bit it is encouraged to ensure integrity as 337 least as good as ECN Nonce. If this is not possible, 338 alternative approaches should be provided how a mechanism 339 using the accurate ECN feedback extension can re-ensure 340 integrity or give strong incentives for the receiver and 341 network node to cooperate honestly. 343 Accuracy 344 Classic ECN feeds back one congestion notification per RTT, 345 as this is supposed to be used for TCP congestion control 346 which reduces the sending rate at most once per RTT. The 347 accurate ECN feedback scheme has to ensure that if a 348 congestion events occurs at least one congestion notification 349 is echoed and received per RRT as classic ECN would do. Of 350 course, the goal of this extension is to reconstruct the 351 number of CE marking more accurately. However, a sender 352 should not assume to get the exact number of congestion 353 marking in a high congestion situation. 355 Complexity 356 Of course, the more accurate ECN feedback can also be used, 357 even if only one ECN feedback signal per RTT is need. To 358 enable this proposal for a more accurate ECN feedback as the 359 standard ECN feedback mechanism, the implementation should be 360 as simple as possible and a minimum of addition state 361 information should be needed. 363 3.1.2. One bit feedback flag 365 This option is using a one bit flag, namely the ECE bit, to signal 366 more accurate ECN feedback. Other than classic ECN feedback, a 367 accurate ECN feedback receiver MUST set the ECE bit in N subsequent 368 ACK packets (only). A accurate ECN feedback receiver MUST NOT wait 369 for a CWR bit from the sender to reset the ECE bit. N is not defined 370 yet but is intended to be 2. 372 Moreover, when a congestion situation occurs or stops, the receiver 373 MUST immediately acknowledge the data packet and MUST NOT delay the 374 acknowledgment until a further data packet is arrived. A congestion 375 situation occurs when the previous data packet was CE=0 but the 376 current one is CE=1. And a congestion situation stops when the 377 previous data packet was CE=1 and the current one is CE=0. 379 The following figure shows a simple state machine to describe the 380 receiver behavior for N=1. 382 Send immediate 383 ACK with ECE=0 384 .---. .------------. .---. 385 Send 1 ACK / v v | | \ 386 for every | .------. .------. | Send 1 ACK 387 m packets | | CE=0 | | CE=1 | | for every 388 with ECE=0 | '------' '------' | m packets 389 \ | | ^ ^ / with ECE=1 390 '---' '------------' '---' 391 Send immediate 392 ACK with ECE=1 394 Figure 2: Two state ACK generation state machine 396 3.1.2.1. Discussion 398 ACK loss 400 The simplest way to get a more accurate ECN feedback, which allows 401 more than one signal per RTT, is to set the ECE flag only once when a 402 congestion marks occurs instead of setting the ECE flag in every 403 packets until a CWR flag is received. This solution still only 404 allows one signal per acknowledgment which might not be sufficient 405 when more than one packet is acknowledged at once (delayed ACKs). 406 And even more important, this information can get lost with the loss 407 only one ACK packet carrying this information. One solution would be 408 to carry the same information in a defined number of subsequent ACK 409 packets. This would reduce again the number of feedback signals that 410 can be transmitted in one RTT but improve the integrity. More 411 sophisticated solutions based on ACK loss detection might be possible 412 as well. 414 Note that the semantics of classic ECN are changed, and the CWR flag 415 is no longer interpreted by the receiver to reset the ECE flag. A 416 simple extension of this scheme could make use of the CWR flag. E.g. 417 the receiver could always repeat the value of the ECE flag of the 418 predecessor ACK in the CWR flag. However, only a single lost ACK can 419 be addressed that way. Two consecutive ACKs becoming lost may still 420 result in a loss of ECN information to the sender. 422 In low congestion situations (less than one CE mark per RTT on 423 average), the loss of m subsequent ACKs would result in complete loss 424 of the congestion information. The opposite would be true during 425 high congestion, where the sender can incorrectly assume that all 426 segments were received with the CE codepoint. 428 With DCTCP [Ali10] it was proposed to acknowledge a data packet 429 directly without delay when a congestion situation occurs, as already 430 described above. This scheme allows a more accurate feedback signal 431 in a high congestion/marking situation. However, using Delayed ACKs 432 is important for a variety of reasons, including reducing the load on 433 the data sender. 435 As this heuristic is triggering immediate ACKs whenever the received 436 CE bit toggles, arbitrarily large ACK ratios are supported. However, 437 the effective ACK ratio is depending on the congestion state of the 438 network. Thus it may collapse to 1 (one ACK for each data 439 segment)More sophisticated solutions based on ACK loss detection 440 might be possible as well, when every other segment is received with 441 CE set. 443 ECN Nonce 445 As the ECN Nonce bit is not used otherwise, ECN Nonce [RFC3540] can 446 be used complementary. Network paths not supporting ECN, 447 misbehaving, or malicious receivers withholding ECN information can 448 therefore be detected. 450 3.1.3. Three bit field with counter feedback 452 The receiver maintains an unsigned integer counter which we call ECC 453 (echo congestion counter). This counter maintains a count of how 454 many times a CE marked packet has arrived during the half-connection. 455 Once a TCP connection is established, the three TCP option flags 456 (ECE, CWR and NS) are used as a 3-bit field for the receiver to 457 permanently signal the sender the current value of ECC, modulo 8, 458 whenever it sends a TCP ACK. We will call these three bits the echo 459 congestion increment (ECI) field. 461 This overloaded use of these 3 option flags as one 3-bit ECI field is 462 shown in Figure 3. The actual definition of the TCP header, 463 including the addition of support for the ECN Nonce, is shown for 464 comparison in Figure 1. This specification does not redefine the 465 names of these three TCP option flags, it merely overloads them with 466 another definition once a flow with accurate ECN feedback is 467 established. 469 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 470 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 471 | | | | U | A | P | R | S | F | 472 | Header Length | Reserved | ECI | R | C | S | S | Y | I | 473 | | | | G | K | H | T | N | N | 474 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 476 Figure 3: Definition of the ECI field within bytes 13 and 14 of the 477 TCP Header (when SYN=0). 479 Also note that, whenever the SYN flag of a TCP segment is set 480 (including when the ACK flag is also set), the NS, CWR and ECE flags 481 (i.e. the ECI field of the SYNACK) MUST NOT be interpreted as the 482 3-bit ECI value, which is only set as a copy of the local ECC value 483 in non-SYN packets. 485 This scheme was first proposed in [I-D.briscoe-tsvwg-re-ecn-tcp] for 486 the use with re-ECN. 488 3.1.3.1. Discussion 490 ACK loss 492 As pure ACKs are not protected by TCP reliable delivery, we repeat 493 the same ECI value in every ACK until it changes. Even if many ACKs 494 in a row are lost, as soon as one gets through, the ECI field it 495 repeats from previous ACKs that didn't get through will update the 496 sender on how many CE marks arrived since the last ACK got through. 498 The sender will only lose a record of the arrival of a CE mark if all 499 the ACKS are lost (and all of them were pure ACKs) for a stream of 500 data long enough to contain 8 or more CE marks. So, if the marking 501 fraction was p, at least 8/p pure ACKs would have to be lost. For 502 example, if p was 5%, a sequence of 160 pure ACKs (without delayed 503 ACKs) would all have to be lost. When ACK are delay this number has 504 to be reduced by 1/m. This would still require a sequence of 80 pure 505 lost ACKs with the usual delay rate of m=2. 507 Additionally, to protect against such extremely unlikely events, if a 508 re- ECN sender detects a sequence of pure ACKs has been lost it can 509 assume the ECI field wrapped as many times as possible within the 510 sequence. E.g., if a re-ECN sender receives an ACK with an 511 acknowledgement number that acknowledges L (>m) segments since the 512 previous ACK but with a sequence number unchanged from the previously 513 received ACK, it can conservatively assume that the ECI field 514 incremented by D' = L - ((L-D) mod 8), where D is the apparent 515 increase in the ECI field. For example if the ACK arriving after 9 516 pure ACK losses apparently increased ECI by 2, the assumed increment 517 of ECI would still be 2. But if ECI apparently increased by 2 after 518 11 pure ACK losses, ECI should be assumed to have increased by 10. 520 ECN Nonce 522 ECN Nonce cannot be used in parallel to this scheme. But mechanism 523 that make use of this new scheme might provide stronger incentives to 524 declare congestion honestly when needed. E.g. with ConEx each 525 congestion notification suppressed by the receiver should lead the 526 ConEx audit function to discard an equivalent number of bytes such 527 that the receiver does not gain from suppressing feedback. This 528 mechanism would even provide a stronger integrity mechanism than ECN- 529 Nonce does. Without an external framework to discourage the 530 withholding of ECN information, this scheme is vulnerable to the 531 problems described in [RFC3540]. 533 3.1.4. Codepoints with dual counter feedback 535 In-line with the definition of the previous section in Figure 3, the 536 ECE, CWR and NS bits are used as one field but instead they are 537 encoding 8 codepoints. These 8 codepoints, as shown below, encode 538 either a "congestion indication" (CI) counter or an ECT(1) counter 539 (E1). These counters maintain the number of CE marks or the number 540 of ECT(1) signals observed at the receiver respectively. 542 +-----+----+-----+-----+------------+------------+ 543 | ECI | NS | CWR | ECE | CI (base5) | E1 (base3) | 544 +-----+----+-----+-----+------------+------------+ 545 | 0 | 0 | 0 | 0 | 0 | - | 546 | 1 | 0 | 0 | 1 | 1 | - | 547 | 2 | 0 | 1 | 0 | 2 | - | 548 | 3 | 0 | 1 | 1 | 3 | - | 549 | 4 | 1 | 0 | 0 | 4 | - | 550 | 5 | 1 | 0 | 1 | - | 0 | 551 | 6 | 1 | 1 | 0 | - | 1 | 552 | 7 | 1 | 1 | 1 | - | 2 | 553 +-----+----+-----+-----+------------+------------+ 555 Table 2: Codepoint assignment for accurate ECN feedback 557 By default an accurate ECN receiver MUST echo the CI counter (modulo 558 5) with the respective codepoints. Whenever an CE occurs and thus 559 the value of the CI has changed, the receiver MUST echo the CI in the 560 next ACK. Moreover, the receiver MUST repeat the codepoint, that 561 provides the CI counter, directly on the subsequent ACK. Thus every 562 value of CI will be transmitted at least twice. 564 If an ECT(1) mark is receipt and thus E1 increases, the receiver has 565 to convey that updated information to the sender as soon as possible. 566 Thus on the reception of a ECT(1) marked packet, the receiver MUST 567 signal the current value of the E1 counter (modulo 3) in the next 568 ACK, unless a CE mark was receipt which is not echoed yet twice. The 569 receiver MUST also repeat very E1 value. But this repetition does 570 not need to be in the subsequent ACK as the E1 value will only be 571 transmitted when no changes in the CI have occured. Each E1 value 572 will be send excatly twice. The repetition of every signal will 573 provide further resilience against lost ACKs. 575 As only a limited number of E1 codepoints exist and the receiver 576 might not acknowledge every single data packet immediately (delayed 577 ACKs), a sender SHOULD NOT mark more than 1/m of the packets with 578 ECT(1), where m is the ACK ratio (e.g. 50% when every second data 579 packet triggers an ACK). This constraint will avoid a permanent 580 feedback of E1 only. 582 This requirement may conflict with delayed ACK ratios larger than 583 two, using the available number of codepoints. A receiver MUST 584 change the ACK'ing rate such that a sufficient rate of feedback 585 signals can be sent. Details on how the change in the ACK'ing rate 586 should be implemented are given in the next subsection. 588 3.1.4.1. Implementation 590 The basic idea is for the receiver to count how many packets carry a 591 congestion notification. This could, in principle, be achieved by 592 increasing a "congestion indication" counter (CI.c) for every 593 incoming CE marked segment. Since the space for communicating the 594 information back to the sender in ACKs is limited, instead of 595 directly increasing this counter, a "gauge" (CI.g) is increased 596 instead. 598 When sending an ACK, the content of this gauge (capped by the maximum 599 number that can be encoded in the ACK, e.g. 4 for CI, and 2 for E1) 600 is copied to the actual counter, and CI.g is reduced by the value 601 that was copied over and transmitted, unless CI.g was zero before. 602 To avoid losing information, it is ensured that an ACK is sent at 603 least after 5 incoming congestion marks (i.e. when CI.g exceeds 5). 605 For resilience against lost ACKs, an indicator flag (CI.i) ensures 606 that, whether another congestion indication arrives or not, a second 607 ACK transmits the previous counter value again. 609 The same counter / gauge method is used to count and feed back (using 610 a different mapping) the number of incoming packets marked ECT(1) 611 (called E1 in the algorithm). As fewer codepoints are available for 612 conveying the E1 counter value, an immediate ACK MUST be triggered 613 whenever the gauge E1.g exceeds a threshold of 3. The sender 614 receives the receiver's counter values and compares them with the 615 locally maintained counter. Any increase of these counters is added 616 to the sender's internal counters, yielding a precise number of CE- 617 marked and ECT(1) marked packets. Architecturally the counters never 618 decrease during a TCP session. However, any overflow must be modulo 619 5 for CI, and modulo 3 for E1. 621 The following table provides an example showing an half-connection 622 with an TCP sender A and receiver B. The sender maintains a counter 623 CI.r to reconstruct the number of CE mark receipt at receiver-side. 625 +----+------+---------------+------------+---------------+------+ 626 | | Data | TCP A | IP | TCP B | Data | 627 +----+------+---------------+------------+---------------+------+ 628 | | | SEQ ACK CTL | | SEQ ACK CTL | | 629 | -- | | ------------- | ---------- | ------------- | | 630 | 1 | | 0100 SYN | ----> | | | 631 | | | CWR,ECE,NS | | | | 632 | 2 | | | <---- | 0300 0101 SYN | | 633 | | | | | ACK,CWR | | 634 | 3 | | 0101 0301 ACK | ECT0 -CE-> | | | 635 | | | | | CI.c=0 CI.g=1 | | 636 | 4 | 100 | 0101 0301 ACK | ECT0 ----> | | | 637 | | | | | CI.c=1 CI.g=0 | | 638 | 5 | | | <---- | 0301 0201 ACK | | 639 | | | | | ECI=CI.1 | | 640 | | | CI.r=1 | | | | 641 | 6 | 100 | 0201 0301 ACK | ECT0 -CE-> | | | 642 | | | | | CI.c=1 CI.g=1 | | 643 | 7 | 100 | 0301 0301 ACK | ECT0 -CE-> | | | 644 | | | | | CI.c=1 CI.g=2 | | 645 | 8 | | | XX-- | 0301 0401 ACK | | 646 | | | | | ECI=CI.1 | | 647 | | | CI.r=1 | | | | 648 | 9 | 100 | 0401 0301 ACK | ECT0 -CE-> | | | 649 | | | | | CI.c=1 CI.g=3 | | 650 | 10 | 100 | 0501 0301 ACK | ECT0 -CE-> | | | 651 | | | | | CI.c=5 CI.g=0 | | 652 | 11 | | | <---- | 0301 0601 ACK | | 653 | | | | | ECI=CI.0 | | 654 | | | CI.r=5 | | | | 655 | 12 | 100 | 0601 0301 ACK | ECT0 -CE-> | | | 656 | | | | | CI.c=5 CI.g=1 | | 657 | 13 | 100 | 0701 0301 ACK | ECT0 -CE-> | | | 658 | | | | | CI.c=5 CI.g=2 | | 659 | 14 | | | <---- | 0301 0801 ACK | | 660 | | | | | ECI=CI.0 | | 661 | | | CI.r=5 | | | | 662 +----+------+---------------+------------+---------------+------+ 664 Table 3: Codepoint signal example 666 3.1.4.2. Discussion 668 ACK loss 670 As this scheme sends each codepoint (of the two subsets) at least two 671 times, at least one, and up to two consecutive ACKs can be lost. 672 Further refinements, such as interleaving ACKs when sending 673 codepoints belonging to the two subsets (e.g. CI, E1), can allow the 674 loss of any two consecutive ACKs, without the sender losing 675 congestion information, at the cost of also reducing the ACK ratio. 677 At low congestion rates, the sending of the current value of the CI 678 counter by default allows higher numbers of consecutive ACKs to be 679 lost, without impacting the accuracy of the ECN signal. 681 ECN Nonce 683 By comparing the number of incoming ECT(1) notifications with the 684 actual number of packets that were transmitted with an ECT(1) mark as 685 well as the sum of the sender's two internal counters, the sender can 686 probabilistic detect a receiver that would send false marks or 687 supress accurate ECN feedback, or a path that doesn't properly 688 support ECN. 690 This approach maintains a balanced selection of properties found in 691 ECN Nonce, Section 3.1.3, and Section 3.1.2. A delayed ACK ratio of 692 two can be sustained indefinitely even during heavy congestion, but 693 not during excessive ECT(1) marking, which is under the control of 694 the sender. An higher ACK ratios can be sustained even when 695 congestion is low but its need for the E1 feedback. 697 3.1.5. Short Summary of the Discussions 699 With the exception of the signaling scheme described in 700 Section 3.1.2, all signaling may fail to work, if middleboxes 701 intervene and check on the semantic of [RFC3168] signals. 703 The scheme described in Section 3.1.4 is the most complex to 704 implement especially on a receiver, with much additional state to be 705 kept there, compared to the other signaling schemes. With the 706 advances in compute power, many more cycles are available to process 707 TCP than ever before. 709 Table 4 gives an overview of the relative implications of the 710 different proposed signaling schemes. Further discussion should be 711 included here in the next version of this document. 713 +-------------+--------+--------+-----------+----------+------------+ 714 | Section | Resi- | Timely | Integrity | Accuracy | Complexity | 715 | | liency | | | | | 716 +-------------+--------+--------+-----------+----------+------------+ 717 | 1-bit-flag | - | + | + | - | + | 718 | 3-bit-field | ++ | ++ | -- | ++ | - | 719 | Codepoints | + | + | + | ++ | -- | 720 +-------------+--------+--------+-----------+----------+------------+ 721 Table 4: Overview of accurate feedback schemes 723 3.2. TCP Sender 725 This section will specify the sender-side action describing how to 726 exclude the accurate number of congestion markings from the given 727 receiver feedback signal. 729 3.3. TCP Receiver 731 This section will describe the receiver-side action to signal the 732 accurate ECN feedback back to the sender. In any case the receiver 733 will need to maintain a counter of how many CE marking has been seen 734 during a connection. Depending on the chosen coding scheme there 735 will be different action to set the corresponding bits in the TCP 736 header. For all case it might be helpful if the receiver is able to 737 switch form a delayed ACK behavior to send ACKs immediately after the 738 data packet reception in a hight congestion situation. 740 3.4. Advanced Compatibility Mode 742 This section describes a possiblity to achieve more accurate feedback 743 even when the receiver is not capable of the new accurate ECN 744 feedback scheme with the drawback of less reliability. 746 During initial deployment, a large number of receivers will only 747 support [RFC3168] classic ECN feedback. Such a receiver will set the 748 ECE bit whenever it receives a segment with the CE codepoint set, and 749 clear the ECE bit only when it receives a segment with the CWR bit 750 set. As the CE codepoint has priority over the CWR bit (Note: the 751 wording in this regard is ambiguous in [RFC3168], but the reference 752 implementation of ECN in ns2 is clear), a [RFC3168] compliant 753 receiver will not clear the ECE bit on the reception of a segment, 754 where both CE and CWR are set simultaneously. This property allows 755 the use of a compatibility mode, to extract more accurate feedback 756 from legacy [RFC3168] receivers by setting the CWR permanently. 758 Assuming an delayed ACK ratio of one, a sender can permanently set 759 the CWR bit in the TCP header, to receive a more accurate feedback of 760 the CE codepoints as seen at the receiver. This feedback signal is 761 however very brittle and any ACK loss may cause congestion 762 information to become lost. Delayed ACKs and ACK loss can both not 763 be accounted for in a reliable way, however. Therefore, a sender 764 would need to use heuristics to determine the current delay ACK ratio 765 m used by the receiver (e.g. most receivers will use m=2), and also 766 the recent ACK loss ratio (l). Acknowledge Congestion Control 767 (AckCC) as defined in [RFC5690] can not be used, as deployment of 768 this feature is only experimental. 770 Using a phase locked loop algorithm, the CWR bit can then be set only 771 on those data segments, that will trigger a (delayed) ACK. Thereby, 772 no congestion information is lost, as long as the ACK carrying the 773 ECE bit is seen by the sender. 775 Whenever the sender sees an ACK with ECE set, this indicates that at 776 least one, and at most m / (m - l) data segments with the CE 777 codepoint set where seen by the receiver. The sender SHOULD react, 778 as if m CE indications where reflected back to the sender by the 779 receiver, unless additional heuristics (e.g. dead time correction) 780 can determine a more accurate value of the "true" number of received 781 CE marks. 783 4. Acknowledgements 785 We want to thank Michael Welzl and Bob Briscoe for their input and 786 discussion. 788 5. IANA Considerations 790 This memo includes no request to IANA. 792 6. Security Considerations 794 For coding schemes that increase robustness for the ECN feedback, 795 similar considerations as in RFC3540 apply for the selection of when 796 to sent a ECT(1) codepoint. 798 7. References 800 7.1. Normative References 802 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 803 Requirement Levels", BCP 14, RFC 2119, March 1997. 805 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 806 of Explicit Congestion Notification (ECN) to IP", 807 RFC 3168, September 2001. 809 [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit 810 Congestion Notification (ECN) Signaling with Nonces", 811 RFC 3540, June 2003. 813 7.2. Informative References 815 [Ali10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 816 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP: 817 Efficient Packet Transport for the Commoditized Data 818 Center", Jan 2010. 820 [I-D.briscoe-tsvwg-re-ecn-tcp] 821 Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, 822 "Re-ECN: Adding Accountability for Causing Congestion to 823 TCP/IP", draft-briscoe-tsvwg-re-ecn-tcp-09 (work in 824 progress), October 2010. 826 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 827 Ramakrishnan, "Adding Explicit Congestion Notification 828 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 829 June 2009. 831 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 832 Control", RFC 5681, September 2009. 834 [RFC5690] Floyd, S., Arcia, A., Ros, D., and J. Iyengar, "Adding 835 Acknowledgement Congestion Control to TCP", RFC 5690, 836 February 2010. 838 Appendix A. Pseudo Code for the Codepoint Coding 840 Receiver: 842 Input signals: CE , ECT(1) 843 TCP Fields: ECI (3-bit field from CWR and ECE). CI.cm and E1.cm map 844 into these 8 codepoints (ie. 5 and 3 codepoints) 846 These counters get tracked by the following variables: 848 CI.c (congestion indication - counter, modulo a multiple of the 849 available codepoints to represent CI.c in the ECI field. 850 Range[0..n*CI.cp-1]) 851 CI.g (congestion indication - gauge, [0.."inf"]) 852 CI.i (congestion indication - iteration, [0,1]) 853 These are to track CE indications. 855 E1.c, E1.g and E1.r (doing the same, but for ECT(1) signals). 857 Constants: 858 CI.cp (number of codepoints available to signal) 859 CI.cm[] (codepoint mapping for CI) 860 E1.cp (number of codepoints available for E1 signal) 861 E1.cm[0..(E1.cp-1)] (codepoint mappings for E1) 863 At session initialization, all these counters are set to 0; 865 When a Segement (Data, ACK) is received, 866 perform the following steps: 868 If a CE codepoint is received, 869 Increase CI.g by 1 870 If a ECT(1) codepoint is received, 871 Increase E1.g by 1 872 If (CI.g > 5) # When ACK rate is not sufficient to keep 873 or (E1.g > 3) # gauge close to zero, increase ACK rate 874 # works independent of delACK number (ie AckCC) 875 Cancel pending delayed ACK (ACK this segment immediately) 876 # this increases the ACK rate to a maximum of 1.5 data segments 877 # per ACK, with delACK=2, 878 # and CE mark rate exceeds 75% for a number 879 # of at least 18 segments. 880 # 5 codepoints would allow delack=2 indefinitely btw 882 When preparing an ACK to be sent: 884 If (CI.g > 0) or 885 ((E1.i != 0) and (CI.i != 0)) # E1.g = 0 is to skip this 886 # if only the 2nd CI.c ACK 887 # has to be sent - effectively alternating CI.c and E1.c on ACKs 888 # should give slightly better resiliency against ack losses 889 If CI.i == 0 # updates to CI.c allowed 890 and CI.g > 0 # update is meaningful 891 CI.i = 1 # may be larger 892 #if more resiliency is reqd 893 CI.c += min(CI.cp-1,CI.g) # CI.cp-1 is 3 for 4 codepoints, 894 # 4 for 5 etc 895 CI.c = CI.c modulo CI.cp*CI.cp # using modulo the square of 896 # available codepoints, 897 # for convinience (debugging) 898 CI.g -= min(CI.cp-1,CI.g) # 899 Else 900 CI.i-- # just in case CI.f was set to 901 # more than 1 for resiliency 902 Send next ACK with ECI = CI.cm[CI.c modulo CI.cp] 903 Else 904 If (E1.g > 0) or (E1.i != 0) 906 If (E1.i == 0) and (E1.g > 0) 907 E1.i = 1 908 E1.c += min(E1.cp-1,E1.g) 909 E1.c = E1.c modulo E1.cp*E1.cp 910 E1.g -= min(E1.cp-1,E1.g) 911 Else 912 E1.i-- 913 Send next ACK with ECI = E1.cm[E1.c modulo E1.cp] 914 Else 915 Send next ACK with ECI = CI.cm[CI.c modulo CI.cp] # default action 917 Sender: 919 Counters: 921 CI.r - current value of CEs seen by receiver 922 E1.s - sum of all sent ECT(1) marked packets (up to snd.nxt) 923 E1.s(t) - value of E1.s at time (in sequence space) t 924 E1.r - value signaled by receiver about received ECT(1) segments 925 E1.r(t) - value of E1.r at time (in sequence space) t 926 CI.r(t) - ditto 927 # Note: With a codepoint-implementation, 928 # a reverse table ECI[n] -> CI.r / E1.r is needed. 929 # This example is simplified with 4/4 codepoints 930 # instead of 5/3 932 If ACK with NS=0 933 CI.r += (ECI + 4 - (CI.r mod CI.cp)) mod CI.cp 934 # The wire protocol transports the absolute value 935 # of the receiver-side counter. 936 # Thus the (positive only) delta needs to be calculated, 937 # and added to the sender-side counter. 938 If ACK with NS=1 939 E1.r += (ECI + 4 - (E1.r mod E1.cp)) mod E1.c 941 # Before CI.r or E1.r reach a (binary) rollover, 942 # they need to roll over some multiple of CI.cp 943 # and E1.cp respectively. 945 CI.r = CI.r modulo CI.cp * n_CI 946 E1.r = E1.r modulo E1.cp * n_E1 948 # (an implementation may choose to use a single constant, 949 # ie 3^4*5^4 for 16-bit integers, 950 # or 3^8*5^8 for 32-bit integers) 952 # The following test can (probabilistically) reveal, 953 # if the receiver or path is not properly 954 # handling ECN (CE, E1) marks 956 If not E1.r(t) <= E1.s(t) <= E1.r(t) + CI.r(t) 957 # -> receiver lies (or too many ACKs got lost, 958 # which can be checked too by the sender). 960 Authors' Addresses 962 Mirja Kuehlewind (editor) 963 University of Stuttgart 964 Pfaffenwaldring 47 965 Stuttgart 70569 966 Germany 968 Email: mirja.kuehlewind@ikr.uni-stuttgart.de 969 Richard Scheffenegger 970 NetApp, Inc. 971 Am Euro Platz 2 972 Vienna, 1120 973 Austria 975 Phone: +43 1 3676811 3146 976 Email: rs@netapp.com