idnits 2.17.1 draft-kuehlewind-conex-tcp-modifications-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 31, 2011) is 4559 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Unused Reference: 'I-D.briscoe-tsvwg-re-ecn-tcp' is defined on line 452, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Congestion Exposure (ConEx) M. Kuehlewind, Ed. 3 Internet-Draft University of Stuttgart 4 Intended status: Experimental R. Scheffenegger 5 Expires: May 3, 2012 NetApp, Inc. 6 October 31, 2011 8 TCP modifications for Congestion Exposure 9 draft-kuehlewind-conex-tcp-modifications-01 11 Abstract 13 Congestion Exposure (ConEx) is a mechanism by which senders inform 14 the network about the congestion encountered by previous packets on 15 the same flow. This document describes the necessary modifications 16 to use ConEx with the Transmission Control Protocol (TCP). 18 Status of this Memo 20 This Internet-Draft is submitted in full conformance with the 21 provisions of BCP 78 and BCP 79. 23 Internet-Drafts are working documents of the Internet Engineering 24 Task Force (IETF). Note that other groups may also distribute 25 working documents as Internet-Drafts. The list of current Internet- 26 Drafts is at http://datatracker.ietf.org/drafts/current/. 28 Internet-Drafts are draft documents valid for a maximum of six months 29 and may be updated, replaced, or obsoleted by other documents at any 30 time. It is inappropriate to use Internet-Drafts as reference 31 material or to cite them other than as "work in progress." 33 This Internet-Draft will expire on May 3, 2012. 35 Copyright Notice 37 Copyright (c) 2011 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents 42 (http://trustee.ietf.org/license-info) in effect on the date of 43 publication of this document. Please review these documents 44 carefully, as they describe your rights and restrictions with respect 45 to this document. Code Components extracted from this document must 46 include Simplified BSD License text as described in Section 4.e of 47 the Trust Legal Provisions and are provided without warranty as 48 described in the Simplified BSD License. 50 Table of Contents 52 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 53 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 54 2. Sender-side Modifications . . . . . . . . . . . . . . . . . . 3 55 3. Accounting congestion . . . . . . . . . . . . . . . . . . . . 4 56 3.1. ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 57 3.1.1. Accurate ECN feedback . . . . . . . . . . . . . . . . 5 58 3.1.2. Classic ECN support . . . . . . . . . . . . . . . . . 5 59 3.2. Loss Detection with/without SACK . . . . . . . . . . . . . 7 60 4. Setting the ConEx IPv6 Bits . . . . . . . . . . . . . . . . . 7 61 4.1. Setting the E and the L Bit . . . . . . . . . . . . . . . 8 62 4.2. Credit Bits . . . . . . . . . . . . . . . . . . . . . . . 8 63 5. Timeliness of the ConEx Signals . . . . . . . . . . . . . . . 9 64 6. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 10 65 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 10 66 8. Security Considerations . . . . . . . . . . . . . . . . . . . 10 67 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 10 68 9.1. Normative References . . . . . . . . . . . . . . . . . . . 10 69 9.2. Informative References . . . . . . . . . . . . . . . . . . 11 70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 11 72 1. Introduction 74 Congestion Exposure (ConEx) is a mechanism by which senders inform 75 the network about the congestion encountered by previous packets on 76 the same flow. This document describes the necessary modifications 77 to use ConEx with the Transmission Control Protocol (TCP). The ConEx 78 signal is based on loss or ECN marks [RFC3168] as a congestion 79 indication. 81 With standard TCP without Selective Acknowledgments (SACK) [RFC2018] 82 the actual number of losses is hard to detect, thus we recommend to 83 enable SACK when using ConEx. However, we discuss both cases, with 84 and without SACK support, later on. 86 Explicit Congestion Notification (ECN) is defined in such a way that 87 only a single congestion signal is guaranteed to be delivered per 88 Round-trip Time (RTT). For ConEx a more accurate feedback signal 89 would be beneficial. Such an extension to ECN is defined in a 90 seperate document [draft-kuehlewind-conex-accurate-ecn], as it can 91 also be useful for other mechanisms, as e.g. [DCTCP] or whenever the 92 congestion control reaction should be proportional to the expirienced 93 congestion. 95 ConEx is currently/will be defined as an destination option for IPv6. 96 The use of four bits have been defined, namely the X (ConEx-capable), 97 the L (loss experienced), the E (ECN experienced) and C (credit) bit. 99 1.1. Requirements Language 101 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 102 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 103 document are to be interpreted as described in [RFC2119]. 105 2. Sender-side Modifications 107 A ConEx sender MUST negotitate for both SACK and the more accurate 108 ECN feedback in the TCP handshake if these TCP extension are 109 available at the sender. Depending on the capability of the 110 receiver, the following operation modes exist: 112 o Full-ConEx (SACK and accurate ECN feedback) 114 o accECN-ConEx (no SACK but accurate ECN feedback) 116 o ECN-ConEx (no SACK and no accurate ECN feedback but 'classic' ECN) 117 o SACK-ECN-ConEx (SACK and 'classic' instead of accurate ECN) 119 o SACK-ConEx (SACK but no ECN at all) 121 o Basic-ConEx (neither SACK nor ECN) 123 A ConEx sender MUST expose congestion to the network according to the 124 congestion information received by ECN or based on loss provided by 125 the TCP feedback loop. A TCP sender MUST account congestion byte- 126 wise (and not packet-wise) and MUST mark the respective number of 127 payload bytes in subsequent packets (after the congestion 128 notification) with the respective ConEx bit in the IP header. The 129 congestion accounting based on different operation modes is described 130 in the next section and the handling of the IPv6 bits itself in the 131 subsequent section afterwards. 133 3. Accounting congestion 135 A TCP sender MUST account congestion byte-wise (and not packet-wise) 136 based the congestion information received by ECN or loss detection 137 provided by TCP. For this purpose a TCP sender will maintain two 138 different counters for number outstanding bytes that need to be ConEx 139 marked either with the E bit or the L Bit. 141 The outstanding bytes accounted based on ECN feedback information are 142 maintained in the congestion exposure gauge (CEG). The accounting of 143 these bytes from the ECN feedback is explained in more detail next. 145 The outstanding bytes for congestion indications based on loss are 146 maintained in the loss exposure gauge (LEG) and the accounting is 147 explained in subsequent to the CEG accounting. 149 The subtraction of bytes which have been ConEx marked from both 150 counters is explained in the next section. 152 Usually all byte of an IP packet must be accounted. If we assume 153 equal sized packets or at least equally distributed packet sizes the 154 sender MAY only account the TCP payload bytes, as the ConEx marked 155 packets as well as the original packets causing the congestion will 156 both contain about the same number of headers. Otherwise the sender 157 MUST take the headers into account. A sender which sends different 158 sized packets with unequally distributed packet sizes should know 159 about reason to do so and thus may be able to reconstruct the exact 160 number of headers based on this information. Otherwise if no 161 additional information is available the worse case number of headers 162 SHOULD be estimated in a conservative way based on a minimum packet 163 size (of all packets sent in the last RTT). 165 3.1. ECN 167 A receiver can support the accurate ECN feedback scheme, the 168 'classic' ECN or neither. In the case ECN is not supported at all, 169 the transport is not ECN-capable and no ECN marks will occur, thus 170 the E bit will never be set. In the other cases a ConEx sender MUST 171 maintain a gauge for the number of outstanding bytes that has to be 172 ConEx marked with the E bit, the congestion exposure gauge (CEG). 174 The CEG is increased when ECN information is received from an ECN- 175 capable receiver supporting the 'classic' ECN scheme or the accurate 176 ECN feedback scheme. When the ConEx sender receives an ACK 177 indicating one or more segments were received with a CE mark, CEG is 178 increased by the appropriate number of bytes. The two cases, 179 depending on the receiver capability, are discussed in the following 180 sections. 182 3.1.1. Accurate ECN feedback 184 With an more accurate ECN feedback scheme either the number of marked 185 packets/received CE marks is know or the number of marked bytes 186 directly. In the later case the CEG can directly be increased by the 187 number of marked bytes. Otherwise when the accurate ECN feedback 188 scheme is supported by the receiver, the receiver will maintain an 189 echo congestion counter (ECC). The ECC will hold the number of CE 190 marks received. A sender that is understanding the accurate ECN 191 feedback will be able to reconstruct this ECC value on the sender 192 side by maintaining a counter ECC.r. 194 On the arrival of every ACK, the sender calculates the difference D 195 between the local ECC.r counter, and the signaled value of the 196 receiver side ECC counter. The value of ECC.r is increased by D, and 197 D is assumed to be the number of CE marked packets that arrived at 198 the receiver since it sent the previously received ACK. 200 Whenever the counter ECC.r is increased, the gauge CEG has to be 201 increased by the amount of bytes sent which were marked: 203 CEG += min( SMSS*D, acked_bytes ) 205 3.1.2. Classic ECN support 207 A ConEx sender that communicates with a classic ECN receiver 208 (conforming to [RFC3168] or [RFC5562]) MAY run in one of these modes: 210 o Full compliance mode: 212 The ConEx sender fully conforms to all the semantics of the ECN 213 signaling as defined by [RFC5562]. In this mode, only a single 214 congestion indication can be signaled by the receiver per RTT. 215 Whenever the ECE flag toggles from "0" to "1", the gauge CEG is 216 increased by the SMSS: 218 CEG += SMSS 220 Note that under severe congestion, a session adhering to these 221 semantics may not provide enough ConEx marks. This may cause 222 appropriate sanctions by an audit device in a ConEx enabled 223 network. 225 o Simple compatibility mode: 227 The sender will set the CWR permanently to force the receiver to 228 signal only one ECE per CE mark. Unfortunately, in a high 229 congestion situation where all packets are CE marled over a 230 certain period of time, the use of delayed ACKs, as it is usually 231 done today, will prevent a feedback of every CE mark. With an ACK 232 rate of m, about m-1/m CE indications will not be signaled back by 233 the receiver (e.g. 50% with M=2). Thus, in this mode the ConEx 234 sender MUST increase CEG by a count of M*SMSS for each received 235 ECE signal: 237 CEG += M*SMSS 239 In case of a congestion event with low congestion (that means when 240 only a very smaller number of packets get marked), the sender 241 might miss the whole congestion event. In average the sender will 242 sent sufficient ConEx marks due to the scheme proposed above but 243 these ConEx marks might be timely shifted. Regarding congestion 244 control it is not a general problem to miss a congestion event as 245 by chance a marking scheme in the network node might also miss a 246 certain flow. Even if then no other flow is reacting, the 247 congestion level will increase and it will get more likely that 248 the congestion feedback is delivered. But to provide a fair share 249 over time, a TCP sender could react more strong when receiving a 250 ECN feedback signal. This of course depends on the congestion 251 control used. A TCP sender using this scheme MUST take the impact 252 on congestion control into account. 254 o Advanced compatibility mode: 256 More sophisticated heuristics, such as a phase locked loop, to set 257 CWR only on those data segments, that will actually trigger an 258 (delayed) ACK, could extract congestion notifications more timely. 259 A ConEx sender MAY choose to implement such a heuristic. In 260 addition, further heuristics SHOULD be implemented, to determine 261 the value of each ECE notification. E.g. for each consecutive ACK 262 received with the ECE flag set, CEG should be increased by min( 263 M*SSMS, acked_bytes). Else if the predecessor ACK was received 264 with the ECE flag cleared, CEG need only be increase by one SMSS: 266 if previous_marked: CEG += min( M*SSMS, acked_bytes) 267 else: CEG += SMSS 269 This heuristic is conservative during more serious congestion, and 270 more relaxed at low congestion levels. 272 3.2. Loss Detection with/without SACK 274 For all the data segments that are determined by a ConEx sender as 275 lost, an identical number of IP bytes MUST be be sent with the ConEx 276 L bit set. Loss detection typically happens by use of duplicate 277 ACKs, or the firing of the retransmission timer. A ConEx sender MUST 278 maintain a loss exposure gauge (LEG), indicating the number of 279 outstanding bytes that must be sent with the ConEx L bit. When a 280 data segment is retransmitted, LEG will be increased by the size of 281 the TCP payload packet containing the retransmission, assuming equal 282 sized segments such that the retransmitted packet will have the same 283 number of header as the original ones. When sending subsequent 284 segments (including TCP control segments), the ConEx L bit is set as 285 long as LEG is positive, and LEG is decreased by the size of the sent 286 TCP payload with the ConEx L bit set. 288 Any retransmission may be spurious. To accommodate that, a ConEx 289 sender SHOULD make use of heuristics to detect such spurious 290 retransmissions (e.g. F-RTO [RFC5682], DSACK [RFC3708], and Eifel 291 [RFC3522], [RFC4015]). When such a heuristic has determined, that a 292 certain number of packets were retransmitted erroneously, the ConEx 293 sender should subtract the payload size of these TCP packets from 294 LEG. 296 Note that the above heuristics delays the ConEx signal by one 297 segment, and also decouples them from the retransmissions themselves, 298 as some control packets (e.g. pure ACKs, window probes, or window 299 updates) may be sent in between data segment retransmissions. A 300 simpler approach would be to set the ConEx signal for each 301 retransmitted data segment. However, it is important to remember, 302 that a ConEx signal and TCP segments do not natively belong together. 304 4. Setting the ConEx IPv6 Bits 306 ConEx is currently/will be defined as an destination option for IPv6. 307 The use of four bits have been defined, namely the X (ConEx-capable), 308 the L (loss experienced), the E (ECN experienced) and C (credit) bit. 310 By setting the X bit a packet is marked as ConEx-capable. All 311 packets carrying payload MUST be marked with the X bit set including 312 retransmissions. About control packets as pure ACKs which are not 313 carrying any payload no congestion feedback information are available 314 thus these packet should not be take into account when determining 315 ConEx information. These packet MUST carry a ConEx Destination 316 Option with the X bit unset. 318 4.1. Setting the E and the L Bit 320 As long as the CEG/LEG is positive, ConEx-capable packets MUST be 321 marked with E or respective L and the CEG/LEG is decreased by the TCP 322 payload bytes carried in this packet. If the CEG/LEG is negative, 323 the CEG/LEG is drained by one byte with every packet sent out, as 324 ConEX information are only meaningful for a certain time: 326 if CEG > 0: CEG -= TCPpayload.length else: CEG-- 327 if LEG > 0: LEG -= TCPpayload.length else: LEG-- 329 4.2. Credit Bits 331 The ConEx abstract mechanism requires that the transport SHOULD 332 signal sufficient credit in advance to cover any reasonably expected 333 congestion during its feedback delay. To be very conservative the 334 number of credits would need to equal the number of packets in 335 flight, as every packet could get lost or congestion marked. With a 336 more moderate view, only an increase in the sending rate should cause 337 congestion. 339 For TCP sender using the [RFC5681] congestion control algorithm, we 340 recommend to only send credit in Slow Start, as in Congestion 341 Avoidance an increase of one segment per RTT should only cause a 342 minor amount of congestion marks (usually at max one). If a more 343 aggressive congestion control is used, a sufficient amount of credits 344 need to be set. 346 In TCP Slow Start the sending rate will increase exponentially and 347 that means double every RTT. Thus the number of credits should equal 348 half the number of packets in flight in every RTT. Under the 349 assumption that all marks will not get invalid for the whole Slow 350 Start phase, marks of a previous RTT have to be summed up. Thus the 351 marking of every fourth packet will allow sufficient credits in Slow 352 Start. 354 RTT1 |------XC------>| 355 |------X------->| 356 |------X------->| credit=1 in_flight=3 357 | | 358 RTT2 |------X------->| 359 |------XC------>| 360 |------X------->| 361 |------X------->| 362 |------X------->| 363 |------XC------>| credit=3 in_flight=6 364 | | 365 RTT3 |------X------->| 366 |------X------->| 367 |------X------->| 368 |------XC------>| 369 |------X------->| 370 |------X------->| 371 |------X------->| 372 |------XC------>| 373 |------X------->| 374 |------X------->| 375 |------X------->| 376 |------XC------>| credit=6 in_flight=12 377 | . | 378 | : | 380 Figure 1: Credits in Slow Start (with an initial window of 3) 382 If a ConEx sender detects an increasing number of losses even though 383 the sender reduced the sending rate, the sender SHOULD assume that 384 those losses are incorporated by an audit device and thus should send 385 further credits. Up to now its not clear if the credits say valid as 386 long as the connection is established or if an expiration of the 387 credits need to be assumed by the sender. 389 5. Timeliness of the ConEx Signals 391 ConEx signals will anyway be evaluated with a slight time delay of 392 about one RTT by a network node. Therefore, it would not be 393 absolutely necessary to immediately signal ConEx bits when they 394 become known (e.g. L and E bits), but a sender SHOULD sent the ConEx 395 signaling with the next available packet. If cases are available 396 where it is preferable to slight delay the ConEx signal, the sender 397 MUST NOT delay the ConEx signal more than one RTT. 399 Multiple ConEx bits may become available for signaling at the same 400 time, for example when an ACK is received by the sender, that 401 indicates that at least one segment has been lost, and that one or 402 more ECN marks were received at the same time. This may happen 403 during excessive congestion, where buffer queues overflow and some 404 packets are marked, while others have to be dropped nevertheless. 405 Another possibility when this may happen are lost ACKs, so that a 406 subsequent ACK carries summary information not previously available 407 to the sender. 409 It is important to remember, that ConEx bits and TCP retransmissions 410 do not interact with each other. However, a retransmission should be 411 accompanied by one ConEx L bit in close proximity nevertheless. This 412 does not mean, that TCP retransmissions may never contain ConEx 413 marks. In a typical scenario using SACK, the first retransmission 414 would not carry a ConEx L bit, while subsequent retransmissions in 415 the same recovery episode, would be marked with the ConEx L bit. 416 Spreading the ConEx bits over a small number of segments increases 417 the likelihood that most devices along the path will see some ConEx 418 marks even during heavy congestion. 420 6. Acknowledgements 422 7. IANA Considerations 424 8. Security Considerations 426 9. References 428 9.1. Normative References 430 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 431 Selective Acknowledgment Options", RFC 2018, October 1996. 433 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 434 Requirement Levels", BCP 14, RFC 2119, March 1997. 436 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 437 of Explicit Congestion Notification (ECN) to IP", 438 RFC 3168, September 2001. 440 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. 441 Ramakrishnan, "Adding Explicit Congestion Notification 442 (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, 443 June 2009. 445 9.2. Informative References 447 [DCTCP] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, 448 P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP: 449 Efficient Packet Transport for the Commoditized Data 450 Center", Jan 2010. 452 [I-D.briscoe-tsvwg-re-ecn-tcp] 453 Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, 454 "Re-ECN: Adding Accountability for Causing Congestion to 455 TCP/IP", draft-briscoe-tsvwg-re-ecn-tcp-09 (work in 456 progress), October 2010. 458 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm 459 for TCP", RFC 3522, April 2003. 461 [RFC3708] Blanton, E. and M. Allman, "Using TCP Duplicate Selective 462 Acknowledgement (DSACKs) and Stream Control Transmission 463 Protocol (SCTP) Duplicate Transmission Sequence Numbers 464 (TSNs) to Detect Spurious Retransmissions", RFC 3708, 465 February 2004. 467 [RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm 468 for TCP", RFC 4015, February 2005. 470 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 471 Control", RFC 5681, September 2009. 473 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 474 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 475 Spurious Retransmission Timeouts with TCP", RFC 5682, 476 September 2009. 478 [draft-kuehlewind-conex-accurate-ecn] 479 Kuehlewind, M. and R. Scheffenegger, "Accurate ECN 480 Feedback in TCP", draft-kuehlewind-conex-accurate-ecn-00 481 (work in progress), Jun 2011. 483 Authors' Addresses 485 Mirja Kuehlewind (editor) 486 University of Stuttgart 487 Pfaffenwaldring 47 488 Stuttgart 70569 489 Germany 491 Email: mirja.kuehlewind@ikr.uni-stuttgart.de 493 Richard Scheffenegger 494 NetApp, Inc. 495 Am Euro Platz 2 496 Vienna, 1120 497 Austria 499 Phone: +43 1 3676811 3146 500 Email: rs@netapp.com