idnits 2.17.1 draft-zimmermann-tcpm-reordering-reaction-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (May 20, 2014) is 3626 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Outdated reference: A later version (-02) exists of draft-zimmermann-tcpm-reordering-detection-01 ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2861 (Obsoleted by RFC 7661) -- Obsolete informational reference (is this intentional?): RFC 2960 (Obsoleted by RFC 4960) Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance and Minor Extensions A. Zimmermann 3 (TCPM) WG NetApp, Inc. 4 Internet-Draft L. Schulte 5 Intended status: Experimental Aalto University 6 Expires: November 21, 2014 C. Wolff 7 A. Hannemann 8 credativ GmbH 9 May 20, 2014 11 Making TCP Adaptively Robust to Non-Congestion Events 12 draft-zimmermann-tcpm-reordering-reaction-01 14 Abstract 16 This document specifies an adaptive Non-Congestion Robustness (aNCR) 17 mechanism for TCP. In the absence of explicit congestion 18 notification from the network, TCP uses only packet loss as an 19 indication of congestion. One of the signals TCP uses to determine 20 loss is the arrival of three duplicate acknowledgments. However, 21 this heuristic is not always correct, notably in the case when paths 22 reorder packets. This results in degraded performance. 24 TCP-aNCR is designed to mitigate this performance degradation by 25 adaptively increasing the number of duplicate acknowledgments 26 required to trigger loss recovery, based on the current state of the 27 connection, in an effort to better disambiguate true segment loss 28 from segment reordering. This document specifies the changes to TCP 29 and TCP-NCR (on which this specification is build on) and discusses 30 the costs and benefits of these modifications. 32 Status of this Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at http://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on November 21, 2014. 49 Copyright Notice 51 Copyright (c) 2014 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (http://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 67 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 7 68 3. Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . 7 69 4. Appropriate Detection and Quantification Algorithms . . . . . 8 70 5. The TCP-aNCR Algorithm . . . . . . . . . . . . . . . . . . . . 8 71 5.1. Initialization during Connection Establishment . . . . . . 9 72 5.2. Initializing Extended Limited Transmit . . . . . . . . . . 10 73 5.3. Executing Extended Limited Transmit . . . . . . . . . . . 11 74 5.4. Terminating Extended Limited Transmit . . . . . . . . . . 12 75 5.5. Entering Loss Recovery . . . . . . . . . . . . . . . . . . 14 76 5.6. Reordering Extent . . . . . . . . . . . . . . . . . . . . 14 77 5.7. Retransmission Timeout . . . . . . . . . . . . . . . . . . 14 78 6. Protocol Steps in Detail . . . . . . . . . . . . . . . . . . . 14 79 7. Discussion of TCP-aNCR . . . . . . . . . . . . . . . . . . . . 17 80 7.1. Variable Duplicate Acknowledgment Threshold . . . . . . . 17 81 7.2. Relative Reordering Extent . . . . . . . . . . . . . . . . 18 82 7.3. Reordering during Slow Start . . . . . . . . . . . . . . . 18 83 7.4. Preventing Bursts . . . . . . . . . . . . . . . . . . . . 19 84 7.5. Persistent receiving of Selective Acknowledgments . . . . 20 85 8. Interoperability Issues . . . . . . . . . . . . . . . . . . . 22 86 8.1. Early Retransmit . . . . . . . . . . . . . . . . . . . . . 22 87 8.2. Congestion Window Validation . . . . . . . . . . . . . . . 22 88 8.3. Reactive Response to Packet Reordering . . . . . . . . . . 22 89 8.4. Buffer Auto-Tuning . . . . . . . . . . . . . . . . . . . . 23 90 9. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 23 91 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 25 92 11. Security Considerations . . . . . . . . . . . . . . . . . . . 25 93 12. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 26 94 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 26 95 13.1. Normative References . . . . . . . . . . . . . . . . . . . 26 96 13.2. Informative References . . . . . . . . . . . . . . . . . . 27 97 Appendix A. Changes from previous versions of the draft . . . . . 28 98 A.1. Changes from 99 draft-zimmermann-tcpm-reordering-reaction-00 . . . . . . . 28 100 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 29 102 1. Introduction 104 One strength of the Transmission Control Protocol (TCP) [RFC0793] 105 lies in its ability to adjust its sending rate according to the 106 perceived congestion in the network [RFC5681]. In the absence of 107 explicit notification of congestion from the network, TCP uses 108 segment loss as an indication of congestion (i.e., assuming queue 109 overflow). A TCP receiver sends cumulative acknowledgments (ACKs) 110 indicating the next sequence number expected from the sender for 111 arriving segments [RFC0793]. When segments arrive out of order, 112 duplicate ACKs are generated. As specified in [RFC5681], a TCP 113 sender uses the arrival of three duplicate ACKs as an indication of 114 segment loss. The TCP sender retransmits the segment assumed lost 115 and reduces the sending rate, based on the assumption that the loss 116 was caused by resource contention on the path. The TCP sender does 117 not assume loss on the first or second duplicate ACK, but waits for 118 three duplicate ACKs to account for minor packet reordering. 119 However, the use of this constant threshold of duplicate ACKs leads 120 to performance degradation if the extent of the packet reordering in 121 the network increases [RFC4653]. 123 Whenever interoperability with the TCP congestion control and loss 124 recovery standard [RFC5681] is a prerequisite, increasing the 125 duplicate acknowledgment threshold (DupThresh) is the method of 126 choice to a priori prevent any negative impact - in particular, a 127 spurious Fast Retransmit and Fast Recovery phase - that packet 128 reordering has on TCP. However, this procedure also delays a Fast 129 Retransmit by increasing the DupThresh, and therefore has costs and 130 risks, too. According to [Zha+03], these are: (1) a delayed response 131 to congestion in the network, (2) a potential expiration of the 132 retransmission timer, and (3) a significant increase in the end-to- 133 end delay for lost segments. 135 In the current TCP standard, congestion control and loss recovery are 136 tightly coupled: when the oldest outstanding segment is declared 137 lost, a retransmission is triggered, and the sending rate is reduced 138 on the assumption that the loss is due to resource contention 139 [RFC5681]. Therefore, any change to DupThresh causes not only a 140 change to the loss recovery, but also to the congestion control 141 response. TCP-NCR [RFC4653] addresses this problem by defining two 142 extensions to TCP's Limited Transmit [RFC3042] scheme: Careful and 143 Aggressive Extended Limited Transmit. 145 The first variant of the two, Careful Limited Transmit, sends one 146 previously unsent segment in response to duplicate acknowledgments 147 for every two segments that are known to have left the network. This 148 effectively halves the sending rate, since normal TCP operation sends 149 one new segment for every segment that has left the network. 151 Further, the halving starts immediately and is not delayed until a 152 retransmission is triggered. In the case of packet reordering (i.e., 153 not segment loss), TCP-NCR restores the congestion control state to 154 its previous state after the event. 156 The second variant, Aggressive Limited Transmit, transmits one 157 previously unsent data segment in response to duplicate 158 acknowledgments for every segment known to have left the network. 159 With this variant, while waiting to disambiguate the loss from a 160 reordering event, ACK-clocked transmission continues at roughly the 161 same rate as before the event started. Retransmission and the 162 sending rate reduction happen per [RFC5681] [RFC6675], albeit after a 163 delay caused by the increased DupThresh. Although this approach 164 delays legitimate rate reductions (possibly slightly, and temporarily 165 aggravating overall congestion on the network), the scheme has the 166 advantage of not reducing the transmission rate in the face of packet 167 reordering. 169 A basic requirement for preventing an avoidable expiration of the 170 retransmission timer is to generally ensure that an increased 171 DupThresh can potentially be reached in time so that Fast Retransmit 172 is triggered and Fast Recovery is completed before the RTO expires. 173 Simply increasing DupThresh before retransmitting a segment can make 174 TCP brittle to packet or ACK loss, since such loss reduces the number 175 of duplicate ACKs that will arrive at the sender from the receiver. 176 For instance, if cwnd is 10 segments and one segment is lost, a 177 DupThresh of 10 will never be met, because duplicate ACKs 178 corresponding to at most 9 segments will arrive at the sender. To 179 mitigate this issue, the TCP-NCR [RFC4653] modification makes two 180 fundamental changes to the way [RFC5681] [RFC6675] currently 181 operates. 183 First, as mentioned above, TCP-NCR [RFC4653] extends TCP's Limited 184 Transmit [RFC3042] scheme to allow for the sending of new data 185 segment while the TCP sender stays in the 'disorder' state and 186 disambiguate loss and reordering. This new data serves to increase 187 the likelihood that enough duplicate ACKs arrive at the sender to 188 trigger loss recovery, if it is appropriate. Second, DupThresh is 189 increased from the current fixed value of three [RFC5681] to a value 190 indicating that approximately a congestion window's worth of data has 191 left the network. Since cwnd represents the amount of data a TCP 192 sender can transmit in one round-trip time (RTT), this corresponds to 193 approximately the largest amount of time a TCP sender can wait before 194 the costly retransmission timeout may be triggered. 196 Of vital importance is that TCP-NCR [RFC4653] holds DupThresh not 197 constant, but dynamically adjusts it on each SACK to the current 198 amount of outstanding data, which depends not only on the congestion 199 window, but also on the receiver's advertised window. Thus, it is 200 guaranteed that the outstanding data generates a sufficient number of 201 duplicate ACKs for reaching DupThresh and a transition to the 202 'recovery' state. This is important in cases where there is no new 203 data available to send. 205 Regarding the problem of packet reordering, TCP-NCR's [RFC4653] 206 decision of waiting to receive notice that cwnd bytes have left the 207 network before deciding whether the root cause is loss or reordering 208 is essentially a trade-off between making the best decision regarding 209 the cause of the duplicate ACKs and responsiveness, and represents a 210 good compromise between avoiding spurious Fast Retransmits and 211 avoiding unnecessary RTOs. On the other hand, if there is no visible 212 packet reordering on the network path - which today is the rule and 213 not the exception - or the delay caused by the reordering is very 214 low, delaying Fast Retransmit is unnecessary in the case of 215 congestion, and data is delivered to the application up to one RTT 216 later. Especially for delay-sensitive applications, such as a 217 terminal session over SSH, this is generally undesirable. By 218 dynamically adapting DupThresh not only to the amount of outstanding 219 data but also to the perceived packet reordering on the network path, 220 this issue can be offset. This is the key idea behind the TCP-aNCR 221 algorithm. 223 This document specifies a set of TCP modifications to provide an 224 adaptive Non-Congestion Robustness (aNCR) mechanism for TCP. The 225 TCP-aNCR modifications lend themselves to incremental deployment. 226 Only the TCP implementation on the sender side requires modification. 227 The changes themselves are modest. TCP-aNCR is built on top of the 228 TCP Selective Acknowledgments Option [RFC2018] and the SACK-based 229 loss recovery scheme given in [RFC6675] and represents an enhancement 230 of the original TCP-NCR mechanism [RFC4653]. Currently, TCP-aNCR is 231 an independent approach of making TCP more robust to packet 232 reordering. It is not clear if upcoming versions of this draft TCP- 233 aNCR will obsolete TCP-NCR or not. 235 It should be noted that the TCP-aNCR algorithm in this document could 236 be easily adapted to the Stream Control Transmission Protocol (SCTP) 237 [RFC2960], since SCTP uses congestion control algorithms similar to 238 TCP (and thus has the same reordering robustness issues). 240 The remainder of this document is organized as follows. Section 3 241 provides a high-level description of the TCP-aNCR mechanism. 242 Section 4 defines TCP-aNCR's requirements for an appropriate 243 detection and quantification algorithm. Section 5 specifies the TCP- 244 aNCR algorithm and Section 6 discusses each step of the algorithm in 245 detail. Section 7 provides a discussion of several design decisions 246 behind TCP-aNCR. Section 8 discusses interoperability issues related 247 to introducing TCP-aNCR. Finally, related work is presented in 248 Section 9 and security concerns in Section 11. 250 2. Terminology 252 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 253 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 254 document are to be interpreted as described [RFC2119]. 256 The reader is expected to be familiar with the TCP state variables 257 described in [RFC0793] (SND.NXT), [RFC5681] (cwnd, rwnd, ssthresh, 258 FlightSize, IW), [RFC6675] (pipe, DupThresh, SACK scoreboard), and 259 [RFC6582] (recover). Further, the term 'acceptable acknowledgment' 260 is used as defined in [RFC0793]. That is, an ACK that increases the 261 connection's cumulative ACK point by acknowledging previously 262 unacknowledged data. The term 'duplicate acknowledgment' is used as 263 defined in [RFC6675], which is different from the definition of 264 duplicate acknowledgment in [RFC5681]. 266 This specification defines the four TCP sender states 'open', 267 'disorder', 'recovery', and 'loss' as follows. As long as no 268 duplicate ACK is received and no segment is considered lost, the TCP 269 sender is in the 'open' state. Upon the reception of the first 270 consecutive duplicate ACK, TCP will enter the 'disorder' state. 271 After receiving DupThresh duplicate ACKs, the TCP sender switches to 272 the 'recovery' state and executes standard loss recovery procedures 273 like Fast Retransmit and Fast Recovery [RFC5681]. Upon a 274 retransmission timeout, the TCP sender enters the 'loss' state. The 275 'recovery' state can only be reached by a transition from the 276 'disorder' state, the 'loss' state can be reached from any other 277 state. 279 The following specification depends on the standard TCP congestion 280 control and loss recovery algorithms and the SACK-based loss recovery 281 scheme given in [RFC5681], respectively [RFC6675]. The algorithm 282 presents an enhancement of TCP-NCR [RFC4653]. The reader is assumed 283 to be familiar with the algorithms specified in these documents. 285 3. Basic Concept 287 The general idea behind the TCP-aNCR algorithm is to extend the TCP- 288 NCR algorithm [RFC4653], so that - based on an appropriate packet 289 reordering detection and quantification algorithm (see Section 4) - 290 TCP congestion control and loss recovery [RFC5681] is adaptively 291 adjusted to the actual perceived packet reordering on the network 292 path. 294 TCP-NCR [RFC4653] increases DupThresh from the current fixed value of 295 three duplicate ACKs [RFC5681] to approximately until a congestion 296 window of data has left the network. Since cwnd represents the 297 amount of data a TCP sender can transmit in one RTT, the choice to 298 trigger a retransmission only after a cwnd's worth of data is known 299 to have left the network represents roughly the largest amount of 300 time a TCP sender can wait before the RTO may be triggered. The 301 approach chosen in TCP-aNCR is to take TCP-NCR's DupThresh as an 302 upper bound for an adjustment of the DupThresh that is adaptive to 303 the actual packet reordering on the network path. 305 Using TCP-NCR's DupThresh as an upper bound decouples the avoidance 306 of spurious Fast Retransmits from the avoidance of unnecessary 307 retransmission timeouts. Therefore, the adaptive adjustment of the 308 DupThresh to current perceived packet reordering can be conducted 309 without taking any retransmission timeout avoidance strategy into 310 account. This independence allows TCP-aNCR to quickly respond to 311 perceived packet reordering by setting its DupThresh so that it 312 always corresponds to the minimum of the maximum possible (TCP-NCR's 313 DupThresh) and the maximum measured reordering extent since the last 314 RTO. The reordering extent used by TCP-aNCR is by itself not a 315 static absolute reordering extent, but a relative reordering extent 316 (see Section 4). 318 4. Appropriate Detection and Quantification Algorithms 320 If the TCP-aNCR algorithm is implemented at the TCP sender, it MUST 321 be implemented together with an appropriate packet reordering 322 detection and quantification algorithm that is specified in a 323 standards track or experimental RFC. 325 Designers of reordering detection algorithms who want their 326 algorithms to work together with the TCP-aNCR algorithm SHOULD reuse 327 the variable 'ReorExtR' (relative reordering extent) with the 328 semantics and defined values specified in 329 [I-D.zimmermann-tcpm-reordering-detection]. A 'ReorExtR' given by 330 the detection algorithm holds a value ranging from 0 to 1 which holds 331 the new measured reordering sample as a fraction of the data in 332 flight. TCP-aNCR then saves this new fraction if it is greater than 333 the current value. 335 5. The TCP-aNCR Algorithm 337 When both the Nagle algorithm [RFC0896] [RFC1122] and the TCP 338 Selective Acknowledgment Option [RFC2018] are enabled for a 339 connection, a TCP sender MAY employ the following TCP-aNCR algorithm 340 to dynamically adapt TCP's congestion control and loss recovery 341 [RFC5681] to the currently perceived packet reordering on the network 342 path. 344 Without the Nagle algorithm, there is no straightforward way to 345 accurately calculate the number of outstanding segments in the 346 network (and, therefore, no good way to derive an appropriate 347 DupThresh) without adding state to the TCP sender. A TCP connection 348 that does not use the Nagle algorithm SHOULD NOT use TCP-aNCR. The 349 adaptation of TCP-aNCR to an implementation that carefully tracks the 350 sequence numbers transmitted in each segment is considered future 351 work. 353 A necessary prerequisite for TCP-aNCR's adaptability is that a TCP 354 sender has enabled an appropriate detection and quantification 355 algorithm that complies with the requirements defined in Section 4. 356 If such an algorithm is either non-existent or not used, the behavior 357 of TCP-aNCR is completely analogous to the TCP-NCR algorithm as 358 defined in [RFC4653]. If a TCP sender does implement TCP-aNCR, the 359 implementation MUST follow the various specifications provided in 360 Sections 5.1 to 5.7. 362 5.1. Initialization during Connection Establishment 364 After the completion of the TCP connection establishment, the 365 following state constants and variables MUST be initialized in the 366 TCP transmission control block for the given TCP connection: 368 (C.1) Depending on which variant of Extended Limited Transmit should 369 be executed, the constant LT_F MUST initialized as follows. 370 For Careful Extended Limited Transmit: 372 LT_F = 2/3 374 For Aggressive Extended Limited Transmit: 376 LT_F = 1/2 378 This constant reflects the fraction of outstanding data 379 (including data sent during Extended Limited Transmit) that 380 must be SACKed before a retransmission is at the latest 381 triggered. 383 (C.2) If TCP-aNCR should adaptively adjust the DupThresh to the 384 current perceived packet reordering on the network path, then 385 the variable 'ReorExtR', which stores the maximum relative 386 reordering extent, MUST initialized as: 388 ReorExtR = 0 390 Otherwise the dynamically adaptation of TCP-aNCR SHOULD be 391 disabled by setting 393 ReorExtR = -1 395 A relative reordering extent of 0 results in the standard 396 DupThresh of three duplicate ACKs, as defined in [RFC5681]. A 397 fixed relative reordering extent of -1 results in the TCP-NCR 398 behavior from [RFC4653]. 400 5.2. Initializing Extended Limited Transmit 402 If the SACK scoreboard is empty upon the receipt of a duplicate ACK 403 (i.e., the TCP sender has received no SACK information from the 404 receiver), a TCP sender MUST enter Extended Limited Transmit by 405 initialize the following five state variables in the TCP Transmission 406 Control Block: 408 (I.1) The TCP sender MUST save the current outstanding data: 410 FlightSizePrev = FlightSize 412 (I.2) The TCP sender MUST save the highest sequence number 413 transmitted so far: 415 recover = SND.NXT - 1 417 Note: The state variable 'recover' from [RFC6582] can be 418 reused, since NewReno TCP uses 'recover' at the initialization 419 of a loss recovery procedure, whereas TCP-aNCR uses 'recover' 420 *before* loss recovery. 422 (I.3) The TCP sender MUST initialize the variable 'skipped' that 423 tracks the number of segments for which an ACK does not 424 trigger a transmission during Careful Limited Transmit: 426 skipped = 0 428 During Aggressive Limited Transmit, 'skipped' is not used. 430 (I.4) The TCP sender MUST set DupThresh based on the current 431 FlightSize: 433 DupThresh = max (LT_F * (FlightSize / SMSS), 3) 435 The lower bound of DupThresh = 3 is kept from [RFC5681] 437 [RFC6675]. 439 (I.5) If (ReorExtR != -1) holds, then the TCP sender MUST set 440 DupThresh based on the relative reordering extent 'ReorExtR': 442 DupThresh = max (min (DupThresh, 443 ReorExtR * (FlightSize / SMSS)), 3) 445 In addition to the above steps, the incoming ACK MUST be processed 446 with the (E) series of steps in Section 5.3. 448 5.3. Executing Extended Limited Transmit 450 On each ACK that a) arrives after TCP-aNCR has entered the Extended 451 Limited Transmit phase (as outlined in Section 5.2) *and* b) carries 452 new SACK information, *and* c) does *not* advance the cumulative ACK 453 point, the TCP sender MUST use the following procedure. 455 (E.1) The TCP sender MUST update the SACK scoreboard and uses the 456 SetPipe() procedure from [RFC6675] to set the 'pipe' variable 457 (which represents the number of bytes still considered "in the 458 network"). Note: the current value of DupThresh MUST be used 459 by SetPipe() to produce an accurate assessment of the amount 460 of data still considered in the network. 462 (E.2) The TCP sender MUST initialize the variable 'burst' that 463 tracks the number of segments that can at most be sent per ACK 464 to the size of the Initial Window (IW) [RFC5681]: 466 burst = IW 468 (E.3) If a) (cwnd - pipe - skipped >= 1 * SMSS) holds, *and* b) the 469 receive window (rwnd) allows to send SMSS bytes of previously 470 unsent data, *and* c) there are SMSS bytes of previously 471 unsent data available for transmission, then the TCP sender 472 MUST transmit one segment of SMSS bytes. Otherwise, the TCP 473 sender MUST skip to step (E.7). 475 (E.4) The TCP sender MUST increment 'pipe' by SMSS bytes and MUST 476 decrement 'burst' by SMSS bytes to reflect the newly 477 transmitted segment: 479 pipe = pipe + SMSS 480 burst = burst - SMSS 482 (E.5) If Careful Limited Transmit is used, 'skipped' MUST be 483 incremented by SMSS bytes to ensure that the next SMSS bytes 484 of SACKed data processed do not trigger a Limited Transmit 485 transmission. 487 skipped = skipped + SMSS 489 (E.6) If (burst > 0) holds, the TCP sender MUST return to step (E.3) 490 to ensure that as many bytes as appropriate are transmitted. 491 Otherwise, if more than IW bytes were SACKed by a single ACK, 492 the TCP sender MUST skip to step (E.7). The additional amount 493 of data becomes available again by the next received duplicate 494 ACK and the re-execution of SetPipe(). 496 (E.7) The TCP sender MUST save the maximum amount of data that is 497 considered to have been in the network during the last RTT: 499 pipe_max = max (pipe, pipe_max) 501 (E.8) The TCP sender MUST set DupThresh based on the current 502 FlightSize: 504 DupThresh = max (LT_F * (FlightSize / SMSS), 3) 506 The lower bound of DupThresh = 3 is kept from [RFC5681] 507 [RFC6675]. 509 (E.9) If (ReorExtR != -1) holds, then the TCP sender MUST set 510 DupThresh based on the relative reordering extent 'ReorExtR': 512 DupThresh = max (min (DupThresh, 513 ReorExtR * (FlightSize / SMSS)), 3) 515 5.4. Terminating Extended Limited Transmit 517 On the receipt of a duplicate ACK that a) arrives after TCP-aNCR has 518 entered the Extended Limited Transmit phase (as outlined in 519 Section 5.2) *and* b) advances the cumulative ACK point, the TCP 520 sender MUST use the following procedure. 522 The arrival of an acceptable ACK that advances the cumulative ACK 523 point while in Extended Limited Transmit, but before loss recovery is 524 triggered, signals that a series of duplicate ACKs was caused by 525 reordering and not congestion. Therefore, Extended Limited Transmit 526 will be either terminated or re-entered. 528 (T.1) If the received ACK extends not only the cumulative ACK point, 529 but *also* carries new SACK information (i.e., the ACK is both 530 an acceptable ACK and a duplicate ACK), the TCP sender MUST 531 restart Extended Limited Transmit and MUST go to step (T.2). 532 Otherwise, the TCP sender MUST terminate it and MUST skip to 533 step (T.3). 535 (T.2) If the Cumulative Acknowledgment field of the received ACK 536 covers more than 'recover' (i.e., SEG.ACK > recover), Extended 537 Limited Transmit has transmitted one cwnd worth of data 538 without any losses and the TCP sender MUST update the 539 following state variables by 541 FlightSizePrev = pipe_max 542 pipe_max = 0 544 and MUST go to step (I.2) to re-start Extended Limited 545 Transmit. Otherwise if (SEG.ACK <= recover) holds, the TCP 546 sender MUST go to step (I.3). This ensures that in the event 547 of a loss the cwnd reduction is based on a current value of 548 FlightSizePrev. 550 The following steps are executed only if the received ACK does *not* 551 carry SACK information. Extended Limited Transmit will be 552 terminated. 554 (T.3) A TCP sender MUST set ssthresh to: 556 ssthresh = max (cwnd, ssthresh) 558 This step provides TCP-aNCR with a sense of "history". If the 559 next step (T.4) reduces the congestion window, this step 560 ensures that TCP-aNCR will slow-start back to the operating 561 point that was in effect before Extended Limited Transmit. 563 (T.4) A TCP sender MUST reset cwnd to: 565 cwnd = FlightSize + SMSS 567 This step ensures that cwnd is not significantly larger than 568 the amount of data outstanding, a situation that would cause a 569 line rate burst. 571 (T.5) A TCP is now permitted to transmit previously unsent data as 572 allowed by cwnd, FlightSize, application data availability, 573 and the receiver's advertised window. 575 5.5. Entering Loss Recovery 577 The receipt of an ACK that results in deeming the oldest outstanding 578 segment is lost via the algorithms in [RFC6675] terminates Extended 579 Limited Transmit and initializes the loss recovery according to 580 [RFC6675]. One slight change to [RFC6675] MUST be made, however. 582 (Ret) In Section 5, step (4.2) of [RFC6675] MUST be changed to: 584 ssthresh = cwnd = (FlightSizePrev / 2) 586 This ensures that the congestion control modifications are 587 made with respect to the amount of data in the network before 588 FlightSize was increased by Extended Limited Transmit. 590 Once the algorithm in [RFC6675] takes over from Extended Limited 591 Transmit, the DupThresh value MUST be held constant until the loss 592 recovery phase terminates. 594 5.6. Reordering Extent 596 Whenever the additional detection and quantification algorithm (see 597 Section 4) detects and quantifies a new reordering event, the TCP 598 sender MUST update the state variable 'ReorExtR'. 600 (Ext) Let 'ReorExtR_New' the newly determined relative reordering 601 extent: 603 ReorExtR = min (max (ReorExtR, ReorExtR_New), 1) 605 5.7. Retransmission Timeout 607 The expiration of the retransmission timer SHOULD be interpreted as 608 an indication of a path characteristics change, and the TCP sender 609 SHOULD reset DupThresh to the default value of three. 611 (RTO) If an RTO occurs and (ReorExtR != -1) (i.e. TCP-aNCR is used 612 and not TCP-NCR), then a TCP sender SHOULD reset 'ReorExtR': 614 ReorExtR = 0 616 6. Protocol Steps in Detail 618 Upon the receipt of the first duplicate ACK in the 'open' state (the 619 SACK scoreboard is empty), the TCP sender starts to execute TCP-aNCR 620 by entering the 'disorder' state and the initialization of Extended 621 Limited Transmit. First, the TCP sender saves the current amount of 622 outstanding data as well as the highest sequence number transmitted 623 so far (SND.NXT - 1) (steps (I.1) and (I.2)). In addition, if the 624 TCP connection uses the careful variant of the Extended Careful 625 Limited Transmit (step (C.1)), the 'skipped' variable, which tracks 626 the number of segments for which an ACK does not trigger a 627 transmission during Careful Limited Transmit, is initialized with 628 zero (step (I.3)). The last step during the initialization is the 629 determination of DupThresh. Depending on whether TCP-aNCR has been 630 configured during the connection establishment to adaptively adjust 631 to the currently perceived packet reordering on the path (step 632 (C.2)), DupThresh is either determined exclusively based on the 633 current FlightSize (as TCP-NCR [RFC4653] does) or, in addition, also 634 based on the relative extent reordering (steps (I.4) and (I.5)). 636 Depending on which variant of Extended Limited Transmit should be 637 executed, the constant LT_F must be set accordingly (step (C.1)). 638 This constant reflects the fraction of outstanding data (including 639 data sent during Extended Limited Transmit) that must be SACKed 640 before a retransmission is triggered at the latest (which is the case 641 when a DupThresh that is based on relative reordering extent is 642 larger then TCP-NCR's DupThresh). Since Aggressive Limited Transmit 643 sends a new segment for every segment known to have left the network, 644 a total of approximately cwnd segments will be sent, and therefore 645 ideally a total of approximately 2*cwnd segments will be outstanding 646 when a retransmission is finally triggered. DupThresh is then set to 647 LT_F = 1/2 of 2*cwnd (or about 1 RTT's worth of data) (see step 648 (I.4)). The factor is different for Careful Limited Transmit, 649 because the sender only transmits one new segment for every two 650 segments that are SACKed and therefore will ideally have a total of 651 maximum of 1.5*cwnd segments outstanding when the retransmission is 652 triggered. Hence, the required threshold is LT_F=2/3 of 1.5*cwnd to 653 delay the retransmission by roughly 1 RTT. 655 For each duplicate ACK received in the 'disorder' state, which is not 656 an acceptable ACK, i.e., it carries new SACK information, but does 657 not advance the cumulative ACK point, Extended Limited Transmit is 658 executed. First, the SACK scoreboard is updated and based on the 659 current value of DupThresh, the amount of outstanding data (step 660 (E.1)). Furthermore, the state variable 'burst' that indicates the 661 number of segments that can be sent at most for of each received ACK 662 is initialized to the size of the initial window [RFC6928] (step 663 E.2)). If more than IW bytes were SACKed by a single ACK, the 664 additional amount of data becomes available again by the next 665 received duplicate ACK and the re-execution of SetPipe() (step 666 (E.1)). 668 Next, if new data is available for transmission and both the 669 congestion window and the receiver window allow to send SMSS bytes of 670 previously unsent data, a segment of SMSS bytes is sent (step (E.3)). 671 Subsequently, the corresponding state variables 'pipe', 'burst' and - 672 optionally - 'skipped' are updated (steps (E.4) and (E.5)). If, due 673 to the current size of the congestion and receiver windows (step 674 (E.2)), due to the current value of 'burst' (step (E.5)), no further 675 segment may be sent, the processing of the ACK is terminated. 676 Provided that the amount of data that is currently considered to be 677 in the network is greater than the previously stored one, this new 678 value is stored for later use (step (E.7)). Finally, to take into 679 account the new data sent, DupThresh is updated (steps (E.6) and 680 (E.7)). 682 The arrival of an acceptable ACK in the 'disorder' state that 683 advances the cumulative ACK point during Extended Limited Transmit 684 signals that a series of duplicate ACKs was caused by reordering and 685 not congestion. Therefore, the receipt of an acceptable ACK that 686 does not carry any SACK information terminates Extended Limited 687 Transmit (step (T.1)). The slow start threshold is set to the 688 maximum of its current value and the current value of cwnd (step 689 (T.3)). Cwnd itself is set to the current value of FlightSize plus 690 one segment (step (T.4)). As a result, the congestion window is not 691 significantly larger than the current amount of outstanding data, so 692 that a burst of data is effectively prevented. If new data is 693 available for transmission and both the new values of cwnd and rwnd 694 allow to send SMSS bytes of previously unsent data, a segment is send 695 (step (T.5)). 697 On the other hand, if the received ACK acknowledges new data not only 698 cumulatively but also selectively - the ACK carries new SACK 699 information - Extended Limited Transmit is not terminated but re- 700 entered (step (T.1)). If the Cumulative Acknowledgment field of the 701 received ACK covers more than 'recover', one cwnd worth of data has 702 been transmitted during Extended Limited Transmit without any packet 703 loss. Therefore, FlightSizePrev, the amount of outstanding data 704 saved at the beginning of Extended Limited Transmit (step (I.1)), is 705 considered outdated (step (T.2)). This step ensures that in the 706 event of packet loss, the reduction of the cwnd is based on an up-to- 707 date value, which reflects the number of bytes outstanding in the 708 network (see Section 7). Finally, regardless of whether or not 709 'recover' is covered, Extended Limited Transmit is re-entered. 711 The second case that leads to a termination of Extended Limited 712 Transmit is the receipt of an ACK that signals via the algorithm in 713 [RFC6675] that the oldest outstanding segment is considered lost. If 714 either DupThresh or more duplicate ACKs are received, or the oldest 715 outstanding segment is deemed lost via the function IsLost() of 716 [RFC6675], Extended Limited Transmit is terminated and SACK-based 717 loss recovery is entered [RFC6675]. Once the algorithm in [RFC6675] 718 takes over from Extended Limited Transmit, the DupThresh value MUST 719 be held constant until loss recovery is terminated. The process of 720 loss recovery itself is not changed by TCP-aNCR. The only exception 721 is a slight change of the step (4.2) of RFC 6675 [RFC6675], which 722 ensures that the adjustment made by the congestion control - halving 723 the congestion window - is made with respect to the initial amount of 724 outstanding data while Limited Transmit Extended is executed (step 725 (Ret)). The use of FlightSize at this point would no longer be valid 726 since the amount of outstanding data may double by executing Extended 727 Limited Transmit. 729 7. Discussion of TCP-aNCR 731 The specification of TCP-aNCR represents an incremental update of RFC 732 4653 [RFC4653]. All changes made by TCP-aNCR can be divided into two 733 categories. On one hand, they implement TCP-aNCR's ability to 734 dynamically adapted TCP congestion control and loss recovery 735 [RFC5681] to the currently perceived packet reordering on the network 736 path. These include the use of a variable DupThresh and the use of a 737 relative reordering extent. On the other hand, the changes that 738 basically correct weaknesses of the original TCP-NCR algorithm and 739 which are independent of TCP-aNCR adaptability. These include packet 740 reordering during slow start, the prevention of bursts, and the 741 persistent receipt of SACKs. 743 7.1. Variable Duplicate Acknowledgment Threshold 745 The central point of the TCP-aNCR algorithm is the usage of a 746 DupThresh that is adaptable to the perceived packet reordering on the 747 network path. Based on the actual amount of outstanding data, TCP- 748 NCR's DupThresh represents roughly the largest amount of time a Fast 749 Retransmit can safely be delayed before a costly retransmission 750 timeout may be triggered. Therefore, to avoid an RTO, TCP-aNCR's 751 reordering-aware DupThresh is an upper bound of the one calculated in 752 TCP-NCR (steps (I.5) and (E.9)). This decouples the avoidance of 753 spurious Fast Retransmits from the avoidance of RTOs. It allows TCP- 754 aNCR to react fast and efficiently to packet reordering. The 755 DupThresh always corresponds to the minimum of the largest possible 756 and largest detected reordering. With constant packet reordering in 757 terms of the rate and delay, TCP-aNCR gives a DupThresh based on the 758 relative reordering extent with an optimal delay for every bandwidth- 759 delay-product. If TCP-aNCR should not adaptively adjust the 760 DupThresh to the current perceived packet reordering on the network 761 path (because for example an appropriate detection and quantification 762 algorithm is not implemented), the dynamically adaptation of TCP-aNCR 763 can be disabled, so that TCP-aNCR behaves like TCP-NCR [RFC4653]. 765 7.2. Relative Reordering Extent 767 Whenever a new reordering event is detected and presented to TCP-aNCR 768 in the form of a relative reordering extend 'ReorExtR', TCP-aNCR 769 saves and uses the new 'ReorExtR' if it is larger than the old one 770 (step (EXT)). The upper bound of 1 assures that no excessively large 771 value is used. A 'ReorExtR' larger than one means that more than 772 FlightSize bytes would have been received out-of-order before the 773 reordered segment is received. The delay caused by the reordering is 774 thus longer than the RTT of the TCP connection. Since the RTT is 775 roughly the time a Fast Retransmit can safely be delayed before the 776 retransmission has to be to avoid an RTO, a maximum 'ReorExtR' of one 777 seems to be a suitable value. 779 The expiration of the retransmission timer is interpreted by TCP-aNCR 780 as an indication of a change in path characteristics, hence, the 781 saved 'ReorExtR' is assumed to be outdated and will be invalidated 782 (step (RTO)). As a consequence, the relative reordering extent 783 'ReorExtR' increases monotonically between two successive 784 retransmission timeouts and corresponds to the maximum measured 785 reordering extent since the last RTO. Other approaches would be an 786 exponentially-weighted moving average (EWMA) or a histogram of the 787 last n reordering extents. The main drawback of an EWMA is however 788 that on average half of the detected reordering events would be 789 larger than the saved reordering extend. Thus, only half of the 790 spurious retransmits could be avoided. Applying an histogram could 791 largely avoid the disadvantages of an EWMA, however, it would result 792 in a not acceptable increase in memory usage. 794 In combination with the invalidation after an RTO, the advantage of 795 using maximum is the low complexity as well as its fast convergence 796 to the actual maximum reordering on the network path. As a result, 797 the negative impact that packet reordering has on TCP's congestion 798 control and loss recovery can be avoided. A disadvantage of using a 799 maximum is that if the delay caused by the reordering decreases over 800 the lifetime of the TCP connection, a Fast Retransmit is 801 unnecessarily long delayed. Nevertheless, since the negative impact 802 reordering has on TCP's congestion control and loss recovery is more 803 substantial than the disadvantage of a longer delay, a decrease of 804 the ReorExtR between RTOs is considered inappropriate. 806 7.3. Reordering during Slow Start 808 The arrival of an acceptable ACK during Extended Limited Transmit 809 signals that previously received duplicate ACKs are the result of 810 packet reordering and not congestion, so that Extended Limited 811 Transmit is completed accordingly. Upon the termination of Extended 812 Limited Transmit, and especially when using the Careful variant, TCP- 813 NCR (as well as TCP-aNCR) may be in a situation where the entire cwnd 814 is not being utilized. Therefore, to mitigate a potential burst of 815 segments, in step (T.2) TCP-NCR sets the slow start threshold to the 816 FlightSize that was saved at the beginning of Extended Limited 817 Transmit [RFC4653]. This step should ensure that TCP-NCR slow starts 818 back to the operating point in use before Extended Limited Transmit. 820 Unfortunately, the assignment in step (T.2) is only correct if the 821 TCP sender already was in congestion avoidance at the time Extended 822 Limited Transmit was entered. Otherwise, if the TCP sender was 823 instead in slow start, the value of ssthresh is greater than the 824 saved FlightSize so that slow start prematurely concludes. This 825 behavior can leave much of the network resources idle, and a long 826 time may needed in order to use the full capacity. To mitigate this 827 issue, TCP-aNCR sets the slow start threshold to the maximum of its 828 current value and the current cwnd (step (T.3)). This continues slow 829 start after a reordering event happening during slow start. 831 7.4. Preventing Bursts 833 In cases where a new single SACK covers more than one segment - this 834 can happen either due to packet loss or packet reordering on the ACK 835 path - TCP-NCR [RFC4653] sends an undesirable burst of data. TCP- 836 aNCR solves this problem by limiting the burst size - the maximum of 837 data that can send in response to a single SACK - to the Initial 838 Window [RFC5681] while executing Extended Limited Transmit (steps 839 (E.2), (E.4), and (E.6)). Since IW represents the amount of data 840 that a TCP sender is able to send into the network safely without 841 knowing its characteristics, it is a reasonable value for the burst 842 size, too. If more than IW bytes were SACKed by a single ACK, the 843 additional amount of data becomes available again by the next 844 received duplicate ACK. Thus, the transmission of new segments is 845 spread over the next received ACKs, so that micro bursts - a 846 characteristic of packet reordering in the reverse path - are largely 847 compensated. 849 Another situation that causes undesired bursts of segments with TCP- 850 NCR is the receipt of an acceptable ACK during Careful Extended 851 Limited Transmit. If multiple segments from a single window of data 852 are delayed by packet reordering, typically the first acceptable ACK 853 after entering the 'disorder' state acknowledges data not only 854 cumulatively but also selectively. Hence, Extended Limited Transmit 855 is not terminated but re-started. If the segments are delayed by the 856 reordering for almost one RTT, then the amount of outstanding data in 857 the network ('pipe') is approximately half the amount of data saved 858 at the beginning of Extended Limited Transmit (FlightSizePrev). If 859 the sequence numbers of the delayed segments are close to each other 860 in the sequence number space, the acceptable ACK acknowledges only a 861 small amount of data, so that FlightSize is still large. As a 862 result, TCP-NCR sets the cwnd to FlightSizePrev in step (T.1). Since 863 'pipe' is only half of FlightSizePrev due to Careful Extended Limited 864 Transmit, TCP-NCR sends a burst of almost half a cwnd worth of data 865 in the subsequent step (T.3). 867 Note: Even in the case the sequence numbers of the delayed segments 868 are not close to each other in the sequence number space and cwnd is 869 set in step (T.1) to FlightSize + SMSS, a burst of data will emerge 870 due to re-entering Extended Limited Transmit, because TCP-NCR sets 871 'skipped' to zero in step (I.2) and uses FlightSizePrev in step 872 (E.2). 874 TCP-aNCR prevents such a burst by making a clear differentiation 875 between terminating Extended Limited Transmit and a restarting 876 Extended Limited Transmit (step T.1). Only the first case causes the 877 congestion window to be set to the current FlightSize plus one 878 segment. In the latter case, when re-entering Extended Limited 879 Transmit, the congestion window is not adjusted and the original 880 (T.1) of the TCP-NCR specification is omitted. The transmission of 881 new data is then only performed after re-entering Extended Limited 882 Transmit in step (E.2) of the TCP-aNCR specification, where the 883 actual burst mitigation takes place. 885 7.5. Persistent receiving of Selective Acknowledgments 887 In some inconvenient cases it could happen that a TCP sender 888 persistently receives SACK information due to reordering on the 889 network path, e.g., if the segments are often and/or lengthy delayed 890 by the packet reordering. With TCP-NCR, the persistent reception of 891 SACKs causes Extended Limited Transmit to be entered with the first 892 received duplicate ACK but never to be terminated if no packet loss 893 occurs - for every received ACK, TCP-NCR either follows steps (E.1) 894 to (E.6) or steps (T.1) to (T.4). In particular, TCP-NCR executes a) 895 for every acceptable ACK step (T.4) and b) at any time step (I.1) 896 again. Hence, the amount of outstanding data saved at the beginning 897 of Extended Limited Transmit, FlightSizePrev, is never updated. 899 An emerging problem in this context is that during Extended Limited 900 Transmit TCP-NCR determines the transmission of new segments in step 901 (E.2) solely on the basis of FlightSizePrev, so that an interim 902 increase of the cwnd is not considered (according to [RFC5681], the 903 congestion window is increased for every received acceptable ACK that 904 advances the cumulative ACK point, no matter if it carries SACK 905 information or not). As a result, TCP-NCR can only very slowly 906 determine the available capacity of the communication path. 908 TCP-aNCR addresses this problem by limiting the amount of data that 909 is allowed to be sent into the network during Extended Limited 910 Transmit not on the basis of FlightSizePrev, but on the size of the 911 congestion window. The equation in step E.3 of the TCP-aNCR 912 specification is therefore equal to the one used in [RFC6675] (except 913 for the 'skipped' variable). If an acceptable ACK is received during 914 the execution of Extended Limited Transmit, re-entering Extended 915 Limited Transmit makes any increase in cwnd immediately available. 916 Hence, even in the case when persistently receiving SACKs, the 917 available capacity of the communication path can be determined 918 quickly. 920 Another problem resulting from persistently receiving SACKs, and 921 which is related to the increase in cwnd in response to received 922 acceptable ACKs, is the reduction of cwnd due to a packet loss. When 923 a packet is considered lost, the congestion control adjustment is 924 done with respect to the amount of outstanding data at the beginning 925 of Extended Limited Transmit, FlightSizePrev (step (Ret)). As in the 926 previous case, an increase in cwnd is again not taken into account. 927 A simple solution to the problem would be to perform the window 928 reduction not on the basis of FlightSizePrev but analogous to step 929 (E.2) based on the current size of cwnd. 931 A problem with this solution is that cwnd can potentially be 932 increased, although the TCP connection is limited by the application 933 and not by cwnd. Although [RFC2861] specifies that an increase of 934 cwnd is only applicable if cwnd is fully utilized, this behavior is 935 not specified by any standards track document. But even this 936 conservative increase behavior is guaranteed to not be conservative 937 enough. If, from a single window of data, both segments are delayed 938 but also lost, cwnd would first be increased in response to each 939 received acceptable ACKs, while subsequently reduced due to the lost 940 segments, which would not result in a halving of the cwnd any more. 942 The solution proposed by TCP-aNCR reuses the state variable 'recover' 943 from [RFC6582] and adapts the approach taken by NewReno TCP and SACK 944 TCP to detect, with help of the state variable, the end of one loss 945 recovery phase properly, allowing to recover multiple losses from a 946 single window of data efficiently. Therefore, by entering the 947 'disorder' state and the starting Extended Limited Transmit, TCP-aNCR 948 saves the highest sequence number sent so far in 'recover'. If a 949 received acceptable ACK covers more than 'recover', one cwnd's worth 950 of data has been transmitted during Extended Limited Transmit without 951 any packet loss. Hence, FlightSizePrev can be updated by 'pipe_max', 952 which reflects the maximum amount of data that is considered to have 953 been in the network during the last RTT. This update takes an 954 interim increase in cwnd into account, so that in case of packet 955 loss, the reduction in cwnd can be based on the current value of 956 FlightSizePrev. 958 8. Interoperability Issues 960 TCP-aNCR requires that both the TCP Selective Acknowledgment Option 961 [RFC2018] as well as a SACK-based loss recovery scheme compatible to 962 one given in [RFC6675] are used by the TCP sender. Hence, 963 compatibility to both specifications is REQUIRED. 965 8.1. Early Retransmit 967 The specification of TCP-aNCR in this document and the Early 968 Retransmit algorithm specified in [RFC5827] define orthogonal methods 969 to modify DupThresh. Early Retransmit allows the TCP sender to 970 reduce the number of duplicate ACKs required to trigger a Fast 971 Retransmit below the standard DupThresh of three, if FlightSize is 972 less than 4*SMSS and no new segment can be sent. In contrast, TCP- 973 aNCR allows, starting from the minimum of three duplicate ACKs, to 974 increase the DupThresh beyond the standard of three duplicate ACKs to 975 make TCP more robust to packet reordering, if the amount of 976 outstanding data is sufficient to reach the increased DupThresh to 977 trigger Fast Retransmit and Fast Recovery. 979 8.2. Congestion Window Validation 981 The increase of the congestion window during application-limited 982 periods can lead to an invalidation of the congestion window, in that 983 it no longer reflects current information about the state of the 984 network, if the congestion window might never have been fully 985 utilized during the last RTT. According to [RFC2861], the congestion 986 window should, first, only be increased during slow-start or 987 congestion avoidance if the cwnd has been fully utilized by the TCP 988 sender and, second, gradually be reduced during each RTT in which the 989 cwnd was not fully used. 991 A problem that arises in this context is that during Careful Extended 992 Limited Transmit, cwnd is not fully utilized due to the variable 993 'skipped' (see step (E.3)), so that - strictly following [RFC2861] - 994 the congestion window should not be increased upon the receipt of an 995 acceptable ACK. A trivial solution of this problem is to include the 996 variable 'skipped' in the calculation of [RFC2861] to determine 997 whether the congestion window is fully utilized or not. 999 8.3. Reactive Response to Packet Reordering 1001 As a proactive scheme with the aim to a priori prevent the negative 1002 impact that packet reordering has on TCP, TCP-aNCR can conceptually 1003 be combined with any reactive response to packet reordering, which 1004 attempts to mitigate the negative effects of reordering a posteriori. 1005 This is because the modifications of TCP-aNCR to the standard TCP 1006 congestion control and loss recovery [RFC6675] are implemented in the 1007 'disorder' state and are performed by the TCP sender before it enters 1008 loss recovery, while reactive responses to packet reordering operate 1009 generally after entering loss recovery, by undoing the unnecessarily 1010 changes to the congestion control state. 1012 If unnecessary changes to the congestion control state are undone 1013 after loss recovery, which is typically the case if a spurious Fast 1014 Retransmit is detected based on the DSACK option [RFC3708][RFC4015], 1015 since first ACK carrying a DSACK option usually arrives at a TCP 1016 sender only after loss recovery has already terminated, it might 1017 happen that the restoring of the original value of the congestion 1018 window is done at a time at which the TCP sender is already back in 1019 again in the 'disorder' state and executing Extended Limited 1020 Transmit. While this is basically compatible with the TCP-aNCR 1021 specification - the undo simply represents an increase of the 1022 congestion window - however, some care must be taken that the 1023 combination of the algorithms does not lead to unwanted behavior. 1025 8.4. Buffer Auto-Tuning 1027 Although all modifications of the TCP-aNCR algorithm are implemented 1028 in the TCP sender, the receiver also potentially has a part to play. 1029 If some segments from a single window of data are delayed by the 1030 packet reordering in the network, all segments that are received in 1031 out-of-order have to be queued in the receive buffer until the holes 1032 in sequence number space have been closed and the data can be 1033 delivered to the receiving application. In the worst case, which 1034 occurs if the TCP sender uses Aggressive Limited Transmit and the 1035 reordering delay is close to the RTT, TCP-aNCR increases the 1036 receiver's buffering requirement by up to an extra cwnd. Therefore, 1037 to maximize the benefits from TCP-aNCR, receivers should advertise a 1038 large window - ideally by using buffer auto-tuning algorithms - to 1039 absorb the extra out-of-order data. In the case that the additional 1040 buffer requirements are not met, the use of the above algorithm takes 1041 into account the reduced advertised window - with a corresponding 1042 loss in robustness to packet reordering. 1044 9. Related Work 1046 Over the past few years, several solutions have been proposed to 1047 improve the performance of TCP in the face of packet reordering. 1048 These schemes generally fall into one of two categories (with some 1049 overlap): mechanisms that try to prevent spurious retransmits from 1050 happening (proactive schemes) and mechanisms that try to detect 1051 spurious retransmits and undo the needless congestion control state 1052 changes that have been taken (reactive schemes). 1054 [I-D.blanton-tcp-reordering], [Zha+03] and [LM05] attempt to prevent 1055 packet reordering from triggering spurious retransmits by using 1056 various algorithms to approximate the DupThresh required to 1057 disambiguate loss and reordering over a given network path at a given 1058 time. This basic principle is also used in TCP-aNCR. While 1059 [I-D.blanton-tcp-reordering] describes four basic approaches on how 1060 to increase the DupThresh and discusses pros and cons of these 1061 approaches, presents [Zha+03] a relatively complex algorithm that 1062 saves the reordering extents in a histogram and calculates the 1063 DupThresh in a way that a certain percentage of samples is smaller 1064 then the DupThresh. [LM05] uses an EWMA for the same purpose. Both 1065 algorithms do not prevent all the spurious retransmissions by design. 1067 In contrast to the above mentioned algorithms Linux [Linux] 1068 implements a proactive scheme by setting the DupThresh to the highest 1069 detected reordering and resets only upon an RTO. To avoid a costly 1070 retransmission timeout due to the increased DupThresh Linux 1071 implements first an extension of the Limited Transmit algorithm, 1072 second limits the DupThresh to an upper bound of 127 duplicate ACKs, 1073 and third prematurely enters loss recovery if too few segments are 1074 in-flight to reach the DupThresh and no additional segments can send. 1075 Especially the last change is commendable since, besides TCP-NCR, 1076 none of the described algorithms in this section mention a similar 1077 concern. 1079 [Boh+06] and [Bha+04] presents proactive schemes based on timers by 1080 which the DupThresh is ignored altogether. After the timer is 1081 expired TCP initialize the loss recovery. In [Bha+04] this timer has 1082 a length of one RTT and is started when the first duplicate ACK is 1083 received, whereas the approach taken in [Boh+06] solely relies on 1084 timers to detect packet loss without taking into account any other 1085 congestion signals such as duplicate ACKs. It assigns each segment 1086 send a timestamp and retransmits the segment if the corresponding 1087 timer fires. 1089 TCP-NCR [RFC4653] tries to prevent spurious retransmits similar to 1090 [I-D.blanton-tcp-reordering] or [Zha+03] as it delays a 1091 retransmission to disambiguate loss and reordering. However, TCP-NCR 1092 takes a simplified approach by simply delay a retransmission by an 1093 amount based on the current cwnd (in comparison to standard TCP), 1094 while the other schemes use relatively complex algorithms in an 1095 attempt to derive a more precise value for DupThresh that depends on 1096 the current patterns of packet reordering. Many of the features 1097 offered by TCP-NCR have been taken into account while designing TCP- 1098 aNCR. 1100 Besides the proactive schemes, several other schemes have been 1101 developed to detect and mitigate needless retransmissions after the 1102 fact. The Eifel detection algorithm [RFC3522], the detection based 1103 on DSACKs [RFC3708], and F-RTO scheme [RFC5682] represent approaches 1104 to detect spurious retransmissions, while the Eifel response 1105 algorithm [RFC4015], [I-D.blanton-tcp-reordering], and Linux [Linux] 1106 present respectively implement algorithms to mitigate the changes 1107 these events made to the congestion control state. As discussed in 1108 Section 8.3 TCP-aNCR could be used in conjunction with these 1109 algorithms, with TCP-aNCR attempting to prevent spurious retransmits 1110 and some other scheme kicking in if the prevention failed. 1112 10. IANA Considerations 1114 This memo includes no request to IANA. 1116 11. Security Considerations 1118 By taking dedicated actions so that the perceived packet reordering 1119 in the network is either underestimating or overestimating by the use 1120 of an relative and absolute reordering, an attacker or misbehaving 1121 TCP receiver has in regards to TCP's congestion control two options 1122 to bias a TCP-aNCR sender. An underestimation of the present packet 1123 reordering in the network occursi, if for example, a misbehaving TCP 1124 receiver already acknowledges segments while they are actually still 1125 in-flight, causing holes premature are closed in the sequence number 1126 space of the SACK scoreboard. With regard to TCP-aNCR the result of 1127 an underestimated packet reordering is a too small DupThresh, 1128 resulting in a premature loss recovery execution. In context of 1129 TCP's congestion control the effects of such attacks are limited 1130 since the lower bound of TCP-aNCR's DupThresh is the default value of 1131 three duplicate ACKs [RFC5681], so that in worst case TCP-aNCR 1132 behaves equal to TCP SACK [RFC6675]. 1134 In contrast to an underestimation, an overestimation of the packet 1135 reordering in the network occurs, if for example, a misbehaving TCP 1136 receiver still further send SACKs for subsequent segments before it 1137 sends an acceptable ACK for the actually already received delayed 1138 segment, so that the hole in the sequence number space of the SACK 1139 scoreboard is later closed. In the context of TCP-aNCR the result of 1140 such an overestimation is a too large DupThresh, so that in the case 1141 of a packet loss TCP's loss recovery is executed later than 1142 necessary. Similar to the previous case, the effects of delayed 1143 entry into the loss recovery are limited because on the one hand TCP- 1144 NCR's DupThresh is used as an upper bound for TCP-aNCR's variable 1145 DupThresh so that the entrance to the loss recovery and the 1146 adaptation of the congestion window may be delayed at most one RTT. 1147 On the other hand, such a limited delay of the congestion control 1148 adjustment has even in the worst case only a limited impact on the 1149 performance of TCP connection and has generally been regarded as safe 1150 for use on the Internet [Ban+01]. 1152 12. Acknowledgments 1154 The authors would like to thank Daniel Slot for his TCP-NCR 1155 implementation in Linux. We also thank the flowgrind [Flowgrind] 1156 authors and contributors for here performance measurement tool, which 1157 give us a powerful tool to analyze TCP's congestion control and loss 1158 recovery behavior in detail. 1160 13. References 1162 13.1. Normative References 1164 [I-D.zimmermann-tcpm-reordering-detection] 1165 Zimmermann, A., Schulte, L., Wolff, C., and A. Hannemann, 1166 "Detection and Quantification of Packet Reordering with 1167 TCP", draft-zimmermann-tcpm-reordering-detection-01 (work 1168 in progress), November 2013. 1170 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 1171 RFC 793, September 1981. 1173 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 1174 Selective Acknowledgment Options", RFC 2018, October 1996. 1176 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1177 Requirement Levels", BCP 14, RFC 2119, March 1997. 1179 [RFC3042] Allman, M., Balakrishnan, H., and S. Floyd, "Enhancing 1180 TCP's Loss Recovery Using Limited Transmit", RFC 3042, 1181 January 2001. 1183 [RFC4653] Bhandarkar, S., Reddy, A., Allman, M., and E. Blanton, 1184 "Improving the Robustness of TCP to Non-Congestion 1185 Events", RFC 4653, August 2006. 1187 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 1188 Control", RFC 5681, September 2009. 1190 [RFC6582] Henderson, T., Floyd, S., Gurtov, A., and Y. Nishida, "The 1191 NewReno Modification to TCP's Fast Recovery Algorithm", 1192 RFC 6582, April 2012. 1194 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 1195 and Y. Nishida, "A Conservative Loss Recovery Algorithm 1196 Based on Selective Acknowledgment (SACK) for TCP", 1197 RFC 6675, August 2012. 1199 [RFC6928] Chu, J., Dukkipati, N., Cheng, Y., and M. Mathis, 1200 "Increasing TCP's Initial Window", RFC 6928, April 2013. 1202 13.2. Informative References 1204 [Ban+01] Bansal, D., Balakrishnan, H., Floyd, S., and S. Shenker, 1205 "Dynamic Behavior of Slowly Responsive Congestion Control 1206 Algorithms", Proceedings of the Conference on 1207 Applications, Technologies, Architectures, and Protocols 1208 for Computer Communication (SIGCOMM'01) pp. 263-274, 1209 September 2001. 1211 [Bha+04] Bhandarkar, S., Sadry, N., Reddy, A., and N. Vaidya, "TCP- 1212 DCR: A Novel Protocol for Tolerating Wireless Channel 1213 Errors", IEEE Transactions on Mobile Computing vol. 4, no. 1214 5., pp. 517-529, September 2005. 1216 [Boh+06] Bohacek, S., Hespanha, J., Lee, J., Lim, C., and K. 1217 Obraczka, "A New TCP for Persistent Packet Reordering", 1218 IEEE/ACM Transactions on Networking vol. 2, no. 14, pp. 1219 369-382, April 2006. 1221 [Flowgrind] 1222 "Flowgrind Home Page", . 1224 [I-D.blanton-tcp-reordering] 1225 Blanton, E., Dimond, R., and M. Allman, "Practices for TCP 1226 Senders in the Face of Segment Reordering", 1227 draft-blanton-tcp-reordering-00 (work in progress), 1228 February 2003. 1230 [LM05] Leung, C. and C. Ma, "Enhancing TCP Performance to 1231 Persistent Packet Reordering", KICS Journal of 1232 Communications and Networks vol. 7, no. 3, pp. 385-393, 1233 September 2005. 1235 [Linux] "The Linux Project", . 1237 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 1238 RFC 896, January 1984. 1240 [RFC1122] Braden, R., "Requirements for Internet Hosts - 1241 Communication Layers", STD 3, RFC 1122, October 1989. 1243 [RFC2861] Handley, M., Padhye, J., and S. Floyd, "TCP Congestion 1244 Window Validation", RFC 2861, June 2000. 1246 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., 1247 Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., 1248 Zhang, L., and V. Paxson, "Stream Control Transmission 1249 Protocol", RFC 2960, October 2000. 1251 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm 1252 for TCP", RFC 3522, April 2003. 1254 [RFC3708] Blanton, E. and M. Allman, "Using TCP Duplicate Selective 1255 Acknowledgement (DSACKs) and Stream Control Transmission 1256 Protocol (SCTP) Duplicate Transmission Sequence Numbers 1257 (TSNs) to Detect Spurious Retransmissions", RFC 3708, 1258 February 2004. 1260 [RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm 1261 for TCP", RFC 4015, February 2005. 1263 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 1264 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 1265 Spurious Retransmission Timeouts with TCP", RFC 5682, 1266 September 2009. 1268 [RFC5827] Allman, M., Avrachenkov, K., Ayesta, U., Blanton, J., and 1269 P. Hurtig, "Early Retransmit for TCP and Stream Control 1270 Transmission Protocol (SCTP)", RFC 5827, May 2010. 1272 [Zha+03] Zhang, M., Karp, B., Floyd, S., and L. Peterson, "RR-TCP: 1273 A Reordering-Robust TCP with DSACK", Proceedings of the 1274 11th IEEE International Conference on Network Protocols 1275 (ICNP'03) pp. 95-106, November 2003. 1277 Appendix A. Changes from previous versions of the draft 1279 This appendix should be removed by the RFC Editor before publishing 1280 this document as an RFC. 1282 A.1. Changes from draft-zimmermann-tcpm-reordering-reaction-00 1284 o Improved the wording throughout the document. 1286 o Replaced and updated some references. 1288 Authors' Addresses 1290 Alexander Zimmermann 1291 NetApp, Inc. 1292 Sonnenallee 1 1293 Kirchheim 85551 1294 Germany 1296 Phone: +49 89 900594712 1297 Email: alexander.zimmermann@netapp.com 1299 Lennart Schulte 1300 Aalto University 1301 Otakaari 5 A 1302 Espoo 02150 1303 Finland 1305 Phone: +358 50 4355233 1306 Email: lennart.schulte@aalto.fi 1308 Carsten Wolff 1309 credativ GmbH 1310 Hohenzollernstrasse 133 1311 Moenchengladbach 41061 1312 Germany 1314 Phone: +49 2161 4643 182 1315 Email: carsten.wolff@credativ.de 1317 Arnd Hannemann 1318 credativ GmbH 1319 Hohenzollernstrasse 133 1320 Moenchengladbach 41061 1321 Germany 1323 Phone: +49 2161 4643 134 1324 Email: arnd.hannemann@credativ.de