idnits 2.17.1 draft-zimmermann-tcpm-reordering-detection-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (November 18, 2013) is 3784 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Outdated reference: A later version (-02) exists of draft-zimmermann-tcpm-reordering-reaction-00 ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2960 (Obsoleted by RFC 4960) Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance and Minor Extensions A. Zimmermann 3 (TCPM) WG NetApp, Inc. 4 Internet-Draft L. Schulte 5 Intended status: Experimental Aalto University 6 Expires: May 22, 2014 C. Wolff 7 A. Hannemann 8 credativ GmbH 9 November 18, 2013 11 Detection and Quantification of Packet Reordering with TCP 12 draft-zimmermann-tcpm-reordering-detection-00 14 Abstract 16 This document specifies an algorithm for the detection and 17 quantification of packet reordering for TCP. In the absence of 18 explicit congestion notification from the network, TCP uses only 19 packet loss as an indication of congestion. One of the signals TCP 20 uses to determine loss is the arrival of three duplicate 21 acknowledgments. However, this heuristic is not always correct, 22 notably in the case when paths reorder packets. This results in 23 degraded performance. 25 The algorithm for the detection and quantification of reordering in 26 this document uses information gathered from the TCP Timestamps 27 Option, the TCP SACK Option and its DSACK extension. When a 28 reordering event is detected, the algorithm calculates a reordering 29 extent by determining the number of segments the reordered segment 30 was late with respect to its position in the sequence number space. 31 Additionally, the algorithm computes a second reordering extent that 32 is relative to the amount of outstanding data and thus provides a 33 better estimation of the reordering delay when other sender state 34 changes. 36 Status of this Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at http://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on May 22, 2014. 53 Copyright Notice 55 Copyright (c) 2013 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (http://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 72 3. Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . 7 73 4. The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 7 74 4.1. Initialization During Connection Establishment . . . . . . 8 75 4.2. Receiving Acknowledgments . . . . . . . . . . . . . . . . 8 76 4.3. Receiving Acknowledgment Closing Hole . . . . . . . . . . 9 77 4.4. Receiving Duplicate Selective Acknowledgment . . . . . . . 9 78 4.5. Reordering Extent Computation . . . . . . . . . . . . . . 10 79 4.6. Retransmitting Segment . . . . . . . . . . . . . . . . . . 10 80 4.7. Placeholder for Response Algorithm . . . . . . . . . . . . 11 81 4.8. Retransmission Timeout . . . . . . . . . . . . . . . . . . 11 82 5. Protocol Steps in Detail . . . . . . . . . . . . . . . . . . . 11 83 6. Discussion of the Algorithm . . . . . . . . . . . . . . . . . 13 84 6.1. Calculation of the Relative Reordering Extent . . . . . . 13 85 6.2. Reordering Delay Longer than RTT . . . . . . . . . . . . . 14 86 6.3. Persistent Reception of Selective Acknowledgments . . . . 14 87 6.4. Packet Duplication . . . . . . . . . . . . . . . . . . . . 16 88 7. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 17 89 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 90 9. Security Considerations . . . . . . . . . . . . . . . . . . . 18 91 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 18 92 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 19 93 11.1. Normative References . . . . . . . . . . . . . . . . . . . 19 94 11.2. Informative References . . . . . . . . . . . . . . . . . . 19 95 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 97 1. Introduction 99 When the Transmission Control Protocol (TCP) [RFC0793] decides that 100 the oldest outstanding segment is lost, it performs a retransmission 101 and changes the sending rate [RFC5681]. This occurs when a 102 Retransmission Timeout (RTO) occurs for a segment, or when three 103 duplicate acknowledgments (ACKs) for a segment have been received 104 (Fast Retransmit). The assumption behind Fast Retransmit is that 105 non-congestion events that can cause duplicate ACKs to be generated 106 (packet duplication, packet reordering and packet corruption) are 107 infrequent. However, a number of Internet measurement studies have 108 shown that packet reordering is not a rare phenomenon [Pax97], 109 [BPS99], [BS02], [ZM04], [GPL04], [Jai+07] and has negative 110 performance implications on TCP [BA02], [Zha+03]. 112 The impact that packet reordering has on TCP can be classified by the 113 type of reordering: forward-path versus reverse-path reordering. 115 From TCP's perspective, the result of packet reordering on the 116 forward-path is the reception of out-of-order segments by the TCP 117 receiver. In response to every received out-of-order segment, the 118 TCP receiver immediately sends a duplicate ACK. (Note: [RFC5681] 119 recommends that delayed ACKs not be used when the ACK is triggered by 120 an out-of-order segment.) The sender side, if the number of 121 consecutively received duplicate ACKs exceeds the duplicate 122 acknowledgment threshold (DupThresh), retransmits the first 123 unacknowledged segment [RFC5681] and continues with a loss recovery 124 algorithm such as NewReno [RFC6582] or the Selective Acknowledgment 125 (SACK) based loss recovery [RFC6675]. If a segment arrives late at 126 the receiver because of reordering by more than three segments (the 127 default value of DupThresh [RFC5681]), the TCP sender is not able to 128 distinguish this reordering event from a segment loss, resulting in 129 an unnecessary retransmission and rate reduction. 131 Packet reordering may not only cause data segments to arrive out-of- 132 order but also ACKs at the receiver. This reordering on the reverse 133 path also has a negative an impact on TCP performance, by causing a 134 degradation of TCP's self-clocking property. In steady state, 135 depending on whether the TCP receiver delays an ACK or not [RFC1122], 136 one or two segments are acknowledged per ACK. If, due to reordering 137 on the reverse path, ACKs arrive at the TCP sender in a different 138 order than they were sent in by the TCP receiver, in-order ACKs 139 acknowledge several segments together rather than only one or two, 140 while disordered ACKs arrive either out-of-order or out-of-window and 141 are ignored. (Note: according to [RFC6675], an ACK only counts as a 142 duplicate if it carries a SACK block that identifies previously 143 unacknowledged and un-SACKed data.) Overall, this leads to a bursty 144 transmission pattern as well as outdated SACK and DSACK information. 146 Since DupThresh is defined in segments rather than bytes [RFC5681], 147 TCP usually quantifies packet reordering in terms of segments. 148 Informally, the reordering extent [RFC4737] is defined as the maximum 149 distance in segments between the reception of a reordered segment and 150 the earliest segment received with a larger sequence number. If a 151 segment is received in-order, its reordering extent is undefined 152 [RFC4737]. On the basis of the reordering extent, a mechanism to 153 make TCP robust to packet reordering can be achieved by directly 154 applying the reordering extent as and DupThresh. A problem that 155 arises with this way of quantifying reordering is that even in the 156 presence of constant reordering, reordering extents may vary if the 157 transmission rate of the TCP sender changes. Therefore, by using a 158 DupThresh that directly reflects the measured reordering extent, 159 spurious retransmissions cannot be fully avoided. 161 The following example illustrates this issue. Assume a path with a 162 reordering probability of 1%, a reordering delay of 20 ms, and a 163 bottleneck bandwidth of 3 Mb/s. Because segments that are delayed by 164 reordering arrive 20 ms too late, the TCP receiver can receive a 165 maximum of ((20 * 3 * 10^3) / 8) = 7500 bytes out-of-order before the 166 reordered segment arrives. Hence, with a Sender Maximum Segment Size 167 (SMSS) of 1460 bytes, the largest possible reordering extent is close 168 to 5 segments. If the bottleneck bandwidth changes from 3 Mb/s to 4 169 Mb/s, the maximum reordering extent will increase to 7 segments, 170 although the reordering delay remains constant. 172 This simple example shows that even with constant reordering, 173 spurious retransmissions cannot be completely avoided if DupThresh 174 directly reflects the reordering extent. On the other hand, the 175 reordering extent and the resulting DupThresh can sometimes also be 176 much too high and do not correspond to the actual packet reordering 177 present on the path. For example, a slow start overshoot [Hoe96], 178 [MM96], [Mat+97] at the end of slow start might induce such a 179 problem. 181 An obvious solution to the problem would be to quantify packet 182 reordering not by calculating a reordering extent, but by using the 183 reordering late time offset [RFC4737]. Since the reordering late 184 time offset is not specified in segments but captures the difference 185 between the expected and actual reception time of a reordered 186 segment, this way of quantifying reordering is independent of the 187 current transmission rate. Disadvantages of this approach are 188 however a higher complexity and a worse integration into the TCP 189 specification, since an implementation would require additional 190 timers, whereas TCP itself is self-clocked. 192 The approach taken by this specification quantifies the reordering 193 extend for a packet not only through an absolute value, but also 194 through a measure that is relative to the amount of outstanding data, 195 in an attempt to approximate a time-based measure. The presented 196 scheme can thereby easily be adapted to the Stream Control 197 Transmission Protocol (SCTP) [RFC2960], since SCTP uses congestion 198 control algorithms similar to TCP. 200 The remainder of this document is organized as follows. Section 3 201 provides a high-level description of the packet reordering detection 202 mechanisms. In Section 4, the algorithm is specified. In Section 5, 203 each step of the algorithm is discussed in detail. Section 6 204 provides a discussion of several design decisions of the algorithm. 205 Section 7 discusses related work. Section 9 discusses security 206 concerns. 208 2. Terminology 210 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 211 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 212 document are to be interpreted as described [RFC2119]. 214 The reader is expected to be familiar with the TCP state variables 215 given in [RFC0793] (SEG.SEQ), [RFC5681] (FlightSize), and [RFC6675] 216 (DupThresh, SACK scoreboard). SND.FACK (forward acknowledgment) is 217 used to describe the highest sequence number - plus one - that has 218 been either cumulatively or selectively acknowledged by the receiver 219 and subsequently seen by the sender [MM96]. Further, the term 220 'acceptable acknowledgment' is used as defined in [RFC0793]. That 221 is, an ACK that increases the connection's cumulative ACK point by 222 acknowledging previously unacknowledged data. The term 'duplicate 223 acknowledgment' is used as defined in [RFC6675], which is different 224 from the definition of duplicate acknowledgment in [RFC5681]. 226 This specification defines the four TCP sender states 'open', 227 'disorder', 'recovery', and 'loss' as follows. As long as no 228 duplicate ACK is received and no segment is considered lost, the TCP 229 sender is in the 'open' state. Upon the reception of the first 230 consecutive duplicate ACK, TCP will enter the 'disorder' state. 231 After receiving DupThresh duplicate ACKs, the TCP sender switches to 232 the 'recovery' state and executes standard loss recovery procedures 233 like Fast Retransmit and Fast Recovery [RFC5681]. Upon a 234 retransmission timeout, the TCP sender enters the 'loss' state. The 235 'recovery' state can only be reached by a transition from the 236 'disorder' state, the 'loss' state can be reached from any other 237 state. 239 3. Basic Concepts 241 The following specification depends on the TCP Timestamps [RFC1323] 242 and the TCP Selective Acknowledgment (SACK) [RFC2018] options and the 243 latter's Duplicate Selective Acknowledgment (DSACK) extension 244 [RFC2883]. The reader is assumed to be familiar with the algorithms 245 specified in these documents. 247 Reordering is quantified by an absolute and a relative reordering 248 extent. If a hole in the SACK scoreboard of the TCP sender is closed 249 either cumulatively by an acceptable ACK or selectively by a new 250 SACK, then the absolute reordering extent is computed as the number 251 of segments in the SACK scoreboard between the sequence number of the 252 reordered segment and the highest selectively or cumulatively 253 acknowledged sequence number. The relative reordering extent is then 254 computed as the ratio between the absolute reordering extent and the 255 FlightSize stored when entering the 'disorder' state. 257 If the hole that was closed in the SACK scoreboard corresponds to a 258 segment that was not retransmitted, or if the retransmission of such 259 a segment can be determined as a spurious retransmission by means of 260 the Eifel detection algorithm [RFC3522], then the calculated 261 reordering extent is immediately valid. Otherwise, if the 262 verification of the Eifel detection algorithm has not been possible, 263 the reordering extent is stored for a possibly subsequent DSACK. If 264 no such DSACK is received in the next two round-trip times (RTTs), 265 the reordering extent is discarded. 267 4. The Algorithm 269 Given that usually both the Nagle algorithm [RFC0896] [RFC1122] and 270 the TCP Selective Acknowledgment Option [RFC2018] are enabled, a TCP 271 sender MAY employ the following algorithm to detect and quantify the 272 current perceived packet reordering in the network. 274 Without the Nagle algorithm, there is no straight way to accurately 275 calculate the number of outstanding segments in the network (and, 276 therefore, no good way to derive an appropriate reordering extent) 277 without adding state to the TCP sender. A TCP connection that does 278 not employ the Nagle algorithm SHOULD NOT use this methodology. 280 If a TCP sender implements the following algorithm, the 281 implementation MUST follow the various specifications provided in 282 Sections 4.1 to 4.8. The algorithm MUST be executed *before* the 283 Transmission Control Block or the SACK scoreboard have been updated 284 by another loss recovery algorithm. 286 4.1. Initialization During Connection Establishment 288 After the completion of the TCP connection establishment, the 289 following state variables MUST be initialized in the TCP transmission 290 control block: 292 (C.1) The variable Dsack, which indicates whether a DSACK has been 293 received so far, and the data structure Samples, which stores 294 the computed reordering extents, MUST be initialized as: 296 Dsack = false 297 Samples = [] 299 (C.2) If the TCP Timestamps option [RFC1122] has been negotiated, 300 then the variable Timestamps MUST be activated and the data 301 structure Retrans_TS, which stores the value of the TSval 302 field of the retransmissions sent during Fast Recovery, MUST 303 be initialized as: 305 Timestamps = true 306 Retrans_TS = [] 308 Otherwise, the Timestamps-based detection SHOULD be 309 deactivated: 311 Timestamps = false 313 4.2. Receiving Acknowledgments 315 For each received ACK that either a) carries SACK information, *or* 316 b) is a full ACK that terminates the current fast recovery procedure, 317 *or* c) is an acceptable ACK that is received immediately after a 318 duplicate ACK, execute steps (A.1) to (A.4), otherwise skip to step 319 (A.4). 321 (A.1) If a) the ACK carries new SACK information, *and* b) the SACK 322 scoreboard is empty (i.e., the TCP sender has received no SACK 323 information from the receiver), then the TCP sender MUST save 324 the amount of current outstanding data: 326 FlightSizePrev = FlightSize 328 (A.2) If the received ACK either a) cumulatively acknowledges at 329 most SMSS bytes, *or* b) selectively acknowledges at most SMSS 330 bytes in the sequence number space in the SACK scoreboard, 331 then: 333 The TCP sender MUST execute steps (S.1) to (S.4) 335 (A.3) If a) Timestamps == false *and* b) the received ACK carries a 336 DSACK option [RFC2883] and the segment identified by the DSACK 337 option can be marked according to step (A.1) to (A.4) of 338 [RFC3708] as a valid duplicate, then: 340 The TCP sender MUST execute steps (D.1) to (D.3) 342 (A.4) The TCP sender MUST terminate the processing of the ACK by 343 this algorithm and MUST continue with the default processing 344 of the ACK. 346 4.3. Receiving Acknowledgment Closing Hole 348 (S.1) If (a) the newly cumulatively or selectively acknowledged 349 segment SEG is a retransmission *and* b) both equations Dsack 350 == false and Timestamps == false hold, then the TCP sender 351 MUST skip to step (A.4). 353 (S.2) Compute the relative and absolute reordering extent ReorExtR, 354 ReorExtA: 356 The TCP sender MUST execute steps (E.1) to (E.4) 358 (S.3) If a) the newly acknowledged segment SEG was not retransmitted 359 before *or* b) both equations Timestamps == true and 360 Retrans_TS[SEG.SEQ] > ACK.TSecr hold, i.e., the ACK 361 acknowledges the original transmission and not a 362 retransmission, then hand over the reordering extents to an 363 additional reaction algorithm: 365 The TCP sender MUST execute step (P) 367 (S.4) If a) the previous step (S.3) was not executed *and* b) both 368 equations Dsack == true and Timestamps == false hold, save the 369 reordering extents for the newly acknowledged segment SEG for 370 at least two RTTs: 372 Samples[SEG.SEQ].ReorExtR = ReorExtR 373 Samples[SEG.SEQ].ReorExtA = ReorExtA 375 4.4. Receiving Duplicate Selective Acknowledgment 376 (D.1) If no DSACK has been received so far, the sender MUST set: 378 Dsack = true 380 (D.2) If a) the previous step (D.1) was not executed *and* a 381 reordering extent was calculated for the segment SEG 382 identified by the DSACK option, then the TCP sender MUST 383 restore the values of the variables ReorExtR and ReorExtA and 384 delete the corresponding entries in the data structure: 386 ReorExtR = Samples[SEG.SEQ].ReorExtR 387 ReorExtA = SAMPLES[SEG.SEQ].ReorExtA 389 (D.3) Hand the newly restored reordering extents over to an 390 additional reaction algorithm: 392 The TCP sender MUST execute step (P) 394 4.5. Reordering Extent Computation 396 (E.1) SEG.SEQ is the sequence number of the newly cumulatively or 397 selectively acknowledged segment SEG. 399 (E.2) SND.FACK is the highest either cumulatively or selectively 400 acknowledged sequence number so far plus one. 402 (E.3) The TCP sender MUST compute the absolute reordering extent 403 ReorExtA as 405 ReorExtA = (SND.FACK - SEG.SEQ) / SMSS 407 (E.4) The TCP sender MUST compute the relative reordering extent 408 ReorExtR as 410 ReorExtR= ReorExtA * (SMSS / FlightSizePrev) 412 4.6. Retransmitting Segment 414 If the TCP Timestamps option [RFC1323] is used to detect packet 415 reordering, the TCP sender must save the TCP Timestamps option of all 416 retransmitted segments during fast recovery. 418 (RET) If a) a segment SEG is retransmitted during Fast Recovery, 419 *and* b) the equation Timestamps = true holds, the TCP sender 420 MUST save the value of the TSval field of the retransmitted 421 segment: 423 Retrans_TS[SEG.SEQ] = SEG.TSval 425 4.7. Placeholder for Response Algorithm 427 (P) This is a placeholder for an additional reaction algorithm 428 that takes further action using the results of this algorithm, 429 for example, the adjustment of the DupThresh based on relative 430 and absolute reordering extent ReorExtR and ReorExtA. 432 4.8. Retransmission Timeout 434 The expiration of the retransmission timer should be interpreted as 435 an indication of a change in path characteristics, and the TCP sender 436 should consider all saved reordering extents as outdated and delete 437 them. 439 (RTO) If an retransmission timeout (RTO) occurs, a TCP sender SHOULD 440 reset the following variables: 442 Samples = [] 443 Retrans_TS = [] 444 FlightSizePrev = 0 446 5. Protocol Steps in Detail 448 The reception of an ACK represents the starting point for the 449 detection scheme above. For each received SACK, DSACK or acceptable 450 ACK that prompts the TCP sender to enter the 'disorder' state, to 451 remain in the 'disorder' state or to leave the 'disorder' or 452 'recovery' states towards the 'open' state, steps (A.1) to (A.4) are 453 performed. All other received ACKs are not relevant for the 454 detection of packet reordering and can be ignored. If the TCP sender 455 changes from the 'open' to the 'disorder' state due to the reception 456 of a duplicate ACK (i.e., the SACK scoreboard is empty and an ACK 457 arrives carrying new SACK information), the current amount of 458 outstanding data, FlightSize, is stored for the subsequent 459 calculation of the relative reordering extent (step (A.1)). 461 Whenever a received acceptable ACK or SACK closes a hole in the 462 sequence number space of the SACK scoreboard either partially or 463 completely, this is an indication of packet reordering in the network 464 (step (A.2)). The prerequisite for an accurate quantification of the 465 reordering is that only one segment is newly acknowledged (maximum 466 SMSS bytes of data). If more than one segment per ACK is 467 acknowledged, either by reordering on the reverse path or the loss of 468 ACKs, the order in which the segments have been received by the TCP 469 receiver is no longer accurately determinable so that in this case a 470 reordering extent is not calculated. Finally, if the received ACK 471 carries a DSACK option that identifies a segment that was 472 retransmitted only once, then this is sufficient to conclude 473 reordering (step (A.3)), so that a previously calculated reordering 474 extent can be passed to another algorithm (steps (D.3) and (P)). 476 With just the information provided by the ACK field or SACK 477 information above SND.UNA , the TCP sender is unable to distinguish 478 whether the ACK that finally acknowledges retransmitted data (either 479 cumulatively or selectively) was sent in response to the original 480 segment or a retransmission of the segment. This is described as the 481 retransmission ambiguity problem in [KP87]. Therefore, the detection 482 and quantification of reordering depends on other means to 483 distinguish between acknowledgments for transmission and 484 retransmission to detect if a retransmission was spurious. If 485 neither a DSACK has been received (Dsack == false) nor the TCP 486 Timestamps option has been enabled on connection establishment 487 (Timestamps == false) then there is no possibility for the TCP sender 488 to identify spurious retransmissions. Hence, the processing of the 489 received ACK by the detection algorithm must be terminated for 490 retransmitted segments (step (S.1)). Otherwise, if the segment that 491 corresponds to the closed hole in the sequence number space of the 492 SACK scoreboard has not been retransmitted or the retransmission can 493 be identified by the Eifel detection algorithm [RFC3522] as a 494 spurious retransmission, the previously calculated reordering extent 495 is valid (step (S.2)) and an additional reaction algorithm can be 496 executed (step (S.3) and (P)). 498 For the use of the Eifel detection it is necessary to store the TCP 499 Timestamps option of all retransmissions sent during Fast Recovery 500 (step (Ret)). However, if the use of the Eifel detection algorithm 501 is not possible (Timestamps == false), the extent of a possible 502 reordering is stored for the possibility of a subsequent arrival of a 503 DSACK (step (P.4)). If no such DSACK is received in the next two 504 round-trip times, the reordering extent is discarded. Since the 505 DSACK extension is not negotiated during connection establishment 506 [RFC2883], the reordering extent is only stored if a DSACK was 507 previously received for the TCP connection (DSACK == true, step 508 (D.1)). 510 Regardless of whether packet reordering is detected by using the 511 SACK-based methodology, the DSACK-based methodology, or the TCP 512 Timestamps option, quantification of the reordering will always be 513 done when closing a hole in the sequence number space of the SACK 514 scoreboard (step (A.2), step (P.2)). The only difference is the time 515 of detection, which is in the case of DSACK-based methodology at 516 least one RTT after the time of the quantification. The absolute 517 reordering extent ReorExtA results from the number of segments in the 518 SACK scoreboard between the sequence number of the newly acknowledged 519 segment and the highest either cumulatively or selectively 520 acknowledged sequence number so far plus one (SND.FACK) (step (E.3)). 522 It is worth noting that the absolute reordering extent includes all 523 segments (bytes) between the closed hole and the highest acknowledged 524 sequence number so far, i.e., it also includes segments (bytes) that 525 are not selectively acknowledged. The reason is that if packet 526 reordering is considered from a temporal perspective, it is 527 irrelevant whether there are lost segments or not. The important 528 fact is that the lost segments have been sent after the delayed 529 segment and before the highest acknowledged segment, which is 530 expressed by the metric. In step (E.4), the relative reordering 531 extent ReorExtR is then calculated by the ratio between the absolute 532 reordering extent ReorExtA and the amount of outstanding data stored 533 by step (A.1). 535 6. Discussion of the Algorithm 537 The focus of the following discussion is on the quantification of 538 reordering by the relative reordering extent and to elaborate on 539 possible sources of error, which may lead to an inaccurate detection 540 and quantification of reordering in the network. 542 6.1. Calculation of the Relative Reordering Extent 544 Generally, the characteristics of a relative reordering extent should 545 be that if packet reordering on a path is constant in terms of rate 546 and delay, the relative reordering extent should also be constant, 547 regardless of the current transmission rate of the TCP sender. The 548 scheme proposed in this document is to calculate the relative 549 reordering by getting the ratio between absolute reordering (the 550 amount of data the reordered segment was received too late) and the 551 amount of outstanding data stored when TCP sender was entering the 552 'disorder' state (the maximum amount of data a reordered segment can 553 be received too late). (Note: A segment can theoretically be delayed 554 for an arbitrarily long period during reordering. However, the 555 scheme proposed in this document does not capture such kind of 556 reordering. See Section 6.2.) Therefore, the relative reordering 557 extent expresses the portion of currently outstanding data that is 558 selectively acknowledged before the reordered segment is cumulatively 559 acknowledged. If the transmission rate changes, the absolute 560 reordering extent changes as well, but together with the amount of 561 outstanding data, and hence the relative reordering extent stays 562 constant. 564 A characteristic of the calculation of the relative reordering extent 565 on the basis of currently outstanding amount of data is that the 566 FlightSize reflects the bandwidth-delay-product and not the 567 transmission rate. As a consequence, the relative reordering extent 568 is not independent of the RTT. If the RTT of the communication path 569 changes, the amount of outstanding data changes as well, but the 570 absolute reordering extent remains constant. Hence, the relative 571 reordering extent adapts. In principle it is possible to design an 572 algorithm to compute the relative reordering extent independently of 573 the RTT and to reflect only the characteristics of packet reordering 574 of the path. But since the calculation would be far from trivial and 575 introducing more complexity, this is considered to be future 576 research. 578 6.2. Reordering Delay Longer than RTT 580 The quantification of packet reordering always takes place at the 581 time when a hole in sequence number space in the SACK scoreboard is 582 closed. In consequence, if a reordered packet is delayed by more 583 than the measured RTT, the amount of reordering on the path is 584 underestimated, since the ACK that closes the hole in the sequence 585 number space was not sent in response to the original transmission, 586 but to the retransmission. Hence, in order to detect and quantify 587 these kinds of events, the reordering extent would have to be 588 calculated when the ACK for the transmission is received and not at 589 the time the hole in the sequence number space is closed. Although 590 it would be possible to take these events into account by evaluating 591 DSACKs and the TCP Timestamps option simultaneously, the scheme 592 proposed in this document refrains from qualifying a reordering delay 593 longer then RTT. A reordering delay of this magnitude is very 594 unlikely, and would lead to a significant overhead in memory usage 595 and complexity. Additionally, taking it into account would result in 596 other problems, especially a potential expiration of the RTO 597 [I-D.zimmermann-tcpm-reordering-reaction]. 599 6.3. Persistent Reception of Selective Acknowledgments 601 Especially on paths with a high bandwidth-delay-product, it is 602 possible that even with a minor packet reordering, several segments 603 in a single window of data are delayed. If, in addition, the 604 sequence numbers of those segments are widely spaced in the sequence 605 number space and the delay caused by packet reordering is 606 sufficiently high, this might lead to a constant reception of out-of- 607 order data. Hence, for each received segment, regardless of whether 608 a hole in the sequence number space of the receive window is closed 609 or not, an ACK is sent that carries SACK information. From TCP 610 sender's perspective, this persistent receiving of new SACK 611 information leads to the situation that the TCP sender enters the 612 'disorder' state when receiving the first SACK and never leaves it 613 again during the connection lifetime if no segment is lost in 614 between. 616 In case of the above reordering detection and quantification scheme, 617 the persistent reception of SACK blocks causes the amount of 618 outstanding data, which is stored when the TCP sender enters the 619 'disorder' state, to never be updated, since FlightSize is only saved 620 in step (A.1) when the SACK scoreboard is empty. If the transmission 621 rate of the TCP sender, and therefore also the maximum amount of data 622 a reordered segment can be received too late, changes significantly 623 during its stay in the 'disorder' state, the actual amount of 624 reordering is not accurately determined by the relative reordering 625 extent. A decrease of the transmission rate would result in an 626 overestimation of the reordering extent and vice versa. 628 A simple solution to the problem would be to store the maximum offset 629 in terms of sequence number space by which a reordered segment can be 630 received too late only when entering the 'disorder' state, but 631 individually for every potentially reordered segment, that is, for 632 every hole in the sequence space of the SACK scoreboard. (Note: The 633 maximum offset in terms of sequence number space by which a reordered 634 segment can be received too late is strictly speaking the amount of 635 data that have been transmitted later than the reordered segment. 636 This amount of data can only be expressed by FlightSize within the 637 'open' state and not within the 'disorder' state, since the 638 cumulative ACK point may not advance). 640 The problem with this simple idea is that for a new hole in the SACK 641 scoreboard, it is not possible to determine whether it is a result of 642 packet reordering or loss, and therefore it results in increased 643 memory usage (to store the amount of data for each hole.) 644 Additionally, the packet reordering would be inaccurately quantified 645 if the transmission rate changes significantly for a short amount of 646 time. For example, if the amount of outstanding data is low when 647 entering the 'disorder' state is entered, the execution of Careful 648 Extended Limited Transmit (as described in 649 [I-D.zimmermann-tcpm-reordering-reaction] [RFC4653]) leads to a 650 significant short-term change of the transmission rate. When the 651 amount of data by which the reordering segment can be delayed is 652 determined individually for every new hole, it leads to an 653 overestimation of the relative reordering extent, since the maximum 654 amount of data possible is 'artificially' reduced by Careful Extended 655 Limited Transmit. 657 A solution to this problem is to store the maximum offset in terms of 658 sequence number space by which a reordered segment can be received 659 too late not for every segment individually (which does not guarantee 660 an accurate calculation of the relative reordering extent) but only 661 sufficiently often, e.g., once per RTT. The identification of what 662 frequency would be adequate, though, is neither trivial nor 663 universally applicable, since a concrete solution depends on the 664 transmission behavior of the used TCP in the 'disorder' state and 665 whether it is more beneficial for an additional reordering response 666 algorithm to over- or underestimate the packet reordering on the 667 path. If, for example, TCP-aNCR 668 [I-D.zimmermann-tcpm-reordering-reaction] is used as additional 669 reordering response algorithm, the maximum offset in terms of 670 sequence number space by which a reordered segment can be received 671 too late is not only stored when entering the 'disorder' state but 672 also updated every RTT (every cwnd worth of data transmitted without 673 a loss) while the TCP sender stays in the 'disorder' state. 675 6.4. Packet Duplication 677 Although the problem of packet duplication in today's Internet 678 [Jai+07], [Mel+08] is negligible, it may happen in rare cases that 679 segments on the path to the TCP receiver are duplicated. If a 680 segment is duplicated on the path, the first incoming segment causes 681 the receiver to send either an acceptable ACK or a SACK, depending on 682 whether the segment is the next expected one or not. Each subsequent 683 identical segment then causes either a duplicate ACK or a DSACK, 684 respectively, depending on whether the DSACK extension [RFC3708] is 685 implemented or not. 687 If by a combination of packet loss and packet duplication the case 688 occurs that a Fast Retransmit for a lost segment is duplicated on the 689 path, the TCP sender is not able to distinguish this from packet 690 reordering. The first received ACK closes a hole in the sequence 691 number space of the SACK scoreboard, while the second received ACK is 692 a valid DSACK. Although both cases are indistinguishable from a 693 theoretical point of view, the TCP sender can take measures to ensure 694 as far as possible that the DSACK received was not the result of 695 packet duplication. 697 For this purpose, step (A.3) of the above detection method checks via 698 the steps (A.1) to (A.4) of [RFC3708] whether the segment identified 699 by the DSACK option is marked as a valid duplicate. Unfortunately, 700 the steps of [RFC3708] do not check that more DSACKs have been 701 received than retransmissions have been sent, which is a 702 characteristic of suffering both packet reordering and packet 703 duplication at the same time. By simply counting the received 704 DSACKs, for example, as additional step (A.5) in [RFC3708], this 705 corner case can be covered as well. 707 7. Related Work 709 Because of retransmission ambiguity problem [KP87], which describes 710 TCP sender's inability to to distinguish whether the first acceptable 711 ACK that arrives after a retransmit was sent in response to the 712 original transmit or the retransmit, two different approaches can 713 generally be taken to detect and quantify packet reordering. First, 714 for transmissions (non-retransmitted segments), the detection is 715 usually conducted by detecting a closed hole in sequence number space 716 of the SACK scoreboard. Second, for retransmissions, the detection 717 of packet reordering is accompanied by the detection of the spurious 718 fast retransmits. 720 Within the IETF, several proposals have been published in the RFC 721 series to detect and quantify packet reordering. With [RFC4737] the 722 IPPM Working Group [IPPM] defines several metrics to evaluate whether 723 a network path has maintained packet order on a packet-by-packet 724 basis. [RFC4737] gives concrete, well-defined metrics, along with a 725 methodology for applying the metric to a generic packet stream. The 726 metric discussed in this document has a different purpose from the 727 IPPM metrics; this document discusses a TCP specific reordering 728 metric calculated on the TCP sender's SACK scoreboard. 730 Besides the IPPM work, several other proposals have been developed to 731 detect spurious retransmissions with TCP. The Eifel detection 732 algorithm [RFC3522] uses the TCP Timestamps option to determine 733 whether the ACK for a given retransmit is for the original 734 transmission or a retransmission. The F-RTO scheme [RFC5682] 735 slightly alters TCP's sending pattern immediately following a 736 retransmission timeout to indicate whether the retransmitted segment 737 was needed. Finally, the DSACK-based algorithm [RFC3708] uses the 738 TCP SACK option [RFC2018] with the DSACK extension [RFC2883] to 739 identify unnecessary retransmissions. The mechanism for detecting 740 packet reordering outlined in this document rely on the detection 741 schemes of those documents (except F-RTO that only works for spurious 742 retransmits triggered by TCP's retransmission timer), although they 743 do not provide metrics for the reordering extent whereas the 744 algorithm described in this document does. 746 RR-TCP [Zha+03] describes a reordering detection and quantification 747 scheme that is also based on holes in the sequence number space of 748 the SACK scoreboard and the reception of DSACKs. For every hole in 749 the SACK scoreboard, RR-TCP calculates a reordering extent. If the 750 segment was retransmitted before an ACK was received, it waits for a 751 DSACK that proves that the segment was spuriously retransmitted. The 752 reordering sample in such a case is the mean between the sample 753 calculated due to the hole in the sequence number space and the 754 sample calculated in responding to the received DSACK. In contrast, 755 the methodology in this document is always to quantify the reordering 756 at the time of a closed hole and to not take packet reordering with a 757 delay larger than one RTT into account (see Section 6.2. 759 The Linux kernel [Linux] implements a reordering detection based on 760 SACK, DSACK and TCP Timestamps option as well. The detection and 761 quantification of non-retransmitted segments with SACK or for 762 retransmitted segments with TCP Timestamps option operates much like 763 the scheme described in this document, with the exception of the 764 DSACK detection. First, Linux does not store any information (e.g., 765 reordering extent) below the cumulative ACK point, so that DSACKs 766 below the cumulative ACK point are ignored (for the purpose for 767 reordering quantification). Second, Linux also does not store any 768 information about a possible reordering event when a hole in the 769 sequence number space of the SACK scoreboard is closed. Therefore, 770 for a DSACK reporting a duplicate above the cumulative ACK, Linux 771 needs to approximate the reordering on arrival of a DSACK by the 772 distance between the DSACK and the highest selectively acknowledged 773 segment. 775 8. IANA Considerations 777 This memo includes no request to IANA. 779 9. Security Considerations 781 The described algorithm neither improves nor degrades the current 782 security of TCP, since this document only detects and quantifies 783 reordering and does not change the TCP behavior. General security 784 considerations for SACK based loss recovery are outlined in 785 [RFC6675]. 787 10. Acknowledgments 789 The authors thank the flowgrind [Flowgrind] authors and contributors 790 for their performance measurement tool, which give us a powerful tool 791 to analyze TCP's congestion control and loss recovery behavior in 792 detail. 794 11. References 795 11.1. Normative References 797 [I-D.zimmermann-tcpm-reordering-reaction] 798 Zimmermann, A., Schulte, L., Wolff, C., and A. Hannemann, 799 "An adaptive Robustness of TCP to Non-Congestion Events", 800 draft-zimmermann-tcpm-reordering-reaction-00 (work 801 in progress), November 2013. 803 [MM96] Mathis, M. and J. Mahdavi, "Forward Acknowledgement: 804 Refining TCP Congestion Control", ACM SIGCOMM 1996 805 Proceedings, in ACM Computer Communication Review 26 (4), 806 pp. 281-292, October 1996. 808 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 809 RFC 793, September 1981. 811 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 812 for High Performance", RFC 1323, May 1992. 814 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 815 Selective Acknowledgment Options", RFC 2018, October 1996. 817 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 818 Requirement Levels", BCP 14, RFC 2119, March 1997. 820 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 821 Extension to the Selective Acknowledgement (SACK) Option 822 for TCP", RFC 2883, July 2000. 824 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm 825 for TCP", RFC 3522, April 2003. 827 [RFC3708] Blanton, E. and M. Allman, "Using TCP Duplicate Selective 828 Acknowledgement (DSACKs) and Stream Control Transmission 829 Protocol (SCTP) Duplicate Transmission Sequence Numbers 830 (TSNs) to Detect Spurious Retransmissions", RFC 3708, 831 February 2004. 833 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 834 Control", RFC 5681, September 2009. 836 11.2. Informative References 838 [BA02] Blanton, E. and M. Allman, "On Making TCP More Robust to 839 Packet Reordering", ACM Computer Communication 840 Review vol.32, no. 1, pp. 20-30, January 2002. 842 [BPS99] Bennett, J., Partridge, C., and N. Shectman, "Packet 843 reordering is not pathological network behavior", IEEE/ACM 844 Transactions on Networking vol. 7, no. 6, pp. 789-798, 845 December 1999. 847 [BS02] Bellardo, J. and S. Partridge, "Measuring Packet 848 Reordering", Proceedings of the 2nd ACM SIGCOMM Workshop 849 on Internet Measurment (IMW'02) pp. 97-105, November 2002. 851 [Flowgrind] 852 "Flowgrind Home Page", 853 . 855 [GPL04] Gharai, L., Perkins, C., and T. Lehman, "Packet 856 Reordering, High Speed Networks and Transport Protocol 857 Performance", Proceedings of the 13th IEEE International 858 Conference on Computer Communications and Networks 859 (ICCCN'04) pp. 73-78, October 2004. 861 [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion 862 Control Scheme for TCP", Proceedings of the Conference on 863 Applications, Technologies, Architectures, and Protocols 864 for Computer Communication (SIGCOMM'96) pp. 270-280, 865 August 1996. 867 [IPPM] "IP Performance Metrics (IPPM) Working Group", 868 . 870 [Jai+07] Jaiswal, S., Iannaccone, G., Diot, C., Kurose, J., and D. 871 Towsley, "Measurement and Classification of Out-of- 872 Sequence Packets in a Tier-1 IP Backbone", IEEE/ACM 873 Transactions on Networking vol. 15, no. 1, pp. 54-66, 874 February 2007. 876 [KP87] Karn, P. and C. Partridge, "Improving Round-Trip Time 877 Estimates in Reliable Transport Protocols", ACM SIGCOMM 878 Computer Communication Review vol. 17, no. 5, pp. 2-7, 879 November 1987. 881 [Linux] "The Linux Project", . 883 [Mat+97] Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The 884 Macroscopic Behavior of the TCP Congestion Avoidance 885 Algorithm", ACM SIGCOMM Computer Communication Review vol. 886 27, no. 3, pp. 67-82, July 1997. 888 [Mel+08] Mellia, M., Meo, M., Muscariello, L., and D. Rossi, 889 "Passive analysis of TCP anomalies", Computer 890 Networks vol. 52, no. 14, pp. 2663-2676, October 2008. 892 [Pax97] Paxson, V., "End-to-End Internet Packet Dynamics", IEEE/ 893 ACM Transactions on Networking vol. 7, no.3, pp. 277-292, 894 June 1997. 896 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 897 RFC 896, January 1984. 899 [RFC1122] Braden, R., "Requirements for Internet Hosts - 900 Communication Layers", STD 3, RFC 1122, October 1989. 902 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., 903 Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., 904 Zhang, L., and V. Paxson, "Stream Control Transmission 905 Protocol", RFC 2960, October 2000. 907 [RFC4653] Bhandarkar, S., Reddy, A., Allman, M., and E. Blanton, 908 "Improving the Robustness of TCP to Non-Congestion 909 Events", RFC 4653, August 2006. 911 [RFC4737] Morton, A., Ciavattone, L., Ramachandran, G., Shalunov, 912 S., and J. Perser, "Packet Reordering Metrics", RFC 4737, 913 November 2006. 915 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 916 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 917 Spurious Retransmission Timeouts with TCP", RFC 5682, 918 September 2009. 920 [RFC6582] Henderson, T., Floyd, S., Gurtov, A., and Y. Nishida, "The 921 NewReno Modification to TCP's Fast Recovery Algorithm", 922 RFC 6582, April 2012. 924 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 925 and Y. Nishida, "A Conservative Loss Recovery Algorithm 926 Based on Selective Acknowledgment (SACK) for TCP", 927 RFC 6675, August 2012. 929 [ZM04] Zhou, X. and P. Mieghem, "Reordering of IP Packets in 930 Internet", Lecture Notes in Computer Science vol. 3015, 931 pp. 237-246, April 2004. 933 [Zha+03] Zhang, M., Karp, B., Floyd, S., and L. Peterson, "RR-TCP: 934 A Reordering-Robust TCP with DSACK", Proceedings of the 935 11th IEEE International Conference on Network Protocols 936 (ICNP'03) pp. 95-106, November 2003. 938 Authors' Addresses 940 Alexander Zimmermann 941 NetApp, Inc. 942 Sonnenallee 1 943 Kirchheim 85551 944 Germany 946 Phone: +49 89 900594712 947 Email: alexander.zimmermann@netapp.com 949 Lennart Schulte 950 Aalto University 951 Otakaari 5 A 952 Espoo 02150 953 Finland 955 Phone: +358 50 4355233 956 Email: lennart.schulte@aalto.fi 958 Carsten Wolff 959 credativ GmbH 960 Hohenzollernstrasse 133 961 Moenchengladbach 41061 962 Germany 964 Phone: +49 2161 4643 182 965 Email: carsten.wolff@credativ.de 967 Arnd Hannemann 968 credativ GmbH 969 Hohenzollernstrasse 133 970 Moenchengladbach 41061 971 Germany 973 Phone: +49 2161 4643 134 974 Email: arnd.hannemann@credativ.de