idnits 2.17.1 draft-zimmermann-tcpm-reordering-detection-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (May 20, 2014) is 3627 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Outdated reference: A later version (-02) exists of draft-zimmermann-tcpm-reordering-reaction-01 ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 1323 (Obsoleted by RFC 7323) -- Obsolete informational reference (is this intentional?): RFC 896 (Obsoleted by RFC 7805) -- Obsolete informational reference (is this intentional?): RFC 2960 (Obsoleted by RFC 4960) Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TCP Maintenance and Minor Extensions A. Zimmermann 3 (TCPM) WG NetApp, Inc. 4 Internet-Draft L. Schulte 5 Intended status: Experimental Aalto University 6 Expires: November 21, 2014 C. Wolff 7 A. Hannemann 8 credativ GmbH 9 May 20, 2014 11 Detection and Quantification of Packet Reordering with TCP 12 draft-zimmermann-tcpm-reordering-detection-01 14 Abstract 16 This document specifies an algorithm for the detection and 17 quantification of packet reordering for TCP. In the absence of 18 explicit congestion notification from the network, TCP uses only 19 packet loss as an indication of congestion. One of the signals TCP 20 uses to determine loss is the arrival of three duplicate 21 acknowledgments. However, this heuristic is not always correct, 22 notably in the case when paths reorder packets. This results in 23 degraded performance. 25 The algorithm for the detection and quantification of reordering in 26 this document uses information gathered from the TCP Timestamps 27 Option, the TCP SACK Option and its DSACK extension. When a 28 reordering event is detected, the algorithm calculates a reordering 29 extent by determining the number of segments the reordered segment 30 was late with respect to its position in the sequence number space. 31 Additionally, the algorithm computes a second reordering extent that 32 is relative to the amount of outstanding data and thus provides a 33 better estimation of the reordering delay when other sender state 34 changes. 36 Status of this Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at http://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on November 21, 2014. 53 Copyright Notice 55 Copyright (c) 2014 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (http://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 72 3. Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . 7 73 4. The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 7 74 4.1. Initialization During Connection Establishment . . . . . . 8 75 4.2. Receiving Acknowledgments . . . . . . . . . . . . . . . . 8 76 4.3. Receiving Acknowledgment Closing Hole . . . . . . . . . . 9 77 4.4. Receiving Duplicate Selective Acknowledgment . . . . . . . 9 78 4.5. Reordering Extent Computation . . . . . . . . . . . . . . 10 79 4.6. Retransmitting Segment . . . . . . . . . . . . . . . . . . 10 80 4.7. Placeholder for Response Algorithm . . . . . . . . . . . . 11 81 4.8. Retransmission Timeout . . . . . . . . . . . . . . . . . . 11 82 5. Protocol Steps in Detail . . . . . . . . . . . . . . . . . . . 11 83 6. Discussion of the Algorithm . . . . . . . . . . . . . . . . . 13 84 6.1. Calculation of the Relative Reordering Extent . . . . . . 13 85 6.2. Reordering Delay Longer than RTT . . . . . . . . . . . . . 14 86 6.3. Persistent Reception of Selective Acknowledgments . . . . 14 87 6.4. Packet Duplication . . . . . . . . . . . . . . . . . . . . 16 88 7. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 17 89 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 90 9. Security Considerations . . . . . . . . . . . . . . . . . . . 18 91 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 18 92 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 19 93 11.1. Normative References . . . . . . . . . . . . . . . . . . . 19 94 11.2. Informative References . . . . . . . . . . . . . . . . . . 19 95 Appendix A. Changes from previous versions of the draft . . . . . 22 96 A.1. Changes from 97 draft-zimmermann-tcpm-reordering-detection-00 . . . . . . 22 98 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 100 1. Introduction 102 When the Transmission Control Protocol (TCP) [RFC0793] decides that 103 the oldest outstanding segment is lost, it performs a retransmission 104 and changes the sending rate [RFC5681]. This occurs either when the 105 Retransmission Timeout (RTO) timer expires for a segment [RFC6298], 106 or when three duplicate acknowledgments (ACKs) for a segment have 107 been received (Fast Retransmit/Fast Recovery) [RFC5681]. The 108 assumption behind Fast Retransmit is that non-congestion events that 109 can cause duplicate ACKs to be generated (packet duplication, packet 110 reordering and packet corruption) are infrequent. However, a number 111 of Internet measurement studies have shown that packet reordering is 112 not a rare phenomenon [Pax97], [BPS99], [BS02], [ZM04], [GPL04], 113 [Jai+07] and has negative performance implications on TCP [BA02], 114 [Zha+03]. 116 The impact that packet reordering has on TCP can be classified by the 117 type of reordering: forward-path versus reverse-path reordering. 119 From TCP's perspective, the result of packet reordering on the 120 forward-path is the reception of out-of-order segments by the TCP 121 receiver. In response to every received out-of-order segment, the 122 TCP receiver immediately sends a duplicate ACK. (Note: [RFC5681] 123 recommends that delayed ACKs not be used when the ACK is triggered by 124 an out-of-order segment.) The sender side, if the number of 125 consecutively received duplicate ACKs exceeds the duplicate 126 acknowledgment threshold (DupThresh), retransmits the first 127 unacknowledged segment [RFC5681] and continues with a loss recovery 128 algorithm such as NewReno [RFC6582] or the Selective Acknowledgment 129 (SACK) based loss recovery [RFC6675]. If a segment arrives due to 130 reordering more than three segments (the default value of DupThresh 131 [RFC5681]) too late at the TCP receiver, the sender is not able to 132 distinguish this reordering event from a segment loss, resulting in 133 an unnecessary retransmission and rate reduction. 135 Packet reordering may not only cause data segments to arrive out-of- 136 order at the receiver but also ACKs ot the sender. This reordering 137 on the reverse path also has a negative an impact on TCP performance, 138 by causing a degradation of TCP's self-clocking property. In steady 139 state, depending on whether the TCP receiver delays an ACK or not 140 [RFC1122], one or two segments are acknowledged per ACK. If, due to 141 reordering on the reverse path, ACKs arrive at the TCP sender in a 142 different order than they were sent in by the TCP receiver, in-order 143 ACKs acknowledge several segments together rather than only one or 144 two, while disordered ACKs arrive either out-of-order or out-of- 145 window and are ignored. (Note: according to [RFC6675], an ACK only 146 counts as a duplicate if it carries a SACK block that identifies 147 previously unacknowledged and un-SACKed data.) Overall, this leads 148 to a bursty transmission pattern as well as outdated SACK and DSACK 149 information. 151 Since DupThresh is defined in segments rather than bytes [RFC5681], 152 TCP usually quantifies packet reordering in terms of segments. 153 Informally, the reordering extent [RFC4737] is defined as the maximum 154 distance in segments between the reception of a reordered segment and 155 the earliest segment received with a larger sequence number. If a 156 segment is received in-order, its reordering extent is undefined 157 [RFC4737]. On the basis of the reordering extent, a simple 158 robustness of TCP to packet reordering can be achieved by directly 159 applying the reordering extent as DupThresh. A problem that arises 160 with this way of quantifying reordering is that even in the presence 161 of constant reordering, reordering extents may vary if the 162 transmission rate of the TCP sender changes. Therefore, by using a 163 DupThresh that directly reflects the measured reordering extent, 164 spurious retransmissions cannot be fully avoided. 166 The following example illustrates this issue. Assume a path with a 167 reordering probability of 1%, a reordering delay of 20 ms, and a 168 bottleneck bandwidth of 3 Mb/s. Because segments that are delayed by 169 reordering arrive 20 ms too late, the TCP receiver can receive a 170 maximum of ((20 * 3 * 10^3) / 8) = 7500 bytes out-of-order before the 171 reordered segment arrives. Hence, with a Sender Maximum Segment Size 172 (SMSS) of 1460 bytes, the largest possible reordering extent is close 173 to 5 segments. If the bottleneck bandwidth changes from 3 Mb/s to 4 174 Mb/s, the maximum reordering extent will increase to 7 segments, 175 although the reordering delay remains constant. 177 This simple example shows that even with constant reordering, 178 spurious retransmissions cannot be completely avoided if the 179 DupThresh directly reflects the reordering extent. On the other 180 hand, the reordering extent and the resulting DupThresh can sometimes 181 also be much too high and do not correspond to the actual packet 182 reordering present on the path. For example, a slow start overshoot 183 [Hoe96], [MM96], [Mat+97] at the end of slow start might induce such 184 a problem. 186 An obvious solution to the problem would be to quantify packet 187 reordering not by calculating a reordering extent, but by using the 188 reordering late time offset [RFC4737]. Since the reordering late 189 time offset is not specified in segments but captures the difference 190 between the expected and actual reception time of a reordered 191 segment, this way of quantifying reordering is independent of the 192 current transmission rate. Disadvantages of this approach are 193 however a higher complexity and a worse integration into the TCP 194 specification, since an implementation would require additional 195 timers, whereas TCP itself is self-clocked. 197 The approach taken by this specification quantifies the reordering 198 extend for a packet not only through an absolute value, but also 199 through a measure that is relative to the amount of outstanding data, 200 in an attempt to approximate a time-based measure. The presented 201 scheme can thereby easily be adapted to the Stream Control 202 Transmission Protocol (SCTP) [RFC2960], since SCTP uses congestion 203 control algorithms similar to TCP. 205 The remainder of this document is organized as follows. Section 3 206 provides a high-level description of the packet reordering detection 207 mechanisms. In Section 4, the algorithm is specified. In Section 5, 208 each step of the algorithm is discussed in detail. Section 6 209 provides a discussion of several design decisions of the algorithm. 210 Section 7 discusses related work. Section 9 discusses security 211 concerns. 213 2. Terminology 215 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 216 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 217 document are to be interpreted as described [RFC2119]. 219 The reader is expected to be familiar with the TCP state variables 220 given in [RFC0793] (SEG.SEQ, SND.UNA), [RFC5681] (FlightSize), and 221 [RFC6675] (DupThresh, SACK scoreboard). SND.FACK (forward 222 acknowledgment) is used to describe the highest sequence number - 223 plus one - that has been either cumulatively or selectively 224 acknowledged by the receiver and subsequently seen by the sender 225 [MM96]. Further, the term 'acceptable acknowledgment' is used as 226 defined in [RFC0793]. That is, an ACK that increases the 227 connection's cumulative ACK point by acknowledging previously 228 unacknowledged data. The term 'duplicate acknowledgment' is used as 229 defined in [RFC6675], which is different from the definition of 230 duplicate acknowledgment in [RFC5681]. 232 This specification defines the four TCP sender states 'open', 233 'disorder', 'recovery', and 'loss' as follows. As long as no 234 duplicate ACK is received and no segment is considered lost, the TCP 235 sender is in the 'open' state. Upon the reception of the first 236 consecutive duplicate ACK, TCP will enter the 'disorder' state. 237 After receiving DupThresh duplicate ACKs, the TCP sender switches to 238 the 'recovery' state and executes standard loss recovery procedures 239 like Fast Retransmit and Fast Recovery [RFC5681]. Upon a 240 retransmission timeout, the TCP sender enters the 'loss' state. The 241 'recovery' state can only be reached by a transition from the 242 'disorder' state, the 'loss' state can be reached from any other 243 state. 245 3. Basic Concepts 247 The following specification depends on the TCP Timestamps option 248 [RFC1323], the TCP Selective Acknowledgment (SACK) [RFC2018] option 249 and the latter's Duplicate Selective Acknowledgment (DSACK) extension 250 [RFC2883]. The reader is assumed to be familiar with the algorithms 251 specified in these documents. 253 Reordering is quantified by an absolute and a relative reordering 254 extent. If a hole in the SACK scoreboard of the TCP sender is closed 255 either cumulatively by an acceptable ACK or selectively by a new 256 SACK, then the absolute reordering extent is computed as the number 257 of segments in the SACK scoreboard between the sequence number of the 258 reordered segment and the highest selectively or cumulatively 259 acknowledged sequence number. The relative reordering extent is then 260 computed as the ratio between the absolute reordering extent and the 261 FlightSize stored when entering the 'disorder' state. 263 If the hole that was closed in the SACK scoreboard corresponds to a 264 segment that was not retransmitted, or if the retransmission of such 265 a segment can be determined as a spurious retransmission by means of 266 the Eifel detection algorithm [RFC3522], then the calculated 267 reordering extent is immediately valid. Otherwise, if the 268 verification of the Eifel detection algorithm has not been possible, 269 the reordering extent is stored for a possibly subsequent DSACK. If 270 no such DSACK is received in the next two round-trip times (RTTs), 271 the reordering extents are discarded. 273 4. The Algorithm 275 Given that usually both the Nagle algorithm [RFC0896] [RFC1122] and 276 the TCP Selective Acknowledgment Option [RFC2018] are enabled, a TCP 277 sender MAY employ the following algorithm to detect and quantify the 278 current perceived packet reordering in the network. 280 Without the Nagle algorithm, there is no straight forward way to 281 accurately calculate the number of outstanding segments in the 282 network (and, therefore, no good way to derive an appropriate 283 reordering extent) without adding state to the TCP sender. A TCP 284 connection that does not employ the Nagle algorithm SHOULD NOT use 285 this methodology. 287 If a TCP sender implements the following algorithm, the 288 implementation MUST follow the various specifications provided in 289 Sections 4.1 to 4.8. The algorithm MUST be executed *before* the 290 Transmission Control Block or the SACK scoreboard have been updated 291 by another loss recovery algorithm. 293 4.1. Initialization During Connection Establishment 295 After the completion of the TCP connection establishment, the 296 following state variables MUST be initialized in the TCP transmission 297 control block: 299 (C.1) The variable Dsack, which indicates whether a DSACK has been 300 received so far, and the data structure Samples, which stores 301 the computed reordering extents, MUST be initialized as: 303 Dsack = false 304 Samples = [] 306 (C.2) If the TCP Timestamps option [RFC1122] has been negotiated, 307 then the variable Timestamps MUST be activated and the data 308 structure Retrans_TS, which stores the value of the TSval 309 field of the retransmissions sent during Fast Recovery, MUST 310 be initialized as: 312 Timestamps = true 313 Retrans_TS = [] 315 Otherwise, the Timestamps-based detection SHOULD be 316 deactivated: 318 Timestamps = false 320 4.2. Receiving Acknowledgments 322 For each received ACK that either a) carries SACK information, *or* 323 b) is a full ACK that terminates the current Fast Recovery procedure, 324 *or* c) is an acceptable ACK that is received immediately after a 325 duplicate ACK, execute steps (A.1) to (A.4), otherwise skip to step 326 (A.4). 328 (A.1) If a) the ACK carries new SACK information, *and* b) the SACK 329 scoreboard is empty (i.e., the TCP sender has received no SACK 330 information from the receiver), then the TCP sender MUST save 331 the amount of current outstanding data: 333 FlightSizePrev = FlightSize 335 (A.2) If the received ACK either a) cumulatively acknowledges at 336 most SMSS bytes, *or* b) selectively acknowledges at most SMSS 337 bytes in the sequence number space in the SACK scoreboard, 338 then: 340 The TCP sender MUST execute steps (S.1) to (S.4) 342 (A.3) If a) Timestamps == false *and* b) the received ACK carries a 343 DSACK option [RFC2883] and the segment identified by the DSACK 344 option can be marked according to step (A.1) to (A.4) of 345 [RFC3708] as a valid duplicate, then: 347 The TCP sender MUST execute steps (D.1) to (D.3) 349 (A.4) The TCP sender MUST terminate the processing of the ACK by 350 this algorithm and MUST continue with the default processing 351 of the ACK. 353 4.3. Receiving Acknowledgment Closing Hole 355 (S.1) If (a) the newly cumulatively or selectively acknowledged 356 segment SEG is a retransmission *and* b) both equations Dsack 357 == false and Timestamps == false hold, then the TCP sender 358 MUST skip to step (A.4). 360 (S.2) Compute the relative and absolute reordering extent ReorExtR, 361 ReorExtA: 363 The TCP sender MUST execute steps (E.1) to (E.4) 365 (S.3) If a) the newly acknowledged segment SEG was not retransmitted 366 before *or* b) both equations Timestamps == true and 367 Retrans_TS[SEG.SEQ] > ACK.TSecr hold, i.e., the ACK 368 acknowledges the original transmission and not a 369 retransmission, then hand over the reordering extents to an 370 additional reaction algorithm: 372 The TCP sender MUST execute step (P) 374 (S.4) If a) the previous step (S.3) was not executed *and* b) both 375 equations Dsack == true and Timestamps == false hold, save the 376 reordering extents for the newly acknowledged segment SEG for 377 at least two RTTs: 379 Samples[SEG.SEQ].ReorExtR = ReorExtR 380 Samples[SEG.SEQ].ReorExtA = ReorExtA 382 4.4. Receiving Duplicate Selective Acknowledgment 383 (D.1) If no DSACK has been received so far, the sender MUST set: 385 Dsack = true 387 (D.2) If a) the previous step (D.1) was not executed *and* a 388 reordering extent was calculated for the segment SEG 389 identified by the DSACK option, then the TCP sender MUST 390 restore the values of the variables ReorExtR and ReorExtA and 391 delete the corresponding entries in the data structure: 393 ReorExtR = Samples[SEG.SEQ].ReorExtR 394 ReorExtA = SAMPLES[SEG.SEQ].ReorExtA 396 (D.3) Hand the newly restored reordering extents over to an 397 additional reaction algorithm: 399 The TCP sender SHOULD execute step (P) 401 4.5. Reordering Extent Computation 403 (E.1) SEG.SEQ is the sequence number of the newly cumulatively or 404 selectively acknowledged segment SEG. 406 (E.2) SND.FACK is the highest either cumulatively or selectively 407 acknowledged sequence number so far plus one. 409 (E.3) The TCP sender MUST compute the absolute reordering extent 410 ReorExtA as 412 ReorExtA = (SND.FACK - SEG.SEQ) / SMSS 414 (E.4) The TCP sender MUST compute the relative reordering extent 415 ReorExtR as 417 ReorExtR= ReorExtA * (SMSS / FlightSizePrev) 419 4.6. Retransmitting Segment 421 If the TCP Timestamps option [RFC1323] is used to detect packet 422 reordering, the TCP sender must save the TCP Timestamps option of all 423 retransmitted segments during Fast Recovery. 425 (RET) If a) a segment SEG is retransmitted during Fast Recovery, 426 *and* b) the equation Timestamps = true holds, the TCP sender 427 MUST save the value of the TSval field of the retransmitted 428 segment: 430 Retrans_TS[SEG.SEQ] = SEG.TSval 432 4.7. Placeholder for Response Algorithm 434 (P) This is a placeholder for an additional reaction algorithm 435 that takes further action using the results of this algorithm, 436 for example, the adjustment of the DupThresh based on relative 437 and absolute reordering extent ReorExtR and ReorExtA. 439 4.8. Retransmission Timeout 441 The expiration of the retransmission timer should be interpreted as 442 an indication of a change in path characteristics, and the TCP sender 443 should consider all saved reordering extents as outdated and delete 444 them. 446 (RTO) If an retransmission timeout (RTO) occurs, a TCP sender SHOULD 447 reset the following variables: 449 Samples = [] 450 Retrans_TS = [] 452 5. Protocol Steps in Detail 454 The reception of an ACK represents the starting point for the 455 detection scheme above. For each received SACK, DSACK or acceptable 456 ACK that prompts the TCP sender to enter the 'disorder' state, to 457 remain in the 'disorder' state or to leave either the 'disorder' or 458 'recovery' states towards the 'open' state, steps (A.1) to (A.4) are 459 performed. All other received ACKs are not relevant for the 460 detection of packet reordering and can be ignored. If the TCP sender 461 changes from the 'open' to the 'disorder' state due to the reception 462 of a duplicate ACK (i.e., the SACK scoreboard is empty and an ACK 463 arrives carrying new SACK information), the current amount of 464 outstanding data, FlightSize, is stored for the subsequent 465 calculation of the relative reordering extent (step (A.1)). 467 Whenever a received acceptable ACK or SACK closes a hole in the 468 sequence number space of the SACK scoreboard either partially or 469 completely, this is an indication of packet reordering in the network 470 (step (A.2)). The prerequisite for an accurate quantification of the 471 reordering is that only one segment is newly acknowledged (maximum 472 SMSS bytes of data). If more than one segment per ACK is 473 acknowledged, either by reordering on the reverse path or the loss of 474 ACKs, the order in which the segments have been received by the TCP 475 receiver is no longer accurately determinable so that in this case a 476 reordering extent is not calculated. Finally, if the received ACK 477 carries a DSACK option that identifies a segment that was 478 retransmitted only once, then this is sufficient to conclude 479 reordering (step (A.3)), so that a previously calculated reordering 480 extent can be passed to another algorithm (steps (D.3) and (P)). 482 With just the information provided by the ACK field or SACK 483 information above SND.UNA, the TCP sender is unable to distinguish 484 whether the ACK that finally acknowledges retransmitted data (either 485 cumulatively or selectively) was sent in response to the original 486 segment or a retransmission of the segment. This is described as the 487 retransmission ambiguity problem in [KP87]. Therefore, the detection 488 and quantification of reordering depends on other means to 489 distinguish between acknowledgments for transmission and 490 retransmission to detect if a retransmission was spurious. If 491 neither a DSACK has been received (Dsack == false) so far nor the TCP 492 Timestamps option has been enabled on connection establishment 493 (Timestamps == false) then there is no possibility for the TCP sender 494 to identify spurious retransmissions. Hence, the processing of the 495 received ACK by the detection algorithm must be terminated for 496 retransmitted segments (step (S.1)). Otherwise, if the segment that 497 corresponds to the closed hole in the sequence number space of the 498 SACK scoreboard has not been retransmitted or the retransmission can 499 be identified by the Eifel detection algorithm [RFC3522] as a 500 spurious retransmission, the previously calculated reordering extent 501 is valid (step (S.2)) and an additional reaction algorithm can be 502 executed (steps (S.3) and (P)). 504 For the use of the Eifel detection it is necessary to store the TCP 505 Timestamps option of all retransmissions sent during Fast Recovery 506 (step (Ret)). However, if the use of the Eifel detection algorithm 507 is not possible (Timestamps == false), the extent of a possible 508 reordering is stored for the possibility of a subsequent arrival of a 509 DSACK (step (P.4)). If no such DSACK is received in the next two 510 round-trip times, the reordering extent is discarded. Since the 511 DSACK extension is not negotiated during connection establishment 512 [RFC2883], the reordering extent is only stored if a DSACK was 513 previously received for the TCP connection (DSACK == true, step 514 (D.1)). 516 Regardless of whether packet reordering is detected by using the 517 SACK-based methodology, the DSACK-based methodology, or the TCP 518 Timestamps option, quantification of the reordering will always be 519 done when closing a hole in the sequence number space of the SACK 520 scoreboard (step (A.2), step (P.2)). The only difference is the time 521 of detection, which is in the case of DSACK-based methodology at 522 least one RTT after the time of the quantification. The absolute 523 reordering extent ReorExtA results from the number of segments in the 524 SACK scoreboard between the sequence number of the newly acknowledged 525 segment and the highest either cumulatively or selectively 526 acknowledged sequence number so far plus one (SND.FACK) (step (E.3)). 528 It is worth noting that the absolute reordering extent includes all 529 segments (bytes) between the closed hole and the highest acknowledged 530 sequence number so far, i.e., it also includes segments (bytes) that 531 are not selectively acknowledged. The reason is that if packet 532 reordering is considered from a temporal perspective, it is 533 irrelevant whether there are lost segments or not. The important 534 fact is that the lost segments have been sent after the delayed 535 segment and before the highest acknowledged segment, which is 536 expressed by the metric. In step (E.4), the relative reordering 537 extent ReorExtR is then calculated by the ratio between the absolute 538 reordering extent ReorExtA and the amount of outstanding data stored 539 by step (A.1). 541 6. Discussion of the Algorithm 543 The focus of the following discussion is on the quantification of 544 reordering by the relative reordering extent and to elaborate on 545 possible sources of error, which may lead to an inaccurate detection 546 and quantification of reordering in the network. 548 6.1. Calculation of the Relative Reordering Extent 550 Generally, the characteristics of a relative reordering extent should 551 be that if packet reordering on a path is constant in terms of rate 552 and delay, the relative reordering extent should also be constant, 553 regardless of the current transmission rate of the TCP sender. The 554 scheme proposed in this document is to calculate the relative 555 reordering by getting the ratio between absolute reordering (the 556 amount of data the reordered segment was received too late) and the 557 amount of outstanding data stored when TCP sender was entering the 558 'disorder' state (the maximum amount of data a reordered segment can 559 be received too late). (Note: A segment can theoretically be delayed 560 for an arbitrarily long period during reordering. However, the 561 scheme proposed in this document does not capture such kind of 562 reordering. See Section 6.2.) Therefore, the relative reordering 563 extent expresses the portion of currently outstanding data that is 564 selectively acknowledged before the reordered segment is cumulatively 565 acknowledged. If the transmission rate changes, the absolute 566 reordering extent changes as well, but together with the amount of 567 outstanding data, and hence the relative reordering extent stays 568 constant. 570 A characteristic of the calculation of the relative reordering extent 571 on the basis of currently outstanding amount of data is that the 572 FlightSize reflects the bandwidth-delay-product and not the 573 transmission rate. As a consequence, the relative reordering extent 574 is not independent of the RTT. If the RTT of the communication path 575 changes, the amount of outstanding data changes as well, but the 576 absolute reordering extent remains constant. Hence, the relative 577 reordering extent adapts. In principle it is possible to design an 578 algorithm to compute the relative reordering extent independently of 579 the RTT and to reflect only the characteristics of packet reordering 580 of the path. But since the calculation would be far from trivial and 581 introducing more complexity, this is considered to be future 582 research. 584 6.2. Reordering Delay Longer than RTT 586 The quantification of packet reordering always takes place at the 587 time when a hole in sequence number space in the SACK scoreboard is 588 closed. In consequence, if a reordered packet is delayed by more 589 than the measured RTT, the amount of reordering on the path is 590 underestimated, since the ACK that closes the hole in the sequence 591 number space was not sent in response to the original transmission, 592 but to the retransmission. Hence, in order to detect and quantify 593 these kinds of events, the reordering extent would have to be 594 calculated when the ACK for the transmission is received and not at 595 the time the hole in the sequence number space is closed. Although 596 it would be possible to take these events into account by evaluating 597 DSACKs and the TCP Timestamps option simultaneously, the scheme 598 proposed in this document refrains from qualifying a reordering delay 599 longer then RTT. A reordering delay of this magnitude is very 600 unlikely, and would lead to a significant overhead in memory usage 601 and complexity. Additionally, taking it into account would result in 602 other problems, especially a potential expiration of the RTO 603 [I-D.zimmermann-tcpm-reordering-reaction]. 605 6.3. Persistent Reception of Selective Acknowledgments 607 Especially on paths with a high bandwidth-delay-product, it is 608 possible that even with a minor packet reordering, several segments 609 in a single window of data are delayed. If, in addition, the 610 sequence numbers of those segments are widely spaced in the sequence 611 number space and the delay caused by packet reordering is 612 sufficiently high, this might lead to a constant reception of out-of- 613 order data. Hence, for each received segment, regardless of whether 614 a hole in the sequence number space of the receive window is closed 615 or not, an ACK is sent that carries SACK information. From TCP 616 sender's perspective, this persistent receiving of new SACK 617 information leads to the situation that the TCP sender enters the 618 'disorder' state when receiving the first SACK and never leaves it 619 again during the connection lifetime if no segment is lost in 620 between. 622 In case of the above reordering detection and quantification scheme, 623 the persistent reception of SACK blocks causes the amount of 624 outstanding data, which is stored when the TCP sender enters the 625 'disorder' state, to never be updated, since FlightSize is only saved 626 in step (A.1) when the SACK scoreboard is empty. If the transmission 627 rate of the TCP sender, and therefore also the maximum amount of data 628 a reordered segment can be received too late, changes significantly 629 during its stay in the 'disorder' state, the actual amount of 630 reordering is not accurately determined by the relative reordering 631 extent. A decrease of the transmission rate would result in an 632 overestimation of the reordering extent and vice versa. 634 A simple solution to the problem would be to store the maximum offset 635 in terms of sequence number space by which a reordered segment can be 636 received too late only when entering the 'disorder' state, but 637 individually for every potentially reordered segment, that is, for 638 every hole in the sequence number space of the SACK scoreboard. 639 (Note: The maximum offset in terms of sequence number space by which 640 a reordered segment can be received too late is strictly speaking the 641 amount of data that have been transmitted later than the reordered 642 segment. This amount of data can only be expressed by FlightSize 643 within the 'open' state and not within the 'disorder' state, since 644 the cumulative ACK point may not advance). 646 The problem with this simple idea is that for a new hole in the SACK 647 scoreboard, it is not possible to determine whether it is a result of 648 packet reordering or loss, and therefore it results in increased 649 memory usage (to store the amount of data for each hole). 650 Additionally, the packet reordering would be inaccurately quantified 651 if the transmission rate changes significantly for a short amount of 652 time. For example, if the amount of outstanding data is low when 653 entering the 'disorder' state is entered, the execution of Careful 654 Extended Limited Transmit (as described in 655 [I-D.zimmermann-tcpm-reordering-reaction] [RFC4653]) leads to a 656 significant short-term change of the transmission rate. When the 657 amount of data by which the reordering segment can be delayed is 658 determined individually for every new hole, it leads to an 659 overestimation of the relative reordering extent, since the maximum 660 amount of data possible is 'artificially' reduced by Careful Extended 661 Limited Transmit. 663 A solution to this problem is to store the maximum offset in terms of 664 sequence number space by which a reordered segment can be received 665 too late not for every segment individually (which does not guarantee 666 an accurate calculation of the relative reordering extent) but only 667 sufficiently often, e.g., once per RTT. The identification of what 668 frequency would be adequate, though, is neither trivial nor 669 universally applicable, since a concrete solution depends on the 670 transmission behavior of the used TCP in the 'disorder' state and 671 whether it is more beneficial for an additional reordering response 672 algorithm to over- or underestimate the packet reordering on the 673 path. If, for example, TCP-aNCR 674 [I-D.zimmermann-tcpm-reordering-reaction] is used as additional 675 reordering response algorithm, the maximum offset in terms of 676 sequence number space by which a reordered segment can be received 677 too late is not only stored when entering the 'disorder' state but 678 also updated every RTT (every cwnd worth of data transmitted without 679 a loss) while the TCP sender stays in the 'disorder' state. 681 6.4. Packet Duplication 683 Although the problem of packet duplication in today's Internet 684 [Jai+07], [Mel+08] is negligible, it may happen in rare cases that 685 segments on the path to the TCP receiver are duplicated. If a 686 segment is duplicated on the path, the first incoming segment causes 687 the receiver to send either an acceptable ACK or a SACK, depending on 688 whether the segment is the next expected one or not. Each subsequent 689 identical segment then causes either a duplicate ACK or a DSACK, 690 respectively, depending on whether the DSACK extension [RFC3708] is 691 implemented or not. 693 If by a combination of packet loss and packet duplication the case 694 occurs that a Fast Retransmit for a lost segment is duplicated on the 695 path, the TCP sender is not able to distinguish this from packet 696 reordering. The first received ACK closes a hole in the sequence 697 number space of the SACK scoreboard, while the second received ACK is 698 a valid DSACK. Although both cases are indistinguishable from a 699 theoretical point of view, the TCP sender can take measures to ensure 700 as far as possible that the DSACK received was not the result of 701 packet duplication. 703 For this purpose, step (A.3) of the above detection method checks via 704 the steps (A.1) to (A.4) of [RFC3708] whether the segment identified 705 by the DSACK option is marked as a valid duplicate. Unfortunately, 706 the steps of [RFC3708] do not check that more DSACKs have been 707 received than retransmissions have been sent, which is a 708 characteristic of suffering both packet reordering and packet 709 duplication at the same time. By simply counting the received 710 DSACKs, for example, as additional step (A.5) in [RFC3708], this 711 corner case can be covered as well. 713 7. Related Work 715 Because of retransmission ambiguity problem [KP87], which describes 716 TCP sender's inability to to distinguish whether the first acceptable 717 ACK that arrives after a retransmit was sent in response to the 718 original transmit or the retransmit, two different approaches can 719 generally be taken to detect and quantify packet reordering. First, 720 for transmissions (non-retransmitted segments), the detection is 721 usually conducted by detecting a closed hole in sequence number space 722 of the SACK scoreboard. Second, for retransmissions, the detection 723 of packet reordering is accompanied by the detection of the spurious 724 Fast Retransmits. 726 Within the IETF, several proposals have been published in the RFC 727 series to detect and quantify packet reordering. With [RFC4737] the 728 IPPM Working Group [IPPM] defines several metrics to evaluate whether 729 a network path has maintained packet order on a packet-by-packet 730 basis. [RFC4737] gives concrete, well-defined metrics, along with a 731 methodology for applying the metric to a generic packet stream. The 732 metric discussed in this document has a different purpose from the 733 IPPM metrics; this document discusses a TCP specific reordering 734 metric calculated on the TCP sender's SACK scoreboard. 736 Besides the IPPM work, several other proposals have been developed to 737 detect spurious retransmissions with TCP. The Eifel detection 738 algorithm [RFC3522] uses the TCP Timestamps option to determine 739 whether the ACK for a given retransmit is for the original 740 transmission or a retransmission. The F-RTO scheme [RFC5682] 741 slightly alters TCP's sending pattern immediately following a 742 retransmission timeout to indicate whether the retransmitted segment 743 was needed. Finally, the DSACK-based algorithm [RFC3708] uses the 744 TCP SACK option [RFC2018] with the DSACK extension [RFC2883] to 745 identify unnecessary retransmissions. The mechanism for detecting 746 packet reordering outlined in this document rely on the detection 747 schemes of those documents (except F-RTO that only works for spurious 748 retransmits triggered by TCP's retransmission timer), although they 749 do not provide metrics for the reordering extent whereas the 750 algorithm described in this document does. 752 RR-TCP [Zha+03] describes a reordering detection and quantification 753 scheme that is also based on holes in the sequence number space of 754 the SACK scoreboard and the reception of DSACKs. For every hole in 755 the SACK scoreboard, RR-TCP calculates a reordering extent. If the 756 segment was retransmitted before an ACK was received, it waits for a 757 DSACK that proves that the segment was spuriously retransmitted. The 758 reordering sample in such a case is the mean between the sample 759 calculated due to the hole in the sequence number space and the 760 sample calculated in responding to the received DSACK. In contrast, 761 the methodology in this document is always to quantify the reordering 762 at the time of a closed hole and to not take packet reordering with a 763 delay larger than one RTT into account (see Section 6.2. 765 The Linux kernel [Linux] implements a reordering detection based on 766 SACK, DSACK and TCP Timestamps option as well. The detection and 767 quantification of non-retransmitted segments with SACK or for 768 retransmitted segments with TCP Timestamps option operates much like 769 the scheme described in this document, with the exception of the 770 DSACK detection. First, Linux does not store any information (e.g., 771 reordering extent) below the cumulative ACK point, so that DSACKs 772 below the cumulative ACK point are ignored (for the purpose for 773 reordering quantification). Second, Linux also does not store any 774 information about a possible reordering event when a hole in the 775 sequence number space of the SACK scoreboard is closed. Therefore, 776 for a DSACK reporting a duplicate above the cumulative ACK, Linux 777 needs to approximate the reordering on arrival of a DSACK by the 778 distance between the DSACK and the highest selectively acknowledged 779 segment. 781 8. IANA Considerations 783 This memo includes no request to IANA. 785 9. Security Considerations 787 The described algorithm neither improves nor degrades the current 788 security of TCP, since this document only detects and quantifies 789 reordering and does not change the TCP behavior. General security 790 considerations for SACK based loss recovery are outlined in 791 [RFC6675]. 793 10. Acknowledgments 795 The authors thank the flowgrind [Flowgrind] authors and contributors 796 for their performance measurement tool, which give us a powerful tool 797 to analyze TCP's congestion control and loss recovery behavior in 798 detail. 800 11. References 801 11.1. Normative References 803 [I-D.zimmermann-tcpm-reordering-reaction] 804 Zimmermann, A., Schulte, L., Wolff, C., and A. Hannemann, 805 "An adaptive Robustness of TCP to Non-Congestion Events", 806 draft-zimmermann-tcpm-reordering-reaction-01 (work 807 in progress), November 2013. 809 [MM96] Mathis, M. and J. Mahdavi, "Forward Acknowledgement: 810 Refining TCP Congestion Control", ACM SIGCOMM 1996 811 Proceedings, in ACM Computer Communication Review 26 (4), 812 pp. 281-292, October 1996. 814 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 815 RFC 793, September 1981. 817 [RFC1323] Jacobson, V., Braden, B., and D. Borman, "TCP Extensions 818 for High Performance", RFC 1323, May 1992. 820 [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP 821 Selective Acknowledgment Options", RFC 2018, October 1996. 823 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 824 Requirement Levels", BCP 14, RFC 2119, March 1997. 826 [RFC2883] Floyd, S., Mahdavi, J., Mathis, M., and M. Podolsky, "An 827 Extension to the Selective Acknowledgement (SACK) Option 828 for TCP", RFC 2883, July 2000. 830 [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm 831 for TCP", RFC 3522, April 2003. 833 [RFC3708] Blanton, E. and M. Allman, "Using TCP Duplicate Selective 834 Acknowledgement (DSACKs) and Stream Control Transmission 835 Protocol (SCTP) Duplicate Transmission Sequence Numbers 836 (TSNs) to Detect Spurious Retransmissions", RFC 3708, 837 February 2004. 839 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 840 Control", RFC 5681, September 2009. 842 11.2. Informative References 844 [BA02] Blanton, E. and M. Allman, "On Making TCP More Robust to 845 Packet Reordering", ACM Computer Communication 846 Review vol.32, no. 1, pp. 20-30, January 2002. 848 [BPS99] Bennett, J., Partridge, C., and N. Shectman, "Packet 849 reordering is not pathological network behavior", IEEE/ACM 850 Transactions on Networking vol. 7, no. 6, pp. 789-798, 851 December 1999. 853 [BS02] Bellardo, J. and S. Partridge, "Measuring Packet 854 Reordering", Proceedings of the 2nd ACM SIGCOMM Workshop 855 on Internet Measurment (IMW'02) pp. 97-105, November 2002. 857 [Flowgrind] 858 "Flowgrind Home Page", . 860 [GPL04] Gharai, L., Perkins, C., and T. Lehman, "Packet 861 Reordering, High Speed Networks and Transport Protocol 862 Performance", Proceedings of the 13th IEEE International 863 Conference on Computer Communications and Networks 864 (ICCCN'04) pp. 73-78, October 2004. 866 [Hoe96] Hoe, J., "Improving the Start-up Behavior of a Congestion 867 Control Scheme for TCP", Proceedings of the Conference on 868 Applications, Technologies, Architectures, and Protocols 869 for Computer Communication (SIGCOMM'96) pp. 270-280, 870 August 1996. 872 [IPPM] "IP Performance Metrics (IPPM) Working Group", 873 . 875 [Jai+07] Jaiswal, S., Iannaccone, G., Diot, C., Kurose, J., and D. 876 Towsley, "Measurement and Classification of Out-of- 877 Sequence Packets in a Tier-1 IP Backbone", IEEE/ACM 878 Transactions on Networking vol. 15, no. 1, pp. 54-66, 879 February 2007. 881 [KP87] Karn, P. and C. Partridge, "Improving Round-Trip Time 882 Estimates in Reliable Transport Protocols", ACM SIGCOMM 883 Computer Communication Review vol. 17, no. 5, pp. 2-7, 884 November 1987. 886 [Linux] "The Linux Project", . 888 [Mat+97] Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The 889 Macroscopic Behavior of the TCP Congestion Avoidance 890 Algorithm", ACM SIGCOMM Computer Communication Review vol. 891 27, no. 3, pp. 67-82, July 1997. 893 [Mel+08] Mellia, M., Meo, M., Muscariello, L., and D. Rossi, 894 "Passive analysis of TCP anomalies", Computer 895 Networks vol. 52, no. 14, pp. 2663-2676, October 2008. 897 [Pax97] Paxson, V., "End-to-End Internet Packet Dynamics", IEEE/ 898 ACM Transactions on Networking vol. 7, no.3, pp. 277-292, 899 June 1997. 901 [RFC0896] Nagle, J., "Congestion control in IP/TCP internetworks", 902 RFC 896, January 1984. 904 [RFC1122] Braden, R., "Requirements for Internet Hosts - 905 Communication Layers", STD 3, RFC 1122, October 1989. 907 [RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., 908 Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., 909 Zhang, L., and V. Paxson, "Stream Control Transmission 910 Protocol", RFC 2960, October 2000. 912 [RFC4653] Bhandarkar, S., Reddy, A., Allman, M., and E. Blanton, 913 "Improving the Robustness of TCP to Non-Congestion 914 Events", RFC 4653, August 2006. 916 [RFC4737] Morton, A., Ciavattone, L., Ramachandran, G., Shalunov, 917 S., and J. Perser, "Packet Reordering Metrics", RFC 4737, 918 November 2006. 920 [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, 921 "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting 922 Spurious Retransmission Timeouts with TCP", RFC 5682, 923 September 2009. 925 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 926 "Computing TCP's Retransmission Timer", RFC 6298, 927 June 2011. 929 [RFC6582] Henderson, T., Floyd, S., Gurtov, A., and Y. Nishida, "The 930 NewReno Modification to TCP's Fast Recovery Algorithm", 931 RFC 6582, April 2012. 933 [RFC6675] Blanton, E., Allman, M., Wang, L., Jarvinen, I., Kojo, M., 934 and Y. Nishida, "A Conservative Loss Recovery Algorithm 935 Based on Selective Acknowledgment (SACK) for TCP", 936 RFC 6675, August 2012. 938 [ZM04] Zhou, X. and P. Mieghem, "Reordering of IP Packets in 939 Internet", Lecture Notes in Computer Science vol. 3015, 940 pp. 237-246, April 2004. 942 [Zha+03] Zhang, M., Karp, B., Floyd, S., and L. Peterson, "RR-TCP: 943 A Reordering-Robust TCP with DSACK", Proceedings of the 944 11th IEEE International Conference on Network Protocols 945 (ICNP'03) pp. 95-106, November 2003. 947 Appendix A. Changes from previous versions of the draft 949 This appendix should be removed by the RFC Editor before publishing 950 this document as an RFC. 952 A.1. Changes from draft-zimmermann-tcpm-reordering-detection-00 954 o Improved the wording throughout the document. 955 o Replaced and updated some references. 957 Authors' Addresses 959 Alexander Zimmermann 960 NetApp, Inc. 961 Sonnenallee 1 962 Kirchheim 85551 963 Germany 965 Phone: +49 89 900594712 966 Email: alexander.zimmermann@netapp.com 968 Lennart Schulte 969 Aalto University 970 Otakaari 5 A 971 Espoo 02150 972 Finland 974 Phone: +358 50 4355233 975 Email: lennart.schulte@aalto.fi 977 Carsten Wolff 978 credativ GmbH 979 Hohenzollernstrasse 133 980 Moenchengladbach 41061 981 Germany 983 Phone: +49 2161 4643 182 984 Email: carsten.wolff@credativ.de 985 Arnd Hannemann 986 credativ GmbH 987 Hohenzollernstrasse 133 988 Moenchengladbach 41061 989 Germany 991 Phone: +49 2161 4643 134 992 Email: arnd.hannemann@credativ.de