idnits 2.17.1 draft-mathis-tcp-ratehalving-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 1) being 971 lines Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 3 instances of too long lines in the document, the longest one being 2 characters in excess of 72. ** The abstract seems to contain references ([Hoe95]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 116: '...nd. At any given time, a TCP MUST NOT...' RFC 2119 keyword, line 838: '...afe to relax the SHOULD in RFC2018, su...' Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 1999) is 9020 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'Hoe95' -- Possible downref: Non-RFC (?) normative reference: ref. 'Hoe96' -- Possible downref: Non-RFC (?) normative reference: ref. 'MM96' -- Possible downref: Non-RFC (?) normative reference: ref. 'MM97' -- Possible downref: Non-RFC (?) normative reference: ref. 'NS' ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 2481 (Obsoleted by RFC 3168) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2582 (Obsoleted by RFC 3782) Summary: 13 errors (**), 0 flaws (~~), 2 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force Matt Mathis 2 INTERNET DRAFT Jeff Semke 3 draft-mathis-tcp-ratehalving-00.txt PSC 4 Jamshid Mahdavi 5 Novell 6 August 1999 7 Expires: February 2000 9 The Rate-Halving Algorithm for TCP Congestion Control 11 Status of this Memo 13 This document is an Internet-Draft and is in full conformance with 14 all provisions of Section 10 of RFC2026. Internet-Drafts are working 15 documents of the Internet Engineering Task Force (IETF), its areas, 16 and its working groups. Note that other groups may also distribute 17 working documents as Internet-Drafts. 19 Internet-Drafts are draft documents valid for a maximum of six months 20 and may be updated, replaced, or obsoleted by other documents at any 21 time. It is inappropriate to use Internet- Drafts as reference 22 material or to cite them other than as "work in progress." 24 The list of current Internet-Drafts can be accessed at 25 http://www.ietf.org/ietf/1id-abstracts.txt 27 The list of Internet-Draft Shadow Directories can be accessed at 28 http://www.ietf.org/shadow.html. 30 Abstract 32 This draft provides a detailed description of the Rate-Halving 33 algorithm. As specified by RFC2581, Fast Recovery adjusts the 34 congestion window (cwnd) by transmitting new data only during the 35 second half of the recovery interval. The Rate-Halving algorithm 36 adjusts the congestion window by spacing transmissions at the rate of 37 one data segment per two segments acknowledged over the entire 38 recovery period, thereby sustaining the self-clocking of TCP and 39 avoiding a burst. Since it is largely independent of the details of 40 the data retransmission strategy, the Rate-Halving algorithm can be 41 used with several standard and experimental TCP implementations: 42 NewReno, SACK, and ECN. 44 This algorithm was first proposed by Janey Hoe [Hoe95]. 46 1. Introduction 47 =============== 49 All "Reno TCP" implementations include TCP Fast Retransmit and Fast 50 Recovery algorithms [RFC2581]. Fast retransmit relies on three 51 duplicate acknowledgements to trigger the retransmission of a single 52 lost segment. Once the Fast Retransmit has occurred, TCP then waits 53 for enough additional duplicate ACKs to arrive, indicating that half 54 of the data in flight has left the network. Only when this has 55 occurred will TCP send additional new data. 57 The consequence of this delay is that the entire new window of data 58 is transmitted in one half of one Round Trip Time (RTT). This burst 59 can cause repeated bursts in successive RTTs following the recovery, 60 which can result in overall additional burstiness on the network. 62 Hoe [Hoe95] suggested that during Fast Recovery the TCP data sender 63 space out retransmissions and new data on alternate acknowledgements 64 across the entire recovery RTT. (Note that this eliminates the half 65 RTT lull in sending which occurs in Reno TCP.) 67 This document describes a Rate-Halving algorithm which implements 68 Hoe's idea. It explains how to implement the algorithm under 69 NewReno, SACK, and ECN-style TCP implementations. This document does 70 not discuss the other suggestions in [Hoe95] and [Hoe96], such as a 71 change to the ssthresh parameter during Slow-Start, or the NewReno 72 proposal [RFC2582]. 74 Rate-Halving has a number of other useful properties as well. It 75 results in a slightly lower final value for cwnd following recovery, 76 which has been suggested by Floyd and others as the more correct 77 value. The Rate-Halving algorithm provides proper adjustments to the 78 congestion window in response to congestion signals such as a lost 79 segment or an ECN-Echo bit [RFC2481]. These adjustments are largely 80 independent of the strategy used to retransmit missing segments, 81 allowing Rate-Halving to be extended for use in other TCP 82 implementations or even non-TCP transport protocols. 84 Definitions of terms and variables that are used throughout the rest 85 of the document can be found in Section 2. Section 3 provides an 86 overview of the Rate-Halving algorithm. Section 4 presents a 87 detailed specification of the algorithm. Section 5 contains related 88 commentary, and Section 6 discusses interactions between Rate-Halving 89 and other components of TCP. 91 2. Definitions 92 ============== 94 The following terms are used throughout the document, and are defined 95 here for clarity. The definitions within quotations in this section 96 are included from the cited documents. 98 SENDER MAXIMUM SEGMENT SIZE (SMSS): "The SMSS is the size of the 99 largest segment that the sender can transmit. This value can be 100 based on the maximum transmission unit of the network, the path 101 MTU discovery [RFC1191] algorithm, RMSS, or other factors. The 102 size does not include the TCP/IP headers and options." from 103 [RFC2581] 105 ADJUSTMENT INTERVAL: The time during which TCP reduces the 106 congestion window after experiencing congestion. 108 REPAIR INTERVAL: The time during which TCP retransmits lost 109 segments. 111 RATE-HALVING STATE (rh_state): A state variable indicating what 112 form of adjustments are being performed during adjustment 113 intervals. The Rate-Halving states are described in Section 4.1. 115 CONGESTION WINDOW (cwnd): "A TCP state variable that limits the 116 amount of data a TCP can send. At any given time, a TCP MUST NOT 117 send data with a sequence number higher than the sum of the 118 highest acknowledged sequence number and the minimum of cwnd and 119 rwnd." from [RFC2581]. Cwnd, as used by Reno and NewReno TCPs, is 120 used as an offset from snd_una (the highest-sequence segment 121 acknowledged by the receiver, indicating that all segments less 122 than snd_una have been received). During recovery, Reno and 123 NewReno artificially inflate cwnd as dupacks are received. 125 RATE-HALVING CONGESTION WINDOW (rhcwnd): In order to differentiate 126 the usage of cwnd within Rate-Halving from the cwnd used by Reno 127 and NewReno [RFC2582], we introduce a new but comparable 128 congestion window variable called rhcwnd. Rhcwnd is not inflated 129 by dupacks during congestion episodes. It represents the amount 130 of data that the sender is allowed to have outstanding. It is not 131 considered an offset to snd_una, but rather a count of the data in 132 flight, regardless of whether the segments contain new or 133 retransmitted data. 135 PRIOR_RHCWND (prior_rhcwnd): The value of rhcwnd saved when the 136 adjustment interval begins. 138 LOSS WINDOW (LW): "The loss window is the size of the congestion 139 window after a TCP sender detects loss using its retransmission 140 timer." from [RFC2581] 142 MAX_SEQ (max_seq): The maximum sequence number transmitted for 143 the current connection. 145 PRIOR_MAX_SEQ (prior_max_seq): The maximum sequence number 146 transmitted before the start of an adjustment interval. 148 FORWARD ACKNOWLEDGEMENT (fack): The highest sequence number known to 149 have reached the receiver, plus one. (This includes SACK 150 information, and is different than the variable snd.una when 151 losses have occurred). The use of the fack state variable (and 152 the retran_data state variable, below) was originally described by 153 Mathis [MM96]. The details for setting fack can be found in 154 Section 6.1.2. 156 DUPLICATE ACKNOWLEDGMENTS (dupacks): Acknowledgments that carry the 157 same sequence number as an earlier acknowledgment. In order to 158 qualify as a dupack, the ack must carry no data, and must not be a 159 window update. Dupacks provide an indication that the segment 160 following the sequence number in the dupack may have been lost, 161 but that some future segment has been received. The state 162 variable "dupacks" is a count of the current number of 163 "outstanding" dupacks. By "outstanding", we mean dupacks which 164 have been received, and which represent data not yet covered by 165 snd.una. 167 PARTIAL ACKNOWLEDGMENTS: Following congestion, these are "ACKs which 168 cover new data, but not all the data outstanding when loss was 169 detected" from [RFC2582]. 171 FULL ACKNOWLEDGMENT: Following congestion, an ACK which covers all 172 data outstanding when a loss was detected (i.e. an ack that is 173 greater than or equal to prior_max_seq.) 175 RETRANSMITTED DATA (retran_data): The amount of retransmitted data 176 outstanding in the network. 178 NUM_RETRANS (num_retrans): The number of bytes retransmitted during 179 the current repair interval. 181 ECN-ECHO ACK PACKET: "If the sender receives an ECN-Echo ACK packet 182 (that is, an ACK packet with the ECN-Echo flag set in the TCP 183 header), then the sender knows that congestion was encountered in 184 the network on the path from the sender to the receiver. The 185 indication of congestion should be treated just as a congestion 186 loss in non-ECN-Capable TCP." from [RFC2481]. 188 In addition to the above definitions, we use the following commonly- 189 used TCP state variables and terms: ssthresh [RFC2581], snd.una, 190 snd.nxt [RFC793], and segment [RFC2581]. 192 3. Overview of the Rate-Halving Algorithm 193 ========================================= 195 The Rate-Halving (RH) algorithm is designed to reduce the congestion 196 window following the detection of congestion in the network. 197 Rate-Halving reduces rhcwnd over one round trip by transmitting one 198 new segment for every two segments acknowledged by the receiver. The 199 Rate-Halving algorithm also detects the end of this round trip, 200 employing several different checks to perform this detection most 201 accurately. The choice of which segment to send at any given point 202 in time is completely independent of the Rate-Halving algorithm, 203 allowing Rate-Halving to be used simultaneously with ECN (where no 204 retransmissions occur) [RFC2481], NewReno (where retransmissions are 205 spread out one per RTT) [RFC2582], or SACK (where holes are filled 206 soon after they are detected) [RFC2018]. 208 As a result of the separation of window reduction from data 209 retransmission, the term "recovery" ceases to have a clear meaning. 210 We replace it with two terms: "adjustment interval" (referring to the 211 round trip containing the window adjustment) and "repair interval" 212 (referring to the retransmission of missing data). The adjustment 213 interval is entered immediately upon indication of the possible onset 214 of congestion. Two possible indications of congestion are presence 215 of the ECN-Echo bit, or indications of a packet loss via duplicate 216 ACK or via presence of a SACK block. 218 During the adjustment interval, the window is reduced by sending one 219 data segment for each two segments which are acknowledged, as per the 220 Rate-Halving algorithm (described in detail below). At the end of 221 the adjustment interval, additional checks are made to update 222 ssthresh and to ensure that the final value of rhcwnd is appropriate. 223 Due to the additional window reduction for the lost data, the final 224 value for rhcwnd is 1/2 of the correctly delivered portion of the 225 window of data which was in the network at the time of detection. 227 4. Complete Rate-Halving Algorithm 228 ================================== 230 This section describes the complete Rate-Halving algorithm. Most of 231 the complexity of the specification is a result of the need for 232 backwards compatibility with hosts that do not support SACK or ECN. 234 For all subsections of this section, additional discussion and 235 explanation may be found in Section 5 under the same subsection 236 number. 238 4.1 Rate-Halving states 240 Rate-Halving is specified as a state machine with the following 241 states: 243 RH_INCR - Rate-Halving is not performing any downward window 244 adjustment. This corresponds to TCP being in either Slow-start 245 or the linear increase region. 247 RH_EXACT - Rate-Halving is reducing the window with accurate knowledge 248 of what data has left the network. This state normally follows 249 congestion indicated by ECN or SACK. 251 RH_EST - Rate-Halving is reducing the window while estimating the 252 amount of data that has left the network. This state is 253 analagous to Reno's Fast Recovery. 255 RH_EST_REPAIR - Rate-Halving is holding the window constant while 256 repairing multiple holes in a manner similar to NewReno on 257 successive round trips after the window adjustment has completed. 259 +----------------+ +----------------+ 260 | RH_INCR |-------------------->| RH_EXACT | 261 | | 4.4 New SACK or ECN | | 262 | 4.2 Grow rhcwnd| | 4.6 Reduce | 263 | |<--------------------| rhcwnd | 264 | 4.3 Transmit | 4.8 Minor reorder | | 265 | | | 4.3 Transmit | 266 | |<--------------------| | 267 | | 4.10 New data ACKed | | 268 | | 4.14 Bound | | 269 | | | | 270 | |<--------------------| | 271 | | 4.15 Retransmit | | 272 | | Timeout | | 273 +----------------+ +----------------+ 274 ^ ^ ^ ^ | | 275 | | | | | | 276 | | | | | v 277 | | | | | | 278 4.14| | | | v +-----------+ 279 Bound| | | | | | 4.5 Dupacks 280 | | | | +---------------------------+ 281 4.13| | | | 4.5 Dupacks | 282 Adjust| | | +---------------+ | 283 | | | | | 284 4.11| | +---------------+ | | 285 Full| | | | | 286 Ack| +---------------+ | | | 287 | | | | v 288 +----------------+ | | | +----------------+ 289 | RH_EST_REPAIR | | | +-<---------| RH_EST | 290 | | | | 4.8 Reorder | | 291 | 4.9 Partial ACK| | | | 4.7 Reduce | 292 | | | +-<------------| rhcwnd | 293 | 4.3 Transmit | | 4.11 Full Ack| | 294 | | | 4.13 Adjust | 4.3 Transmit | 295 | | ^ 4.14 Bound | | 296 | | | | 4.9 Partial ACK| 297 | |->-+---------<-------| | 298 | | 4.15 Timeout | | 299 | | | | 300 | |<--------------------| | 301 | | 4.12 Exit if rhcwnd | | 302 | | <= prior_rhcwnd/2 | | 303 | | | | 304 | |-------------------->| | 305 | | 4.5 Re-enter | | 306 +----------------+ +----------------+ 308 4.2 Increasing Rhcwnd 310 Upon receipt of a new ACK in RH_INCR, apply Congestion Avoidance or 311 Slow-start per RFC2581 only when: 313 snd.nxt - fack + retran_data + SMSS >= rhcwnd 315 4.3 Data Transmission 317 While honoring all other sending constraints (such as receiver window), 318 in any state, transmit data of length "len" if 320 (snd.nxt - fack + retran_data + len) < rhcwnd 322 If the data is a retransmission, 324 retran_data += len 326 4.4 Entering RH_EXACT State 328 Transition from RH_INCR to RH_EXACT if the ACK either carries an 329 ECN-Echo bit, or indicates a new hole via a SACK option. 331 Store the following information: 333 prior_rhcwnd = rhcwnd 334 prior_max_seq = max_seq 336 4.5 Entering RH_EST State 338 From either RH_INCR or RH_EXACT, transition to RH_EST upon receipt of 339 a dupack with no SACK options. 341 From the RH_EST_REPAIR state, transition to RH_EST upon receipt of an 342 ACK with the ECN-Echo bit set. 344 If the transition is from RH_INCR or RH_EST_REPAIR to RH_EST, store 345 the following information: 347 prior_rhcwnd = rhcwnd 348 prior_max_seq = max_seq 350 4.6 Reduction of Rhcwnd in RH_EXACT State 352 For each ACK received while in state RH_EXACT, reduce rhcwnd by 1/2 353 the distance fack advances plus 1/2 the size of any new holes. 355 4.7 Reduction of Rhcwnd in RH_EST State 357 For each dupack received while in state RH_EST: 359 rhcwnd -= SMSS/2 360 fack = min((fack + SMSS), max_seq) 361 dupacks += 1 363 4.8 Detection of Reordering; Return to RH_INCR 365 Reordering detection applies only if an ECN-Echo has not been received 366 and a retransmission has not been sent since the adjustment interval 367 began. 369 In RH_EXACT, reordering has occurred if an ACK arrives carrying no 370 ECN-Echo bit and no SACK blocks. 372 In RH_EST, reordering has occurred if a partial or full ACK arrives. 374 If reordering is detected, transition to RH_INCR and restore the 375 value of rhcwnd, without performing the bounding checks of 4.14: 377 rhcwnd = prior_rhcwnd; 379 4.9 Processing Partial Acks in RH_EST / RH_EST_REPAIR 381 For each partial acknowledgement received in RH_EST or RH_EST_REPAIR: 383 1. Set retran_data to 0. 384 2. Reduce dupacks by [(bytes of data acked)/SMSS - 1]. 385 3. For the purpose of data transmission, note that a new hole exists 386 at snd.una. The new hole is eligible for retransmission if 387 dupacks > 3, after being reduced in step 2. 389 4.10 Completion of RH_EXACT; Return to RH_INCR 391 In RH_EXACT, Rate Halving is complete if one of the following occurs 392 (indicating that a round trip time has passed): 394 1. An ACK or SACK arrives which acknowledges data beyond 395 prior_max_seq. 397 2. An ACK or SACK arrives which acknowledges a segment retransmitted 398 after entering the RH_EXACT state. 400 Transition to state RH_INCR and perform the bounding checks in 4.14. 402 4.11 Completion of RH_EST and RH_EST_REPAIR; Return to RH_INCR 404 In state RH_EST or RH_EST_REPAIR, adjustment and repair are both 405 complete when a full acknowledgement is received (i.e. snd.una >= 406 prior_max_seq). 408 When this occurs, transition to state RH_INCR, perform the estimation 409 adjustment in 4.13, followed by the bounding checks in 4.14. 411 4.12 "NewReno" Case: Transition from RH_EST to RH_EST_REPAIR 413 In state RH_EST, adjustment (but not repair) is complete when 415 rhcwnd <= prior_rhcwnd / 2 417 Transition from state RH_EST to state RH_EST_REPAIR. 419 4.13 Corrections for Estimation upon return to RH_INCR 421 When adjustment and repair are complete after an estimated recovery, 422 perform the following adjustment to compute the correct value for 423 rhcwnd: 425 rhcwnd = (prior_rhcwnd - num_retrans) / 2; 427 4.14 Application of Bounding Paremeters 429 When transitioning back to RH_INCR, update the TCP state variables 430 as follows: 432 if (rhcwnd > prior_rhcwnd/2) rhcwnd = prior_rhcwnd/2; 433 ssthresh = rhcwnd; 434 if (ssthresh < prior_rhcwnd/4) ssthresh = prior_rhcwnd/4; 436 Check for a new condition that would cause entry into another 437 adjustment interval (e.g. the presence of SACK blocks reporting a new 438 hole (beyond prior_max_seq). 440 4.15 Retransmission Timeouts 442 If a timeout occurs while in a non-RH_INCR state, set 444 ssthresh = prior_rhcwnd / 2 445 rhcwnd = LW 447 and return to the RH_INCR. 449 Otherwise, if a timeout occurs while in RH_INCR, set 451 ssthresh = rhcwnd /2 452 rhcwnd = LW 454 5. Discussion of Rate-Halving Algorithm 455 ======================================= 457 Each of the subsections of Section 5 provide discussion for the 458 corresponding subsection of Section 4. 460 5.1 Rate-Halving States 462 The Rate-Halving algorithm may be viewed as a finite state machine 463 with four states. These states are independent of whether the TCP 464 connection is in Congestion Avoidance or Slow-start. 466 5.2 Increasing Rhcwnd 468 The left hand term in 4.2 includes all outstanding transmitted and 469 retransmitted segments that are still in the network, plus one 470 additional segment to account for the effects of rounding due to 471 segmentation. 473 This restriction assures that rhcwnd only grows as long as TCP 474 actually succeeds in injecting enough data into the network to test 475 the path. If TCP is throttled by something other than rhcwnd, then 476 there is no assurance that rhcwnd of data will actually fit into the 477 network. 479 This test should be done prior to taking any data which was ACKed or 480 SACKed in the incoming segment into account. This may be 481 accomplished either by performing this check immediately upon 482 receiving an ACK, or by remembering the total amount of new and 483 retransmitted data ACKed or SACKed by the segment, and adding this to 484 the term on the left hand side. 486 Since it is never safe to open the window while estimating what data 487 has left the network, Reno and NewReno can never open the window 488 while dupacks are present - e.g during any adjustment or repair 489 interval. 491 5.3 Data Transmission 493 Rhcwnd always specifies an amount of data allowed to be in flight in 494 the network. The expression (snd.nxt - fack + retran_data + len) is a 495 measure of the outstanding data. Data may be transmitted whenever 496 the amount of outstanding data is less than the amount allowed by 497 rhcwnd. 499 Contrast this to Reno, where cwnd is used as an offset to snd.una, 500 specifying which sequence numbers may be sent. As a consequence the 501 cwnd variable is overloaded during recovery (e.g. setting cwnd=SMSS 502 to force Fast Retransmit.) 504 To illustrate that the new semantics for rhcwnd in Rate-Halving are 505 compatible with those of cwnd in Reno, observe that in Reno the 506 boundary between being able to send data and not send data is: 508 snd.nxt = snd.una + cwnd 510 where 511 cwnd = rhcwnd - retran_data + dupacks*SMSS 512 and 513 retran_data = 0 or 1 depending on whether a retransmission has 514 been sent. 516 Substituting and rearranging the Reno equations yields equation 1: 518 (1) rhcwnd = snd.nxt - snd.una - dupacks*SMSS + retran_data 520 In Rate-Halving, the boundary occurs at: 522 awnd = rhcwnd, 524 where 525 awnd = snd.nxt - fack + retran_data 526 and 527 fack = snd.una + losses + SACKed data 529 Substituting and rearranging the Rate-Halving equations yields: 531 (2) rhcwnd = snd.nxt - snd.una - losses - SACKed data + 532 retran_data 534 It can be seen that equations 1 and 2 are quite similar. The losses 535 known by Rate-Halving (via SACK options) are not known to Reno. In 536 addition, the amount of data that has left the network (indicated by 537 SACK options to Rate-Halving) must be estimated by Reno as 538 dupacks*SMSS. Reno only allows one retransmitted segment per round 539 trip, while Rate-Halving permits (and keeps track of) more. 541 When SACK options are not used by the receiver, the Rate-Halving 542 sender must estimate by using equation 1. 544 Similar analysis can be applied to The "pipe" algorithm used for SACK 545 in the "ns" [NS] simulator and some SACK implementations. 547 5.4 Entering RH_EXACT State 549 Note that the sender can infer congestion by receiving a segment with 550 the ECN-Echo bit set, or by receiving an Ack carrying a SACK option 551 which indicates a new hole. In both of these cases, the amount of 552 data that has left the network is known, either because no data was 553 lost, or the amount of data that was lost is reported by SACK 554 options. 556 In the case of a suspected loss, enter the adjustment interval 557 immediately. While this may seem to be a major change to TCP, in 558 fact we retain the "3 Dupacks" check (and make it more robust) which 559 is included in Reno TCP. Segment retransmissions are determined by 560 the SACK scoreboard, as discussed in section 6.2.1. During the early 561 portion of the adjustment interval, one new segment will generally be 562 transmitted prior to retransmitting the lost segment which triggered 563 the adjustment. 565 The algorithm will detect if the segments have merely been reordered, 566 and the adjustment is terminated (see section 5.8 below). 568 Prior_rhcwnd and prior_max_seq will be used for end-of-adjustment 569 checks. 571 5.5 Entering RH_EST State 573 The significance of this state is that the amount of data in the 574 network is estimated, since exact information is not available 575 through SACK options. 577 If the prior state was RH_EST_REPAIR, then this is a new reduction 578 triggered by detecting a new ECN prior to the completion of a 579 NewReno-style repair. 581 The sender can also infer congestion by receiving duplicate 582 acknowledgements. In this case, it is known that a segment was lost 583 or misordered, but the number of missing segments is not known. The 584 number can only be estimated by counting the number of duplicate ACKs 585 received. Even if RH_EXACT was entered with an ECN-Echo, the RH_EST 586 state can be entered from RH_EXACT if a dupack is later received. 588 5.6 Reduction of Rhcwnd in RH_EXACT State 590 Rather than counting received ACKs, the distance (in bytes) that fack 591 advances is explicitly used. This ensures that the window is 592 properly reduced even when the receiver is sending delayed ACKs 593 during congestion episodes (such as with ECN). 595 Each SACK option that indicates a new hole triggers a reduction of 596 rhcwnd by all of data that was lost plus 1/2 the data that was 597 received. 599 During RH, we reduce rhcwnd as data leaves the network. The net 600 effect is to cause the transmission of one new segment for every two 601 segments reported by the receiver. 603 5.7 Reduction of Rhcwnd in RH_EST State 605 Assume that each dupack is for a segment of size SMSS, and that the 606 rate is being halved. This estimate is designed to result in rhcwnd 607 and fack values that are reasonable, given that more exact 608 information is not available about which segments have left the 609 network. These estimates are comparable to (and no worse than) the 610 estimates that Reno uses during Fast Recovery. When the repair 611 interval ends, final adjustments recalibrate rhcwnd and fack to their 612 correct values. 614 5.8 Detection of Reordering; Return to RH_INCR 616 Note that there is a choice of what do with the congestion window 617 when segment reordering is detected. In principle, rhcwnd could be 618 set to an arbitrary function of prior_rhcwnd. 620 For example, one could choose to penalize networks which reorder 621 segments by forcibly reducing rhcwnd: 623 rhcwnd = prior_rhcwnd / 2; 625 A gentler form of reordering penalty would be to leave rchwnd as 626 updated by Rate-Halving (Reduce rhcwnd as indicated in section 4.6 or 627 4.7 until the reordering has been detected). 629 The choice to restore rhcwnd results in no penalty for reordering: 631 rhcwnd = prior_rhcwnd; 633 This is functionally equivalent to today's Reno implementations, and 634 may be less bursty in some cases. (RH is less bursty in cases in 635 which a single new segment is transmitted prior to detecting the 636 reordering, resulting in a burst which is one segment smaller than 637 the equivalent Reno burst). 639 The optimal action is probably between these extremes, and should be 640 the subject of future research. 642 5.9 Processing Partial Acks in RH_EST / RH_EST_REPAIR 644 A partial ACK indicates that one hole has been filled (since it is 645 now covered by an ACK) and that another one exists. In step 1, 646 retran_data is cleared since the retransmitted segments left the 647 network when the partial ACK was generated. 649 Some of the segments that generated duplicate ACKs are acknowledged 650 by the partial ACK. These segments must be accounted for in the 651 count of dupacks, so step 2 reduces dupacks by the amount of data 652 that has just been acknowledged. 654 A new hole is indicated by the partial ack one round trip after the 655 retransmission for the previous hole was sent. The only way in which 656 the segment indicated by the partial ack would be falsely 657 retransmitted would be if the segment was reordered (delayed) by more 658 than a round trip time. 660 If the previous hole was caused by a reordered segment, then the 661 partial ack may appear without sending a retransmission (and with 662 dupacks < 3). In this case, the new hole is not eligible for 663 retransmission until dupacks >= 3. 665 5.10 Completion of RH_EXACT; Return to RH_INCR 667 Under severe reordering (after retransmission) condition 2 may cause 668 a premature exit from the adjustment interval. This would leave 669 rhcwnd too large. However the bounding check in 4.14 will force the 670 required 1/2 window reduction. 672 Case 2 might be particularly tricky to implement for cases in which 673 SACK, ECN, and NewReno are all supported. Here are some guidelines 674 for implementation: 676 ACKs or SACKs which show the arrival of retransmitted data may not be 677 reliable indicators of case 2. This is because it is possible (and 678 even likely) for the data repair interval to continue well beyond the 679 end of the adjustment interval. Should a second adjustment interval 680 begin as a result of subsequent lost data, it is possible for holes 681 to be filled which don't satisfy case 2. Therefore, a more rigorous 682 test is needed for retransmitted segments. 684 Assuming that there is a SACK scoreboard, one possible implementation 685 is to store an adjustment interval ID number with each 686 retransmission. This ID will clearly indicate during which 687 adjustment interval a retransmission occurred. (The ID may also be 688 "zero" indicating the retransmissions occurred outside of an 689 adjustment interval). ACKs and SACKs which are received for 690 retransmissions can then be definitively tested against condition 2. 692 5.11 Completion of RH_EST and RH_EST_REPAIR; Return to RH_INCR 694 A full ack indicates that all segments that were outstanding at the 695 first indication of congestion have been successfully received. 696 Therefore, the adjustment and repair intervals should be completed, 697 but final adjustments and checks must be made to ensure the validity 698 of state variables. 700 5.12 "NewReno" Case: Transition from RH_EST to RH_EST_REPAIR 702 When the window has been reduced by half, but a full ack has not yet 703 been received, it is necessary to end the adjustment interval, 704 holding the window constant while the repair interval completes. 706 In the RH_EST_REPAIR state, rhcwnd is not increased or decreased. 707 While in this state, retransmissions will be sent once per round trip 708 time, as in NewReno until a full ACK ends the repair interval. 710 5.13 Corrections for Estimation upon return to RH_INCR 712 The final adjustment ensures that the resulting congestion window is 713 half of the data that successfully passed through the network during 714 the congested round trip. 716 5.14 Application of Bounding Paremeters 718 At the end of the RH_EXACT state, rhcwnd has been set to 719 (prior_rhcwnd-loss)/2, where loss is the number of bytes of data that 720 have been lost during the current adjustment interval. 722 In some cases, it is possible for rhcwnd to be set to an invalid 723 result (such as severe reordering mentioned above or improper 724 acknowledgement strategies of the receiver). 726 Rate-Halving includes a set of Bounding Parameters which are used to 727 guarantee that the Rate-Halving algorithm will make appropriate 728 adjustments to the window. The prior_rhcwnd/2 check guarantees that, 729 no matter what else happens, rhcwnd will always be reduced by a 730 factor of two. 732 The prior_rhcwnd/4 check ensures that the TCP connection is not 733 unduly harmed by extreme network conditions. The choice to set 734 ssthresh (rather than rhcwnd) to prior_rhcwnd/4 prevents sending a 735 burst into the network as a result of suddenly increasing rhcwnd. A 736 short period of Slow-start will follow, allowing rhcwnd to quickly 737 reach the bounding parameter of prior_rhcwnd/4. 739 Note that it is possible to exit the adjustment interval on an ACK, 740 and re-enter a new adjustment interval as a result of the same 741 acknowledgement. For example, an ACK may carry both acknowledgement 742 of retransmitted data, causing an exit per 4.10 case 2, and an 743 ECN-Echo bit (indicating that the retransmitted segment was marked), 744 causing a new adjustment interval per 4.4. 746 5.15 Retransmission Timeouts 748 Timeout behavior is nearly identical to Reno. Since rhcwnd is 749 reduced during non-RH_INCR states, it is necessary to use 750 prior_rhcwnd to prevent an overly agressive adjustment in the 751 ssthresh calculations. 753 See section 6.2.2 for information about SACK reneging (SACK receiver 754 discards data that it has already sent SACK blocks for [RFC2018]). 756 Do not clear the exponential backoff multiplier after the timeout if 757 the ECN-Echo bit is set on the ACK that results from the 758 retransmission. 760 6. Additional Implementation Notes 761 ================================== 763 6.1 ACK processing 765 6.1.1 Order of actions to be taken 767 Upon receiving an acknowledgement, the following actions should be 768 taken: 770 1. Save a copy of the following state variables for use during 771 processing of this ACK: snd.una, fack, retran_data. 772 2. Update the scoreboard, snd.una, fack, and retran_data. 773 3. Note if the ECN-Echo bit is set. 774 4. Check for exit conditions from RH states (4.8, 4.10, 4.11). 775 5. Check for transition to RH_EST_REPAIR state (4.12). 776 6. Process partial ACKs (4.9). 777 7. Check for entrance conditions for RH (4.4, 4.5). 778 8. Adjust the window using state variables as they existed at the 779 time the ACK was received (4.2, 4.6, 4.7). 780 9. If the window allows (4.3), send a new segment or a retransmission 781 as determined by the scoreboard (6.2.1). 783 6.1.2 Setting FACK 785 When in the RH_EXACT or RH_INCR state: 786 When each ACK is received, set the fack state variable to be 787 the maximum of: (1) the highest sequence number which has been 788 acknowledged plus one, and (2) the highest sequence number 789 which has appeared in a SACK block plus one. 791 When in the RH_EST or RH_EST_REPAIR state: 792 When each ACK is received, set the fack state variable to be 793 snd_una + (1 + dupacks) * SMSS. 795 6.2 Data Recovery 797 6.2.1 SACK Scoreboard 799 A SACK scoreboard keeps track of data blocks that have been received 800 after some data has been lost, as indicated by SACK options. The 801 data transmission rule of 4.3 indicates only WHEN to send data (based 802 on congestion control), not WHAT data to send (based on reliable 803 delivery). The scoreboard keeps track of what data has been 804 received, and indicates what data is eligible for retransmission. 806 Reno TCP uses a threshold of 3 dupacks to determine that a missing 807 segment has been lost, not just reordered. By immediately entering 808 the adjustment interval, not only can a Rate-Halving implementation 809 apply the dupack threshold to the first loss, but also to all 810 subsequent losses. Mathis and Mahdavi presented a method for 811 implementing this logic, termed thresholded retransmissions, in the 812 scoreboard [MM97]. A data block is eligible for retransmission only 813 after the hole has been observed at least three times, as indicated 814 by dupacks, or SACK options with sequence numbers that are higher 815 than the data block in question. 817 If no data is eligible for retransmission, new data may be sent when 818 allowed by the congestion window and the receiver's advertised window 819 (4.3). 821 When no SACK information is available, such as in states RH_EST or 822 RH_EST_REPAIR, a single-element scoreboard may be used to store loss 823 and retransmission state information. 825 6.2.2 Retransmission timeout/SACK reneging 827 RFC2018 recommends (with a SHOULD) that the scoreboard be cleared 828 when a timeout is experienced, since the timeout might indicate that 829 the SACK receiver may have reneged (discarded data that it has 830 already sent a SACK option for). If this RFC2018 recommendation is 831 followed, then perform the following actions in addition to the 832 actions specified in 4.15: 834 snd.nxt = snd.una 835 fack = snd.una 836 retran_data = 0 838 It is believed to be safe to relax the SHOULD in RFC2018, such that 839 when a retransmission timeout is experienced, the scoreboard can be 840 kept in order to retain the information about which segments have 841 already been received. When allowing the scoreboard to be maintained 842 through a timeout (conflicting with RFC2018), the following actions 843 must be performed in addition to the actions specified in 4.15: 845 snd.nxt = fack 846 retran_data = 0 848 In this case, snd.nxt is pulled back to fack, which is the highest 849 sequence number known to have reached the receiver. Normal 850 transmissions will begin at this sequence number, and retransmissions 851 will continue to be issued by the scoreboard. (Following the 852 timeout, fack will advance to the highest ACK or SACK block received. 853 snd.nxt should be pulled ahead any time that fack > snd.nxt.) 855 If the scoreboard is not discarded during a timeout, the sender must 856 be able to detect that the receiver has reneged. If snd.una advances 857 to a block stored in the scoreboard, the sender knows that a SACK 858 renege has occurred, and must clear the scoreboard, setting 860 fack = snd.una 861 snd.nxt = snd.una 862 retran_data = 0 864 There are three options for what should happen to the window when the 865 sender detects that the receiver has reneged: 1) force a retransmit 866 timeout, 2) cut the window in half, or 3) do nothing to the window. 867 More experimentation is needed to determine the correct approach. 869 7. Acknowledgements 870 ==================== 872 The authors would like to acknowledge Janie Hoe's work, which 873 introduced the concepts of sending new data during recovery and 874 spacing out data segments during the round trip following a loss. 875 The authors would also like to thank Van Jacobson and Sally Floyd for 876 inspirational conversations about many of the ideas in this document. 877 In addition, the authors are grateful to Mark Allman, Jitendra 878 Padhye, and Khrys Myrddin for their constructive feedback about 879 earlier drafts of this document. 881 8. References 882 ============== 884 [Hoe95] J. Hoe, Startup Dynamics of TCP's Congestion Control and 885 Avoidance Schemes. Master's Thesis, MIT, 1995. URL "http://ana- 886 www.lcs.mit.edu/anaweb/ps-papers/hoe-thesis.ps". 888 [Hoe96] J. Hoe, Improving the Start-up Behavior of a Congestion 889 Control Scheme for TCP. In ACM SIGCOMM, August 1996. URL 890 "http://www.acm.org/sigcomm/sigcomm96/program.html". 892 [MM96] M. Mathis, J. Mahdavi, "Forward Acknowledgement: Refining TCP 893 Congestion Control," SIGCOMM 96, August 1996. 895 [MM97] M. Mathis, J. Mahdavi, "TCP Rate-Halving with Bounding 896 Parameters," December 1997. 897 URL "http://www.psc.edu/networking/papers/FACKnotes/current/". 899 [NS] The UCB/LBNL/VINT Network Simulator (NS). URL "http://www- 900 mash.cs.berkeley.edu/ns/". 902 [RFC793] J. Postel, "Transmission Control Protocol," RFC793, 903 September 1981. 905 [RFC1191] J. Mogul, and S. Deering, "Path MTU Discovery," RFC1191, 906 November 1990. 908 [RFC2018] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, "TCP 909 Selective Acknowledgment Options," RFC2018, October 1996. 911 [RFC2481] K. Ramakrishnan, and S. Floyd, "A Proposal to add Explicit 912 Congestion Notification (ECN) to IP," RFC2481, January 1999. 914 [RFC2581] M. Allman, V. Paxson, and W. Stevens, "TCP Congestion 915 Control," RFC2581, April 1999. 917 [RFC2582] S. Floyd, and T. Henderson, "The NewReno Modification to 918 TCP's Fast Recovery Algorithm," RFC2582, April 1999. 920 9. Security Considerations 921 ========================== 923 RFC 2581 discusses general security considerations concerning TCP 924 congestion control. This document describes a specific algorithm 925 that conforms with the congestion control requirements of RFC 2581, 926 and so those considerations apply to this algorithm, too. There are 927 no known additional security concerns for this specific algorithm. 929 AUTHORS' ADDRESSES 931 Matthew Mathis and Jeffrey Semke 932 Pittsburgh Supercomputing Center 933 4400 Fifth Avenue 934 Pittsburgh, PA 15213 936 EMail: mathis@psc.edu, semke@psc.edu 938 Jamshid Mahdavi 939 Novell, Inc. 940 Mailstop: SJF-B492 941 2211 North First Street 942 San Jose, CA 95131 944 EMail: mahdavi@novell.com, kml@novell.com