idnits 2.17.1 draft-swami-tsvwg-tcp-dclor-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-26) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 25 longer pages, the longest (page 2) being 60 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 26 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 9 instances of too long lines in the document, the longest one being 5 characters in excess of 72. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 465: '..., the TCP sender MUST set its congesti...' RFC 2119 keyword, line 468: '... SS_THRESH MUST be left UNCHANGE...' RFC 2119 keyword, line 470: '.... The TCP sender SHOULD also reset all...' RFC 2119 keyword, line 514: '... 5. A TCP sender MUST repeat step-2 un...' RFC 2119 keyword, line 525: '.... The TCP sender MUST NOT update its c...' (13 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 290 has weird spacing: '...plicity reaso...' == Line 484 has weird spacing: '..._PTR is recei...' == Line 936 has weird spacing: '...g delay and l...' == Line 1066 has weird spacing: '...terates over ...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 2002) is 7833 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RFC2026' on line 16 == Missing Reference: '21' is mentioned on line 816, but not defined == Missing Reference: '11' is mentioned on line 816, but not defined == Unused Reference: '6' is defined on line 851, but no explicit reference was found in the text == Unused Reference: '9' is defined on line 862, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2581 (ref. '1') (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2861 (ref. '3') (Obsoleted by RFC 7661) == Outdated reference: A later version (-13) exists of draft-allman-tcp-sack-10 -- Possible downref: Non-RFC (?) normative reference: ref. '6' == Outdated reference: A later version (-07) exists of draft-ietf-tsvwg-tcp-eifel-alg-05 ** Downref: Normative reference to an Experimental draft: draft-ietf-tsvwg-tcp-eifel-alg (ref. '7') == Outdated reference: A later version (-04) exists of draft-sarolahti-tsvwg-tcp-frto-01 -- Possible downref: Normative reference to a draft: ref. '8' ** Downref: Normative reference to an Informational RFC: RFC 2998 (ref. '9') Summary: 9 errors (**), 0 flaws (~~), 14 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force Yogesh Swami 3 INTERNET DRAFT Khiem Le 4 File: draft-swami-tsvwg-tcp-dclor-00.txt Nokia Research Center 5 Dallas 6 November 2002 7 Expires: Apr 2003 9 DCLOR: De-correlated Loss Recovery using SACK option 10 for spurious timeouts. 12 Status of this Memo 14 This document is an Internet-Draft and is in full conformance with 15 all provisions of Section 10 of [RFC2026]. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress. 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html 33 Abstract 35 A spurious timeout in TCP forces the sender to unnecessarily 36 retransmit one complete congestion window of data into the network. 37 In addition, TCP uses the rate of arrival of ACKs as the basic 38 criterion for congestion control. TCP makes the assumption that the 39 rate at which ACKs are received reflects the end-to-end state of the 40 network in terms of congestion. But after a spurious-timeout, the 41 ACKs don't reflect the end-to-end congestion state of the network, 42 but only a part of it. In these cases, the slow-start behavior after 43 a timeout can further add to network congestion instead of relieving 44 it. 46 In this draft we propose changes to the TCP sender (no change is 47 needed for TCP receiver) that can be used to solve the problems of 48 both redundant-retransmission and network congestion after a spurious 49 timeout. These changes preserve the sanctity of congestion control 50 principles and are conservative in nature. The proposed 51 algorithm--called DCLOR--separates congestion control from loss 52 recovery, and uses the TCP SACK option to achieve this. DCLOR can be 53 used as a congestion control algorithm (after a spurious timeout) only, 54 and it can work with other spurious timeout detection mechanisms such 55 as the Eifel detection scheme. 57 Table of Contents 59 Abstract ..................................................... 1 61 1. Introduction .............................................. 3 63 2. Problem Description ....................................... 4 64 2.1 Redundant Data Retransmission ......................... 5 65 2.2 Can Slow Start add to Congestion ...................... 5 67 3. Sources of spurious timeout ............................... 3 68 3.1 Spurious timeout due to Excessive delay ............... 7 69 3.2 Spurious timeout due to change of subnets ............. 8 70 3.3 Spurious Timeout due to Protocol Inefficiencies ....... 9 72 4. De-correlated Loss Recovery (DCLOR) ....................... 9 73 4.1 Probe phase after a timeout ....................... 10 74 4.2 Congestion Control After the probe phase ........... 12 75 4.3 Recovering lost packets after timeout .............. 13 77 5. Data Delivery To Upper Layers ............................. 14 79 6. Sack Reneging ............................................. 15 81 7. DCLOR Examples ............................................ 15 82 7.1 Timeout due to congestion ............................. 15 83 7.2 Timeout due to pure packet stalling ................... 16 84 7.3 Timeout due to stalling and loss ...................... 17 86 8. Security Considerations .................................. 18 88 9. References ............................................... 19 90 10. IPR Statement ............................................ 19 92 Author's Address ............................................ 19 94 Appendix - 1 ................................................. 20 96 1. Introduction 98 The response of a TCP sender after a retransmission timeout is 99 governed by the underlying assumption that a mid-stream timeout can 100 occur only if there is heavy congestion--manifested as packet 101 loss--in the network. Even though loss is often caused by congestion, 102 the loss recovery algorithm itself should only answer the question of 103 "what" data (i.e., what sequence number of data ) to send. While on 104 the other hand, the congestion control algorithm should answer the 105 question of "how much" data to send. But after a timeout TCP 106 addresses the issues of loss recovery and congestion control using a 107 single mechanism--send one segment per round trip timeout (RTO) 108 (answers the "how much" question) until an acknowledgment is 109 received. The single segment sent is always the first unacknowledged 110 outstanding packet in the retransmission queue (answers the "what" 111 question). Since the present TCP's loss recovery and congestion 112 control algorithms are coupled together, we call this "Correlated 113 Loss Recovery (CLOR)." 115 Although the assumption that a timeout can occur only if there is 116 severe congestion is valid for traditional wire-line networks, it 117 does not hold good for some other types of networks--networks where 118 packets can be stalled "in the network" for a significant duration 119 without being discarded. Typical examples of such networks are 120 cellular networks. In cellular networks, the link layer can 121 experience a relatively long disruption due to errors, and the link 122 layer protocol can keep these packets-in-error buffered as long as 123 the link layer disruption lasts. 125 In this document we present an alternative approach to loss recovery 126 and congestion control that "De-Correlates" Loss Recovery from 127 congestion congestion and allows independent choice on using a 128 particular TCP sequence number without compromising on the congestion 129 control principles of [1][2][3][4]. In doing so, the algorithm 130 mitigates the ill effects of spurious retransmission timeouts. 131 Spurious timeouts are not only detrimental to the throughput of the 132 defaulting connection, but they also add to the overall network 133 congestion--paradoxically enough, due to the subsequent slow-start. 135 Although several drafts [7][8] have been presented on this topic in 136 the IETF, we believe that none of them fully considers all the 137 problems associated with spurious timeouts. In the following section 138 we first describe these problems in more detail and then describe the 139 DCLOR mechanism in section-4. 141 2. Problem Description. 143 Let us assume that a TCP sender has sent N packets, p(1) ... p(N), 144 into the network and it's waiting for the ACK of p(1) (Figure-1). Due 145 to bad network conditions or some other problem, these packets are 146 excessively delayed at some some intermediary node NDN. Unlike 147 standard IP routers, the NDN keeps these packets buffered for a 148 relatively long period of time until these packets are forwarded to 149 their intended recipient. This excessive delay forces the TCP sender 150 to timeout and enter slow start. 152 Figure-1 154 TCP-Sender NDN TCP-Receiver 156 ..... |----p(1)------>| | 157 ^ |----p(2)------>| | 158 : | . | | 159 RTT=D | . | | 160 : | . | | 161 ..... |----p(N)------>| | 162 | ^ | | 163 | : | | 164 | RTO | | 165 | : | | 166 | V |----p(1)-->| 167 ... |----p1(1)----->|<---a(1)---|... 168 L | | | 169 ... |<----a(1)------|----p(2)-->| Inter ACK arrival time (IAT) 170 |->p1(2),p1(3)->|<---a(2)---|... 171 | . | . | 172 | . | . | 173 | . | . | 174 | |<---a(N)---| 175 | |---p1(1)-->| 176 | |<---a(N)---| 177 | | | 179 As far as the TCP sender is concerned, a timeout is always 180 interpreted as heavy congestion. The TCP sender therefore makes the 181 assumption that all packets between p(1) and p(N) were lost in the 182 network. To recover from this misconstrued loss, the TCP sender 183 retransmits P1(1) ( Px(k) represents the xth retransmission of packet 184 with sequence number k), and waits for the ACK a(1). 186 After some period of time the network conditions at NDN become 187 amicable again, and the queued in packets at NDN are finally 188 dispatched to their intended recipient, to which the the TCP receiver 189 generates the ACK a(1). When the TCP sender receives a(1), it's 190 fooled into believing that this ACK was generated in response to the 191 retransmitted packet p1(1), while in reality a(1) was generated in 192 response to the originally transmitted packet p(1). When the sender 193 receives a(1), it increases its congestion window to two, and 194 retransmits p1(2) and p1(3). As the sender receives more 195 acknowledgments, it continues with retransmissions and finally starts 196 sending new data. 198 The following two subsections examine the problems associated with 199 the above-mentioned TCP behavior. 201 2.1 Redundant Data Retransmission 203 The obvious and relatively easy-to-solve inefficiency of the above 204 algorithm is that the entire congestion window worth of data is 205 unnecessarily retransmitted. Although such retransmissions are 206 harmless to high-bandwidth, well-provisioned, backbone links (so long 207 they are infrequent), it could severely degrade the performance of 208 slow links. 210 In cases where bandwidth is a commodity at a premium, (e.g., cellular 211 networks), unnecessary retransmission can also be costly. 213 2.2 Can Slow Start add to Congestion after Spurious Timeout 215 Paradoxically, slow start--the epitome of congestion control--can 216 itself add to network congestion after a spurious timeout. Going back 217 to the previous example, when the TCP sender receives a(1), it 218 falsely assumes that a(1) was triggered by p1(1). But in reality, 219 a(1) was triggered by p(1), not p1(1). This ambiguity exits because 220 of the cumulative acknowledgment property of TCP. In this case, if 221 the TCP sender were to take RTT samples (which it does not take 222 because of ACK ambiguity), it would register a value of L (Figure-1) 223 instead of D--the RTT. In addition to this, when the TCP sender sends 224 a1(2) and a1(3), it will receive the ACK for a1(2) and a1(3) right 225 after the inter ACK arrival time IAT. Then-after, the TCP sender will 226 keep increasing the data rate every IAT as new ACKs arrive. 227 Mathematically, one can show that after a spurious timeout, the rate 228 at which data is injected in to the network is given by 1. But if we 229 want to preserve the network probing capability [2] of a TCP sender, 230 then the rate should be given by (2). 232 r(t) = 2*ceil(t-t0/IAT) ... (1) 234 r(t) = 2^ceil(t-t0/D) ... (2) 236 Where t>t0, and t0 is the time when first ACK after spurious timeout 237 is received, and IAT and D are related by (3) 239 IAT <= D/N ... (3) 241 (N is the flight size at the time of spurious timeout). 243 Figure-2 244 C(1) C(2)... C(M) 245 | | ... | 246 | | ... | 247 | | ... | 248 V V ... V 249 \ \ / 250 \ \ / 251 \ \ / 252 +------X--X--X---+ +------------------+ 253 Defaulting | | | | 254 C(0) ----------->| Bottleneck |------>|Buffered packets |---> 255 connection | router | | | 256 +-----X--X----X--+ +------------------+ 257 | | | 258 | | | 259 c(1)c(2) C(M) 261 We now compute the worst-case packet loss due to this data burst. 262 After the timeout, the TCP sender will set its SS_THRESH to N/2 263 (packets-in-flight/2). Therefore, for the first N/2 ACKs received 264 (i.e., ACK a(1) to a(N/2) ), the TCP sender will grow its congestion 265 window by one and reach the SS_THRESH value of N/2. For each ACK 266 received, the TCP sender also sends 2 packets. Therefore by the end 267 of the slow start, the TCP sender would have sent 2*(N/2) packets 268 into the network. For the remaining N/2 ACKs (i.e., ACKs between 269 a(N/2+1) to a(N)) the TCP sender will remain in the congestion 270 avoidance phase and send one packet for each ACK received--sending 271 N/2 more data segments. The net amount of data sent is therefore N/2 272 + N = 3N/2 . Please note that the entire 3N/2 packets are injected 273 into the network within a time period less than or equal to RTT. The 274 number of data segments that left the network during this time is 275 only N. Therefore, there is a high probability that N/2 packets out 276 of 3N/2 packets will be lost. These N/2 lost packets, however, need 277 not come from the same connection, and such a data-burst will 278 unnecessarily penalize all the competing TCP connections that share 279 the same bottleneck router. 281 Going further ahead, let us assume there are M competing TCP 282 connections that share the same bottleneck router as the defaulting 283 connection C(0) does (Figure-2). During the period of time while C(0) 284 is stalled, the TCP sender of C(0) does not use its network 285 resources--the buffer space--on the bottleneck router(s). The 286 competing connections, C(1) ... C(M), however see this lack of 287 activity as resource availability and start growing their window by 288 at least one segment per RTT during this time period (by virtue of 289 linear window increase during congestion avoidance phase). If, for 290 simplicity reasons, we assume that each of these TCP connections has 291 the same round trip time of RTT, and if the idle time for C(0) is 292 k*RTT, (where k > RTO/RTT) then each of these competing connections 293 will increase their congestion window by k segments. Therefore the 294 amount of packets lost in the network due to slow start can be as 295 high as: 297 N/2 + M*k ... (4) 299 where the first term is the packet loss due to slow start, while the 300 second term is the loss due to window growth of completing 301 connections. 303 Please note that even if a TCP sender restores its congestion window 304 [7](or halves [8] it) to avoid slow start and redundant 305 retransmission, it still cannot avoid the loss of M*k segments (M*k- 306 N/2 segments in case the window was halved) that were added in the 307 network because of sender's inactivity. Since a TCP sender does not 308 know the number of connections it's competing with, or the time 309 duration for which packets could be stalled in the network, the 310 number of segments lost due to spurious timeout could be large. 312 In the following sections we describe an algorithm that solves the 313 problem of both redundant retransmission and packet loss after a 314 spurious timeout. 316 3. Sources of spurious timeout 318 Since the response algorithm after a timeout depends on the event 319 that triggered it, it is important to know the factors that can cause 320 such timeouts (in addition to the timeouts caused by congestion.) The 321 following list enumerates few broad categories of events that can 322 cause spurious timeouts in TCP. 324 3.1 Spurious timeout due to Excessive delay at a router 326 Many link layer technologies have their own loss recovery schemes 327 that use link layer feedback mechanisms to recover bit losses. These 328 loss recovery schemes are often unaware of the transport layer loss 329 recovery schemes. Therefore, if there is burst of error at the link 330 layer, the TCP sender may timeout because of 331 the loss recovery scheme used by the link layers. 333 3.2 Spurious timeout due to change of subnets 335 A TCP receiver or sender may change subnets due to sender or receiver 336 mobility (Figure-3). While the change of subnets is taking place, a 337 router may keep the data packets buffered until a new route is 338 established (i.e., until Path-3 and Path-2 are established in 339 Figure-3). Since establishing a new route can take a long time, a 340 TCP sender may timeout. 342 Figure-3 343 <.... Path of new data and ACK. ...> 344 Path-2 345 +-----------> After 346 +-------------+ | ^ Subnet Change 347 | +---------+ . 348 | TCP Sender | .Path-3 349 | +---------+ . 350 +-------------+ | . 351 +-----------> Before 352 Path-1 Subnet Change 354 Once the new path is established, the router may forward all data 355 packets to the TCP receiver using Path-3. However, the ACKs triggered 356 by these packets will reach the TCP sender using Path-2. In addition, 357 the TCP sender will send new data in response to these ACKs on 358 Path-2. 360 When there is a change of subnets, the entire end-to-end path between 361 the sender and receiver might change completely (for example, Path-1 362 and Path-2 may not have any common routers between then). A complete 363 change in the end-to-end path may take place, for example, if a TCP 364 receiver uses Mobile-IPv6 with route optimization to move from one 365 subnet to another. Since the TCP connection does not have any 366 information about the BDP on new path (Path-2), the TCP connection 367 should start with a window size same as the window size of a new 368 connection, i.e., with a size of two. 370 Please note that of all possible reasons for spurious timeout, a 371 change in subnets has maximum impact on network congestion. Since the 372 BDP on a new path could be tens of orders of magnitude higher or 373 lower than the BDP on an old path, an inappropriate congestion 374 control scheme could severely add to network congestion on the new 375 path (Path-2). For example, let us assume there are 100 TCP 376 connections sharing the same bottleneck router. Let's also assume 377 that the average number of subnet change per second per TCP receiver 378 is 1/50, and the average congestion window for each TCP connection at 379 the time of spurious timeout is 5 segments. Then in the worst case, 380 the network may drop as many as (1/50)*(5/2)*100 = 20 segments every 381 second if the congestion window is restored as done in [7]. 383 3.3 Spurious Timeout due to Protocol Inefficiencies 385 A TCP sender may timeout because of inherent protocol inefficiencies 386 in TCP. For example, a spurious timeout may occur if 387 multiple packets are lost and fast-retransmit/fast-recovery 388 algorithms could not recover the lost packets within an RTO. A 389 spurious timeout may also occur if there are so few packets injected 390 into the network that in case of a packet loss the receiver cannot 391 generate 3 duplicate ACKs. 393 4. De-correlated Loss Recovery (DCLOR) 395 De-correlated loss recovery tries to remedy the problems described in 396 Section-2 by separating congestion control from loss recovery 397 mechanism. In doing so, we make the decision of "which packets to 398 send" independent from "how much" data to send. In addition to this, 399 the rate at which data is injected into the network preserves the 400 conservation control principles as described in [1][2][3]. 402 The basic idea behind DCLOR is to send a new data segment from 403 outside the sender's retransmission queue--the queue of 404 unacknowledged data segments--and wait for the ACK or SACK of the new 405 data before initiating the response algorithm. Unlike slow-start 406 where the response algorithm starts immediately after receiving the 407 first ACK, DCLOR waits for the ACK/SACK of the new data sent after 408 timeout before initiating loss recovery. The SACK block for new data 409 contains sufficient information to determine all the packets that 410 were lost into the network. Once the sequence number of lost packets 411 is determined, the TCP sender grows its congestion window as in case 412 of slow-start, but only tries to recover those segments that were 413 lost in the network. This not only allows the TCP sender to avoid the 414 go-back-N problem, but also prevents superfluous congestion window 415 growth as seen with slow-start (section-2.2). 417 Before describing the algorithm itself, we fist enumerate the 418 underlying design philosophy behind DCLOR as it can give more insight 419 into the workings of the algorithm: 421 1. A spurious timeout should not degrade the performance of 422 other TCP connections sharing the same network path. As pointed 423 out in Section-2, the slow-start subsequent to spurious timeout 424 not only affects the performance of the defaulting connections, 425 but it can also confer data-loss on other competing connections. 427 2. Congestion control decisions should be made independent of the 428 loss recovery decision. In other words, the congestion control 429 algorithm should only determines the amount of data to be 430 injected into the network, while the loss recovery algorithm 431 should determine the sequence number of data to be recovered. 433 3. Since the state of the network can change dramatically 434 after a long interruption--long with respect to 435 RTT--(Section-3), the response algorithm after a timeout should 436 be conservative in the amount of data injected into the network 437 after a timeout. This avoids packet loss (and congestion) due 438 to inappropriately inflated congestion window after spurious 439 timeouts (Section-2.2) 441 4. The TCP sender's estimate of network capacity--the 442 SS_THRESH--should be updated if and only if packets are lost in 443 the network. A spurious timeout without packet loss should not 444 affect the SS_THRESH since SS_THRESH is a measure of the 445 network's buffering capacity and there is no need to updated it 446 unless packet loss is detected. 448 5. If there is packet loss in addition to packet stalling, all 449 the lost packets within that window should be recovered without 450 any need for future timeouts. In other words, the algorithm 451 should not require the TCP sender to timeout again due to 452 protocol limitations. 454 We now describe the algorithm in more detail, keeping the above 455 mentioned points in mind. Please see Figure-4 for certain parameters. 457 4.1 Probe phase after a timeout 459 The following steps describe the response of a TCP sender on a 460 timeout: 462 1. If the timeout occurs before the 3 way handshake is complete, 463 the TCP sender's behavior is unchanged, 465 2. After a timeout, the TCP sender MUST set its congestion window 466 to ZERO (CWND = 0), and keep the number of outstanding packets 467 (N)into a state variable for future use (step-9). The value of 468 SS_THRESH MUST be left UNCHANGED at this point. 470 3. The TCP sender SHOULD also reset all the SACK tag bits in its 471 retransmission queue (essentially renege all the SACK 472 information in accordance with the recommendation in [5]. Please 473 also see Section-6 on SACK reneging for more information.) 475 4. Instead of sending the first unacknowledged packet P1 476 after a timeout, the TCP sender disregards its congestion window 477 and sends ONE NEW MSS size data packet Pn+1. 479 Figure-4 . 480 . DCLOR: 481 . a) At timeout send Pn+1 in stead of P1; 482 . store SS_PTR=seq-num(Pn+1);set CWND=0 483 <.....CWND.....> . b) Don't update CWND for ACK/SACK <= SS_PTR 484 +--+--+-...-+--+----+ . c) Once ACK/SACK > SS_PTR is received 485 |P1|P2| |Pn|Pn+1| . Update SS_THRESH if SACK received 486 +--+--+-...-+--+----+ . d) Update lost-packet information 487 ^ . e) set CWND=2 and start loss-recovery/ 488 | . start sending new data in case pure ACK 489 SS_PTR . greater than SS_PTR is received. 491 TCP Sender's Retransmission 492 Queue 494 The TCP sender also stores the sequence number of this new data 495 packet in a new state variable called SS_PTR (for slow start 496 pointer). 498 If the sender does not have any new data outside its 499 retransmission queue, or if the receiver's flow control window 500 cannot sustain any new data, the TCP sender should send the 501 highest sequence numbered MSS sized data chunk from its 502 retransmission queue (i.e., it should send the last packet from 503 its retransmission queue). 505 The idea behind sending one packet on a timeout is to probe the 506 network and determine if the network is congested. However, from the 507 network point of view it does not matter what the sequence number of 508 data sent was. What matters is the "amount" of data sent into the 509 network. Therefore, sending a new packet from outside the 510 retransmission queue has the same affect as sending an old packet. 511 Therefore this behavior is in accordance with the TCP congestion 512 control and avoidance mechanisms described in [1]. 514 5. A TCP sender MUST repeat step-2 until it enters the 515 Timeout-Recovery state as described in step 6. 517 4.2 Congestion Control After the probe phase 519 6. After the timeout, for each ACK received with the ACK-sequence 520 number less than SS_PTR, the packet ACKed is removed from the 521 retransmission queue. If the ACK contains a new SACK block, then 522 the SACK tag is set in the corresponding data packet. However, 523 no data is sent for these ACKs. 525 7. The TCP sender MUST NOT update its congestion window from a 526 value of ZERO until an ACK-sequence number greater than SS_PTR, 527 or a SACK block containing SS_PTR is received. In other words, 528 until the ACK-sequence for Pn+1 is received or a SACK block 529 containing information about Pn+1 is received, NO new data 530 segments (i.e., segments Pn+2 ...,) are added into the network. 531 This allows the TCP sender to discard all the "stale ACKs"--the 532 ACKs that are generated for stalled packets--and prevents 533 superfluous growth of congestion window. 535 The TCP sender MUST NOT take RTT samples for stale-ACKs. The 536 SACK information present in stale-ACKs SHOULD be however used to 537 put SACK tags on the retransmission queue. A TCP sender MAY 538 reset its retransmission timer with every stale ACK received. 540 If the sender receives a SACK block containing SS_PTR, i.e., if 541 there is a packet loss in the stalled window, it SHOULD go to 542 step-8. 544 If the sender receives an ACK that acknowledges SS_PTR, i.e., if 545 no packets were lost from the stalled window, it SHOULD go to 546 step-10. 548 The rationale for not sending any data segment for stale ACKs is to 549 minimize congestion after a timeout. Since the duration of packet 550 stalling could be large, the congestion state of the network could 551 have change dramatically. Therefore, not using the stale ACKs to send 552 data allows the TCP sender to recompute its congestion state from 553 scratch and also minimizes packet loss. 555 NOTE-1: In our experiments, we have tried sending new data for 556 stale ACKs in the spirit of reverting the congestion window as 557 described in [7]. We have also tried setting the congestion window to 558 half the packets-in-flight (pipe) as described in [8]. But we have 559 found that not sending any new data, as described above, is the best 560 scheme whenever there are multiple competing connections in the 561 system. The results of these experiments, along with the experimental 562 setup, are listed in Appendix-1. 564 If a TCP sender has not received any ACK within an RTT, sending more 565 than a few data segments (>4) will add to network congestion in one 566 form or the other (in varying degrees) and the overall performance of 567 the system as a whole degrades--on an average. In addition, the DCLOR 568 scheme works well in case of subnet change where there could be 569 orders of magnitude difference in BDP. DCLOR also has minimal effect 570 on other competing connections. 572 Based on our experiments, we do not recommend sending new data for 573 stale ACKs, but it's possible to deploy a variety of congestion 574 control schemes with the DCLOR framework. For example, a TCP sender 575 on the reception of an old ACK MAY revert (or halve) its congestion 576 window to the old value of packets-in-flight (pipe) and MAY send new 577 data depending upon the congestion window. But the TCP sender SHOULD 578 initiate its loss recovery ONLY after SS_PTR has been ACKed or 579 SACKed. In other words, if the TCP sender receives duplicate ACKs 580 with ACK-sequence number less than SS_PTR, it SHOULD NOT use these 581 duplicate ACKs to initiate Fast-Retransmit and Fast-Recovery. The 582 loss recovery SHOULD be initiated only after ACK/SACK for SS_PTR is 583 received. After that the sender should follow the same steps as 584 described in step-8, step-9, and step-10 (with a different congestion 585 window of course). 587 NOTE-2: For the congestion response of DCLOR, we have also considered 588 adaptive update of congestion window after each timeout. In this 589 scheme a TCP sender reduces its congestion window by a factor of 2 590 for every RTO period of inactivity. The SS_THRESH is not updated 591 until a loss is detected. At the time of this writing we are still 592 experimenting with the usefulness of such a scheme. However, such a 593 scheme is not well suited for spurious timeouts caused by change 594 of subnets as described in Section-3.2. 596 4.3 Timeout-Recovery: recovering lost packets after timeout 598 8. The TCP sender traverses the retransmission queue and marks 599 all the packets without any SACK tag as lost. The TCP sender 600 also updates its packets-in-flight (pipe) based on the SACK tags 601 and the lost segment information (the packets-in-flight (pipe) 602 should be ZERO after the update). 604 Please note that unlike Fast-Retransmit and Fast-recovery, DCLOR uses 605 only one SACK block containing SS_PTR to mark packets as lost. This 606 is because we do not expect packet reordering to exist over the 607 period of RTO. 609 9. The TCP sender updates its SS_THRESH, as following: 611 SS_THRESH=packets-in-flight (pipe) at the time of timeout / 2 613 Please note that the TCP sender stored the value of packets-in- 614 flight at the time of timeout in step-2. 616 10. The TCP sender sets its congestion window to 2. If packets 617 were lost into the network (i.e., if a SACK for SS_PTR was 618 received), the TCP sender should start by sending the least 619 sequence number packets, else it should continue by sending new 620 data. 622 Please note that with a pure ACK acknowledging SS_PTR, the TCP sender 623 does not update the SS_THRESH value (it directly enters step-10 from 624 step-7). This prevents a TCP sender from setting its SS_THRESH to a 625 very small value if the spurious timeout occurs at the start of the 626 connection. 628 The rationale behind not updating the SS_THRESH if no loss is 629 detected is that SS_THRESH represents the sender's estimate of 630 network capacity--the BDP. Unless a loss is detected, there is no 631 reason to update the value of BDP. 633 5. Data Delivery To Upper Layers 635 If a TCP sender loses its entire congestion window worth of data due 636 to congestion, sending new data after timeout prevents a TCP receiver 637 from forwarding the new data to the upper layers immediately. 638 However, once the SACK for this new data is received, the TCP sender 639 will send the first lost segment. This essentially means that data 640 delivery to the upper layers could be delayed by ONE RTT when all the 641 packets are lost in the network. 643 This, however, does not affect the throughput of the connection in 644 any way. If a timeout has occurred, then the data delivery to the 645 upper layers has already been excessively delayed. Delaying it by 646 another round trip is not a serious problem. Please note that 647 reliability and timeliness are two conflicting issues and one cannot 648 gain on one without sacrificing something else on the other. 650 6. Sack Reneging 652 The TCP SACK information is meant to be advisory, and a TCP receiver 653 is allowed--though strongly discouraged--to discard data blocks the 654 receiver has already SACKed[5]. Please note however that even if the 655 TCP sender discards the data block it received, it MUST still send 656 the SACK block for at least the recent most data received. Therefore 657 in spite of SACK reneging, DCLOR will work without any deadlocks. 659 A SACK implementation is also allowed not to send a SACK block even 660 though the TCP sender and receiver might have agreed to SACK- 661 Permitted option at the start of the connection. In these cases, 662 however, if the receiver sends one SACK block, it must send SACK 663 blocks for the rest of the connection. Because of the above mentioned 664 leniency in implementation, its possible that a TCP receiver may 665 agree on SACK-Permitted option, and yet not send any SACK blocks. To 666 make DCLOR robust under these circumstances, DCLOR SHOULD NOT be 667 invoked unless the sender has seen at least one SACK block before 668 timeout. We, however, believe that once the SACK-Permitted option is 669 accepted, the TCP sender MUST send a SACK block--even though that 670 block might finally be discarded. Otherwise, the SACK-Permitted 671 option is completely redundant and serves little purpose. To the best 672 of our knowledge, almost all SACK implementations send a SACK block 673 if they have accepted the SACK-Permitted option. 675 7. DCLOR Examples 677 We now demonstrate the working of DCLOR with a few concrete examples. 678 We first take the case in which the timeout is caused due to 679 congestion. We then consider the case, in which all the packets are 680 stalled in the network but there is no packet loss. Finally, we take 681 the example in which packets are both lost and stalled at the same 682 time. 684 In all these examples, the TCP sender has 20 MSS packets in flight 685 (pipe) at the time when timeout occurs and each packet is denoted by 686 P(i) (please note that we are using packet numbers instead of TCP 687 sequence numbers to make the presentation simple. The example can be 688 modified to take TCP sequence numbers into account too.) An ACK is 689 denoted by A(i)[x,y][z,w], where i is the highest packet number 690 acknowledged, and the ordered pair [x,y] and [z,w] represent the sack 691 blocks present in that ACK (please note that i < x,y,z,w and its 692 possible that x=y if just one packet is being sacked.) 694 7.1 Timeout due to congestion. 696 In this case, we assume that all packets P(1) to P(20) are lost into 697 the network. The following time-line shows the sequence of events: 699 TCP Sender Network TCP Receiver 700 ---------- ------- ----------- 701 P(1)-----------X (Highest In-order seq = 0) 702 . 703 . LOST IN NETWORK 704 . DUE TO CONGESTION 705 P(20)-----------X 706 ^ 707 : 708 RTO 709 : 710 V 711 preserve old_pkt_in_flight (pipe) = 20 712 send P(21)-------------------------> rec P(21) 713 set SS_PTR = 21 714 rcv A(0)[21,21] <-------------------- send A(0)[21, 21] 716 Since SACK contains SS_PTR ==> 717 Tag P(1) to P(20) as Lost 718 Tag P(21) as SACKED 719 Update pkt_in_flight (pipe) = 21 - 20 -1 = 0 720 Since SACK=21 was received therefore 721 set SS_THRESH to old_pkt_in_flight/2 = 10 722 set CWND = 2 723 Restart RTO-timer 725 Send P(1)---------------------------> 726 Send P(2)---------------------------> 728 7.2 Timeout due to pure packet stalling. 730 In this case, we assume that all the packets P(1) to P(20) are 731 stalled in the network. The following time-line shows the sequence of 732 events: 734 TCP Sender Network TCP Receiver 735 ---------- ------- ----------- 736 P(1)-----------+ (Highest In-order seq = 0) 737 . | 738 . | 739 P(20)-----------+ 740 ^ | 741 RTO | 742 V | Packets delayed 744 old_pkt_in_flight (pipe) = 20 | 745 send P(21)----------+ 746 set SS_PTR = 21 | 747 CWND=0 +--------------> rec P(1) 748 rcv A(1) <---------------------------- send A(1) 749 SS_PTR > A(1)=> | 750 no change in CWND = 0 . 751 Remove P(1) from . 752 retransmission queue . 753 No timer sample taken . 754 +--------------> rec P(i) {1 A(i) ==> | 757 no change in CWND = 0 | 758 Remove P(i) from | 759 retransmission queue . 760 No timer sample taken . 761 +---------------> rec P(21) 762 rcv A(21) <---------------------------- send A(21) 763 SS_PTR = A(21) ==> 764 pure ACK received 765 Remove P(21) from retransmission queue 766 Update pkt_in_flight (pipe) = 0 767 Since Pure ACK was received therefore 768 SS_THRESH remains unchanged. 769 set CWND = 2 770 Restart RTO-timer 771 Send P(22)---------------------------> 772 Send P(23)---------------------------> 774 7.3 Comprehensive Scenario: Timeout due to stalling and loss. 776 Building upon the previous two example, we now assume that in 777 addition to data stalling, P(10) was lost into the network. 779 The following time-line shows the sequence of events: 781 TCP Sender Network TCP Receiver 782 ---------- ------- ----------- 783 P(1)-------------+ (Highest In-order seq = 0) 784 . | 785 P(10)----X Lost | 786 . | 787 P(20)-------------+ 788 ^ | 790 RTO | 791 V | Packets delayed 792 old_pkt_in_flight (pipe) = 20 | 793 send P(21)------------+ 794 set SS_PTR = 21 | 795 CWND=0 +--------------> rec P(1) 796 rcv A(1) <----------------------------- send A(1) 797 SS_PTR > A(1)=> | 798 no change in CWND = 0 . 799 Remove P(1) from . 800 retransmission queue . 801 No RTT sample taken . {1 rec P(i) 803 rcv A(i) <---------------------------- send A(i) 804 SS_PTR > A(i) ==> | 805 no change in CWND = 0 | 806 Remove P(i) from | 807 retransmission queue . 808 No RTT sample taken . {10 rec P(j) 810 rcv A(9)[11,j] <------------------------ send A(9)[11,j] 811 SS_PTR > A(9) and | 812 SS_PTR outside A[11,j]==> | 813 Tag P(j) as SACKED | 814 NO change in CWND=0 +---------------> rec P(21) 816 rcv A(21) <---------------------------- send A(9)[11,21] 817 Since SS_PTR = 21 ==> 818 Tag P(10) as Lost 819 Update pkt_in_flight (pipe) = 11-10-1 820 (Note, P(1) ... P(9) are already removed 821 from the retransmission queue) 822 Since a SACK was received therefore 823 set SS_THRESH=20/2 = 10 824 set CWND = 2 825 Restart RTO-timer 826 recover the lost packet first 827 Send P(10)---------------------------> 828 Send P(22)---------------------------> 830 8. Security Considerations 832 No new security threats are introduced by the use of DCLOR. 834 9. References 836 [1] M. Allman, V. Paxson, W. Stevens. "TCP Congestion Control. 837 RFC-2581." 839 [2] S. Floyd. "Congestion Control Principles." RFC-2914. 841 [3] M. Handley, J. Padhye, S. Floyd. "TCP Congestion Window 842 Validation." RFC-2861. 844 [4] Ethan Blanton, Mark Allman, Kevin Fall. "Conservative 845 SACK-based Loss Recovery Algorithm for TCP." draft-allman-tcp- 846 sack-10.txt. Work in Progress. 848 [5] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow. "TCP Selective 849 Acknowledgment Options." RFC-2018. 851 [6] S. Floyd, J. Mahdavi, M. Mathis, M. Podolsky. "An Extension to 852 the Selective Acknowledgment (SACK) Option for TCP" RCF-2883. 854 [7] Reiner Ludwig, Michael Meyer. "The Eiffel Detection Algorithm 855 for TCP." Internet Draft-work in progress. draft-ietf-tsvwg- 856 tcp-eifel-alg-05.txt. 858 [8] P. Sarolahti, M. Kojo. "F-RTO: A TCP RTO Recovery Algorithm 859 for Avoiding Unnecessary Retransmissions." Internet Draft--work 860 in progress. draft-sarolahti-tsvwg-tcp-frto-01.txt 862 [9] V. Paxon, M. Allman. "Computing TCP's Retransmission 863 Timer." RFC-2998 865 10. IPR Statement 867 The IETF has been notified of intellectual property rights claimed in 868 regard to some or all of the specification contained in this 869 document. For more information consult the on-line list of claimed 870 rights at http://www.ietf.org/ipr. 872 Author's Address: 874 Yogesh Prem Swami 875 Nokia Research Center 876 6000 Connection Drive 877 Irving TX-75063 878 USA 880 Phone: +1 972-374-0669 881 Email: yogesh.swami@nokia.com 883 Khiem Le 884 Nokia Research Center 885 6000 Connection Drive 886 Irving TX-75063 887 USA 889 Phone: +1 972-894-4882 890 Email: khiem.le@nokia.com 892 Appendix - 1 894 In this section we provide information on our test setup and provide 895 test results for comparison of DCLOR with Standard TCP, Eifel, and 896 FRTO schemes. 898 App-1: Testing Methodology 900 Please see Figure-5 for the test setup. The TCP sender and TCP 901 receiver are connected through an intermediary "spurious timeout and 902 network impairment simulator (STIS)," that simulates packet delay and 903 certain network impairments. The STIS is capable of reading all TCP 904 packets received on any of its interfaces. It is capable of 905 processing packets coming from multiple TCP connections. Each TCP 906 packet is then subjected to a series of network impairments that are 907 expected to be present in a typical IP network. The specifics of 908 these impairments along with the model used to simulate them is 909 enumerated below: 911 Figure-5 913 +---------------+ +--------------------+ +---------------+ 914 | TCP Sender | | Spurious timeout & | | | 915 | +----+ network impairment +---+ TCP Receiver | 916 | Linux-2.5.40 | | simulator | | | 917 +---------------+ +--------------------+ +---------------+ 919 1. Bottleneck Router 921 The STIS is capable of simulating a bottleneck router on the 922 end-to-end path of a TCP connection. For TCP, a bottleneck 923 router is one that has the minimum ratio for the buffer-space to 924 difference in link speed on the end-to-end path (i.e., if b(i) 925 represents the buffer space of router i, and a(i) and d(i) be 926 the incoming and the outgoing link speed of the router for a 927 particular TCP connection, then the bottle neck router for such 928 a connection is one that has min{ b(i)/[d(i)-a(i)]} for all 929 routers i on the end-to-end path of the TCP connection). 931 The bottleneck router is simulated by specifying a buffer space 932 and the outgoing link speed (the arrival rate in our simulation 933 was fixed by the link speed of the interface on which the packet 934 arrived. This speed was a fixed value of 10Mbps). In order to 935 simulate outgoing link, the STIS enqueues each TCP packet and 936 schedules them based on the queuing delay and link-delay 937 (service time) of packets in the queue (please see item-2 for 938 more details). A packet received on any of the interfaces is 939 processed only if the total bytes of enqueued data in the router 940 is less than the specified buffer space. A packet is dropped if 941 the buffer space is full. Please note that the buffer space is 942 allocated for and shared by all possible TCP connections going 943 through STIS. 945 2. Link Speed Simulation 947 Although each TCP connection shares the same buffer space, each 948 TCP connection has its own logical queue for each direction. 949 The departure time of a new packet is computed by addition the 950 link delay (packet-size / outgoing-link-speed) of the new packet 951 to the queuing delay of all packets ahead of it. A TCP sender is 952 scheduled on the appropriate interface once its departure time 953 expires. 955 3. Fixed Network Delay 957 The wired-network delay seen by individual packets is the sum of 958 the queuing delay and link delay of each packet on individual 959 routers. If we assume that the number of routers are infinitely 960 many on the end-to-end path, and if the variance of the delay is 961 bounded (which is the case with finite queue sizes), then by the 962 law of large numbers, the delay seen by each packet will be a 963 constant. Although there are not infinitely many routers in the 964 real networks, to make our model simple we assume that the delay 965 added to each packet on the end-to-end path is a constant. This 966 delay is added in addition to the queuing delay incurred due to 967 link speed simulation. (Please note that fixed network delay 968 tries to capture the queuing delay on the high speed link, while 969 the delay incurred due to link speed simulation is the delay due 970 to a last hop slow link.) 972 The fixed network delay is simulated by adding a fixed delay to 973 the departure time of each packet. 975 4. Prolonged packet stalling 977 In addition to the delay models described above, the STIS also 978 simulates prolonged packet stalling in the network. Although 979 the end result of packet stalling is a delay spike, we do not 980 simply add a long delay burst to achieve packet stalling rather 981 use the Markov model shown in Figure-6. (This model is based on 982 the two different kinds of delay spikes seen in many cellular- 983 networks.) 985 Every second a random number r (<1) is generated(lets say the 986 random number is generated at time T) for each TCP connection 987 and compared with p1 and p2. If r is less than p1 or P2 a state 988 transition is made into state-2 or state-3. When a TCP 989 connection is in state-2 or state-3, each packet arriving in the 990 time interval T and T+D (where D = d1 or d2 depending upon the 991 state), is treated as if it arrived at time T+D. By altering the 992 arrival time of each packet, this model simulates packet 993 stalling in the network. 995 Figure-6 997 p=1-p1-p2 998 +------+ 999 \ / 1000 +----+-----V-----+ 1001 p=1 | No Delay Spike | 1002 +--------------->| (state-1) |<--------------+ 1003 | +--+--------+----+ | 1004 | | | |p=1 1005 | +-----------+ +--------------+ | 1006 | | p=p1 p=p2| | 1007 +------+-------V-------+ +----V-----+--------+ 1008 | Moderate delay spike | | Large delay spike | 1009 | delay=d1(state-2) | | delay=d2(state-2) | 1010 +----------------------+ +-------------------+ 1011 Please note that in a real network, when packet stalling 1012 occurs--for whatever reason--all packets remain queued in the 1013 network until the packets can be forwarded again. Therefore the 1014 effect of packet stalling is different from adding a long burst 1015 of delay to a few of the packets. Please note that with such a 1016 model, the delay incurred by individual packets will not be the 1017 same. 1019 Please also note that the state transitions are based on time. 1020 Therefore the STIS will keep entering state-2 and state-3 1021 whether or not there are any TCP packets to be forwarded. 1023 5. Packet Reordering 1025 Since the Internet is known to reorder 12% of the packets (this 1026 study was done by Vern Paxson of UC Berkeley) on the end-to-end 1027 path, we try to simulate this behavior in our simulations. To 1028 simulate packet reordering we try to emulate the network 1029 behavior that creates packet reordering--that is we try to 1030 emulate router failure or load balancing to achieve packet 1031 reordering. 1033 A router failure is modeled using a two state Markov model, in 1034 which transition from state to another means a route failure. In 1035 each of the different states, the TCP connection incurs a 1036 different fixed network delay as described above in item-2. 1037 This model is based on the assumption that when a route failure 1038 takes place, the delay on new path is different from the delay 1039 on the old path. 1041 App-2: Test setup, parameters, and Results 1043 Based on the model described above, the STIS uses following 1044 parameters for testing. 1046 Table-1: STIS parameters 1048 +---------------------------------------+------------------+ 1049 | Parameter Name | Parameter Value | 1050 +---------------------------------------+------------------+ 1051 | Path MTU | 1500 Bytes | 1052 | Bottleneck Router's Capacity | 74K Bytes | 1053 | Outgoing link data rate | 50 K Bits/sec | 1054 | Fixed Network Delay (one way) | 200 ms | 1055 | Pkt Stalling delay in state-2 (d1) | 5 sec | 1056 | Pkt Stalling delay in state-3 (d2) | 8 sec | 1057 | Pkt Stalling prob to state-2 (p1) | 0.05 per-sec | 1058 | Pkt Stalling prob to state-3 (p2) | 0.005 per-sec | 1059 | Pkt reordering probability | 0.12 | 1060 | Pkt reordering delay | 20 ms | 1061 +---------------------------------------+------------------+ 1063 On the TCP receiver, multiple TCP receiver processes simultaneously 1064 run and try to download files of different sizes--each connection 1065 competing to download a file of given size. In addition, each process 1066 waits for a random time interval and iterates over and over again to 1067 download the same file. The following table shows the traffic mix 1068 used for our experiments. 1070 Table-2: Traffic Mix 1072 +---------------+------------------+--------------+ 1073 | File Size | # simultaneous | # iterations | 1074 | (KB) | connections | | 1075 +---------------+------------------+--------------+ 1076 | 5 | 6 | 2000 | 1077 | 10 | 5 | 1000 | 1078 | 100 | 5 | 100 | 1079 | 1000 | 3 | 10 | 1080 | 10000 | 1 | 1 | 1081 +---------------+------------------+--------------+ 1083 Based on the STIS parameters and traffic mix, the cumulative 1084 probability distribution of download time, i.e., the plot of number 1085 of points with a download time of less than a give time duration in 1086 the above experiment, for each file size was generated (because of 1087 proper format for representing these plots, we are unable to include 1088 these plots in this document. We can provide this plot on request.) 1090 Tables 3,4,and 5 show the mean and variance (in second) for 1091 downloading different file sizes using DCLOR, EIFEL, FRTO, and 1092 Standard TCP. Please note that not every connection was subjected to 1093 a spurious timeout, and therefore the mean value alone is not a good 1094 measure of performance. The variance captures the impact of such 1095 spurious timeouts (the probability distribution is the best 1096 representation of how individual schemes perform for different 1097 download time range). As can be seen from the following table, Eifel 1098 has a large variance because of restoring the window. Please note 1099 that the variance for standard TCP is less than Eifel, or Frto, 1100 because the first N packets in case of standard TCP are redundant, 1101 and therefore even if they are lost due to error burst, the 1102 defaulting connection does not incur a very severe penalty and 1103 therefore the variance is small. But for Eifel and FRTO, the N 1104 packets after spurious timeout are new and therefore loss of these 1105 packets has a huge impact on the variance. 1107 Table-3: Download Time Statistics for 5K files 1109 +---------+--------+-------+--------+-------+ 1110 | (sec) | DCLOR | EIFEL |STD_TCP | FRTO | 1111 +---------+--------+-------+--------+-------+ 1112 |Mean | 2.3869 | 2.4162| 2.3962 | 2.4049| 1113 |Variance | 3.2473 |10.2579| 3.2164 | 4.0466| 1114 +---------+--------+-------+--------+-------+ 1116 Table-4: Download Time Statistics for 10K files 1118 +---------+-------+--------+--------+------+ 1119 | (sec) | DCLOR | EIFEL |STD_TCP | FRTO | 1120 +---------+-------+--------+--------+------+ 1121 |Mean |3.4547 | 4.0560 | 3.7314 |3.7317| 1122 |Variance |4.7452 |19.7828 | 7.6378 |8.4576| 1123 +---------+-------+--------+--------+------+ 1125 Table-5: Download Time Statistics for 100K files 1127 +---------+---------+---------+---------+---------+ 1128 | (sec) | DCLOR | EIFEL | STD_TCP | FRTO | 1129 +---------+---------+---------+---------+---------+ 1130 |Mean | 24.6297 | 26.1017 | 26.7425 | 25.5325 | 1131 |Variance | 66.0804 | 98.4933 | 98.9363 | 78.1667 | 1132 +---------+---------+---------+---------+---------+ 1134 In addition to the mean and variance of download times, we also 1135 calculated the "spectral efficiency" of the system. The spectral 1136 efficiency was computed as: 1138 (redundant_bytes_received) 1139 SE= -------------------------------- 1140 mean(cwnd)*MSS 1142 The spectral efficiency is a measure of usefulness of a particular 1143 scheme in terms of fraction of extra bytes sent due to spurious 1144 timeout for every byte sent into the network. The smaller this number 1145 is, the better the scheme is. The numerator is the amount of 1146 unnecessary data, while the denominator is the average network 1147 capacity experienced by the connection. Since some schemes (Eifel for 1148 example) requires more TCP options, the MSS was used instead of MTU 1149 in the denominator. Table-5 shows these results for all the four 1150 different TCP implementations. Please note that FRTO behaves like 1151 standard TCP if there is no new data to send or if the congestion 1152 window is as big as flow control window. This is one of the reasons 1153 why FRTO performance is worse than Eifel or DCLOR. 1155 Table-6: Spectral Efficiency 1157 +-----------+----------+----------+----------+----------+ 1158 | File Size | DCLOR | EIFEL | STD_TCP | FRTO | 1159 | |(MSS=1460)| (1448) | (1460) | (1460) | 1160 +-----------+----------+----------+----------+----------+ 1161 | 5K | 0.004042 | 0.004454 | 0.092714 | 0.071336 | 1162 | 10k | 0.005249 | 0.012293 | 0.078977 | 0.052581 | 1163 | 100k | 0.017124 | 0.036748 | 0.624361 | 0.079285 | 1164 +-----------+----------+----------+----------+----------+ 1166 The DCLOR algorithm for all these tests cases were implemented 1167 in the Linux Kernel-2.5.40 (the source code for DCLOR 1168 could be provided on request). The same kernel version was 1169 used for all the other TCP enhancements too.