idnits 2.17.1 draft-ayesta-to-short-tcp-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 1 longer page, the longest (page 5) being 446 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 5 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There is 1 instance of too long lines in the document, the longest one being 73 characters in excess of 72. ** There are 16 instances of lines with control characters in the document. ** The abstract seems to contain references ([RFC3042], [RFC2414]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 2002) is 7864 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '1' on line 20 -- Looks like a reference, but probably isn't: '2' on line 48 == Unused Reference: 'CSA00' is defined on line 339, but no explicit reference was found in the text == Unused Reference: 'CMT98' is defined on line 342, but no explicit reference was found in the text == Unused Reference: 'RFC1122' is defined on line 383, but no explicit reference was found in the text == Unused Reference: 'RFC2581' is defined on line 392, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'AA02' -- Possible downref: Non-RFC (?) normative reference: ref. 'BPS99' -- Possible downref: Non-RFC (?) normative reference: ref. 'BS02' -- Possible downref: Non-RFC (?) normative reference: ref. 'CSA00' -- Possible downref: Non-RFC (?) normative reference: ref. 'CMT98' -- Possible downref: Non-RFC (?) normative reference: ref. 'FF96' -- Possible downref: Non-RFC (?) normative reference: ref. 'Flo01' -- Possible downref: Non-RFC (?) normative reference: ref. 'GM01' -- Possible downref: Non-RFC (?) normative reference: ref. 'Jac88' -- Possible downref: Non-RFC (?) normative reference: ref. 'NS' -- Possible downref: Non-RFC (?) normative reference: ref. 'LK98' -- Possible downref: Non-RFC (?) normative reference: ref. 'Pax97' ** Obsolete normative reference: RFC 2414 (Obsoleted by RFC 3390) ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2988 (Obsoleted by RFC 6298) -- Possible downref: Non-RFC (?) normative reference: ref. 'TMW97' Summary: 10 errors (**), 0 flaws (~~), 8 warnings (==), 17 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force Urtzi Ayesta 3 Internet Draft FranceTelecom R&D 5 Document: draft-ayesta-to-short-tcp-00.txt Konstantin Avrachenkov 7 Expires: October 2002 INRIA 9 October 2002 11 On reducing the number of TimeOuts for short-lived TCP connections 13 Status of this Memo 15 This document is an Internet-Draft and is in full conformance with all 16 provisions of Section 10 of RFC2026 [1]. Internet-Drafts are working 17 documents of the Internet Engineering Task Force (IETF), its areas, and 18 its working groups. Note that other groups may also distribute 19 working documents as Internet-Drafts. 21 Internet-Drafts are draft documents valid for a maximum of six months 22 and may be updated, replaced, or obsoleted by other documents at any 23 time. It is inappropriate to use Internet-Drafts as reference material 24 or to cite them other than as "work in progress." 26 The list of current Internet-Drafts can be accessed at 27 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html. 31 Abstract 33 This document shows that short TCP sessions are prone to timeout. In 34 particular, one single segment loss will provoke TCP to timeout if the 35 document size is below certain threshold. This document analyzes the 36 benefit of TCP modifications such as Limited Transmit Algorithm 37 [RFC3042] and Increasing Initial Window [RFC2414] in the context of 38 short-lived TCP transfers. However TCP remains vulnerable to the losses 39 at the very end of the transmission. Therefore we suggest complementary 40 modifications to Limited Transmit Algorithm to recover effectively from 41 losses at the end of the TCP transfer. 43 Conventions used in this document 45 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 46 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 47 document are to be interpreted as described in RFC-2119 [2]. 49 1. Introduction 51 TCP sender requires the reception of three duplicate acknowledgements 52 (ACK) to recover from a segment loss without timing out. Consequently, 53 losses at the very end of the transmission will inevitably provoke a 54 timeout. This might especially degrade the TCP performance of 55 short-lived sessions. This document analyzes possible modifications to 56 reduce the timeout probability. 58 RFC 2988 [RFC2988] defines the standard algorithm to compute the 59 retransmission timeout (RTO). In particular, RFC 2988 [RFC2988] 60 recommends to round this timer up to 1 second to avoid retransmissions 61 of segments only delayed and not lost. Because of this conservative RTO 62 definition, it is important for TCP senders to detect and recover from 63 as many losses as possible without having a timeout. 65 The TCP loss recovery mechanism have had have had several modifications 66 over the recent years. The fast retransmission algorithm, which was 67 developed in Tahoe TCP [Jac88], retransmits an unacknowledged segment 68 upon reception of three duplicate ACKs, sets the congestion window to 69 one, sets the slow-start threshold to half of the current congestion 70 window and begins slow start. In the fast recovery algorithm proposed 71 in Reno TCP version[FF96], after receiving three duplicate ACKs, the 72 congestion window is halved by two and Congestion Avoidance replaces 73 slow start. TCP's selective acknowledgement(SACK) [RFC2018] option 74 permits the receiver to inform the sender about the data blocks that 75 were successfully received. 77 Recently two new modifications have been proposed, Increasing the 78 Initial Window (IW) [RFC2414] and Limited Transmit Algorithm (LT) 79 [RFC3042]. According to the IW proposal the initial size of the 80 congestion window is increased from one or two segment(s) to roughly 4K 81 bytes (never more than four segments). This modification benefits the 82 individual connection in several ways [RFC2414]. In the particular case 83 of short-lived TCP we note that: it reduces the transfer time in 84 several round trip times (RTT) and it makes TCP more robust against 85 segments lost in the very beginning of the connection. With LT, the TCP 86 sender sends a new data segment in response to each of the first two 87 duplicate ACKs. Eventually it will receive a third duplicate which will 88 trigger off fast retransmission and fast recovery phases. Clearly, 89 transmitting these new data segments increases the probability that TCP 90 can recover from a lost segment(s) without timing out (see [Flo01] for 91 simulations examples of LT). 93 In the literature it has been reported that many of the timeouts are 94 due to non-trigger of fast recovery. In [LK98] the authors analyzed 95 part of the traces collected by Paxson [Pax97] and found that 85% of 96 the timeouts were due to this reason. [BPS+97] found that almost 50% of 97 the losses required a timeout to recover. In addition only 4% out of 98 them could have been avoided with the TCP selective acknowledgement 99 (SACK) mechanism and 25% using LT. Unfortunately, to the best of our 100 knowledge some important questions remain open: Why do TCP senders 101 receive not enough duplicate ACKs? Is this because of the small size of 102 the congestion window or because of burstyness of the segment loss 103 process? 105 So far the same TCP algorithm is used regardless the size of the file 106 to be transmitted. It is known (see, e.g. [FBP+01]) that a TCP session 107 typically belongs to one of the following two kinds: "mice" or 108 "elephants". Most TCP sessions are "mice" with a small size, but a 109 small amount of "elephants" (in terms of flows) is responsible for the 110 largest amount of transferred data (in terms of bytes) (approximately 111 80% according to [GM01]). In [TMW97] the authors based on measurements 112 in the backbone found that the average size of flows was 10Kbytes. More 113 recent measurements on the Sprint IP backbone network [FMD+01] show 114 that around 70% of the flows carry fewer than 1Kbyte and 90% of the 115 flows carry fewer than 10Kbytes. 117 Note: The values of [FMD+01] do not correspond only to TCP, but to all 118 transport protocols. Still the authors report that over 90% of the 119 traffic is transmitted with TCP, even on the links with a significant 120 percentage of streaming media. In [TMW97] authors have reported that 121 TCP carries 95% of the bytes, 90% of the segments and 80% of the flows 122 on the link. 124 Therefore, it seems feasible to modify TCP and improve the performance 125 of short-lived TCP flows without significant increase of the overall 126 network load. 128 The rest of the paper is organized as follows, Section 2 analyzes in 129 detail the performance of TCP focusing on short-lived TCP sessions. In 130 Section 3 some possible TCP modifications are discussed and simulation 131 results are presented. The last section is conclusions. 133 2. What causes short-lived TCP to timeout? 135 When a loss occurs, the congestion window of the sender will continue 136 sliding forward until the lost segment gets to the left most position. 137 If the value of the congestion window is less than four segments (two 138 with LT) the TCP session will timeout. There is yet another situation 139 when the sender will inevitably timeout. Namely, if a loss occurs when 140 the remaining amount of data is less than three segments, no matter 141 what the actual value of the congestion window is, the sender will not 142 receive three duplicate ACKs and will have to rely on a timeout to 143 detect the loss. 145 As a consequence, one can identify three situations where TCP sessions 146 are prone to timeouts. The first case corresponds to the beginning of 147 the session when the congestion window is below 4 (2 with LT) segments. 148 The second case corresponds to the middle part of the transfer when the 149 congestion window is small. For example the limit imposed by the 150 receiver advertised window is small, the link has a small 151 bandwidth-delay product or after the loss recovery phase. The third 152 case corresponds to the very end of the transmission. Namely if any of 153 the last three segments are lost the sender will not receive three 154 duplicate ACKs and it will inevitably timeout. IW helps in the first 155 case, LT does the same in the first and second case. However, neither 156 of them helps at the end of the transmission. 158 Note: At the end of the transmission, the use of LT does not make any 159 difference since it only sends new data upon the reception of two 160 duplicate ACKs and it does not make the decision to retransmit a 161 segment until three duplicate ACKs are received. To the best of our 162 knowledge this case was first observed in [AA02]. 164 The third case might not be of crucial importance for long-lived TCP 165 flows, but it may have a significant effect on the transfer time of 166 short-lived TCP. 168 One can define a threshold on the file size (TO-THRESH), such that if 169 the file size is less than this threshold a single loss will inevitably 170 lead to a timeout. TO-THRESH is given by the sum of the number of 171 segments that have to be transmitted to reach a congestion window of 172 size 4 (two with LT) and three segments corresponding to the end of the 173 file. In the case the receiver does not employ delayed ACK we get (in 174 brackets it is shown the contribution of the two intervals): 176 Initial TCP TCP 177 Window Limited Transmit 178 1 6(3+3) 4(1+3) 179 2 5(2+3) 3(0+3) 180 3 4(1+3) 3(0+3) 181 4 3(0+3) 3(0+3) 183 In the case of TCP employing delayed ACK we get the following values: 185 Delayed ACK employed 186 Initial TCP TCP 187 Window Limited Transmit 188 1 7(4+3) 5(2+3) 189 2 6(3+3) 4(1+3) 190 3 4(1+3) 3(0+3) 191 4 3(0+3) 3(0+3) 193 The values presented in the tables above along the statistics reported 194 on the file size of TCP flows [GM01,TMW97,FMD+01]] suggest that the 195 value of TO-THRESH is of the order of a major portion of the TCP flows. 196 Clearly, this implies that TCP's loss recovery mechanism does not work 197 well for "mince" type TCP flows. Balakrishnan et al. [BPS+97] concluded 198 by measurements on an internet server that TCP's loss recovery 199 performance is poor when it comes to short Web transfers. We presume 200 that the end of the file effect might have had an impact on the 201 measurements and it was not identified by the authors. 203 In [AA02] we look at the expected TCP transfer time conditioning on the 204 number of losses. Via simulations and the theoretic model we observed a 205 very interesting phenomenon - the non monotonicity of the expected 206 conditional transfer time. That is given that certain segment(s) 207 is(are) lost(s), it turns out that on average it may take less time to 208 transmit a larger file. For instance in the case of one loss, the 209 picture transfer time vs. file size shows a unique peak at TO-THRESH. 210 First the transfer time goes up until the file size is smaller than 211 TO-THRESH. Then the transfer time start to decrease and only after some 212 file size it starts to increase again. This behavior is due to the 213 conservative duration of the retransmission timer, typically several 214 times greater than an average round trip time (RTT). 216 3. TCP modifications to improve the performance of short-lived TCP 217 transfers. 219 From the previous section we know that short-lived TCP flows are 220 particularly vulnerable to segment losses since in most of the 221 situations they will have to rely on a RTO to recover from them. The 222 use of the LT algorithm, reduces the value of TO-THRESH and hence the 223 aliviates the outlined problem. However if there is no new data to send 224 (at the end of the file) LT does not help. Thus, at the end of the TCP 225 transfer it might be useful to retransmit early. 227 Paxson [Pax97] affirms that TCP fast retransmission threshold could be 228 safely lowered from 3 duplicate ACKs to 2 by introducing a 20msec 229 waiting time before retransmitting. This strategy could be as well 230 adopted in the case one duplicate ACK is received and no further data 231 is queued to send. That is, waiting for some time before deciding to 232 retransmit a segment. 234 With early retransmission only the loss of the last segment will force 235 the sender to timeout. To overcome this one can consider that TCP could 236 send an extra segment at the end of the session (containing no data of 237 course). This segment would not be sent reliably and its only goal 238 would be to avoid a timeout when the last segment is dropped. 240 On the other hand this modification may degrade the performance of the 241 network because of the early retransmission of only reordered and not 242 lost segments and lead to an increase of the loss rate. Several authors 243 have studied the phenomena of segment reordering. Paxson [Pax97] 244 transmitted 100Kb between 35 computers and measured that 0.1%-2% of all 245 segments (data and ACK) experienced reordering and 12%-35% of the flows 246 (depending on the data set) experienced at least one reordered segment. 247 Bennett et al. [BPS99] sent ICMP probes to the MAE-East Internet 248 exchange point and found that the probability of a session experiencing 249 reordering was over 90%. They conjecture reordering is a function of 250 network load and they consider reordering is a result of the use of 251 parallelism in network devices. Iannaccone et al. In [BS02] the authors 252 develop three techniques to measure one-way segment reordering and 253 perform 20 day period test. They establish that over 40% of the paths 254 tested experience some reordering during the t 256 Note: ACKs that acknowledge new data are the only ones that make the 257 sender increase the congestion window. If we consider a TCP receiver 258 that implements a delayed ACK algorithm with more tha 50ms idle time, a 259 reordered segment with segment lag of 1 and time lag less than 50ms 260 would not affect the number and the rate of ACK acknowledging new 261 sequence numbers sent by the receiver to the sender. 263 Clearly, there is a variability on the reported values of reordering 264 and it is not possible to conclude whether this variability comes from 265 the differences on the procedure to collect and analyze the data or 266 changes in the network (for example, different grade of parallelism in 267 the switches). [JID+02] is in our opinion the most comprehensive study 268 carried out until now but their results have to be confirmed by other 269 studies before concluding that reordering is not significant in the 270 today Internet. To explain the differences among the cited papers, it 271 is worthy to note that [Pax97] focuses on long lived TCP flows while in 272 [JID+02] the authors deal with the usual mix of "elephant" and "mice" 273 type flows. Bennet et al. [BPS99] study is based on measurements taken 274 at a particular switch that is known to induce high level of reordering 275 while [JID+02] is based on flows from a big diverse range of sources 276 and destinations. 278 We have implemented this modification in the ns simulator [NS] (early 279 retransmission at the end of the transfer on top of LT) to evaluate the 280 reduction of the transfer time. We have not investigated the impact of 281 segment reordering because of the non-existence of appropriate models. 282 We compute the conditional expected transfer time given the flow 283 experiences at least one lost. We compare the values of the conditional 284 expected transfer time for TCP without LT, TCP with LT and TCP with LT 285 and file end early retransmission. In the particular case of RTT=100ms 286 we obtained that LT decreases the conditional expected transfer time of 287 file 6 segments by 10% and our proposition by 45%. The reduction in 288 conditional expected transfer time decreases as the file size 289 increases. This demonstrates that our modification benefits short-lived 290 TCP transfers. 292 One may expect that the increase of the load network load because of 293 spurious retransmission is proportional to the number of spurious 294 retransmission induced by the modification. If the measurements of the 295 flow size distribution [FBP+01,GM01,TMW97,FMD+01] and the segment 296 reordering rates [Pax97, BPS99,JID+02] are in agreement with the real 297 Internet, one expects that our modification will not lead to a 298 significant increase of the load and loss rates. Particularly if we 299 note that some of the reorderings are invisible for the sender due to 300 the small time lag [JID+02]. 302 4. Conclusion. 304 This document analyzes the impact of timeouts in the context on the 305 performance of short-lived TCP flows. The document proposes a 306 modification of TCP on the top of the LT algorithm to avoid timeouts 307 and hence to reduce the transfer time. 309 Security Considerations 311 This document proposes a modification of TCP on the top of Limited 312 Transmit Algorithm. Security considerations concerning Limited Transmit 313 Algorithm are discussed in RFC 3042 and they apply to this algorithm 314 also. Secondly, when duplicate ACKs are received and there is no more 315 data to send this document proposes TCP to retransmit immediately to 316 avoid timeouts. This modification does not raise any known security 317 issue. 319 References 321 [AA02] Urtzi Ayesta, Konstantin Avrachenkov, "The Effect of the 322 Initial Window Size and Limited Transmit Algorithm on the Transient 323 Behavior of TCP Transfers", In Proc. of the 15th ITC Internet 324 Specialist Seminar, Wurzburg, July 2002. 326 [BPS+97] Hari Balakrishnan, Venkata Padmanabhan, Srinivasan 327 Seshan, Mark Stemm and Randy Katz, "TCP Behavior of a Busy Web Server: 328 analysis and Improvements". Proc IEEE INFOCOM, San Francisco, CA, March 329 1998 331 [BPS99] J.C.R. Bennett, C.Partridge and N.Shectman, "Packet Reordering 332 is Not Pathological Network Behavior," IEEE Transaction on Networking, 333 Vol. 7,No. 6, December 1999. 335 [BS02] John Bellardo, Stefan Savage, "Measuring Packet Reordering,", 336 ACM SIGCOMM Internet Measurement Workshop 2002, Marseille, France, 337 November 2002. 339 [CSA00] Neal Cardwell, Stefan Savage, Thomas Anderson, "Modeling TCP 340 latency", in Proc. IEEE INFOCOM 2000, Tel-Aviv, Israel, March 2000. 342 [CMT98] K. Claffy, Greg Miller, and Kevin Thompson. "The nature of the 343 beast. Recent traffic measurements from an Internet backbone". In 344 Proceedings of INET '98, July 1998. 346 [FF96] Kevin Fall, Sally Floyd. "Simulation-based Comparisons of 347 Tahoe, Reno and SACK TCP," Computer Communication Review, July 1996. 349 [Flo01] Floyd, S. "A Report on Some Recent Developments in TCP 350 Congestion Control", IEEE Communications Magazine, April 2001. 352 [FMD+01] C.Fraleigh, S.Moon, C.Diot, B.Lyles, F.Tobagi, 353 "Packet-Level Traffic Measurements from a Tier-1 IP Backbone", Sprint 354 ATL Technical Report TR01-ATL-110101, November 2001. 356 [FBP+01] S. Ben Fredj, T.Bonald, A.Proutiere, G.Regnie, J.Roberts, 357 "Statistical Badwidth Sharing: A Study of Congestion at Flow Level", 358 SIGCOMM 2001. 360 [GM01] Liang Guo, Ibrahim Matta, "The War Between Mice and Elephants", 361 Proc. 9th IEEE International Conference on Network Protocols (ICNP'01), 362 Riverside, CA, November,2001. 364 [Jac88] Jacobson, V., "Congestion Avoidance and Control," SIGCOMM 1988, 365 Stanford, CA., August 1988. 367 [JID+02] S.Jaiswal, G.Iannaccone, C.Diot, J.Kurose, D.Towsley, 368 "Measurement and Classification of Out-of-Sequence Packets in a Tier-1 369 IP Backbone," ACM SIGCOMM Internet Measurement Workshop 2002, 370 Marseille, France, November 2002. Extended version available as: UMass 371 CMPSCI Technical Report 372 TR 02-17. 374 [NS] Ns network simulator. URL: http://www.isi.edu/nsnam/. 376 [LK98] Lin, D., and Kung, H.T., TCP Fast Recovery Strategies: Analysis 377 and Improvements, In Proc. of INFOCOM 98, San Francisco, CA, March 378 1998. 380 [Pax97] Vern Paxson, "Ent-to-End Internet Packet Dynamics", ACM 381 SIGCOMM, Cannes, France, September 1997. 383 [RFC1122] Braden, R., "Requirements for Internet Hosts Communication 384 Layers", STD 3, RFC 1122, October 1989. 386 [RFC2018] Mathis M., Mahdavi J., Floyd S., Romanow A., "TCP Selective 387 Acknowledgement Options," RFC 2018. 389 [RFC2414] M.Allmamn, S.Floyd, C. Partridge, "Increasing TCP's Initial 390 window", RFC 2414, September 1998.A small modification of RFC 2414 has been approved by the IESG to go to Proposed Standard on August 28, 2002. 392 [RFC2581] M. Allman, V.Paxson, W.Stevens, "TCP Congestion Control", RFC 393 2581, April 1999. 395 [RFC2988] Vern Paxson, Mark Allman, "Computing TCP's Retransmission 396 Timer", RFC 2988, November 2000. 398 [RFC3042] Mark Allman, Hari Balakrishnan, Sally Floyd, "Enhancing 399 TCP's Loss Recovery Using Limited Transmit", RFC 3042, January 2001. 401 [TMW97] Kevin Thompson, Gregory J. Miller, and Rick Wilder. "Wide-area 402 Internet traffic patterns and characteristics". IEEE Network, 11(6), 403 November 1997. 405 Acknowledgments 407 Author's Addresses 409 Urtzi Ayesta 410 France Telecom R&D 411 905 rue Albert Einstein 412 06921 Sophia Antipolis 413 France 414 Email: Urtzi.Ayesta@francetelecom.com 416 Konstantin Avrachenkov 417 INRIA 418 2004 route des Lucioles, B.P.93 419 06902, Sophia Antipolis 420 France 421 Phone: 00 33 492 38 7751 422 Email: k.avrachenkov@inria.fr 423 1 Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, 424 RFC 2026, October 1996. 426 2 Bradner, S., "Key words for use in RFCs to Indicate Requirement 427 Levels", BCP 14, RFC 2119, March 1997 429 October 2002