idnits 2.17.1 draft-ietf-pilc-link-design-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == It seems as if not all pages are separated by form feeds - found 0 form feeds but 21 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([RFC1435], [RFC2616], [RFC791], [RFCs-2630-2634], [RFC1661], [OKM96], [RFC2581], [RFC1191], [RFC2406], [BPK98], [RFC2364], [RFC2440], [RFC2393], [RFC2394], [RFC1144], [RFC2246], [RFC2395], [RFC2684], [Stevens94,RFC1144], [RFC2507], [PFTK98], [RFC1981], [RFC2508], [RED93], [Jac90,APS99], [MSMO97], [SRC81]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 444: '...CN), the lack of an ECN bit MUST NEVER...' RFC 2119 keyword, line 446: '... future, TCP MUST interpret a lost p...' RFC 2119 keyword, line 811: '... SHOULD be reserved for native link ...' RFC 2119 keyword, line 812: '...lticast groups (and there SHOULD be at...' RFC 2119 keyword, line 819: '... filters. A shared link SHOULD support...' (1 more instance...) Miscellaneous warnings: ---------------------------------------------------------------------------- -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFCs-2630-2634' is mentioned on line 739, but not defined == Unused Reference: 'RFC1577' is defined on line 903, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2581 (ref. 'APS99') (Obsoleted by RFC 5681) -- Possible downref: Non-RFC (?) normative reference: ref. 'BPK98' -- Possible downref: Non-RFC (?) normative reference: ref. 'Jac90' -- Possible downref: Non-RFC (?) normative reference: ref. 'SRC81' ** Downref: Normative reference to an Informational RFC: RFC 1435 ** Obsolete normative reference: RFC 1577 (Obsoleted by RFC 2225) ** Obsolete normative reference: RFC 1981 (Obsoleted by RFC 8201) ** Obsolete normative reference: RFC 2393 (Obsoleted by RFC 3173) ** Downref: Normative reference to an Informational RFC: RFC 2394 ** Downref: Normative reference to an Informational RFC: RFC 2395 ** Obsolete normative reference: RFC 2440 (Obsoleted by RFC 4880) ** Obsolete normative reference: RFC 2246 (Obsoleted by RFC 4346) -- Duplicate reference: RFC2581, mentioned in 'RFC2581', was also mentioned in 'APS99'. ** Obsolete normative reference: RFC 2581 (Obsoleted by RFC 5681) ** Obsolete normative reference: RFC 2406 (Obsoleted by RFC 4303, RFC 4305) ** Obsolete normative reference: RFC 2616 (Obsoleted by RFC 7230, RFC 7231, RFC 7232, RFC 7233, RFC 7234, RFC 7235) -- Possible downref: Non-RFC (?) normative reference: ref. 'PFTK98' -- Possible downref: Non-RFC (?) normative reference: ref. 'MSMO97' -- Possible downref: Non-RFC (?) normative reference: ref. 'OKM96' -- Possible downref: Non-RFC (?) normative reference: ref. 'RED93' -- Possible downref: Non-RFC (?) normative reference: ref. 'Stevens94' Summary: 19 errors (**), 0 flaws (~~), 4 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force Phil Karn 2 INTERNET DRAFT Aaron Falk 3 Joe Touch 4 Marie-Jose Montpetit 5 Jamshid Mahdavi 6 Gabriel Montenegro 8 File: draft-ietf-pilc-link-design-01.txt October, 1999 9 Expires: April, 2000 11 Advice for Internet Subnetwork Designers 13 Status of this Memo 15 This document is an Internet-Draft and is in full conformance with 16 all provisions of Section 10 of RFC2026. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet- Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html. 34 Abstract 36 This document provides advice to the designers of digital 37 communication equipment, link layer protocols and packet switched 38 subnetworks (collectively referred to as subnetworks) who wish to 39 support the Internet protocols but who may be unfamiliar with the 40 architecture of the Internet and the implications of their design 41 choices on the performance and efficiency of the Internet. 43 This document represents an evolving consensus of the members of the 44 IETF Performance Implications of Link Characteristics (PILC) working 45 group. 47 Introduction and Overview 49 The Internet Protocol [RFC791] is the core protocol of the world-wide 50 Internet that defines a simple "connectionless" packet-switched 51 network. The success of the Internet is largely attributed to the 52 simplicity of IP, the "end-to-end principle" on which the Internet is 53 based, and the resulting ease of carrying IP on a wide variety of 54 subnetworks not necessarily designed with IP in mind. 56 But while many subnetworks carry IP, they do not necessarily do so 57 with maximum efficiency, minimum complexity or minimum cost. Nor do 58 they implement certain features to efficiently support newer Internet 59 features of increasing importance, such as multicasting or quality of 60 service. 62 With the explosive growth of the Internet, IP is an increasingly 63 large fraction of the traffic carried by the world's 64 telecommunications networks. It therefore makes sense to optimize 65 both existing and new subnetwork technologies for IP as much as 66 possible. 68 Optimizing a subnetwork for IP involves three complementary 69 considerations: 71 1. Providing functionality sufficient to carry IP. 73 2. Eliminating unnecessary functions that increase cost or 74 complexity. 76 3. Choosing subnetwork parameters that maximize the performance of 77 the Internet protocols. 79 Because IP is so simple, consideration 2 is more of an issue than 80 consideration 1. I.e., subnetwork designers make many more errors of 81 commission than errors of omission. But certain enhanced Internet 82 features, such as multicasting and quality-of-service, rely on 83 support from the underlying subnetworks beyond that necessary to 84 carry "traditional" unicast, best-effort IP. 86 A major consideration in the efficient design of any layered 87 communication network are the appropriate layer(s) in which to 88 implement a given feature. This issue was first addressed in the 89 seminal paper "End-to-End Arguments in System Design" [SRC81]. This 90 paper argued that many -- if not most -- network functions are best 91 implemented on an end-to-end basis, i.e., at the higher protocol 92 layers. Duplicating these functions at the lower levels is usually 93 redundant, and can even be harmful. However, certain low level 94 functions can sometimes be justified as a performance enhancement. 96 An example would be link layer retransmission on an unusually lossy 97 channel, e.g., mobile radio. 99 The architecture of the Internet was heavily influenced by the end- 100 to-end principle, and in our view it was crucial to the Internet's 101 success. 103 The remainder of this document discusses the various subnetwork 104 design issues that the authors consider relevant to efficient IP 105 support. 107 Maximum Transmission Units (MTUs) and IP Fragmentation 109 IP packets (datagrams) vary in size from 20 bytes (the size of the IP 110 header alone) to a maximum of 65535 bytes. Subnetworks need not 111 support maximum-sized (64KB) IP packets, as IP provides a scheme that 112 breaks packets that are too large for a given subnetwork into 113 fragments that travel as independent packets and are reassembled at 114 the destination. The maximum packet size supported by a subnetwork is 115 known as its Maximum Transmission Unit (MTU). 117 Subnetworks may, but are not required to indicate the lengths of the 118 packets they carry. One example is Ethernet with the widely used DIX 119 (not IEEE 802.3) header, which lacks a length field to indicate the 120 true data length when the packet is padded to the 60 byte minimum. 121 This is not a problem for uncompressed IP because it carries its own 122 length field. 124 If optional header compression [RFC1144] [RFC2507] [RFC2508] is used, 125 however, it is required that the link framing indicate frame length 126 as it is needed for the reconstruction of the original header. 128 In IP version 4 (current IP), fragmentation can occur at either the 129 sending host or in an intermediate router, and fragments can be 130 further fragmented at subsequent routers if necessary. 132 In IP version 6, fragmentation can occur only at the sending host; it 133 cannot occur in a router. 135 Both IPv4 and IPv6 provide a "Path MTU Discovery" procedure [RFC1191] 136 [RFC1435] [RFC1981] that allows the sending host to avoid 137 fragmentation by discovering the minimum MTU along a given path and 138 reducing its packet sizes accordingly. This procedure is optional in 139 IPv4 but mandatory in IPv6 where there is no router fragmentation. 141 The Path MTU Discovery procedure (and the deletion of router 142 fragmentation in IPv6) reflects a consensus of the Internet technical 143 community that IP fragmentation is best avoided. This requires that 144 subnetworks support MTUs that are "reasonably" large. The smallest 145 MTU that IPv4 can use is 28 bytes, but this is clearly unreasonable; 146 because each IP header is 20 bytes, only 8 bytes per packet would be 147 available to carry transport headers and application data. 149 If a subnetwork cannot directly support a "reasonable" MTU with 150 native framing mechanisms, it should internally fragment. That is, it 151 should transparently break IP packets into internal data elements and 152 reassemble them at the other end of the subnetwork. 154 This leaves the question of what is a "reasonable" MTU. Ethernet (10 155 and 100 Mb/s) has a MTU of 1500 bytes, and because of its ubiquity 156 few Internet paths have MTUs larger than this value. This severely 157 limits the utility of larger MTUs provided by other subnetworks. But 158 larger MTUs are increasingly desirable on high speed subnetworks to 159 reduce the per-packet processing overhead in host computers, and 160 implementers are encouraged to provide them even though they may not 161 be usable when Ethernet is also in the path. 163 Choosing the MTU in Slow Networks [Stevens94, RFC1144] 165 In slow networks, the time required to transmit the largest possible 166 packet may be considerable. Interactive response time should not 167 exceed the well-known human factors limit of 100 to 200 ms. This 168 includes all sources of delay: electromagnetic propagation delay, 169 queueing delay, and the store-and-forward time, i.e,. the time to 170 transmit a packet at link speed. 172 At low link speeds, store-and-forward delays can dominate total end- 173 to-end delay, and these are in turn directly influenced by the 174 maximum transmission unit (MTU). Even when an interactive packet is 175 given a higher queuing priority, it may have to wait for a large bulk 176 transfer packet to finish transmission. This worst-case wait can be 177 set by an appropriate choice of MTU. 179 For example, if the MTU is set to 1500 bytes, then a MTU-sized packet 180 will take about 8 milliseconds to send on a T1 (1.536 Mb/s) link. 181 But if the link speed is 19.2kb/s, then the transmission time becomes 182 625 ms -- well above our 100-200ms limit. A 256-byte MTU would lower 183 this delay to a little over 100 ms. However, care should be taken not 184 to lower the MTU excessively, as this will increase header overhead 185 and trigger IP fragmentation (if Path MTU discovery is not in use). 187 One way to limit delay for interactive traffic without imposing a 188 small MTU is to preempt (abort) the transmission of a lower priority 189 packet when a higher priority packet arrives in the queue. However, 190 the link resources used to send the aborted packet are lost, and 191 overall throughput will decrease. 193 Another way is to implement a link-level multiplexing scheme that 194 allows several packets to be in progress simultaneously, with 195 transmission priority given to segments of higher priority IP 196 packets. ATM (asynchronous transfer mode) is an example of this 197 technique. However, ATM is generally used on high speed links where 198 the store-and-forward delays are already minimal, and it introduces 199 significant (~9%) additional overhead due to the addition of 5-byte 200 frame headers to each 48-byte data frame. 202 To summarize, there is a fundamental tradeoff between efficiency and 203 latency in the design of a subnetwork, and the designer should keep 204 this in mind. 206 Framing on Connection-Oriented Subnetworks 208 IP needs a way to mark the beginning and end of each variable-length, 209 asynchronous IP packet. Some examples of links and subnetworks that 210 do not provide this as an intrinsic feature include: 212 1. leased lines carrying a synchronous bit stream; 214 2. ISDN B-channels carrying a synchronous octet stream; 216 3. dialup telephone modems carrying an asynchronous octet stream; 218 and 220 4. Asynchronous Transfer Mode (ATM) networks carrying an asynchronous 221 stream of fixed-sized "cells" 223 The Internet community has defined packet framing methods for all 224 these subnetworks. The Point-To-Point Protocol (PPP) [RFC1661] is 225 applicable to bit synchronous, octet synchronous and octet 226 asynchronous links (i.e., examples 1-3 above). ATM has its own 227 framing methods described in [RFC2684] [RFC2364]. 229 At high speeds, a subnetwork should provide a framed interface 230 capable of carrying asynchronous, variable-length IP datagrams. The 231 maximum packet size supported by this interface is discussed above in 232 the MTU/Fragmentation section. The subnetwork may implement this 233 facility in any convenient manner. 235 In particular, IP packet boundaries may, but need not, coincide with 236 any framing or synchronization mechanisms internal to the subnetwork. 237 When the subnetwork implements variable sized data units, the most 238 straightforward approach is to place exactly one IP packet into each 239 subnetwork data unit (SDU), and to rely on the subnetwork's existing 240 ability to delimit SDUs to also delimit IP packets. A good example 241 is Ethernet. But some subnetworks have SDUs of one or more fixed 242 sizes, as dictated by switching, forward error correction and/or 243 interleaving considerations. Examples of such subnetworks include 244 ATM, with a single frame size of 48 bytes plus a 5-byte header, and 245 IS-95 digital cellular, with two "rate sets" of four fixed frame 246 sizes each that may be selected on 20 millisecond boundaries. 248 Because IP packets are variable sized, they may not necessarily fit 249 into an integer multiple of fixed-sized SDUs. An "adaptation layer" 250 is needed to convert IP packets into SDUs while marking the boundary 251 between each IP packet in some manner. 253 There are two traditional approaches to the problem. The first is to 254 encode each IP packet into one or more SDUs, with no SDU containing 255 pieces of more than one IP packet, and padding out the last SDU of 256 the packet as needed. Bits in a control header added to each SDU 257 indicate where it belongs in the IP packet. If the subnetwork 258 provides in-order, at-most-once delivery, the header can be as simple 259 as a pair of bits to indicate whether the SDU is the first and/or the 260 last in the IP packet. Or only the last SDU of the packet could be 261 marked, as this would implicitly mark the next SDU as the first in a 262 new IP packet. The AAL5 (ATM Adaption Layer 5) scheme used with ATM 263 is an example of this approach, though it adds other features, 264 including a payload length field and a payload CRC. 266 The second approach is to insert a special flag sequence into the 267 data stream between each IP packet, and to pack the resulting data 268 stream into SDUs without regard to SDU boundaries. The flag sequence 269 can also pad unused space at the end of an SDU. If the special flag 270 appears in the user data, it is escaped to an alternate sequence 271 (usually larger than a flag) to avoid being misinterpreted as a flag. 272 The HDLC-based framing schemes used in PPP are all examples of this 273 approach. 275 Both adaptation schemes introduce overhead; how much depends on the 276 distribution of IP packet sizes, the size(s) of the SDUs, and in the 277 HDLC-like approaches, the content of the IP packet (since flags 278 occurring in the packet must be escaped, which expands them). The 279 designer must also weigh implementation complexity in the choice and 280 design of an adaptation layer. 282 Connection-Oriented Subnetworks 284 IP has no notion of a "connection"; it is a purely connectionless 285 protocol. When a connection is required by an application, it is 286 usually provided by TCP, the Transmission Control Protocol, running 287 atop IP on an end-to-end basis. 289 Connection-oriented subnetworks can be (and are) widely used to carry 290 IP, but often with considerable complexity. Subnetworks with a few 291 nodes can simply open a permanent connection between each pair of 292 nodes, as is frequently done with ATM. But the number of connections 293 is equal to the square of the number of nodes, so this is clearly 294 impractical for large subnetworks. A "shim" layer between IP and the 295 subnetwork is therefore required to manage connections in the latter. 297 These shim layers typically open subnetwork connections as needed 298 when an IP packet is queued for transmission and close them after an 299 idle timeout. There is no relation between subnetwork connections and 300 any connections that may exist at higher layers (e.g., TCP). 302 Because Internet traffic is typically bursty and transaction- 303 oriented, it is often difficult to pick an optimal idle timeout. If 304 the timeout is too short, subnetwork connections are opened and 305 closed rapidly, possibly over-stressing its call management system 306 (especially if was designed for voice traffic holding times). If the 307 timeout is too long, subnetwork connections are idle much of the 308 time, wasting any resources dedicated to them by the subnetwork. 310 The ideal subnetwork for IP is connectionless. Connection-oriented 311 networks that dedicate minimal resources to each connection (e.g., 312 ATM) are a distant second, and connection-oriented networks that 313 dedicate a fixed amount of bandwidth to each connection (e.g., the 314 PSTN, including ISDN) are the least efficient. If such subnetworks 315 must be used to carry IP, their call-processing systems should be 316 capable of rapid call set-up and tear-down. 318 Bandwidth on Demand (BoD) Subnets (Aaron Falk) 320 Wireless networks, including both satellite and terrestrial, may use 321 Bandwidth on Demand (BoD). Bandwidth on demand, which is implemented 322 at the link layer by Demand Assignment Multiple Access (DAMA) in TDMA 323 systems, is currently one of the proposed mechanism to efficiently 324 share limited spectrum resources amongst a large number of users. 326 The design parameters for BoD are similar to those in connection 327 oriented subnetworks, however the implementations may be very 328 different. In BoD, the user typically requests access to the shared 329 channel for some duration. Access may be allocated in terms of a 330 period of time at a specific rate, a certain number of packets, or 331 until the user chooses to release the channel. Access may be 332 coordinated through a central management entity or through using a 333 distributed algorithm amongst the users. The resource shared may be a 334 terrestrial wireless hop, a satellite uplink, or an end-to-end 335 satellite channel. 337 Long delay BoD subnets pose problems similar to the Connection 338 Oriented networks in terms of anticipating traffic arrivals. While 339 connection oriented subnets hold idle channels open expecting new 340 data to arrive, BoD subnets request channel access based on buffer 341 occupancy (or expected buffer occupancy) on the sending port. Poor 342 performance will likely result if the sender does not anticipate 343 additional traffic arriving at that port during the time it takes to 344 grant a transmission request. It is recommended that the algorithm 345 have the capability to extend a hold on the channel for data that has 346 arrived after the original request was generated (this may done by 347 piggybacking new requests on user data). 349 There are a wide variety of BoD protocols available and there has 350 been relatively little comprehensive research on the interactions 351 between the BoD mechanisms and Internet protocol performance. A 352 tradeoff exists balancing the time a user can be allowed to hold a 353 channel to drain port buffers with the additional imposed latency on 354 other users who are forced to wait to get access to the channel. It 355 is desirable to design mechanisms that constrain the BoD imposed 356 latency variation. This will be helpful in preventing spurious 357 timeouts from TCP. 359 Reliability and Error Control 361 In the Internet architecture, the ultimate responsibility for error 362 recovery is at the end points. The Internet may occasionally drop, 363 corrupt, duplicate or reorder packets, and the transport protocol 364 (e.g., TCP) or application (e.g., if UDP is used) must recover from 365 these errors on an end-to-end basis. Error recovery in the 366 subnetwork is therefore justified only to the extent that it can 367 enhance overall performance. It is important to recognize that a 368 subnetwork can go too far in attempting to provide error recovery 369 services in the Internet environment. Subnet reliability should be 370 "lightweight", i.e., it only has to be "good enough", *not* perfect. 372 In this section we discuss how to analyze characteristics of a 373 subnetwork to determine what is "good enough". The discussion below 374 focuses on TCP, which is the most widely used transport protocol in 375 the Internet. It is widely believed (and is in fact a stated goal 376 within the IETF community) that non-TCP transport protocols should 377 attempt to be "TCP-friendly" and have many of the same performance 378 characteristics. Thus, the discussion below should be applicable 379 even to portions of the Internet where TCP may not be the predominant 380 protocol. 382 How TCP Works 384 One of TCP's functions is end-host based congestion control for the 385 Internet. This is a critical part of the overall stability of the 386 Internet, so it is important that link layer designers understand 387 TCP's congestion control algorithms. 389 TCP assumes that, at the most abstract level, the network consists of 390 links and queues. Queues provide output-buffering on links that are 391 momentarily oversubscribed. They smooth instantaneous traffic bursts 392 to fit the link bandwidth. 394 When demand exceeds link capacity long enough to fill the queue, 395 packets must be dropped. The traditional action of dropping the most 396 recent packet ("tail dropping") is no longer recommended (see 397 [RED93]), but it is still widely practiced. 399 TCP uses sequence numbering and acknowledgements (ACKs) on an end-to- 400 end basis to provide reliable, sequenced, once-only delivery. TCP 401 ACKs are cumulative, i.e., each one implicitly ACKs every segment 402 received so far. If a packet is lost, the cumulative ACK will cease 403 to advance. 405 Since the most common cause of packet loss is congestion, TCP treats 406 packet loss as a network congestion indicator. This happens 407 automatically, and the subnetwork need not know anything about IP or 408 TCP. It simply drops packets whenever it must, though RED shows that 409 some packet-dropping strategies are more fair than others. 411 TCP recovers from packet losses in two different ways. The most 412 important is by a retransmission timeout. If an ACK fails to arrive 413 after a certain period of time, TCP retransmits the oldest unacked 414 packet. Taking this as a hint that the network is congested, TCP 415 waits for the retransmission to be ACKed before it continues, and it 416 gradually increases the number of packets in flight as long as a 417 timeout does not occur again. 419 A retransmission timeout can impose a significant performance 420 penalty, as the sender will be idle during the timeout interval and 421 restarts with a congestion window of 1 following the timeout. To 422 allow faster recovery from the occasional lost packet in a bulk 423 transfer, an alternate scheme known as "fast recovery" was introduced 424 [ref?] 426 Fast recovery relies on the fact that when a single packet is lost in 427 a bulk transfer, the receiver continues to return ACKs to subsequent 428 data packets, but they will not actually ACK any data. These are 429 known as "duplicate acknowledgments" or "dupacks". The sending TCP 430 can use dupacks as a hint that a packet has been lost, and it can 431 retransmit it without waiting for a timeout. Dupacks effectively 432 constitute a negative acknowledgement (NAK) for the packet whose 433 sequence number is equal to the acknowledgement field in the incoming 434 TCP packet. TCP currently waits until a certain number of dupacks 435 (currently 3) are seen prior to assuming a loss has occurred; this 436 helps avoid an unnecessary retransmission in the face of out-of- 437 sequence delivery. 439 A new technique called "Explicit Congestion Notification" (ECN) 440 allows routers to directly signal congestion to hosts without 441 dropping packets. This is done by setting a bit in the IP header. 442 Since this is currently an optional behavior (and, longer term, there 443 will always be the possibility of congestion in portions of the 444 network which don't support ECN), the lack of an ECN bit MUST NEVER 445 be interpreted as a lack of congestion. Thus, for the foreseeable 446 future, TCP MUST interpret a lost packet as a signal of congestion. 448 The TCP "congestion avoidance" [RFC2581] algorithm is the end-system 449 congestion control algorithm used by TCP. This algorithm maintains a 450 congestion window (cwnd), which controls the amount of data which TCP 451 may have in flight at any given point in time. Reducing cwnd reduces 452 the overall bandwidth obtained by the connection; similarly, raising 453 cwnd increases the performance, up to the limit of the available 454 bandwidth. 456 TCP probes for available network bandwidth by setting cwnd at one 457 packet and then increasing it by one packet for each ACK returned 458 from the receiver. This is TCP's "slow start" mechanism. When a 459 packet loss is detected (or congestion is signalled by other 460 mechanisms), cwnd is set back to one and the slow start process is 461 repeated until cwnd reaches one half of its previous setting before 462 the loss. Cwnd continues to increase past this point, but at a much 463 slower rate than before. If no further losses occur, cwnd will 464 ultimately reach the window size advertised by the receiver. 466 This is referred to as an "Additive Increase, Multiplicative 467 Decrease" (AIMD) algorithm. The steep decrease in response to 468 congestion provides for network stability; the AIMD algorithm also 469 provides for fairness between long running TCP connections sharing 470 the same path. 472 TCP Performance Characteristics 474 Caveat 476 In this section, we present the current "state-of-the-art" 477 understanding of TCP performance. This analysis attempts to 478 characterize the performance of TCP connections over links of varying 479 characteristics. 481 Link designers may wish to use the techniques in this section to 482 predict what performance TCP/IP may achieve over a new link layer 483 design. Such analysis is encouraged. Because this is relatively new 484 analysis, and the theory is based on single stream TCP connections 485 under "ideal" conditions, it should be recognized that the results of 486 such analysis may be different than actual performance in the 487 Internet. That being said, we have done the best we can to provide 488 information which will help designers get an accurate picture of the 489 capabilities and limitations of TCP under various conditions. 491 The Formulae 493 The performance of TCP's AIMD Congestion Avoidance algorithm has been 494 extensively analyzed. The current best formula for the performance 495 of the specific algorithms used by Reno TCP is given by Padhye, 496 et.al. [PFTK98]. This formula is: 498 MSS 499 BW = -------------------------------------------------------- 500 RTT*sqrt(1.33*p) + RTO*p*[1+32*p^2]*min[1,3*sqrt(.75*p)] 502 In this formula, the variables are as follows: 503 MSS is the segment size being used by the connection 504 RTT is the end-to-end round trip time of the TCP connection 505 RTO is the packet timeout (based on RTT) 506 p is the packet loss rate for the path 507 (i.e. .01 if there is 1% packet loss) 509 This is currently considered to be the best approximate formula for 510 Reno TCP performance. A further simplification to this formula is 511 generally made by assuming that RTO is approximately 5*RTT. 513 TCP is constantly being improved. A simpler formula, which gives an 514 upper bound on the performance of any AIMD algorithm which is likely 515 to be implemented in TCP in the future, was derived by Ott, et.al. 516 [MSMO97][OKM96] 518 MSS 1 519 BW = 0.93 --- ------- 520 RTT sqrt(p) 522 Assumptions of these formulae 524 Both of these formulae assume that the TCP Receiver Window is not 525 limiting the performance of the connection in any way. Because 526 receiver window is entirely determined by end-hosts, we assume that 527 hosts will maximize the announced receiver window in order to 528 maximize their network performance. 530 Both of these formulae allow for BW to become infinite if there is no 531 loss. This is because an Internet path will drop packets at 532 bottleneck queues if the load is too high. Thus, a completely 533 lossless TCP/IP network can never occur (unless the network is being 534 underutilized). 536 The RTT used is the average RTT including queuing delays. 538 The formulae are calculations for a single TCP connection. If a path 539 carries many TCP connections, each will follow the formulae above 540 independently. 542 The formulae assume long running TCP connections. For connections 543 which are extremely short (<10 packets) and don't lose any packets, 544 performance is driven by the TCP slow start algorithm. For 545 connections of medium length, where on average only a few segments 546 are lost, single connection performance will actually be slightly 547 better than given by the formulae above. 549 The difference between the simple and complex formulae above is that 550 the complex formula includes the effects of TCP retransmission 551 timeouts. For very low levels of packet loss (significantly less 552 than 1%), timeouts are unlikely to occur, and the formulae lead to 553 very similar results. At higher packet losses (1% and above), the 554 complex formula gives a more accurate estimate of performance (which 555 will always be significantly lower than the result from the simple 556 formula). 558 Note that these formulae break down as p approaches 100%. 560 Analysis of Link Layer Effects on TCP Performance 562 Link layer designers who are interested in understanding the 563 performance of TCP over these links can use these formulae to figure 564 this out. Consider the following example: 566 A designer invents a new wireless link layer which, on average, loses 567 1% of IP packets. The link layer supports packets of up to 1040 568 bytes, and has a one-way delay of 20 msec. 570 If this link layer were used in the Internet, on a path which 571 otherwise had a round trip of of 80 msec, you could compute an upper 572 bound on the performance as follows: 574 For MSS, use 1000 bytes (remove the 40 bytes for TCP/IP headers, 575 which do not contribute to performance). 577 For RTT, use 120 msec (80 msec for the Internet part, plus 20 msec 578 each way for the new wireless link). 580 For p, use .01. For C, assume 1. 582 The simple formula gives: 584 BW = (1000 * 8 bits) / (.120 sec * sqrt(.01)) = 666 kbit/sec 586 The more complex formula gives: 588 BW = 402.9 kbit/sec 590 If this were a 2 Mb/s wireless LAN, the designers might be somewhat 591 disappointed. 593 Some observations on performance: 595 1. We have assumed that the packet losses on the link layer are 596 interpreted as congestion by TCP. This is a "fact of life" which 597 must be accepted. 599 2. Note that the equations for TCP performance are all expressed in 600 terms of packet loss. Many link-layer designers think in terms of 601 bit-error rate. *If* there were a uniform random distribution of 602 errors, then the probability of a packet being corrupted would be: 604 p = 1 - ([1 - BER]^[MSS * 8]) 606 (Here we assume MSS is represented in bytes). If the inequality 608 BER * MSS * 8 << 1 610 holds, p can be approximated by: 612 p = BER * MSS * 8 614 These equations can be used to apply BER to the performance equations 615 above. 617 Note that links with Forward Error Correction (FEC) generally have 618 very non-uniform bit error distributions. The distribution is a 619 strong function of the types and combinations of FEC algorithms used. 620 In such cases these equations cannot be used to apply BER to the 621 performance equations above. If the distribution of error 622 distributions under the FEC scheme is known, one could apply the same 623 type of analysis as above, using the correct distribution function 624 for the BER. It is more likely in these FEC cases, however, that 625 empirical methods will need to be used to determine the actual packet 626 loss rate. 628 3. Note that the packet size plays an important role. Larger packet 629 sizes will allow for improved performance at the same *packet loss* 630 rate. Assuming constant, uniform bit-errors (instead of packet 631 errors), and assuming that the BER is small enough for the 632 approximation [p=BER*MSS*8] to apply, a simple derivation will show 633 that larger packet sizes still result in increased TCP performance. 634 For this reason (and others) it is advisable to support larger packet 635 sizes where possible. 637 To derive this, simply plug in p = BER*MSS*8 into the simple formula 638 for performance. The result is p = O(sqrt(MSS)), providing larger 639 performance for larger packet sizes. 641 If the approximation p = BER*MSS*8 breaks down, and in particular if 642 the BER is high enough that BER*MSS approaches (or exceeds) 1, the 643 packet loss rate p will tend to 100%, resulting in zero throughput. 645 4. We have chosen a specific RTT which might occur on a wide-area 646 Internet path within the USA. In the Internet, it is important to 647 recognize that RTT varies considerably. 649 For example, in a wired LAN environment, RTTs are typically less than 650 10 msec. International connections (between hosts in different 651 countries) may have RTTs of 200 msec or more. Modems and other low- 652 capacity links can add considerable delay to the overall RTTs 653 experienced by the end hosts due to their long packet transmission 654 times. 656 Links running over geostationary repeater satellites have one-way 657 times of around 250ms (125ms up to the satellite, 125ms down) so the 658 RTT of an end-to-end TCP connection that includes such a link can be 659 expected to be greater than 250ms. 661 Heavily congested links may have queues which back up, increasing 662 RTTs. Finally, VPNs and other forms of encryption and tunneling can 663 add significant end-to-end delay to network connections. 665 Increased delay decreases the overall performance of TCP at a given 666 loss rate. A good rule of thumb is to recognize that you can't do 667 anything about the laws of physics, so you can't change the 668 propagation delay. Many link layer designers are likely to face the 669 following tradeoff: using additional delay to reduce the probability 670 of packet loss (through FEC, ARQ, or other methods). Increasing the 671 delay somewhat in order to decrease packet loss is probably a 672 worthwhile investment, either up to doubling, or in the case of very 673 low delay pipes, adding 10-20 msec won't have much effect on a 674 typical Internet path. 676 Quality of Service, Fairness vs Performance, Congestion signalling 678 [subnet hooks for QOS bits] 680 Delay Characteristics 682 [self clocking TCP, (re)transmission shaping] 684 Bandwidth Asymmetries 686 Some subnetworks may provide asymmetric bandwidth and the Internet 687 protocol suite will generally still work fine. However, there is a 688 case when such a scenario reduces TCP performance. Since TCP data 689 segments are ``clocked'' out by returning acknowledgments TCP senders 690 are limited by the rate at which ACKs can be returned [BPK98]. 691 Therefore, when the ratio of the bandwidth of the subnetwork carrying 692 the data to the bandwidth of the subnetwork carrying the 693 acknowledgments is too large, the slow return of of the ACKs directly 694 impacts performance. Since ACKs are generally smaller than data 695 segments, TCP can tolerate some asymmetry, but as a general rule 696 designers of subnetworks should avoid large differences in the 697 incoming and outgoing bandwidth. 699 One way to cope with asymmetric subnetworks is to increase the size 700 of the data segments as much as possible. This allows more data to 701 be sent per ACK, and therefore mitigates the slow flow of ACKs. 702 Using the delayed acknowledgment mechanism {Bra89], which reduces the 703 number of ACKs transmitted by the receiver by roughly half, can also 704 improve performance by reducing the congestion on the ACK channel. 705 These mechanisms should be employed in asymmetric networks. 707 Several researchers have introduced strategies for coping with 708 bandwidth asymmetry. These mechanisms generally attempt to reduce 709 the number of ACKs being transmitted over the low bandwidth channel 710 by limiting the ACK frequency or filtering out ACKs at an 711 intermediate router [BPK98]. While these solutions mitigate the 712 performance problems caused by asymmetric subnetworks they do have 713 some cost and therefore, as suggested above, bandwidth asymmetry 714 should be minimized whenever possible when designing subnetworks. 716 Buffering, flow & congestion control 718 [atm dropping individual cells in a packet means the entire packet 719 must be dropped unless EPD/PPD is used] 721 Compression 722 User data compression is a function that can usually be omitted at 723 the subnetwork layer. The endpoints typically have more CPU and 724 memory resources to run a compression algorithm and a better 725 understanding of what is being compressed. End-to-end compression 726 benefits every network element in the path, while subnetwork-layer 727 compression, by definition, benefits only a single subnetwork. 729 Data presented to the subnetwork layer may already be in compressed 730 format (e.g., a JPEG file), compressed at the application layer 731 (e.g., the optional "gzip", "compress", and "deflate" compression in 732 HTTP/1.1 [RFC2616]), or compressed at the IP layer (the IP Payload 733 Compression Protocol [RFC2393] supports DEFLATE [RFC2394] and LZS 734 [RFC2395]). In any of these cases, compression in the subnetwork is 735 of no benefit. 737 The subnetwork may also process data that has been encrypted at the 738 application protocol layer (OpenPGP [RFC2440] or S/MIME 739 [RFCs-2630-2634]), the transport layer (SSL, TLS [RFC2246]), or the 740 IP layer (IPSEC ESP [RFC2406]). Ciphers generate random-looking bit 741 streams lacking any patterns that can be exploited by a compression 742 algorithm. 744 If a subnetwork decides to implement user data compression, it must 745 detect when the data is encrypted or already compressed and transmit 746 it without further compression. This is important because most 747 compression algorithms increase the size of encrypted data or data 748 that has already been compressed. 750 In contrast to user data compression, subnetworks that operate at low 751 speed or with small packet size limits are encouraged to compress IP 752 and transport-level headers (TCP and UDP). An uncompressed 40-byte 753 TCP/IP header takes about 33 milliseconds to send at 9600 bps. "VJ" 754 TCP/IP header compression [RFC1144] compresses most headers to 3-5 755 bytes, reducing transmission time to several milliseconds. This is 756 especially beneficial for small, latency-sensitive packets, such as 757 in interactive sessions. 759 Designers should consider the effect of the subnetwork error rate on 760 performance when considering header compression. TCP ordinarily 761 recovers from lost packets by retransmitting only those packets that 762 were actually lost; packets arriving correctly after a packet loss 763 are kept on a resequencing queue and do not need to be retransmitted. 764 In VJ TCP/IP [RFC1144] header compression, however, the receiver 765 cannot explicitly notify a sender about data corruption and 766 subsequent loss of synchronization between compressor and 767 decompressor. It relies instead on TCP retransmission to 768 resynchronize the decompressor. After a packet is lost, the 769 decompressor must discard every subsequent packet, even if the 770 subnetwork makes no further errors, until the sending TCP retransmits 771 to resynchronize the decompressor. This effect can substantially 772 magnify the effect of subnetwork packet losses if the sending TCP 773 window is large, as it will often be on a path with a large 774 bandwidth*delay product. 776 Alternative header compression schemes such as those described in 777 [RFC2507] include an explicit request for retransmission of an 778 uncompressed packet to allow decompressor resynchronization without 779 waiting for a TCP retransmission. However, these schemes are not yet 780 in widespread use. 782 Packet Reordering 784 The Internet architecture does not guarantee that packets will arrive 785 in the same order in which they were originally transmitted, and 786 transport protocols like TCP must take this into account. However, 787 we recommend that subnetworks not gratuitously deliver packets out of 788 sequence. Since TCP returns a cumulative acknowledgment (ACK) 789 indicating the last in-order segment that has arrived, out-of-order 790 segments cause a TCP receiver to transmit a duplicate acknowledgment. 791 When the TCP sender notices three duplicate acknowledgments it 792 assumes that a segment was dropped by the network and uses the fast 793 retransmit algorithm [Jac90,APS99] to resend the segment. In 794 addition, the congestion window is reduced by half, effectively 795 halving TCP's sending rate. If a subnetwork badly re-orders segments 796 such that three duplicate ACKs are generated the TCP sender 797 needlessly reduces the congestion window, and therefore performance. 799 Mobility 801 [best provided at a higher layer, for performance and flexibility 802 reasons, but some subnet mobility can be a convenience as long as 803 it's not too inefficient with routing] 805 Multicasting 807 Similar to the case of broadcast and discovery, multicast is more 808 efficient on shared links where it is supported natively. Native 809 multicast support requires a reasonable number (?? - over 10, under 810 1000?) of separate link-layer broadcast addresses. One such address 811 SHOULD be reserved for native link broadcast; other addresses SHOULD 812 be provided support separate multicast groups (and there SHOULD be at 813 least 10?? such addresses). 815 The other criteria for native multicast is a link-layer filter, which 816 can select individual or sets of broadcast addresses. Such link 817 filters avoid having every host parse every multicast message in the 818 driver; a host receives, at the network layer, only those packets 819 that pass its configured link filters. A shared link SHOULD support 820 multiple, programmable link filters, to support efficient native 821 multicast. 823 [Multicasting can be simulated over unicast subnets by sending 824 multiple copies of packets, but this is wasteful. If the subnet can 825 support native multicasting in an efficient way, it should do so] 827 Broadcasting and Discovery 829 Link layers fall into two categories: point-to-point and shared link. 830 A point-to-point link has exactly two endpoint components (hosts or 831 gateways); a shared link has more than two, either on an inherently 832 broadcast media (e.g., Ethernet, radio) or on a switching layer 833 hidden from the network layer (switched Ethernet, Myrinet, ATM). 835 There are a number of Internet protocols which make use of link layer 836 broadcast capabilities. These include link layer address lookup 837 (ARP), auto-configuration (RARP, BOOTP, DHCP), and routing (RIP). 838 These protocols require broadcast-capable links. Shared links SHOULD 839 support native, link layer subnet broadcast. 841 The lack of broadcast can impede the performance of these protocols, 842 or in some cases render them inoperable. ARP-like link address lookup 843 can be provided by a centralized database, rather than owner response 844 to broadcast queries. This comes at the expense of potentially higher 845 response latency and the need for explicit knowledge of the ARP 846 server address (no automatic ARP discovery). 848 For other protocols, if a link does not support broadcast, the 849 protocol is inoperable. This is the case for DHCP, for example. 851 Routing 853 [what is proper division between routing at the Internet layer and 854 routing in the subnet? Is it useful or helpful to Internet routing to 855 have subnetworks that provide their own internal routing?] 857 Security 859 [Security mechanisms should be placed as close as possible to the 860 entities that they protect. E.g., mechanisms that protect host 861 computers or users should be implemented at the higher layers and 862 operate on an end-to-end basis under control of the users. This makes 863 subnet security mechanisms largely redundant unless they are to 864 protect the subnet itself, e.g., against unauthorized use.] 866 References 868 References of the form RFCnnnn are Internet Request for Comments 869 (RFC) documents available online at www.rfc-editor.org. 871 [APS99] Mark Allman, Vern Paxson, W. Richard Stevens. TCP Congestion 872 Control, April 1999. RFC 2581. 874 [BPK98] Hari Balakrishnan, Venkata Padmanabhan, Randy H. Katz. The 875 Effects of Asymmetry on TCP Performance. ACM Mobile Networks and 876 Applications (MONET), 1998. 878 [Jac90] Van Jacobson. Modified TCP Congestion Avoidance Algorithm. 879 Email to the end2end-interest mailing list, April 1990. URL: 880 ftp://ftp.ee.lbl.gov/email/vanj.90apr30.txt. 882 [SRC81] Jerome H. Saltzer, David P. Reed and David D. Clark, End-to- 883 End Arguments in System Design. Second International Conference on 884 Distributed Computing Systems (April, 1981) pages 509-512. Published 885 with minor changes in ACM Transactions in Computer Systems 2, 4, 886 November, 1984, pages 277-288. Reprinted in Craig Partridge, editor 887 Innovations in internetworking. Artech House, Norwood, MA, 1988, 888 pages 195-206. ISBN 0-89006-337-0. Also scheduled to be reprinted in 889 Amit Bhargava, editor. Integrated broadband networks. Artech House, 890 Boston, 1991. ISBN 0-89006-483-0. 891 http://people.qualcomm.com/karn/library.html. 893 [RFC791] Jon Postel. "Internet Protocol". September 1981. 895 [RFC1144] Jacobson, V., "Compressing TCP/IP Headers for Low-Speed 896 Serial Links," RFC 1144, February 1990. 898 [RFC1191] J. Mogul, S. Deering. "Path MTU Discovery". November 1990. 900 [RFC1435] S. Knowles. "IESG Advice from Experience with Path MTU 901 Discovery". March 1993. 903 [RFC1577] M. Laubach. "Classical IP and ARP over ATM". January 1994. 905 [RFC1661] W. Simpson. "he Point-to-Point Protocol (PPP)". July 1994. 907 [RFC1981] J. McCann, S. Deering, J. Mogul. "Path MTU Discovery for IP 908 version 6". August 1996. 910 [RFC2364] G. Gross et al. "PPP Over AAL5". July 1998. 912 [RFC2393] A. Shacham et al. "IP Payload Compression Protocol 913 (IPComp)". December 1998. 915 [RFC2394] R. Pereira. "IP Payload Compression Using DEFLATE". 916 December 1998. 918 [RFC2395] R. Friend, R. Monsour. "IP Payload Compression Using LZS". 919 December 1998. 921 [RFC2440] J. Callas et al. "OpenPGP Message Format". November 1998. 923 [RFC2246] T. Dierks, C. Allen. "The TLS Protocol Version 1.0". 924 January 1999. 926 [RFC2507] M. Degermark, B. Nordgren, S. Pink. "IP Header 927 Compression". February 1999. 929 [RFC2508] S. Casner, V. Jacobson. "Compressing IP/UDP/RTP Headers for 930 Low-Speed Serial Links". February 1999. 932 [RFC2581] M. Allman, V. Paxson, W. Stevens. "TCP Congestion Control". 933 April 1999. 935 [RFC2406] S. Kent, R. Atkinson. "P Encapsulating Security Payload 936 (ESP)". November 1998. 938 [RFC2616] R. Fielding et al. "Hypertext Transfer Protocol -- 939 HTTP/1.1". June 1999. 941 [RFC2684] D. Grossman, J. Heinanen. "Multiprotocol Encapsulation over 942 ATM Adaptation Layer 5". September 1999. 944 [PFTK98] Padhye, J., Firoiu, V., Towsley, D., and Kurose, J., 945 Modeling TCP Throughput: a Simple Model and its Empirical Validation, 946 UMASS CMPSCI Tech Report TR98-008, Feb. 1998. 948 [MSMO97] M. Mathis, J. Semke, J. Mahdavi, T. Ott, "The Macroscopic 949 Behavior of the TCP Congestion Avoidance Algorithm",Computer 950 Communication Review, volume 27, number 3, July 1997. 952 [OKM96] T. Ott, J.H.B. Kemperman, M. Mathis, The Stationary Behavior 953 of Ideal TCP Congestion Avoidance. 954 ftp://ftp.bellcore.com/pub/tjo/TCPwindow.ps 956 [RED93] S. Floyd, V. Jacobson, "Random Early Detection gateways for 957 Congestion Avoidance", IEEE/ACM Transactions in Networking, V.1 N.4, 958 August 1993, http://www.aciri.org/floyd/papers/red/red.html 960 [Stevens94] R. Stevens, "TCP/IP Illustrated, Volume 1," Addison- 961 Wesley, 1994 (section 2.10). 963 Security Considerations 965 [comment here] 967 Authors' Addresses: 969 Phil Karn (karn@qualcomm.com) 970 Aaron Falk (afalk@panamsat.com) 971 Joe Touch (touch@isi.edu) 972 Marie-Jose Montpetit (marie@teledesic.com) 973 Jamshid Mahdavi (mahdavi@novell.com) 974 Gabriel Montenegro (Gabriel.Montenegro@eng.sun.com)