idnits 2.17.1 draft-rosen-pwe3-congestion-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5 on line 941. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 952. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 959. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 965. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 15, 2006) is 6400 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-04) exists of draft-briscoe-tsvwg-cl-architecture-03 == Outdated reference: A later version (-03) exists of draft-briscoe-tsvwg-cl-phb-02 == Outdated reference: A later version (-14) exists of draft-ietf-pwe3-sonet-13 == Outdated reference: A later version (-06) exists of draft-ietf-pwe3-tdmoip-05 == Outdated reference: A later version (-15) exists of draft-ietf-pwe3-vccv-11 -- Obsolete informational reference (is this intentional?): RFC 2001 (Obsoleted by RFC 2581) -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 3448 (Obsoleted by RFC 5348) -- Obsolete informational reference (is this intentional?): RFC 4447 (Obsoleted by RFC 8077) Summary: 5 errors (**), 0 flaws (~~), 7 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group S. Bryant 3 Internet-Draft B. Davie 4 Intended status: Standards Track L. Martini 5 Expires: April 18, 2007 E. Rosen 6 Cisco Systems, Inc. 7 October 15, 2006 9 Pseudowire Congestion Control Framework 10 draft-rosen-pwe3-congestion-04.txt 12 Status of this Memo 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on April 18, 2007. 37 Copyright Notice 39 Copyright (C) The Internet Society (2006). 41 Abstract 43 Given that pseudowires may be used to carry non-TCP data flows, it is 44 necessary to provide pseudowire-specific congestion control 45 procedures. These procedures should ensure that pseudowire traffic 46 is "TCP-compatible", as defined in RFC 2914. This document attempts 47 to lay out the issues which must be considered when defining such 48 procedures. 50 Requirements Language 52 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 53 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 54 document are to be interpreted as described in RFC 2119 [RFC2119]. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 1.1. Pseudowires and Congestion in IP Networks . . . . . . . . 3 60 1.2. Arguments Against PW Congestion as a Practical Problem . . 4 61 1.3. Goals of PW-specific Congestion Control . . . . . . . . . 6 62 1.4. Challenges for PW Congestion . . . . . . . . . . . . . . . 7 63 1.4.1. Scale . . . . . . . . . . . . . . . . . . . . . . . . 7 64 1.4.2. Interaction among control loops . . . . . . . . . . . 8 65 1.4.3. Constant Bit Rate PWs . . . . . . . . . . . . . . . . 8 66 2. Detecting Congestion . . . . . . . . . . . . . . . . . . . . . 9 67 2.1. Using Sequence Numbers to Detect Congestion . . . . . . . 10 68 2.2. Using VCCV to Detect Congestion . . . . . . . . . . . . . 11 69 2.3. Explicit Congestion Notification . . . . . . . . . . . . . 12 70 3. Feedback from Receiver to Transmitter . . . . . . . . . . . . 13 71 3.1. Control Plane Feedback . . . . . . . . . . . . . . . . . . 13 72 3.2. Using Reverse Data Packets for Feedback . . . . . . . . . 14 73 3.3. Reverse VCCV Traffic . . . . . . . . . . . . . . . . . . . 14 74 4. Responding to Congestion . . . . . . . . . . . . . . . . . . . 15 75 4.1. Interaction with TCP . . . . . . . . . . . . . . . . . . . 16 76 5. Rate Control per Tunnel vs. per PW . . . . . . . . . . . . . . 16 77 6. Constant Bit Rate Services . . . . . . . . . . . . . . . . . . 17 78 7. Mandatory vs. Optional . . . . . . . . . . . . . . . . . . . . 17 79 8. Related Work: Pre-Congestion Notification . . . . . . . . . . 18 80 9. Informative References . . . . . . . . . . . . . . . . . . . . 18 81 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20 82 Intellectual Property and Copyright Statements . . . . . . . . . . 21 84 1. Introduction 86 1.1. Pseudowires and Congestion in IP Networks 88 Congestion in an IP network occurs when the amount of traffic that 89 needs to use a particular network resource exceeds the capacity of 90 that resource. This results first in long queues within the network, 91 and then in packet loss. If the amount of traffic is not then 92 reduced, the packet loss rate will climb, potentially until it 93 reaches 100%. 95 To prevent this sort of "congestive collapse", there must be 96 congestion control: a feedback loop by which the presence of 97 congestion somewhere in the network forces the transmitters to reduce 98 the amount of traffic being sent. As a connectionless protocol, IP 99 has no way to push back directly on the originator of the traffic. 100 Procedures for (a) detecting congestion, (b) providing the necessary 101 feedback to the transmitters, and (c) adjusting the transmission 102 rates, are thus left to higher protocol layers such as TCP. 104 The vast majority of traffic in IP networks is currently TCP traffic. 105 TCP includes an elaborate congestion control mechanism which causes 106 the end systems to reduce their transmission rates when congestion 107 occurs. For those readers not intimately familiar with the details 108 of TCP congestion control, we give below a brief summary, greatly 109 simplified and not entirely accurate, of TCP's very complicated 110 feedback mechanism. The details of TCP congestion control can be 111 found in [RFC2581]. [RFC2001] is an earlier but more accessible 112 discussion. [RFC2914] articulates a number of general principles 113 governing congestion control in the Internet. 115 In TCP congestion control, a lost packet is considered to be an 116 indication of congestion. Roughly, TCP considers a given packet to 117 be lost if that packet is not acknowledged within a specified time, 118 or if three subsequent packets arrive at the receiver before the 119 given packet. The latter condition manifests itself at the 120 transmitter as the arrival of three duplicate acks in a row. The 121 algorithm by which TCP detects congestion is thus highly dependent on 122 the mechanisms used by TCP to ensure reliable and sequential 123 delivery. 125 Once a TCP transmitter becomes aware of congestion, it halves its 126 transmission rate. If congestion still occurs at the new rate, the 127 rate is halved again. When a rate is found at which congestion no 128 longer occurs, the rate is increased by one MSS ("Maximum Segment 129 Size") per RTT ("Round Trip Time"). The rate is increased each RTT 130 until congestion is encountered again, or until something else limits 131 it (e.g., the flow control window reached, or the application is 132 transmitting at its max desired rate, or at line rate). 134 This sort of mechanism is known as an "Additive Increase, 135 Multiplicative Decrease" (AIMD) mechanism. Congestion causes 136 relatively rapid decreases in the transmission rate, while the 137 absence of congestion causes relatively slow increases in the allowed 138 transmission rate. 140 Currently, traffic in IP networks is predominantly TCP traffic. Even 141 the layer 2 tunneled traffic (e.g., PPP frames tunneled through L2TP) 142 is predominantly TCP traffic from the end-users. If pseudowires 143 (PWs) [RFC3985] were to be used only for carrying TCP flows, there 144 would be no need for any PW-specific congestion mechanisms. The 145 existing TCP congestion control mechanisms would be all that is 146 needed, since any loss of packets on the PW would be detected as loss 147 of packets on a TCP connection, and the TCP flow control mechanisms 148 would ensure a reduction of transmission rate. However, if a PW is 149 carrying non-TCP traffic, then there is no feedback mechanism to 150 cause the end-systems to reduce their transmission rates in response 151 to congestion. When congestion occurs, any TCP traffic that is 152 sharing the congested resource with the non-TCP traffic will be 153 throttled, and the non-TCP traffic may "starve" the TCP traffic. If 154 there is enough non-TCP traffic to congest the network all by itself, 155 there is nothing to prevent congestive collapse. 157 The non-TCP traffic in a PW can belong to any higher layer 158 whatsoever, and there is no way to ensure that TCP-like congestion 159 control mechanisms will be used by all those layers. Hence it 160 appears that there is a need for an edge-to-edge (i.e, PE-to-PE) 161 feedback mechanism which forces a transmitting PE to reduce its 162 transmission rate in the face of network congestion. 164 As TCP uses window-based flow control, controlling the rate is really 165 a matter of limiting the amount of traffic which can be "in flight" 166 (i.e., transmitted but not yet acknowledged) at any one time. 167 Obviously a different technique needs to be used to control the 168 transmission rate of the non-windowed protocol used for transmitting 169 data on PWs. 171 1.2. Arguments Against PW Congestion as a Practical Problem 173 One may argue that congestion due to non-TCP PW traffic is only a 174 theoretical problem. 176 o "99.9% of all the traffic in PWs is really IP traffic" 178 If this is the case, then the traffic is either TCP traffic, which 179 is already congestion-controlled, or "other" IP traffic. While 180 the congestion control issue may exist for the "other" IP traffic, 181 it is a general issue which is not specific to PWs. 183 Unfortunately, we cannot be sure that this is the case. It may 184 well be the case for the PW offerings of certain providers, but 185 perhaps not for others. It does appear that many providers want 186 to be able to use PWs for transporting "legacy traffic" of various 187 non-IP protocols. Constant bit-rate services are an example of 188 this, and raise particular issues for congestion control 189 (discussed below). 191 o "PW traffic usually stays within one SP's network, and an SP 192 always engineers its network carefully enough so that congestion 193 is an impossibility" 195 Perhaps this will be true of "most" PWs, but inter-provider PWs 196 are certainly expected to have a significant presence. 198 Even within a single provider's network, the provider might 199 consider whether he is so confident of his network engineering 200 that he does not need a feedback loop reducing the transmission 201 rate in response to congestion. 203 There is also the issue of keeping the network running (i.e., out 204 of congestive collapse) after an unexpected reduction of capacity. 206 o "If one provider accepts PW traffic from another, policing will be 207 done at the entry point to the second provider's network, so that 208 the second provider is sure that the first provider is not sending 209 too much traffic. This policing, together with the second 210 provider's careful network engineering, makes congestion an 211 impossibility" 213 This could be the case given carefully controlled bilateral 214 peering arrangements. Note though that if the second provider is 215 merely providing transit services for a PW whose endpoints are in 216 other providers, it may be difficult for the transit provider to 217 tell which traffic is the PW traffic and which is "ordinary" IP 218 traffic. 220 o "The only time we really need a general congestion control 221 mechanism is when traffic goes through the public Internet. 222 Obviously this will never be the case for PW traffic." 224 It is not at all difficult to imagine someone using an IPsec 225 tunnel across the public Internet to transport a PW from one 226 private IP network to another. 228 Nor is it difficult to imagine some enterprise implementing a PW 229 and transporting it across some SP's backbone, e.g., if that SP is 230 providing VPN service to that enterprise. 232 The arguments that non-TCP traffic in PWs will never make any 233 significant contribution to congestion thus do not seem to be totally 234 compelling. 236 1.3. Goals of PW-specific Congestion Control 238 [RFC2914] defines the notion of a "TCP-compatible flow": 240 "A TCP-compatible flow is responsive to congestion notification, and 241 in steady-state uses no more bandwidth than a conformant TCP running 242 under comparable conditions (drop rate, RTT [round trip time], MTU 243 [maximum transmission unit], etc.)" 245 TCP-compatible flows respond to congestion in much the way TCP does, 246 so that they do not starve the TCP flows or otherwise obtain an 247 unfair advantage. [RFC2914] further points out: 249 "any form of congestion control that successfully avoids a high 250 sending rate in the presence of a high packet drop rate should be 251 sufficient to avoid congestion collapse from undelivered packets." 253 "This does not mean, however, that concerns about congestion collapse 254 and fairness with TCP necessitate that all best-effort traffic deploy 255 congestion control based on TCP's Additive-Increase Multiplicative- 256 Decrease (AIMD) algorithm of reducing the sending rate in half in 257 response to each packet drop." 259 "However, the list of TCP-compatible congestion control procedures is 260 not limited to AIMD with the same increase/ decrease parameters as 261 TCP. Other TCP-compatible congestion control procedures include 262 rate-based variants of AIMD; AIMD with different sets of increase/ 263 decrease parameters that give the same steady-state behavior; 264 equation-based congestion control where the sender adjusts its 265 sending rate in response to information about the long-term packet 266 drop rate ... and possibly other forms that we have not yet begun to 267 consider." 269 The AIMD procedures are not mandated for non-TCP traffic, and might 270 not be optimal for non-TCP PW traffic. Choosing a proper set of 271 procedures which are TCP-compatible while being optimized for a 272 particular type of traffic is no simple task. [RFC3448], "TCP 273 Friendly Rate Control (TFRC)" provides an alternative: 275 "TFRC is designed to be reasonably fair when competing for bandwidth 276 with TCP flows, where a flow is "reasonably fair" if its sending rate 277 is generally within a factor of two of the sending rate of a TCP flow 278 under the same conditions. However, TFRC has a much lower variation 279 of throughput over time compared with TCP, which makes it more 280 suitable for applications such as telephony or streaming media where 281 a relatively smooth sending rate is of importance." 283 "For its congestion control mechanism, TFRC directly uses a 284 throughput equation for the allowed sending rate as a function of the 285 loss event rate and round-trip time. In order to compete fairly with 286 TCP, TFRC uses the TCP throughput equation, which roughly describes 287 TCP's sending rate as a function of the loss event rate, round-trip 288 time, and packet size." 290 "Generally speaking, TFRC's congestion control mechanism works as 291 follows: 293 o The receiver measures the loss event rate and feeds this 294 information back to the sender. 296 o The sender also uses these feedback messages to measure the round- 297 trip time (RTT). 299 o The loss event rate and RTT are then fed into TFRC's throughput 300 equation, giving the acceptable transmit rate. 302 o The sender then adjusts its transmit rate to match the calculated 303 rate." 305 Note that the TFRC procedures require the transmitter to calculate a 306 throughput equation. For these procedures to be feasible as a means 307 of PW congestion control, they must be computationally efficient. 308 Section 8 of [RFC3448] describes an implementation technique that 309 appears to make it efficient to calculate the equation. It is not 310 clear whether this is the case; this is an area for further 311 consideration. 313 1.4. Challenges for PW Congestion 315 1.4.1. Scale 317 It might appear at first glance that an easy solution to PW 318 congestion control would be to run the PWs through a TCP connection. 319 This would provide congestion control automatically. However, the 320 overhead is prohibitive for the PW application. The PWE3 data plane 321 may be implemented in a microcoded hardware engine which needs to 322 support thousands of PWs, and needs to do as little as possible for 323 each data packet; running a TCP state machine, and implementing TCP's 324 flow control procedures, would impose too high a cost in this 325 environment. Nor do we want to add the large overhead of TCP to the 326 PWs -- the large headers, the plethora of small acks in the reverse 327 direction, etc., etc. In fact, we want to avoid acknowledgments 328 altogether. These same considerations lead us away from using e.g., 329 DCCP [RFC4340]. Therefore we will investigate some PW-specific 330 solutions for congestion control. 332 We also want to minimize the amount of interaction between the data 333 processing path (which is likely to be distributed among a set of 334 line cards) and the control path; we need to be especially careful of 335 interactions which might require atomic read/modify/write operations 336 from the control path, or which might require atomic read/modify/ 337 write operations between different processors in a multiprocessing 338 implementation, as such interactions can cause scaling problems. 340 Thus, feasible solutions for PW-specific congestion will require 341 scalable means to detect congestion and to reduce the amount of 342 traffic sent into the network when congestion is detected. These 343 topics are discussed in more detail in subsequent sections. 345 1.4.2. Interaction among control loops 347 As noted above, much of the traffic that is carried on PWs is likely 348 to be TCP traffic, and will therefore be subject the congestion 349 control mechanisms of TCP. It will typically be difficult for a PW 350 endpoint to tell whether or not this is the case. Thus there is the 351 risk that the PE-PE congestion control mechanisms applied over the PW 352 may interact in undesirable ways with the end-to-end congestion 353 control mechanisms of TCP. The PW-specific congestion control 354 mechanisms should be designed to minimize the negative impact of such 355 interaction. 357 1.4.3. Constant Bit Rate PWs 359 Some types of PW, for example SAToP (Structure Agnostic TDM over 360 Packet) [RFC4553], CESoPSN (Circuit Emulation over Packet Switched 361 Networks) [I-D.ietf-pwe3-cesopsn], TDM over IP 362 [I-D.ietf-pwe3-tdmoip][I-D.ietf-pwe3-sonet], SONET/SDH and Constant 363 Bit Rate ATM PWs represent an inelastic constant bit-rate (CBR) flow. 364 Such PWs cannot respond to congestion in a TCP-friendly manner 365 prescribed by [RFC2914]; the amount of total bandwidth consumed by 366 such a PW remains constant. AIMD or even more gradual TFRC 367 techniques are clearly not applicable to such services; it is not 368 feasible to reduce the rate of a CBR service without violating the 369 service definition. Such services are also frequently more sensitive 370 to packet loss than connectionless packet PWs. Given that CBR 371 services are not greedy (in the sense of trying to increase their 372 share of a link, as TCP does), there may be a case for allowing them 373 greater latitude during congestion peaks. However, if some CBR PWs 374 are not able to endure any significant packet loss or reduction in 375 rate without compromising the transported service, such PWs must be 376 shutdown when the level of congestion becomes excessive. At suitably 377 low levels of congestion they may be allowed to continue to offer 378 traffic to the network. 380 Some CBR services may be carried over connectionless packet PWs. An 381 example of such a case would be a CBR MPEG-2 video stream carried 382 over over an Ethernet PW. One could argue that such a service - 383 provided the rate was policed at the ingress PE - should be offered 384 the same latitude as a PW that explicitly provided a CBR service. 385 Likewise, there may not be much value in trying to throttle such a 386 service rather than cutting it off completely during severe 387 congestion. However, this clearly raises the issue of how to know 388 that a PW is indeed carrying a CBR service. 390 2. Detecting Congestion 392 In TCP, congestion is detected by the transmitter; the receipt of 393 three successive duplicate TCP acks are taken to be indicative of 394 congestion. What this actually means is that the several packets in 395 a row were received at the remote end, such that none of those 396 packets had the next expected sequence number. This is interpreted 397 as meaning that the packet with the next expected sequence number was 398 lost in the network, and the loss of a single packet in the network 399 is taken as a sign of congestion. (Naturally, the presence of 400 congestion is also inferred if TCP has to retransmit a packet.) Note 401 that it is possible for mis-ordered packets to be misinterpreted as 402 lost packets, if they do not arrive "soon enough". 404 In TCP, a time-out while awaiting an ack is also interpreted as a 405 sign of congestion. 407 Since there are no acknowledgments on a PW, the PW-specific 408 congestion control mechanism obviously cannot be based on either the 409 presence of or the absence of acknowledgments. Some types of 410 pseudowire (the CBR PWs) have a single bit that indicates that a 411 preset amount of data has been lost, but this is a non-quantitative 412 indicator. CBR PWs have the advantage that there is a constant two 413 way data flow, while other PW types do not have the constant 414 symmetric flow of payload on which to piggyback the congestion 415 notification. Most PW types therefore provide no way for a 416 transmitter to determine (or even to make an educated guess as to) 417 whether any data has been lost. 419 Thus we need to add a mechanism for determining whether data packets 420 on a PW have gotten lost. There are several possible methods for 421 doing this: 423 o Detect Congestion Using PW Sequence Numbers 425 o Detect Congestion Using Modified VCCV Packets [I-D.ietf-pwe3-vccv] 427 o Rely on Explicit Congestion Notification (ECN) [RFC3168] 429 We discuss each option in turn in the following sections. 431 2.1. Using Sequence Numbers to Detect Congestion 433 When the optional sequencing feature is in use on a PW [RFC4385], it 434 is necessary for the receiver to maintain a "next expected sequence 435 number" for the PW. If a packet arrives with a sequence number that 436 is earlier than the next expected (a "mis-ordered packet"), the 437 packet is discarded; if it arrives with a sequence number that is 438 greater than or equal to the next expected, the packet is delivered, 439 and the next expected sequence number becomes the sequence number of 440 the current packet plus 1. 442 It is easy to tell when there is one or more missing packets (i.e., 443 there is a "gap" in the sequence space) -- that is the case when a 444 packet arrives whose sequence number is greater than the next 445 expected. What is difficult to tell is whether any misordered 446 packets that arrive after the gap are indeed the missing packets. 447 One could imagine that the receiver remembers the sequence number of 448 each missing packet for a period of time, and then checks off each 449 such sequence number if a misordered packet carrying that sequence 450 number later arrives. The difficulty is doing this in a manner which 451 is efficient enough to be done by the microcoded hardware handling 452 the PW data path. This approach does not really seem feasible. 454 One could make certain simplifying assumptions, such as assuming that 455 the presence of any gaps at all indicates congestion. While this 456 assumption makes it feasible to use the sequence numbers to "detect 457 congestion", it also throttles the PW unnecessarily if there is 458 really just misordering and no congestion. Such an approach would be 459 considerably more likely to misinterpret misordering as congestion 460 than would TCP's approach. 462 An intermediate approach would be to keep track of the number of 463 missing packets and the number of misordered packets for each PW. 464 One could "detect congestion" if the number of missing packets is 465 significantly larger than the number of misordered packets over some 466 sampling period. However, gaps occurring near the end of a sampling 467 period would tend to result in false indications of congestion. To 468 avoid this one might try to smooth the results over several sampling 469 periods; While this would tend to decrease the responsiveness, it is 470 inevitable that there will be a trade-off between the rapidity of 471 responsiveness and the rate of false alarms. 473 One would not expect the hardware or microcode to keep track of the 474 sampling period; presumably software would read the necessary 475 counters from hardware at the necessary intervals. 477 Such a scheme would have the advantage of being based on existing PW 478 mechanisms. However, it has the disadvantage of requiring 479 sequencing, and it also introduces a fairly complicated interaction 480 between the control processing and the data path. 482 2.2. Using VCCV to Detect Congestion 484 It is reasonable to suppose that the hardware keeps counts of the 485 number of packets sent and received on each PW. Suppose that the PW 486 uses MPLS, and that the transmitter periodically inserts VCCV packets 487 into the PE data stream, where each VCCV packet carries: 489 o A sequence number, increasing by 1 for each successive VCCV 490 packet; 492 o The current value of the transmission counter for the PW 494 We assume that the size of the counter is such that it cannot wrap 495 during the interval between n VCCV packets, for some n > 1. 497 When the receiver gets one of these VCCV packets on a PW, he inserts 498 into it his count of received packets for that PW, and delivers the 499 packet to the software. The receiving software can now compute, for 500 the inter-VCCV intervals, the count of packets transmitted and the 501 count of packets received. The presence of congestion can be 502 inferred if the count of packets transmitted is significantly greater 503 than the count of packets received during the most recent interval. 504 Even the loss rate could be calculated. The loss rate calculated in 505 this way could be used as input to the TFRC rate equation. 507 VCCV messages would not need to be sent on a PW (for the purpose of 508 detecting congestion) in the absence of traffic on that PW. 510 Of course, misordered packets that are sent during one interval but 511 arrive during the next will throw off the loss rate calculation; 512 hence the difference between sent traffic and received traffic should 513 be "significant" before the presence of congestion is inferred. The 514 value of "significance" can be made larger or smaller depending on 515 the probability of misordering. 517 Note that congestion can cause a VCCV packet to go missing, and 518 anything that misorders packets can misorder a VCCV packet as well as 519 any other. One may not want to infer the presence of congestion if a 520 single VCCV packet does not arrive when expected, as it may just be 521 delayed in the network, even if it hasn't been misordered. However, 522 failure to receive a VCCV packet after a certain amount of time has 523 elapsed since the last VCCV was received (on a particular PW) may be 524 taken as evidence of congestion. This scheme has the disadvantage of 525 requiring periodic VCCV packets, and it requires VCCV packet formats 526 to be modified to include the necessary counts. However, the 527 interaction between the control path and the data path is very 528 simple, as there is no polling of counters, no need for timers in the 529 data path, and no need for the control path to do read-modify-write 530 operations on the data path hardware. A bigger disadvantage may 531 arise from the possible inability to ensure that the transmit counts 532 in the VCCVs are exactly correct. The transmitting hardware may not 533 be able to insert a packet count in the VCCV IMMEDIATELY before 534 transmission of the VCCV on the wire, and if it cannot, the count of 535 transmit packets will only be approximate. 537 Neither scheme can provide the same type of continuous feedback that 538 TCP gets. TCP gets a continuous stream of acknowledgments, whereas 539 the PW congestion detection mechanism would only be able to say 540 whether congestion occurred during a particular interval. If the 541 interval is about 1 RTT, the PW congestion control would be 542 approximately as responsive as TCP congestion control, and there does 543 not seem to be any advantage to making it smaller. However, sampling 544 at an interval of 1 RTT might generate excessive amounts of overhead. 545 Sampling at longer intervals would reduce responsiveness to 546 congestion but would not necessarily render the congestion control 547 mechanism "TCP-unfriendly". 549 2.3. Explicit Congestion Notification 551 In networks that support explicit congestion notification (ECN) 552 [RFC3168] the ECN notification provides congestion information to the 553 PEs before the onset of congestion discard. This is particularly 554 useful to PWs that are sensitive to packet loss, since it gives the 555 PE the opportunity to intelligently reduce the offered load. ECN 556 marking rates of packets received on a PW could be used to calculate 557 the TFRC rate for a PW. However ECN is not widely deployed at the 558 time of writing; hence it seems that PEs must also be capable of 559 operating in a network where packet loss is the only indicator of 560 congestion. 562 3. Feedback from Receiver to Transmitter 564 Given that the receiver can tell, for each sampling interval, whether 565 or not a PW's traffic has encountered congestion, the receiver must 566 provide this information as feedback to the transmitter, so that the 567 transmitter can adjust its transmission rate appropriately. The 568 feedback could be as simple as a bit stating whether or not there was 569 any packet loss during the specified interval. Alternatively, the 570 actual loss rate could be provided in the feedback, if that 571 information turns out to be useful to the transmitter (e.g. to enable 572 it to calculate a TCP-friendly rate at which to send). There are a 573 number of possible ways in which the feedback can be provided: 574 control plane, reverse data traffic, or VCCV messages. We discuss 575 each in turn below. 577 3.1. Control Plane Feedback 579 A control message can be sent periodically to indicate the presence 580 or absence of congestion. For example, when LDP is the control 581 protocol [RFC4447], the control message would of course be delivered 582 reliably by TCP. (The same considerations apply for any protocol 583 which has a reliable control channel.) When congestion is detected, 584 a control message can be sent indicating that fact. No further 585 congestion control messages would need to be sent until congestion is 586 no longer detected. If the loss rate is being sent, changes in the 587 loss rate would need to be sent as well. When there is no longer any 588 congestion, a message indicating the absence of congestion would have 589 to be sent. 591 Since congestion in the reverse direction can prevent the delivery of 592 these control messages, periodic "no congestion detected" messages 593 would need to be sent whenever there is no congestion. Failure to 594 receive these in a timely manner would lead the control protocol peer 595 to infer that there is congestion. (Actually, there might or might 596 not be congestion in the transmitting direction, but in the absence 597 of any feedback one cannot assume that everything is fine.) If 598 control messages really cannot get through at all, control protocol 599 keepalives will fail and the control connection will go down anyway. 601 If the control messages simply say whether or not congestion was 602 detected, then given a reliable control channel, periodic messages 603 are not needed during periods of congestion. Of course, if the 604 control messages carry more data, such as the loss rate, then they 605 need to be sent whenever that data changes. 607 If it is desired to control congestion on a per-tunnel basis, these 608 control messages will simply say that there was congestion on some PW 609 (one or more) within the tunnel. If it is desired to control 610 congestion on a per-PW basis, the control message can list the PWs 611 which have experienced congestion, most likely by listing the 612 corresponding labels. If the VCCV method of detecting congestion is 613 used, one could even include the sent/received statistics for 614 particular VCCV intervals. 616 This method is very simple, as one does not have to worry about the 617 congestion control messages themselves getting lost or out of 618 sequence. Feedback traffic is minimized, as a single control message 619 relays feedback about an entire tunnel. 621 3.2. Using Reverse Data Packets for Feedback 623 If a receiver detects congestion on a particular PW, it can set a bit 624 in the data packets that are traveling on that PW in the reverse 625 direction; when no congestion is detected, the bit would be clear. 626 The bit would be ignored on any packet which is received out of 627 sequence, of course. There are several disadvantages to this 628 technique: 630 o There may be no (or insufficient) data traffic in the reverse 631 direction 633 o Sequencing of the data stream is required 635 o The transmission of the congestion indications is not reliable 637 o The most one could hope to convey is one bit of information per PW 638 (if there is even a bit available in the encapsulation). 640 3.3. Reverse VCCV Traffic 642 Congestion indications for a particular PW could be carried in VCCV 643 packets traveling in the reverse direction on that PW. Of course, 644 this would require that the VCCV packets be sent periodically in the 645 reverse direction whether or not there is reverse direction traffic. 646 For congestion feedback purposes they might need to be sent more 647 frequently than they'd need to be sent for OAM purposes. It would 648 also be necessary for the VCCVs to be sequenced (with respect to each 649 other, not necessarily with respect to the datastream). Since VCCV 650 transmission is unreliable, one would want to send multiple VCCVs 651 within whatever period we want to be able to respond in. Further, 652 this method provides no means of aggregating congestion information 653 into information about the tunnel. 655 4. Responding to Congestion 657 In TCP, one tends to think of the transmission rate in terms of MTUs 658 per RTT, which defines the maximum number of unacknowledged packets 659 that TCP is allowed to maintain "in flight". Upon detection of a 660 lost packet, this rate is halved ("multiplicative decrease"). It 661 will be halved again approximately every RTT until the missing data 662 gets through. Once all missing data has gotten through, the 663 transmission rate is increased by one MTU per RTT. Every time a new 664 acknowledgment (i.e., not a duplicate acknowledgment) is received, 665 the rate is similarly increased (additive increase). Thus TCP can 666 adjust its transmit rate very rapidly, i.e., it responds on the order 667 of a RTT. By contrast, TCP-friendly rate control adjusts its rate 668 rather more gradually. 670 For simplicity, this discussion only covers the "congestion 671 avoidance" phase of TCP congestion control. The analogy of TCP's 672 "slow start phase" would also be needed. 674 TCP can easily estimate the RTT, since all its transmissions are 675 acknowledged. In PWE3, the best way to estimate the RTT might be via 676 the control protocol. In fact, if the control protocol is TCP-based, 677 getting the RTT estimate from TCP might be a good option. 679 TCP's rate control is window-based, expressed as a number of bytes 680 that can be in flight. PWE3's rate control would need to be rate- 681 based. The TFRC specification [RFC3448] provides the equation for 682 the TCP-friendly rate for a given loss rate, RTT, and MTU. Given 683 some means of determining the loss rate, as described in Section 2, 684 the TCP friendly rate for a PW or a tunnel can be calculated at the 685 ingress PE. 687 If the congestion detection mechanism only produces an approximate 688 result, the probability of a "false alarm" (thinking that there is 689 congestion when there really is not) for some interval becomes 690 significant. It would be better then to have some algorithm which 691 smoothes the result over several intervals. The TFRC procedures, 692 which tend to generate a smoother and less abrupt change in the 693 transmission rate than the AIMD procedures, may also be more 694 appropriate in this case. 696 Once a PE has determined the appropriate rate at which to transmit 697 traffic on a given PW or tunnel, it needs some means to enforce that 698 rate via policing, shaping, or selective shutting down of PWs. There 699 are tradeoffs to be made among these options, depending on various 700 factors including the higher layer service that is carried. The 701 effect of different mechanisms when the higher layer traffic is 702 already using TCP is discussed below. 704 4.1. Interaction with TCP 706 Ideally there should be no PW-specific congestion control mechanism 707 used when the higher layer traffic is already running over TCP and is 708 thus subject to TCP's existing congestion control. However it may be 709 difficult to determine what the higher layer is on any given PW. 710 Thus, interaction between PW-specific congestion control and TCP's 711 congestion control needs to be considered. 713 As noted in Section 1.4.2, a PW-specific congestion control mechanism 714 may interact poorly with the "outer" control loop of TCP if the PW 715 carries TCP traffic. A well-documented example of such poor 716 interaction is a token bucket policer that drops packets outside the 717 token bucket. TCP has difficulty finding the "bottleneck" bandwidth 718 in such an environment and tends to overshoot, incurring heavy losses 719 and consequent loss of throughput. 721 A shaper that queues packets at the PE and only injects them into the 722 network at the appropriate "TCP friendly" rate may be a better 723 choice, but may still interact unpredictably with the "outer control 724 loop" of TCP flows that happen to traverse the PW. This issue 725 warrants further study. 727 Another possibility is simply to shut down a PW when the rate of 728 traffic on the PW significantly exceeds the "TCP friendly" rate that 729 has been determined for the PW. While this might be viewed as 730 draconian, it does ensure that any PW that is allowed to stay up will 731 behave in a predictable manner. Note that this would also be the 732 most likely choice of action for CBR PWs (as discussed in Section 6). 733 Thus all PWs would be treated alike and there would be no need to try 734 to determine what sort of upper layer payload a PW is carrying. 736 5. Rate Control per Tunnel vs. per PW 738 Rate controls can be applied on a per-tunnel basis or on a per-PW 739 basis. Applying them on a per-tunnel basis (and obtaining congestion 740 feedback on a per-tunnel basis) would seem to provide the most 741 efficient and most scalable system. Achieving fairness among the PWs 742 then becomes a local issue for the transmitter. However, if the 743 different PWs follow different paths through the network (e.g. 744 because of ECMP over the tunnel), it is possible that some PWs will 745 encounter congestion while some will not. If rate controls are 746 applied on a per-tunnel basis, then if any PW in a tunnel is affected 747 by congestion, all the PWs in the tunnel will be throttled. While 748 this is sub-optimal, it is not clear that this would be a significant 749 problem in practice, and it may still be the best trade-off. 751 Per-tunnel rate control also has some desirable properties if the 752 action taken during congestion is to selectively shut down certain 753 PWs. Since a tunnel will typically carry many PWs, it will be 754 possible to make relatively small adjustments in the total bandwidth 755 consumed by the tunnel by selectively shutting down or bringing up 756 one or more PWs. 758 6. Constant Bit Rate Services 760 As noted above, some PW services may require a fixed rate of 761 transmission, and it may be impossible to provide the service while 762 throttling the transmission rate. To provide such services, the 763 network paths must be engineered so that congestion is impossible; 764 providing such services over the Internet is thus not very likely. 765 In fact, as congestion control cannot be applied to such services, it 766 may be necessary to prohibit these services from being provided in 767 the Internet, except in the case where the payload is known to 768 consist of TCP connections or other traffic that is congestion- 769 controlled by the end-points. It is not clear how such a prohibition 770 could be enforced. 772 The only feasible mechanism for handling congestion affecting CBR 773 services would appear to be to selectively turn off PWs when 774 congestion occurs. Clearly it is important to avoid "false alarms" 775 in this case. It is also important to avoid bringing PWs back up too 776 quickly and re-introducing congestion. 778 The idea of controlling rate per tunnel rather than PW, discussed 779 above, seems particularly attractive when some of the PWs are CBR. 780 First, it provides the possibility that non-CBR PWs could be 781 throttled before it is necessary to shut down the CBR PWs. Second, 782 with the aggregation of multiple PWs on a single rate-controlled 783 tunnel, it becomes possible to gradually increase or decrease the 784 total offered load on the tunnel by selectively bringing up or 785 shutting down PWs. As noted above, local policies at a PE could be 786 used to determine which PWs to shut down or bring up first. Similar 787 approaches would apply if the CBR PW offers a channelized service, 788 with selected channels being shut down and brought up to control the 789 total rate of the PW. 791 7. Mandatory vs. Optional 793 As discussed in section 1, there are a significant set of scenarios 794 in which PW-specific congestion control is not necessary. One might 795 therefore argue that it doesn't seem to make sense to require PW- 796 specific congestion control to be used on all PWs at all times. On 797 the other hand, if the option of turning off PW-specific congestion 798 control is available, there is nothing to stop a provider from 799 turning it off in inappropriate situations. As this may contribute 800 to congestive collapse outside the provider's own network, it may not 801 be advisable to allow this. 803 8. Related Work: Pre-Congestion Notification 805 It has been suggested that Pre-congestion Notification (PCN) 806 [I-D.briscoe-tsvwg-cl-architecture][I-D.briscoe-tsvwg-cl-phb] might 807 provide a basis for addressing the PW congestion control problem. 808 Using PCN, it would potentially be possible to determine if the level 809 of congestion currently existing between an ingress and an egress PE 810 was sufficiently low to safely allow a new PW to be established. 811 PCN's pre-emption mechanisms could be used to notify a PE that one or 812 more PWs need to be brought down, which again could be coupled with 813 local policies to determine exactly which PWs should be shut down 814 first. This approach certainly merits further examination, but we 815 note that PCN is considerably further away from deployment in the 816 Internet than ECN, and thus cannot be considered as a near-term 817 solution to the problem of PW-induced congestion in the Internet. 819 9. Informative References 821 [I-D.briscoe-tsvwg-cl-architecture] 822 Briscoe, B., "An edge-to-edge Deployment Model for Pre- 823 Congestion Notification: Admission Control over a 824 DiffServ Region", draft-briscoe-tsvwg-cl-architecture-03 825 (work in progress), June 2006. 827 [I-D.briscoe-tsvwg-cl-phb] 828 Briscoe, B., "Pre-Congestion Notification marking", 829 draft-briscoe-tsvwg-cl-phb-02 (work in progress), 830 June 2006. 832 [I-D.ietf-pwe3-cesopsn] 833 Vainshtein, S., "Structure-aware TDM Circuit Emulation 834 Service over Packet Switched Network (CESoPSN)", 835 draft-ietf-pwe3-cesopsn-07 (work in progress), May 2006. 837 [I-D.ietf-pwe3-sonet] 838 Malis, A., "SONET/SDH Circuit Emulation over Packet 839 (CEP)", draft-ietf-pwe3-sonet-13 (work in progress), 840 June 2006. 842 [I-D.ietf-pwe3-tdmoip] 843 Stein, Y., "TDM over IP", draft-ietf-pwe3-tdmoip-05 (work 844 in progress), June 2006. 846 [I-D.ietf-pwe3-vccv] 847 Nadeau, T., "Pseudo Wire Virtual Circuit Connectivity 848 Verification (VCCV)", draft-ietf-pwe3-vccv-11 (work in 849 progress), October 2006. 851 [RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast 852 Retransmit, and Fast Recovery Algorithms", RFC 2001, 853 January 1997. 855 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 856 Requirement Levels", BCP 14, RFC 2119, March 1997. 858 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 859 Control", RFC 2581, April 1999. 861 [RFC2914] Floyd, S., "Congestion Control Principles", BCP 41, 862 RFC 2914, September 2000. 864 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 865 of Explicit Congestion Notification (ECN) to IP", 866 RFC 3168, September 2001. 868 [RFC3448] Handley, M., Floyd, S., Padhye, J., and J. Widmer, "TCP 869 Friendly Rate Control (TFRC): Protocol Specification", 870 RFC 3448, January 2003. 872 [RFC3985] Bryant, S. and P. Pate, "Pseudo Wire Emulation Edge-to- 873 Edge (PWE3) Architecture", RFC 3985, March 2005. 875 [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram 876 Congestion Control Protocol (DCCP)", RFC 4340, March 2006. 878 [RFC4385] Bryant, S., Swallow, G., Martini, L., and D. McPherson, 879 "Pseudowire Emulation Edge-to-Edge (PWE3) Control Word for 880 Use over an MPLS PSN", RFC 4385, February 2006. 882 [RFC4447] Martini, L., Rosen, E., El-Aawar, N., Smith, T., and G. 883 Heron, "Pseudowire Setup and Maintenance Using the Label 884 Distribution Protocol (LDP)", RFC 4447, April 2006. 886 [RFC4553] Vainshtein, A. and YJ. Stein, "Structure-Agnostic Time 887 Division Multiplexing (TDM) over Packet (SAToP)", 888 RFC 4553, June 2006. 890 Authors' Addresses 892 Stewart Bryant 893 Cisco Systems, Inc. 894 250 Longwater 895 Green Park, Reading RG2 6GB 896 U.K. 898 Phone: 899 Fax: 900 Email: stbryant@cisco.com 901 URI: 903 Bruce Davie 904 Cisco Systems, Inc. 905 1414 Mass. Ave. 906 Boxborough, MA 01719 907 USA 909 Email: bsd@cisco.com 911 Luca Martini 912 Cisco Systems, Inc. 913 9155 East Nichols Avenue, Suite 400. 914 Englewood, CO 80112 915 USA 917 Email: lmartini@cisco.com 919 Eric Rosen 920 Cisco Systems, Inc. 921 1414 Mass. Ave. 922 Boxborough, MA 01719 923 USA 925 Email: erosen@cisco.com 927 Full Copyright Statement 929 Copyright (C) The Internet Society (2006). 931 This document is subject to the rights, licenses and restrictions 932 contained in BCP 78, and except as set forth therein, the authors 933 retain all their rights. 935 This document and the information contained herein are provided on an 936 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 937 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 938 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 939 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 940 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 941 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 943 Intellectual Property 945 The IETF takes no position regarding the validity or scope of any 946 Intellectual Property Rights or other rights that might be claimed to 947 pertain to the implementation or use of the technology described in 948 this document or the extent to which any license under such rights 949 might or might not be available; nor does it represent that it has 950 made any independent effort to identify any such rights. Information 951 on the procedures with respect to rights in RFC documents can be 952 found in BCP 78 and BCP 79. 954 Copies of IPR disclosures made to the IETF Secretariat and any 955 assurances of licenses to be made available, or the result of an 956 attempt made to obtain a general license or permission for the use of 957 such proprietary rights by implementers or users of this 958 specification can be obtained from the IETF on-line IPR repository at 959 http://www.ietf.org/ipr. 961 The IETF invites any interested party to bring to its attention any 962 copyrights, patents or patent applications, or other proprietary 963 rights that may cover technology that may be required to implement 964 this standard. Please address the information to the IETF at 965 ietf-ipr@ietf.org. 967 Acknowledgment 969 Funding for the RFC Editor function is provided by the IETF 970 Administrative Support Activity (IASA).