idnits 2.17.1 

draft-rosen-pwe3-congestion-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 17.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 941.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 952.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 959.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 965.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack a Security Considerations section.

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 15, 2006) is 6400 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-04) exists of
     draft-briscoe-tsvwg-cl-architecture-03

  == Outdated reference: A later version (-03) exists of
     draft-briscoe-tsvwg-cl-phb-02

  == Outdated reference: A later version (-14) exists of
     draft-ietf-pwe3-sonet-13

  == Outdated reference: A later version (-06) exists of
     draft-ietf-pwe3-tdmoip-05

  == Outdated reference: A later version (-15) exists of
     draft-ietf-pwe3-vccv-11

  -- Obsolete informational reference (is this intentional?): RFC 2001
     (Obsoleted by RFC 2581)

  -- Obsolete informational reference (is this intentional?): RFC 2581
     (Obsoleted by RFC 5681)

  -- Obsolete informational reference (is this intentional?): RFC 3448
     (Obsoleted by RFC 5348)

  -- Obsolete informational reference (is this intentional?): RFC 4447
     (Obsoleted by RFC 8077)


     Summary: 5 errors (**), 0 flaws (~~), 7 warnings (==), 11 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                          S. Bryant
3	Internet-Draft                                                  B. Davie
4	Intended status: Standards Track                              L. Martini
5	Expires: April 18, 2007                                         E. Rosen
6	                                                     Cisco Systems, Inc.
7	                                                        October 15, 2006

9	                Pseudowire Congestion Control Framework
10	                   draft-rosen-pwe3-congestion-04.txt

12	Status of this Memo

14	   By submitting this Internet-Draft, each author represents that any
15	   applicable patent or other IPR claims of which he or she is aware
16	   have been or will be disclosed, and any of which he or she becomes
17	   aware will be disclosed, in accordance with Section 6 of BCP 79.

19	   Internet-Drafts are working documents of the Internet Engineering
20	   Task Force (IETF), its areas, and its working groups.  Note that
21	   other groups may also distribute working documents as Internet-
22	   Drafts.

24	   Internet-Drafts are draft documents valid for a maximum of six months
25	   and may be updated, replaced, or obsoleted by other documents at any
26	   time.  It is inappropriate to use Internet-Drafts as reference
27	   material or to cite them other than as "work in progress."

29	   The list of current Internet-Drafts can be accessed at
30	   http://www.ietf.org/ietf/1id-abstracts.txt.

32	   The list of Internet-Draft Shadow Directories can be accessed at
33	   http://www.ietf.org/shadow.html.

35	   This Internet-Draft will expire on April 18, 2007.

37	Copyright Notice

39	   Copyright (C) The Internet Society (2006).

41	Abstract

43	   Given that pseudowires may be used to carry non-TCP data flows, it is
44	   necessary to provide pseudowire-specific congestion control
45	   procedures.  These procedures should ensure that pseudowire traffic
46	   is "TCP-compatible", as defined in RFC 2914.  This document attempts
47	   to lay out the issues which must be considered when defining such
48	   procedures.

50	Requirements Language

52	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
53	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
54	   document are to be interpreted as described in RFC 2119 [RFC2119].

56	Table of Contents

58	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
59	     1.1.  Pseudowires and Congestion in IP Networks  . . . . . . . .  3
60	     1.2.  Arguments Against PW Congestion as a Practical Problem . .  4
61	     1.3.  Goals of PW-specific Congestion Control  . . . . . . . . .  6
62	     1.4.  Challenges for PW Congestion . . . . . . . . . . . . . . .  7
63	       1.4.1.  Scale  . . . . . . . . . . . . . . . . . . . . . . . .  7
64	       1.4.2.  Interaction among control loops  . . . . . . . . . . .  8
65	       1.4.3.  Constant Bit Rate PWs  . . . . . . . . . . . . . . . .  8
66	   2.  Detecting Congestion . . . . . . . . . . . . . . . . . . . . .  9
67	     2.1.  Using Sequence Numbers to Detect Congestion  . . . . . . . 10
68	     2.2.  Using VCCV to Detect Congestion  . . . . . . . . . . . . . 11
69	     2.3.  Explicit Congestion Notification . . . . . . . . . . . . . 12
70	   3.  Feedback from Receiver to Transmitter  . . . . . . . . . . . . 13
71	     3.1.  Control Plane Feedback . . . . . . . . . . . . . . . . . . 13
72	     3.2.  Using Reverse Data Packets for Feedback  . . . . . . . . . 14
73	     3.3.  Reverse VCCV Traffic . . . . . . . . . . . . . . . . . . . 14
74	   4.  Responding to Congestion . . . . . . . . . . . . . . . . . . . 15
75	     4.1.  Interaction with TCP . . . . . . . . . . . . . . . . . . . 16
76	   5.  Rate Control per Tunnel vs. per PW . . . . . . . . . . . . . . 16
77	   6.  Constant Bit Rate Services . . . . . . . . . . . . . . . . . . 17
78	   7.  Mandatory vs. Optional . . . . . . . . . . . . . . . . . . . . 17
79	   8.  Related Work: Pre-Congestion Notification  . . . . . . . . . . 18
80	   9.  Informative References . . . . . . . . . . . . . . . . . . . . 18
81	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20
82	   Intellectual Property and Copyright Statements . . . . . . . . . . 21

84	1.  Introduction

86	1.1.  Pseudowires and Congestion in IP Networks

88	   Congestion in an IP network occurs when the amount of traffic that
89	   needs to use a particular network resource exceeds the capacity of
90	   that resource.  This results first in long queues within the network,
91	   and then in packet loss.  If the amount of traffic is not then
92	   reduced, the packet loss rate will climb, potentially until it
93	   reaches 100%.

95	   To prevent this sort of "congestive collapse", there must be
96	   congestion control: a feedback loop by which the presence of
97	   congestion somewhere in the network forces the transmitters to reduce
98	   the amount of traffic being sent.  As a connectionless protocol, IP
99	   has no way to push back directly on the originator of the traffic.
100	   Procedures for (a) detecting congestion, (b) providing the necessary
101	   feedback to the transmitters, and (c) adjusting the transmission
102	   rates, are thus left to higher protocol layers such as TCP.

104	   The vast majority of traffic in IP networks is currently TCP traffic.
105	   TCP includes an elaborate congestion control mechanism which causes
106	   the end systems to reduce their transmission rates when congestion
107	   occurs.  For those readers not intimately familiar with the details
108	   of TCP congestion control, we give below a brief summary, greatly
109	   simplified and not entirely accurate, of TCP's very complicated
110	   feedback mechanism.  The details of TCP congestion control can be
111	   found in [RFC2581].  [RFC2001] is an earlier but more accessible
112	   discussion.  [RFC2914] articulates a number of general principles
113	   governing congestion control in the Internet.

115	   In TCP congestion control, a lost packet is considered to be an
116	   indication of congestion.  Roughly, TCP considers a given packet to
117	   be lost if that packet is not acknowledged within a specified time,
118	   or if three subsequent packets arrive at the receiver before the
119	   given packet.  The latter condition manifests itself at the
120	   transmitter as the arrival of three duplicate acks in a row.  The
121	   algorithm by which TCP detects congestion is thus highly dependent on
122	   the mechanisms used by TCP to ensure reliable and sequential
123	   delivery.

125	   Once a TCP transmitter becomes aware of congestion, it halves its
126	   transmission rate.  If congestion still occurs at the new rate, the
127	   rate is halved again.  When a rate is found at which congestion no
128	   longer occurs, the rate is increased by one MSS ("Maximum Segment
129	   Size") per RTT ("Round Trip Time").  The rate is increased each RTT
130	   until congestion is encountered again, or until something else limits
131	   it (e.g., the flow control window reached, or the application is
132	   transmitting at its max desired rate, or at line rate).

134	   This sort of mechanism is known as an "Additive Increase,
135	   Multiplicative Decrease" (AIMD) mechanism.  Congestion causes
136	   relatively rapid decreases in the transmission rate, while the
137	   absence of congestion causes relatively slow increases in the allowed
138	   transmission rate.

140	   Currently, traffic in IP networks is predominantly TCP traffic.  Even
141	   the layer 2 tunneled traffic (e.g., PPP frames tunneled through L2TP)
142	   is predominantly TCP traffic from the end-users.  If pseudowires
143	   (PWs) [RFC3985] were to be used only for carrying TCP flows, there
144	   would be no need for any PW-specific congestion mechanisms.  The
145	   existing TCP congestion control mechanisms would be all that is
146	   needed, since any loss of packets on the PW would be detected as loss
147	   of packets on a TCP connection, and the TCP flow control mechanisms
148	   would ensure a reduction of transmission rate.  However, if a PW is
149	   carrying non-TCP traffic, then there is no feedback mechanism to
150	   cause the end-systems to reduce their transmission rates in response
151	   to congestion.  When congestion occurs, any TCP traffic that is
152	   sharing the congested resource with the non-TCP traffic will be
153	   throttled, and the non-TCP traffic may "starve" the TCP traffic.  If
154	   there is enough non-TCP traffic to congest the network all by itself,
155	   there is nothing to prevent congestive collapse.

157	   The non-TCP traffic in a PW can belong to any higher layer
158	   whatsoever, and there is no way to ensure that TCP-like congestion
159	   control mechanisms will be used by all those layers.  Hence it
160	   appears that there is a need for an edge-to-edge (i.e, PE-to-PE)
161	   feedback mechanism which forces a transmitting PE to reduce its
162	   transmission rate in the face of network congestion.

164	   As TCP uses window-based flow control, controlling the rate is really
165	   a matter of limiting the amount of traffic which can be "in flight"
166	   (i.e., transmitted but not yet acknowledged) at any one time.
167	   Obviously a different technique needs to be used to control the
168	   transmission rate of the non-windowed protocol used for transmitting
169	   data on PWs.

171	1.2.  Arguments Against PW Congestion as a Practical Problem

173	   One may argue that congestion due to non-TCP PW traffic is only a
174	   theoretical problem.

176	   o  "99.9% of all the traffic in PWs is really IP traffic"

178	      If this is the case, then the traffic is either TCP traffic, which
179	      is already congestion-controlled, or "other" IP traffic.  While
180	      the congestion control issue may exist for the "other" IP traffic,
181	      it is a general issue which is not specific to PWs.

183	      Unfortunately, we cannot be sure that this is the case.  It may
184	      well be the case for the PW offerings of certain providers, but
185	      perhaps not for others.  It does appear that many providers want
186	      to be able to use PWs for transporting "legacy traffic" of various
187	      non-IP protocols.  Constant bit-rate services are an example of
188	      this, and raise particular issues for congestion control
189	      (discussed below).

191	   o  "PW traffic usually stays within one SP's network, and an SP
192	      always engineers its network carefully enough so that congestion
193	      is an impossibility"

195	      Perhaps this will be true of "most" PWs, but inter-provider PWs
196	      are certainly expected to have a significant presence.

198	      Even within a single provider's network, the provider might
199	      consider whether he is so confident of his network engineering
200	      that he does not need a feedback loop reducing the transmission
201	      rate in response to congestion.

203	      There is also the issue of keeping the network running (i.e., out
204	      of congestive collapse) after an unexpected reduction of capacity.

206	   o  "If one provider accepts PW traffic from another, policing will be
207	      done at the entry point to the second provider's network, so that
208	      the second provider is sure that the first provider is not sending
209	      too much traffic.  This policing, together with the second
210	      provider's careful network engineering, makes congestion an
211	      impossibility"

213	      This could be the case given carefully controlled bilateral
214	      peering arrangements.  Note though that if the second provider is
215	      merely providing transit services for a PW whose endpoints are in
216	      other providers, it may be difficult for the transit provider to
217	      tell which traffic is the PW traffic and which is "ordinary" IP
218	      traffic.

220	   o  "The only time we really need a general congestion control
221	      mechanism is when traffic goes through the public Internet.
222	      Obviously this will never be the case for PW traffic."

224	      It is not at all difficult to imagine someone using an IPsec
225	      tunnel across the public Internet to transport a PW from one
226	      private IP network to another.

228	      Nor is it difficult to imagine some enterprise implementing a PW
229	      and transporting it across some SP's backbone, e.g., if that SP is
230	      providing VPN service to that enterprise.

232	   The arguments that non-TCP traffic in PWs will never make any
233	   significant contribution to congestion thus do not seem to be totally
234	   compelling.

236	1.3.  Goals of PW-specific Congestion Control

238	   [RFC2914] defines the notion of a "TCP-compatible flow":

240	   "A TCP-compatible flow is responsive to congestion notification, and
241	   in steady-state uses no more bandwidth than a conformant TCP running
242	   under comparable conditions (drop rate, RTT [round trip time], MTU
243	   [maximum transmission unit], etc.)"

245	   TCP-compatible flows respond to congestion in much the way TCP does,
246	   so that they do not starve the TCP flows or otherwise obtain an
247	   unfair advantage.  [RFC2914] further points out:

249	   "any form of congestion control that successfully avoids a high
250	   sending rate in the presence of a high packet drop rate should be
251	   sufficient to avoid congestion collapse from undelivered packets."

253	   "This does not mean, however, that concerns about congestion collapse
254	   and fairness with TCP necessitate that all best-effort traffic deploy
255	   congestion control based on TCP's Additive-Increase Multiplicative-
256	   Decrease (AIMD) algorithm of reducing the sending rate in half in
257	   response to each packet drop."

259	   "However, the list of TCP-compatible congestion control procedures is
260	   not limited to AIMD with the same increase/ decrease parameters as
261	   TCP.  Other TCP-compatible congestion control procedures include
262	   rate-based variants of AIMD; AIMD with different sets of increase/
263	   decrease parameters that give the same steady-state behavior;
264	   equation-based congestion control where the sender adjusts its
265	   sending rate in response to information about the long-term packet
266	   drop rate ... and possibly other forms that we have not yet begun to
267	   consider."

269	   The AIMD procedures are not mandated for non-TCP traffic, and might
270	   not be optimal for non-TCP PW traffic.  Choosing a proper set of
271	   procedures which are TCP-compatible while being optimized for a
272	   particular type of traffic is no simple task.  [RFC3448], "TCP
273	   Friendly Rate Control (TFRC)" provides an alternative:

275	   "TFRC is designed to be reasonably fair when competing for bandwidth
276	   with TCP flows, where a flow is "reasonably fair" if its sending rate
277	   is generally within a factor of two of the sending rate of a TCP flow
278	   under the same conditions.  However, TFRC has a much lower variation
279	   of throughput over time compared with TCP, which makes it more
280	   suitable for applications such as telephony or streaming media where
281	   a relatively smooth sending rate is of importance."

283	   "For its congestion control mechanism, TFRC directly uses a
284	   throughput equation for the allowed sending rate as a function of the
285	   loss event rate and round-trip time.  In order to compete fairly with
286	   TCP, TFRC uses the TCP throughput equation, which roughly describes
287	   TCP's sending rate as a function of the loss event rate, round-trip
288	   time, and packet size."

290	   "Generally speaking, TFRC's congestion control mechanism works as
291	   follows:

293	   o  The receiver measures the loss event rate and feeds this
294	      information back to the sender.

296	   o  The sender also uses these feedback messages to measure the round-
297	      trip time (RTT).

299	   o  The loss event rate and RTT are then fed into TFRC's throughput
300	      equation, giving the acceptable transmit rate.

302	   o  The sender then adjusts its transmit rate to match the calculated
303	      rate."

305	   Note that the TFRC procedures require the transmitter to calculate a
306	   throughput equation.  For these procedures to be feasible as a means
307	   of PW congestion control, they must be computationally efficient.
308	   Section 8 of [RFC3448] describes an implementation technique that
309	   appears to make it efficient to calculate the equation.  It is not
310	   clear whether this is the case; this is an area for further
311	   consideration.

313	1.4.  Challenges for PW Congestion

315	1.4.1.  Scale

317	   It might appear at first glance that an easy solution to PW
318	   congestion control would be to run the PWs through a TCP connection.
319	   This would provide congestion control automatically.  However, the
320	   overhead is prohibitive for the PW application.  The PWE3 data plane
321	   may be implemented in a microcoded hardware engine which needs to
322	   support thousands of PWs, and needs to do as little as possible for
323	   each data packet; running a TCP state machine, and implementing TCP's
324	   flow control procedures, would impose too high a cost in this
325	   environment.  Nor do we want to add the large overhead of TCP to the
326	   PWs -- the large headers, the plethora of small acks in the reverse
327	   direction, etc., etc.  In fact, we want to avoid acknowledgments
328	   altogether.  These same considerations lead us away from using e.g.,
329	   DCCP [RFC4340].  Therefore we will investigate some PW-specific
330	   solutions for congestion control.

332	   We also want to minimize the amount of interaction between the data
333	   processing path (which is likely to be distributed among a set of
334	   line cards) and the control path; we need to be especially careful of
335	   interactions which might require atomic read/modify/write operations
336	   from the control path, or which might require atomic read/modify/
337	   write operations between different processors in a multiprocessing
338	   implementation, as such interactions can cause scaling problems.

340	   Thus, feasible solutions for PW-specific congestion will require
341	   scalable means to detect congestion and to reduce the amount of
342	   traffic sent into the network when congestion is detected.  These
343	   topics are discussed in more detail in subsequent sections.

345	1.4.2.  Interaction among control loops

347	   As noted above, much of the traffic that is carried on PWs is likely
348	   to be TCP traffic, and will therefore be subject the congestion
349	   control mechanisms of TCP.  It will typically be difficult for a PW
350	   endpoint to tell whether or not this is the case.  Thus there is the
351	   risk that the PE-PE congestion control mechanisms applied over the PW
352	   may interact in undesirable ways with the end-to-end congestion
353	   control mechanisms of TCP.  The PW-specific congestion control
354	   mechanisms should be designed to minimize the negative impact of such
355	   interaction.

357	1.4.3.  Constant Bit Rate PWs

359	   Some types of PW, for example SAToP (Structure Agnostic TDM over
360	   Packet) [RFC4553], CESoPSN (Circuit Emulation over Packet Switched
361	   Networks) [I-D.ietf-pwe3-cesopsn], TDM over IP
362	   [I-D.ietf-pwe3-tdmoip][I-D.ietf-pwe3-sonet], SONET/SDH and Constant
363	   Bit Rate ATM PWs represent an inelastic constant bit-rate (CBR) flow.
364	   Such PWs cannot respond to congestion in a TCP-friendly manner
365	   prescribed by [RFC2914]; the amount of total bandwidth consumed by
366	   such a PW remains constant.  AIMD or even more gradual TFRC
367	   techniques are clearly not applicable to such services; it is not
368	   feasible to reduce the rate of a CBR service without violating the
369	   service definition.  Such services are also frequently more sensitive
370	   to packet loss than connectionless packet PWs.  Given that CBR
371	   services are not greedy (in the sense of trying to increase their
372	   share of a link, as TCP does), there may be a case for allowing them
373	   greater latitude during congestion peaks.  However, if some CBR PWs
374	   are not able to endure any significant packet loss or reduction in
375	   rate without compromising the transported service, such PWs must be
376	   shutdown when the level of congestion becomes excessive.  At suitably
377	   low levels of congestion they may be allowed to continue to offer
378	   traffic to the network.

380	   Some CBR services may be carried over connectionless packet PWs.  An
381	   example of such a case would be a CBR MPEG-2 video stream carried
382	   over over an Ethernet PW.  One could argue that such a service -
383	   provided the rate was policed at the ingress PE - should be offered
384	   the same latitude as a PW that explicitly provided a CBR service.
385	   Likewise, there may not be much value in trying to throttle such a
386	   service rather than cutting it off completely during severe
387	   congestion.  However, this clearly raises the issue of how to know
388	   that a PW is indeed carrying a CBR service.

390	2.  Detecting Congestion

392	   In TCP, congestion is detected by the transmitter; the receipt of
393	   three successive duplicate TCP acks are taken to be indicative of
394	   congestion.  What this actually means is that the several packets in
395	   a row were received at the remote end, such that none of those
396	   packets had the next expected sequence number.  This is interpreted
397	   as meaning that the packet with the next expected sequence number was
398	   lost in the network, and the loss of a single packet in the network
399	   is taken as a sign of congestion.  (Naturally, the presence of
400	   congestion is also inferred if TCP has to retransmit a packet.)  Note
401	   that it is possible for mis-ordered packets to be misinterpreted as
402	   lost packets, if they do not arrive "soon enough".

404	   In TCP, a time-out while awaiting an ack is also interpreted as a
405	   sign of congestion.

407	   Since there are no acknowledgments on a PW, the PW-specific
408	   congestion control mechanism obviously cannot be based on either the
409	   presence of or the absence of acknowledgments.  Some types of
410	   pseudowire (the CBR PWs) have a single bit that indicates that a
411	   preset amount of data has been lost, but this is a non-quantitative
412	   indicator.  CBR PWs have the advantage that there is a constant two
413	   way data flow, while other PW types do not have the constant
414	   symmetric flow of payload on which to piggyback the congestion
415	   notification.  Most PW types therefore provide no way for a
416	   transmitter to determine (or even to make an educated guess as to)
417	   whether any data has been lost.

419	   Thus we need to add a mechanism for determining whether data packets
420	   on a PW have gotten lost.  There are several possible methods for
421	   doing this:

423	   o  Detect Congestion Using PW Sequence Numbers

425	   o  Detect Congestion Using Modified VCCV Packets [I-D.ietf-pwe3-vccv]

427	   o  Rely on Explicit Congestion Notification (ECN) [RFC3168]

429	   We discuss each option in turn in the following sections.

431	2.1.  Using Sequence Numbers to Detect Congestion

433	   When the optional sequencing feature is in use on a PW [RFC4385], it
434	   is necessary for the receiver to maintain a "next expected sequence
435	   number" for the PW.  If a packet arrives with a sequence number that
436	   is earlier than the next expected (a "mis-ordered packet"), the
437	   packet is discarded; if it arrives with a sequence number that is
438	   greater than or equal to the next expected, the packet is delivered,
439	   and the next expected sequence number becomes the sequence number of
440	   the current packet plus 1.

442	   It is easy to tell when there is one or more missing packets (i.e.,
443	   there is a "gap" in the sequence space) -- that is the case when a
444	   packet arrives whose sequence number is greater than the next
445	   expected.  What is difficult to tell is whether any misordered
446	   packets that arrive after the gap are indeed the missing packets.
447	   One could imagine that the receiver remembers the sequence number of
448	   each missing packet for a period of time, and then checks off each
449	   such sequence number if a misordered packet carrying that sequence
450	   number later arrives.  The difficulty is doing this in a manner which
451	   is efficient enough to be done by the microcoded hardware handling
452	   the PW data path.  This approach does not really seem feasible.

454	   One could make certain simplifying assumptions, such as assuming that
455	   the presence of any gaps at all indicates congestion.  While this
456	   assumption makes it feasible to use the sequence numbers to "detect
457	   congestion", it also throttles the PW unnecessarily if there is
458	   really just misordering and no congestion.  Such an approach would be
459	   considerably more likely to misinterpret misordering as congestion
460	   than would TCP's approach.

462	   An intermediate approach would be to keep track of the number of
463	   missing packets and the number of misordered packets for each PW.
464	   One could "detect congestion" if the number of missing packets is
465	   significantly larger than the number of misordered packets over some
466	   sampling period.  However, gaps occurring near the end of a sampling
467	   period would tend to result in false indications of congestion.  To
468	   avoid this one might try to smooth the results over several sampling
469	   periods; While this would tend to decrease the responsiveness, it is
470	   inevitable that there will be a trade-off between the rapidity of
471	   responsiveness and the rate of false alarms.

473	   One would not expect the hardware or microcode to keep track of the
474	   sampling period; presumably software would read the necessary
475	   counters from hardware at the necessary intervals.

477	   Such a scheme would have the advantage of being based on existing PW
478	   mechanisms.  However, it has the disadvantage of requiring
479	   sequencing, and it also introduces a fairly complicated interaction
480	   between the control processing and the data path.

482	2.2.  Using VCCV to Detect Congestion

484	   It is reasonable to suppose that the hardware keeps counts of the
485	   number of packets sent and received on each PW.  Suppose that the PW
486	   uses MPLS, and that the transmitter periodically inserts VCCV packets
487	   into the PE data stream, where each VCCV packet carries:

489	   o  A sequence number, increasing by 1 for each successive VCCV
490	      packet;

492	   o  The current value of the transmission counter for the PW

494	   We assume that the size of the counter is such that it cannot wrap
495	   during the interval between n VCCV packets, for some n > 1.

497	   When the receiver gets one of these VCCV packets on a PW, he inserts
498	   into it his count of received packets for that PW, and delivers the
499	   packet to the software.  The receiving software can now compute, for
500	   the inter-VCCV intervals, the count of packets transmitted and the
501	   count of packets received.  The presence of congestion can be
502	   inferred if the count of packets transmitted is significantly greater
503	   than the count of packets received during the most recent interval.
504	   Even the loss rate could be calculated.  The loss rate calculated in
505	   this way could be used as input to the TFRC rate equation.

507	   VCCV messages would not need to be sent on a PW (for the purpose of
508	   detecting congestion) in the absence of traffic on that PW.

510	   Of course, misordered packets that are sent during one interval but
511	   arrive during the next will throw off the loss rate calculation;
512	   hence the difference between sent traffic and received traffic should
513	   be "significant" before the presence of congestion is inferred.  The
514	   value of "significance" can be made larger or smaller depending on
515	   the probability of misordering.

517	   Note that congestion can cause a VCCV packet to go missing, and
518	   anything that misorders packets can misorder a VCCV packet as well as
519	   any other.  One may not want to infer the presence of congestion if a
520	   single VCCV packet does not arrive when expected, as it may just be
521	   delayed in the network, even if it hasn't been misordered.  However,
522	   failure to receive a VCCV packet after a certain amount of time has
523	   elapsed since the last VCCV was received (on a particular PW) may be
524	   taken as evidence of congestion.  This scheme has the disadvantage of
525	   requiring periodic VCCV packets, and it requires VCCV packet formats
526	   to be modified to include the necessary counts.  However, the
527	   interaction between the control path and the data path is very
528	   simple, as there is no polling of counters, no need for timers in the
529	   data path, and no need for the control path to do read-modify-write
530	   operations on the data path hardware.  A bigger disadvantage may
531	   arise from the possible inability to ensure that the transmit counts
532	   in the VCCVs are exactly correct.  The transmitting hardware may not
533	   be able to insert a packet count in the VCCV IMMEDIATELY before
534	   transmission of the VCCV on the wire, and if it cannot, the count of
535	   transmit packets will only be approximate.

537	   Neither scheme can provide the same type of continuous feedback that
538	   TCP gets.  TCP gets a continuous stream of acknowledgments, whereas
539	   the PW congestion detection mechanism would only be able to say
540	   whether congestion occurred during a particular interval.  If the
541	   interval is about 1 RTT, the PW congestion control would be
542	   approximately as responsive as TCP congestion control, and there does
543	   not seem to be any advantage to making it smaller.  However, sampling
544	   at an interval of 1 RTT might generate excessive amounts of overhead.
545	   Sampling at longer intervals would reduce responsiveness to
546	   congestion but would not necessarily render the congestion control
547	   mechanism "TCP-unfriendly".

549	2.3.  Explicit Congestion Notification

551	   In networks that support explicit congestion notification (ECN)
552	   [RFC3168] the ECN notification provides congestion information to the
553	   PEs before the onset of congestion discard.  This is particularly
554	   useful to PWs that are sensitive to packet loss, since it gives the
555	   PE the opportunity to intelligently reduce the offered load.  ECN
556	   marking rates of packets received on a PW could be used to calculate
557	   the TFRC rate for a PW.  However ECN is not widely deployed at the
558	   time of writing; hence it seems that PEs must also be capable of
559	   operating in a network where packet loss is the only indicator of
560	   congestion.

562	3.  Feedback from Receiver to Transmitter

564	   Given that the receiver can tell, for each sampling interval, whether
565	   or not a PW's traffic has encountered congestion, the receiver must
566	   provide this information as feedback to the transmitter, so that the
567	   transmitter can adjust its transmission rate appropriately.  The
568	   feedback could be as simple as a bit stating whether or not there was
569	   any packet loss during the specified interval.  Alternatively, the
570	   actual loss rate could be provided in the feedback, if that
571	   information turns out to be useful to the transmitter (e.g. to enable
572	   it to calculate a TCP-friendly rate at which to send).  There are a
573	   number of possible ways in which the feedback can be provided:
574	   control plane, reverse data traffic, or VCCV messages.  We discuss
575	   each in turn below.

577	3.1.  Control Plane Feedback

579	   A control message can be sent periodically to indicate the presence
580	   or absence of congestion.  For example, when LDP is the control
581	   protocol [RFC4447], the control message would of course be delivered
582	   reliably by TCP.  (The same considerations apply for any protocol
583	   which has a reliable control channel.)  When congestion is detected,
584	   a control message can be sent indicating that fact.  No further
585	   congestion control messages would need to be sent until congestion is
586	   no longer detected.  If the loss rate is being sent, changes in the
587	   loss rate would need to be sent as well.  When there is no longer any
588	   congestion, a message indicating the absence of congestion would have
589	   to be sent.

591	   Since congestion in the reverse direction can prevent the delivery of
592	   these control messages, periodic "no congestion detected" messages
593	   would need to be sent whenever there is no congestion.  Failure to
594	   receive these in a timely manner would lead the control protocol peer
595	   to infer that there is congestion.  (Actually, there might or might
596	   not be congestion in the transmitting direction, but in the absence
597	   of any feedback one cannot assume that everything is fine.)  If
598	   control messages really cannot get through at all, control protocol
599	   keepalives will fail and the control connection will go down anyway.

601	   If the control messages simply say whether or not congestion was
602	   detected, then given a reliable control channel, periodic messages
603	   are not needed during periods of congestion.  Of course, if the
604	   control messages carry more data, such as the loss rate, then they
605	   need to be sent whenever that data changes.

607	   If it is desired to control congestion on a per-tunnel basis, these
608	   control messages will simply say that there was congestion on some PW
609	   (one or more) within the tunnel.  If it is desired to control
610	   congestion on a per-PW basis, the control message can list the PWs
611	   which have experienced congestion, most likely by listing the
612	   corresponding labels.  If the VCCV method of detecting congestion is
613	   used, one could even include the sent/received statistics for
614	   particular VCCV intervals.

616	   This method is very simple, as one does not have to worry about the
617	   congestion control messages themselves getting lost or out of
618	   sequence.  Feedback traffic is minimized, as a single control message
619	   relays feedback about an entire tunnel.

621	3.2.  Using Reverse Data Packets for Feedback

623	   If a receiver detects congestion on a particular PW, it can set a bit
624	   in the data packets that are traveling on that PW in the reverse
625	   direction; when no congestion is detected, the bit would be clear.
626	   The bit would be ignored on any packet which is received out of
627	   sequence, of course.  There are several disadvantages to this
628	   technique:

630	   o  There may be no (or insufficient) data traffic in the reverse
631	      direction

633	   o  Sequencing of the data stream is required

635	   o  The transmission of the congestion indications is not reliable

637	   o  The most one could hope to convey is one bit of information per PW
638	      (if there is even a bit available in the encapsulation).

640	3.3.  Reverse VCCV Traffic

642	   Congestion indications for a particular PW could be carried in VCCV
643	   packets traveling in the reverse direction on that PW.  Of course,
644	   this would require that the VCCV packets be sent periodically in the
645	   reverse direction whether or not there is reverse direction traffic.
646	   For congestion feedback purposes they might need to be sent more
647	   frequently than they'd need to be sent for OAM purposes.  It would
648	   also be necessary for the VCCVs to be sequenced (with respect to each
649	   other, not necessarily with respect to the datastream).  Since VCCV
650	   transmission is unreliable, one would want to send multiple VCCVs
651	   within whatever period we want to be able to respond in.  Further,
652	   this method provides no means of aggregating congestion information
653	   into information about the tunnel.

655	4.  Responding to Congestion

657	   In TCP, one tends to think of the transmission rate in terms of MTUs
658	   per RTT, which defines the maximum number of unacknowledged packets
659	   that TCP is allowed to maintain "in flight".  Upon detection of a
660	   lost packet, this rate is halved ("multiplicative decrease").  It
661	   will be halved again approximately every RTT until the missing data
662	   gets through.  Once all missing data has gotten through, the
663	   transmission rate is increased by one MTU per RTT.  Every time a new
664	   acknowledgment (i.e., not a duplicate acknowledgment) is received,
665	   the rate is similarly increased (additive increase).  Thus TCP can
666	   adjust its transmit rate very rapidly, i.e., it responds on the order
667	   of a RTT.  By contrast, TCP-friendly rate control adjusts its rate
668	   rather more gradually.

670	   For simplicity, this discussion only covers the "congestion
671	   avoidance" phase of TCP congestion control.  The analogy of TCP's
672	   "slow start phase" would also be needed.

674	   TCP can easily estimate the RTT, since all its transmissions are
675	   acknowledged.  In PWE3, the best way to estimate the RTT might be via
676	   the control protocol.  In fact, if the control protocol is TCP-based,
677	   getting the RTT estimate from TCP might be a good option.

679	   TCP's rate control is window-based, expressed as a number of bytes
680	   that can be in flight.  PWE3's rate control would need to be rate-
681	   based.  The TFRC specification [RFC3448] provides the equation for
682	   the TCP-friendly rate for a given loss rate, RTT, and MTU.  Given
683	   some means of determining the loss rate, as described in Section 2,
684	   the TCP friendly rate for a PW or a tunnel can be calculated at the
685	   ingress PE.

687	   If the congestion detection mechanism only produces an approximate
688	   result, the probability of a "false alarm" (thinking that there is
689	   congestion when there really is not) for some interval becomes
690	   significant.  It would be better then to have some algorithm which
691	   smoothes the result over several intervals.  The TFRC procedures,
692	   which tend to generate a smoother and less abrupt change in the
693	   transmission rate than the AIMD procedures, may also be more
694	   appropriate in this case.

696	   Once a PE has determined the appropriate rate at which to transmit
697	   traffic on a given PW or tunnel, it needs some means to enforce that
698	   rate via policing, shaping, or selective shutting down of PWs.  There
699	   are tradeoffs to be made among these options, depending on various
700	   factors including the higher layer service that is carried.  The
701	   effect of different mechanisms when the higher layer traffic is
702	   already using TCP is discussed below.

704	4.1.  Interaction with TCP

706	   Ideally there should be no PW-specific congestion control mechanism
707	   used when the higher layer traffic is already running over TCP and is
708	   thus subject to TCP's existing congestion control.  However it may be
709	   difficult to determine what the higher layer is on any given PW.
710	   Thus, interaction between PW-specific congestion control and TCP's
711	   congestion control needs to be considered.

713	   As noted in Section 1.4.2, a PW-specific congestion control mechanism
714	   may interact poorly with the "outer" control loop of TCP if the PW
715	   carries TCP traffic.  A well-documented example of such poor
716	   interaction is a token bucket policer that drops packets outside the
717	   token bucket.  TCP has difficulty finding the "bottleneck" bandwidth
718	   in such an environment and tends to overshoot, incurring heavy losses
719	   and consequent loss of throughput.

721	   A shaper that queues packets at the PE and only injects them into the
722	   network at the appropriate "TCP friendly" rate may be a better
723	   choice, but may still interact unpredictably with the "outer control
724	   loop" of TCP flows that happen to traverse the PW.  This issue
725	   warrants further study.

727	   Another possibility is simply to shut down a PW when the rate of
728	   traffic on the PW significantly exceeds the "TCP friendly" rate that
729	   has been determined for the PW.  While this might be viewed as
730	   draconian, it does ensure that any PW that is allowed to stay up will
731	   behave in a predictable manner.  Note that this would also be the
732	   most likely choice of action for CBR PWs (as discussed in Section 6).
733	   Thus all PWs would be treated alike and there would be no need to try
734	   to determine what sort of upper layer payload a PW is carrying.

736	5.  Rate Control per Tunnel vs. per PW

738	   Rate controls can be applied on a per-tunnel basis or on a per-PW
739	   basis.  Applying them on a per-tunnel basis (and obtaining congestion
740	   feedback on a per-tunnel basis) would seem to provide the most
741	   efficient and most scalable system.  Achieving fairness among the PWs
742	   then becomes a local issue for the transmitter.  However, if the
743	   different PWs follow different paths through the network (e.g.
744	   because of ECMP over the tunnel), it is possible that some PWs will
745	   encounter congestion while some will not.  If rate controls are
746	   applied on a per-tunnel basis, then if any PW in a tunnel is affected
747	   by congestion, all the PWs in the tunnel will be throttled.  While
748	   this is sub-optimal, it is not clear that this would be a significant
749	   problem in practice, and it may still be the best trade-off.

751	   Per-tunnel rate control also has some desirable properties if the
752	   action taken during congestion is to selectively shut down certain
753	   PWs.  Since a tunnel will typically carry many PWs, it will be
754	   possible to make relatively small adjustments in the total bandwidth
755	   consumed by the tunnel by selectively shutting down or bringing up
756	   one or more PWs.

758	6.  Constant Bit Rate Services

760	   As noted above, some PW services may require a fixed rate of
761	   transmission, and it may be impossible to provide the service while
762	   throttling the transmission rate.  To provide such services, the
763	   network paths must be engineered so that congestion is impossible;
764	   providing such services over the Internet is thus not very likely.
765	   In fact, as congestion control cannot be applied to such services, it
766	   may be necessary to prohibit these services from being provided in
767	   the Internet, except in the case where the payload is known to
768	   consist of TCP connections or other traffic that is congestion-
769	   controlled by the end-points.  It is not clear how such a prohibition
770	   could be enforced.

772	   The only feasible mechanism for handling congestion affecting CBR
773	   services would appear to be to selectively turn off PWs when
774	   congestion occurs.  Clearly it is important to avoid "false alarms"
775	   in this case.  It is also important to avoid bringing PWs back up too
776	   quickly and re-introducing congestion.

778	   The idea of controlling rate per tunnel rather than PW, discussed
779	   above, seems particularly attractive when some of the PWs are CBR.
780	   First, it provides the possibility that non-CBR PWs could be
781	   throttled before it is necessary to shut down the CBR PWs.  Second,
782	   with the aggregation of multiple PWs on a single rate-controlled
783	   tunnel, it becomes possible to gradually increase or decrease the
784	   total offered load on the tunnel by selectively bringing up or
785	   shutting down PWs.  As noted above, local policies at a PE could be
786	   used to determine which PWs to shut down or bring up first.  Similar
787	   approaches would apply if the CBR PW offers a channelized service,
788	   with selected channels being shut down and brought up to control the
789	   total rate of the PW.

791	7.  Mandatory vs. Optional

793	   As discussed in section 1, there are a significant set of scenarios
794	   in which PW-specific congestion control is not necessary.  One might
795	   therefore argue that it doesn't seem to make sense to require PW-
796	   specific congestion control to be used on all PWs at all times.  On
797	   the other hand, if the option of turning off PW-specific congestion
798	   control is available, there is nothing to stop a provider from
799	   turning it off in inappropriate situations.  As this may contribute
800	   to congestive collapse outside the provider's own network, it may not
801	   be advisable to allow this.

803	8.  Related Work: Pre-Congestion Notification

805	   It has been suggested that Pre-congestion Notification (PCN)
806	   [I-D.briscoe-tsvwg-cl-architecture][I-D.briscoe-tsvwg-cl-phb] might
807	   provide a basis for addressing the PW congestion control problem.
808	   Using PCN, it would potentially be possible to determine if the level
809	   of congestion currently existing between an ingress and an egress PE
810	   was sufficiently low to safely allow a new PW to be established.
811	   PCN's pre-emption mechanisms could be used to notify a PE that one or
812	   more PWs need to be brought down, which again could be coupled with
813	   local policies to determine exactly which PWs should be shut down
814	   first.  This approach certainly merits further examination, but we
815	   note that PCN is considerably further away from deployment in the
816	   Internet than ECN, and thus cannot be considered as a near-term
817	   solution to the problem of PW-induced congestion in the Internet.

819	9.  Informative References

821	   [I-D.briscoe-tsvwg-cl-architecture]
822	              Briscoe, B., "An edge-to-edge Deployment Model for Pre-
823	              Congestion Notification: Admission  Control over a
824	              DiffServ Region", draft-briscoe-tsvwg-cl-architecture-03
825	              (work in progress), June 2006.

827	   [I-D.briscoe-tsvwg-cl-phb]
828	              Briscoe, B., "Pre-Congestion Notification marking",
829	              draft-briscoe-tsvwg-cl-phb-02 (work in progress),
830	              June 2006.

832	   [I-D.ietf-pwe3-cesopsn]
833	              Vainshtein, S., "Structure-aware TDM Circuit Emulation
834	              Service over Packet Switched Network  (CESoPSN)",
835	              draft-ietf-pwe3-cesopsn-07 (work in progress), May 2006.

837	   [I-D.ietf-pwe3-sonet]
838	              Malis, A., "SONET/SDH Circuit Emulation over Packet
839	              (CEP)", draft-ietf-pwe3-sonet-13 (work in progress),
840	              June 2006.

842	   [I-D.ietf-pwe3-tdmoip]
843	              Stein, Y., "TDM over IP", draft-ietf-pwe3-tdmoip-05 (work
844	              in progress), June 2006.

846	   [I-D.ietf-pwe3-vccv]
847	              Nadeau, T., "Pseudo Wire Virtual Circuit Connectivity
848	              Verification (VCCV)", draft-ietf-pwe3-vccv-11 (work in
849	              progress), October 2006.

851	   [RFC2001]  Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
852	              Retransmit, and Fast Recovery Algorithms", RFC 2001,
853	              January 1997.

855	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
856	              Requirement Levels", BCP 14, RFC 2119, March 1997.

858	   [RFC2581]  Allman, M., Paxson, V., and W. Stevens, "TCP Congestion
859	              Control", RFC 2581, April 1999.

861	   [RFC2914]  Floyd, S., "Congestion Control Principles", BCP 41,
862	              RFC 2914, September 2000.

864	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
865	              of Explicit Congestion Notification (ECN) to IP",
866	              RFC 3168, September 2001.

868	   [RFC3448]  Handley, M., Floyd, S., Padhye, J., and J. Widmer, "TCP
869	              Friendly Rate Control (TFRC): Protocol Specification",
870	              RFC 3448, January 2003.

872	   [RFC3985]  Bryant, S. and P. Pate, "Pseudo Wire Emulation Edge-to-
873	              Edge (PWE3) Architecture", RFC 3985, March 2005.

875	   [RFC4340]  Kohler, E., Handley, M., and S. Floyd, "Datagram
876	              Congestion Control Protocol (DCCP)", RFC 4340, March 2006.

878	   [RFC4385]  Bryant, S., Swallow, G., Martini, L., and D. McPherson,
879	              "Pseudowire Emulation Edge-to-Edge (PWE3) Control Word for
880	              Use over an MPLS PSN", RFC 4385, February 2006.

882	   [RFC4447]  Martini, L., Rosen, E., El-Aawar, N., Smith, T., and G.
883	              Heron, "Pseudowire Setup and Maintenance Using the Label
884	              Distribution Protocol (LDP)", RFC 4447, April 2006.

886	   [RFC4553]  Vainshtein, A. and YJ. Stein, "Structure-Agnostic Time
887	              Division Multiplexing (TDM) over Packet (SAToP)",
888	              RFC 4553, June 2006.

890	Authors' Addresses

892	   Stewart Bryant
893	   Cisco Systems, Inc.
894	   250 Longwater
895	   Green Park, Reading  RG2 6GB
896	   U.K.

898	   Phone:
899	   Fax:
900	   Email: stbryant@cisco.com
901	   URI:

903	   Bruce Davie
904	   Cisco Systems, Inc.
905	   1414 Mass. Ave.
906	   Boxborough, MA  01719
907	   USA

909	   Email: bsd@cisco.com

911	   Luca Martini
912	   Cisco Systems, Inc.
913	   9155 East Nichols Avenue, Suite 400.
914	   Englewood, CO  80112
915	   USA

917	   Email: lmartini@cisco.com

919	   Eric Rosen
920	   Cisco Systems, Inc.
921	   1414 Mass. Ave.
922	   Boxborough, MA  01719
923	   USA

925	   Email: erosen@cisco.com

927	Full Copyright Statement

929	   Copyright (C) The Internet Society (2006).

931	   This document is subject to the rights, licenses and restrictions
932	   contained in BCP 78, and except as set forth therein, the authors
933	   retain all their rights.

935	   This document and the information contained herein are provided on an
936	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
937	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
938	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
939	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
940	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
941	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

943	Intellectual Property

945	   The IETF takes no position regarding the validity or scope of any
946	   Intellectual Property Rights or other rights that might be claimed to
947	   pertain to the implementation or use of the technology described in
948	   this document or the extent to which any license under such rights
949	   might or might not be available; nor does it represent that it has
950	   made any independent effort to identify any such rights.  Information
951	   on the procedures with respect to rights in RFC documents can be
952	   found in BCP 78 and BCP 79.

954	   Copies of IPR disclosures made to the IETF Secretariat and any
955	   assurances of licenses to be made available, or the result of an
956	   attempt made to obtain a general license or permission for the use of
957	   such proprietary rights by implementers or users of this
958	   specification can be obtained from the IETF on-line IPR repository at
959	   http://www.ietf.org/ipr.

961	   The IETF invites any interested party to bring to its attention any
962	   copyrights, patents or patent applications, or other proprietary
963	   rights that may cover technology that may be required to implement
964	   this standard.  Please address the information to the IETF at
965	   ietf-ipr@ietf.org.

967	Acknowledgment

969	   Funding for the RFC Editor function is provided by the IETF
970	   Administrative Support Activity (IASA).