<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-ietf-pwe3-congestion-frmwk-02.txt"
     ipr="trust200902">
  <front>
    <title abbrev="PWE3 Congestion">Pseudowire Congestion Control
    Framework</title>

    <author fullname="Stewart Bryant" initials="S." surname="Bryant">
      <organization>Cisco Systems, Inc.</organization>

      <address>
        <postal>
          <street>250 Longwater</street>

          <city>Green Park</city>

          <region>Reading</region>

          <code>RG2 6GB</code>

          <country>U.K.</country>
        </postal>

        <phone></phone>

        <facsimile></facsimile>

        <email>stbryant@cisco.com</email>

        <uri></uri>
      </address>
    </author>

    <author fullname="Bruce Davie" initials="B." surname="Davie">
      <organization>Cisco Systems, Inc.</organization>

      <address>
        <postal>
          <street>1414 Mass. Ave.</street>

          <city>Boxborough</city>

          <region>MA</region>

          <code>01719</code>

          <country>USA</country>
        </postal>

        <email>bsd@cisco.com</email>
      </address>
    </author>

    <author fullname="Luca Martini" initials="L." surname="Martini">
      <organization>Cisco Systems, Inc.</organization>

      <address>
        <postal>
          <street>9155 East Nichols Avenue, Suite 400.</street>

          <city>Englewood</city>

          <region>CO</region>

          <code>80112</code>

          <country>USA</country>
        </postal>

        <email>lmartini@cisco.com</email>
      </address>
    </author>

    <author fullname="Eric Rosen" initials="E." surname="Rosen">
      <organization>Cisco Systems, Inc.</organization>

      <address>
        <postal>
          <street>1414 Mass. Ave.</street>

          <city>Boxborough</city>

          <region>MA</region>

          <code>01719</code>

          <country>USA</country>
        </postal>

        <email>erosen@cisco.com</email>
      </address>
    </author>

    <date year="2009" />

    <abstract>
      <t>Pseudowires are sometimes used to carry non-TCP data flows. In these
      circumstances the service payload will not react to network congestion
      by reducing its offered load. Pseudowires should therefore reduces their
      network bandwidth demands in the face of significant packet loss,
      including if necessary completely ceasing transmission. Since it is
      difficult to determine a priori the number of equivalent TCP flow that a
      pseudowire represents, a suitably "fair" rate of back-off cannot be
      pre-determined. This document describes pseudowire congestion problem
      and provides guidance on the development suitable solutions.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>Pseudowires are sometimes used to emulate services network that carry
      non-TCP data flows. In these circumstances the service payload will not
      react to network congestion by reducing its offered load. In order to
      protect the newtork, pseudowires (PW) should therefore reduces their
      network bandwidth demands in the face of significant packet loss,
      including if necessary completely ceasing transmission.</t>

      <t>Pseudowires are commonly deployed as part of the managed
      infrastructure deployed by a service provider. It is likely in these
      cases that the PW will be used to carry many TCP equivalent flows. In
      these environments the operator needs to make a judgement call on a fair
      estimate of the number of equivalent flows that the PW represents. The
      counterpoint to a well managed network is a plug and play network, of
      the type typically found in a consumer environment. Whilst PWs have been
      designed to provide an infrastructure tool to the service provider, we
      cannot be sure that they will not find some consumer or similar plug and
      play application. In this environment a conservative is needed, with PW
      congestion avoidance enabled by default, and the throughput set by
      default to the equivalent of a single TCP flow.</t>

      <t>This framework first considers the context in which PWs operate, and
      hence the context in which PW congestion control needs to operate. It
      then explores the issues of detection, feedback and load control in PW
      context, and the special issues that apply to constant bit rate
      services. Finally it considers the applicability of congestion control
      in various network environments.</t>

      <t>This document defines the framework to be used in designing a
      congestion control mechanism for PWs. It does not prescribe a solution
      for an particular PW type, and it is possible that more than one
      mechanism may be required, depending on the PW type and the environment
      in which it is to be deployed.</t>
    </section>

    <section anchor="Context" title="PW Congestion Context">
      <t />

      <section title="Congestion in IP Networks">
        <t>Congestion in an IP or MPLS enabled IP network occurs when the
        amount of traffic that needs to use a particular network resource
        exceeds the capacity of that resource. This results first in long
        queues within the network, and then in packet loss. If the amount of
        traffic is not then reduced, the packet loss rate will climb,
        potentially until it reaches 100%. If the situation is left unabated
        the network enters a state where it is no longer able to sustain any
        productive information flows and the result is "congestive
        collapse"</t>

        <t>To prevent this sort of "congestive collapse", there must be
        congestion control: a feedback loop by which the presence of
        congestion anywhere on the path from the transmitter to the receiver
        forces the transmitters to reduce the amount of traffic being sent. As
        a connectionless protocol, IP has no way to push back directly on the
        originator of the traffic. Procedures for (a) detecting congestion,
        (b) providing the necessary feedback to the transmitters, and (c)
        adjusting the transmission rates, are thus left the to higher protocol
        layers such as TCP.</t>
      </section>

      <section title="A Short Tutorial on TCP Congestion Control">
        <t>TCP includes an elaborate congestion control mechanism that causes
        the end systems to reduce their transmission rates when congestion
        occurs. For those readers not intimately familiar with the details of
        TCP congestion control, we give below a brief summary, greatly
        simplified and not entirely accurate, of TCP's very complicated
        feedback mechanism. The details of TCP congestion control can be found
        in <xref target="RFC2581" />. <xref target="RFC2001" /> is an earlier
        but more accessible discussion. <xref target="RFC2914" /> articulates
        a number of general principles governing congestion control in the
        Internet.</t>

        <t>In TCP congestion control, a lost packet is considered to be an
        indication of congestion. Roughly, TCP considers a given packet to be
        lost if that packet is not acknowledged within a specified time, or if
        three subsequent packets arrive at the receiver before the given
        packet. The latter condition manifests itself at the transmitter as
        the arrival of three duplicate acknowledgements in a row. The
        algorithm by which TCP detects congestion is thus highly dependent on
        the mechanisms used by TCP to ensure reliable and sequential
        delivery.</t>

        <t>Once a TCP transmitter becomes aware of congestion, it halves its
        transmission rate. If congestion still occurs at the new rate, the
        rate is halved again. When a rate is found at which congestion no
        longer occurs, the rate is increased by one MSS ("Maximum Segment
        Size") per RTT ("Round Trip Time"). The rate is increased each RTT
        until congestion is encountered again, or until something else limits
        it (e.g., the flow control window reached, or the application is
        transmitting at its max desired rate, or at line rate).</t>

        <t>This sort of mechanism is known as an "Additive Increase,
        Multiplicative Decrease" (AIMD) mechanism. Congestion causes
        relatively rapid decreases in the transmission rate, while the absence
        of congestion causes relatively slow increases in the allowed
        transmission rate.</t>
      </section>

      <section title="Alternative Approaches to Congestion Control">
        <t><xref target="RFC2914" /> defines the notion of a "TCP-compatible
        flow":</t>

        <t>"A TCP-compatible flow is responsive to congestion notification,
        and in steady state uses no more bandwidth than a conformant TCP
        running under comparable conditions (drop rate, RTT [round trip time],
        MTU [maximum transmission unit], etc.)"</t>

        <t>TCP-compatible flows respond to congestion in much the way TCP
        does, so that they do not starve the TCP flows or otherwise obtain an
        unfair advantage. <xref target="RFC2914" /> further points out:</t>

        <t>"any form of congestion control that successfully avoids a high
        sending rate in the presence of a high packet drop rate should be
        sufficient to avoid congestion collapse from undelivered packets."</t>

        <t>"This does not mean, however, that concerns about congestion
        collapse and fairness with TCP necessitate that all best-effort
        traffic deploy congestion control based on TCP's Additive-Increase
        Multiplicative-Decrease (AIMD) algorithm of reducing the sending rate
        in half in response to each packet drop."</t>

        <t>"However, the list of TCP-compatible congestion control procedures
        is not limited to AIMD with the same increase/ decrease parameters as
        TCP. Other TCP-compatible congestion control procedures include
        rate-based variants of AIMD; AIMD with different sets of
        increase/decrease parameters that give the same steady-state behavior;
        equation-based congestion control where the sender adjusts its sending
        rate in response to information about the long-term packet drop rate
        ... and possibly other forms that we have not yet begun to
        consider."</t>

        <t>The AIMD procedures are not mandated for non-TCP traffic, and might
        not be optimal for non-TCP PW traffic. Choosing a proper set of
        procedures which are TCP-compatible while being optimized for a
        particular type of traffic is no simple task.</t>

        <t><xref target="RFC3448" />, "TCP Friendly Rate Control (TFRC)"
        provides an alternative: TFRC is designed to be reasonably fair when
        competing for bandwidth with TCP flows, where a flow is "reasonably
        fair" if its sending rate is generally within a factor of two of the
        sending rate of a TCP flow under the same conditions. However, TFRC
        has a much lower variation of throughput over time compared with TCP,
        which makes it more suitable for applications such as telephony or
        streaming media where a relatively smooth sending rate is of
        importance."</t>

        <t>"For its congestion control mechanism, TFRC directly uses a
        throughput equation for the allowed sending rate as a function of the
        loss event rate and round-trip time. In order to compete fairly with
        TCP, TFRC uses the TCP throughput equation, which roughly describes
        TCP's sending rate as a function of the loss event rate, round-trip
        time, and packet size."</t>

        <t>"Generally speaking, TFRC's congestion control mechanism works as
        follows:</t>

        <t>
          <list style="symbols">
            <t>The receiver measures the loss event rate and feeds this
            information back to the sender.</t>

            <t>The sender also uses these feedback messages to measure the
            round-trip time (RTT).</t>

            <t>The loss event rate and RTT are then fed into TFRC's throughput
            equation, giving the acceptable transmit rate.</t>

            <t>The sender then adjusts its transmit rate to match the
            calculated rate."</t>
          </list>
        </t>

        <t>Note that the TFRC procedures require the transmitter to calculate
        a throughput equation. For these procedures to be feasible as a means
        of PW congestion control, they must be computationally efficient.
        Section 8 of [RFC3448] describes an implementation technique that
        appears to make it efficient to calculate the equation.</t>

        <t>In the context of pseudowires, we note that TFRC aims to achieve
        comparable throughput in the long term to a single TCP connection
        experiencing similar loss and round trip time. As noted above, this
        may not be an appropriate rate for a PW that is carrying many
        flows.</t>
      </section>

      <section title="Pseudowires and TCP">
        <t>Currently, traffic in IP and MPLS networks is predominantly TCP
        traffic. Even the layer 2 tunnelled traffic (e.g., PPP frames
        tunnelled through L2TP) is predominantly TCP traffic from the
        end-users. If pseudowires (PWs) <xref target="RFC3985" /> were to be
        used only for carrying TCP flows, there would be no need for any
        PW-specific congestion mechanisms. The existing TCP congestion control
        mechanisms would be all that is needed, since any loss of packets on
        the PW would be detected as loss of packets on a TCP connection, and
        the TCP flow control mechanisms would ensure a reduction of
        transmission rate.</t>

        <t>If a PW is carrying non-TCP traffic, then there is no feedback
        mechanism to cause the end-systems to reduce their transmission rates
        in response to congestion. When congestion occurs, any TCP traffic
        that is sharing the congested resource with the non-TCP traffic will
        be throttled, and the non-TCP traffic may "starve" the TCP traffic. If
        there is enough non-TCP traffic to congest the network all by itself,
        there is nothing to prevent congestive collapse.</t>

        <t>The non-TCP traffic in a PW can belong to any higher layer
        whatsoever, and there is no way to ensure that a TCP-like congestion
        control mechanisms will be present to regulate the traffic rate. Hence
        it appears that there is a need for an edge-to-edge (i.e., PE-to-PE)
        feedback mechanism which forces a transmitting PE to reduce its
        transmission rate in the face of network congestion.</t>

        <t>As TCP uses window-based flow control, controlling the rate is
        really a matter of limiting the amount of traffic which can be "in
        flight" (i.e., transmitted but not yet acknowledged) at any one time.
        Where a non-windowed protocol is used for transmitting data on a PW a
        different technique is obviously required to control the transmission
        rate.</t>
      </section>

      <section title="Arguments Against PW Congestion as a Practical Problem" />

      <t>One may argue that congestion due to non-TCP PW traffic is only a
      theoretical problem and that no congestion control mechanism is needed.
      For example the following cases have been put forward:</t>

      <t>
        <list style="symbols">
          <t>"99.9% of all the traffic in PWs is really IP traffic" <vspace
          blankLines="1" />If this is the case, then the traffic is either TCP
          traffic, which is already congestion-controlled, or "other" IP
          traffic. While the congestion control issue may exist for the
          "other" IP traffic, this is a general issue and hence is outside the
          scope of the pseudowire design. <vspace
          blankLines="1" />Unfortunately, we cannot be sure that this is the
          case. It may well be the case for the PW offerings of certain
          providers, but perhaps not for others. It does appear that many
          providers want to be able to use PWs for transporting "legacy
          traffic" of various non-IP protocols. Constant bit-rate services are
          an example of this, and raise particular issues for congestion
          control (discussed below).</t>

          <t>"PW traffic usually stays within one SP's network, and an SP
          always engineers its network carefully enough so that congestion is
          an impossibility" <vspace blankLines="1" />Perhaps this will be true
          of "most" PWs, but inter-provider PWs are certainly expected to have
          a significant presence. <vspace blankLines="1" /> Even within a
          single provider's network, the provider might consider whether he is
          so confident of his network engineering that he does not need a
          feedback loop reducing the transmission rate in response to
          congestion. <vspace blankLines="1" />There is also the issue of
          keeping the network running (i.e., out of congestive collapse) after
          an unexpected reduction of capacity.<vspace blankLines="1" />
          Finally the PW may be deployed on the Internet rather than in an SP
          environment.</t>

          <t>"If one provider accepts PW traffic from another, policing will
          be done at the entry point to the second provider's network, so that
          the second provider is sure that the first provider is not sending
          too much traffic. This policing, together with the second provider's
          careful network engineering, makes congestion an impossibility"
          <vspace blankLines="1" /> This could be the case given carefully
          controlled bilateral peering arrangements. Note though that if the
          second provider is merely providing transit services for a PW whose
          endpoints are in other providers, it may be difficult for the
          transit provider to tell which traffic is the PW traffic and which
          is "ordinary" IP traffic.</t>

          <t>"The only time we really need a general congestion control
          mechanism is when traffic goes through the public Internet.
          Obviously this will never be the case for PW traffic." <vspace
          blankLines="1" /> It is not at all difficult to imagine someone
          using an IPsec tunnel across the public Internet to transport a PW
          from one private IP network to another. <vspace blankLines="1" />
          Nor is it difficult to imagine some enterprise implementing a PW and
          transporting it across some SP's backbone, e.g., if that SP is
          providing VPN service to that enterprise.</t>
        </list>

        <t>The arguments that non-TCP traffic in PWs will never make any
        significant contribution to congestion thus do not seem to be totally
        compelling.</t>
      </t>

      <section title="Goals and non-goals of this draft">
        <t>As a framework, this draft aims to explore the issues surrounding
        PW congestion and to lay out some of the design trade offs that will
        need to be made if a solution is to be developed. It does not intend
        to propose a particular solution, but it does point out some of the
        problems that will arise with certain solution approaches.</t>

        <t>The over-riding technical goal of this work is to avoid scenarios
        in which the deployment of PW technology leads to congestive collapse
        of the PSN over which the PWs are tunnelled. It is a non-goal of this
        work to ensure that PWs receive a particular Quality of Service (QoS)
        particular level in order to protect them from congestion<xref
        target="Valuable"> (</xref>). While such an outcome may be desirable,
        it is beyond the scope of this draft.</t>
      </section>
    </section>

    <section title="Challenges for PW Congestion Management">
      <section title="Scale">
        <t>It might appear at first glance that an easy solution to PW
        congestion control would be to run the PWs through a TCP connection.
        This would provide congestion control automatically. However, the
        overhead is prohibitive for the PW application. The PWE3 data plane
        may be implemented in a micro-coded or hardware engine which needs to
        support thousands of PWs, and needs to do as little as possible for
        each data packet; running a TCP state machine, and implementing TCP's
        flow control procedures, would impose too high a processing and memory
        cost in this environment. Nor do we want to add the large overhead of
        TCP to the PWs -- the large headers, the plethora of small
        acknowledgements in the reverse direction, etc., etc. In fact, we need
        to avoid acknowledgements altogether. These same considerations lead
        us away from using e.g., DCCP <xref target="RFC4340" />, or SCTP <xref
        target="RFC4960" />, even in its partially reliable mode <xref
        target="RFC3758" />. Therefore we will investigate some PW-specific
        solutions for congestion control.</t>

        <t>The PW designed also needs to minimise the amount of interaction
        between the data processing path (which is likely to be distributed
        among a set of line cards) and the control path; be especially careful
        of interactions which might require atomic read/modify/write
        operations from the control path, or which might require atomic
        read/modify/write operations between different processors in a
        multiprocessing implementation, as such interactions can cause scaling
        problems.</t>

        <t>Thus, feasible solutions for PW-specific congestion will require
        scalable means to detect congestion and to reduce the amount of
        traffic sent into the network when congestion is detected. These
        topics are discussed in more detail in subsequent sections.</t>
      </section>

      <section anchor="cloops" title="Interaction among control loops">
        <t>As noted above, much of the traffic that is carried on PWs is
        likely to be TCP traffic, and will therefore be subject the congestion
        control mechanisms of TCP. It will typically be difficult for a PW
        endpoint to tell whether or not this is the case. Thus there is the
        risk that the PE-PE congestion control mechanisms applied over the PW
        may interact in undesirable ways with the end-to-end congestion
        control mechanisms of TCP. The PW-specific congestion control
        mechanisms should be designed to minimise the negative impact of such
        interaction.</t>
      </section>

      <section title="Determining the Appropriate Rate">
        <t>TCP tends to share the bandwidth of a bottleneck among flows
        somewhat evenly, although flows with shorter round-trip-times and
        larger MSS values will tend to get more throughput in the steady state
        than those with longer RTTs or smaller MSS. TFRC simply tries to
        deliver the same rate to a flow that a TCP flow would obtain in the
        steady state under similar conditions (loss rate, RTT and MSS). The
        challenge fin the PW environment is to determine what constitutes a
        "flow". While it is tempting to consider a single PW to be equivalent
        to a "flow" for the purposes of fair rate estimation, it is likely in
        many cases that one PW will be carrying a large number of flows, in
        which case it would seem quite unfair to throttle that PW down to the
        same rate that a single TCP flow would obtain.</t>

        <t>The issue of what constitutes fairness and the perils of using the
        TCP flow as the basic unit of fairness have been explored at some
        length in <xref target="I-D.briscoe-tsvarea-fair" />.</t>

        <t>In the PW environment some estimate (measured or configured) of the
        number flows in the PW needs to be applied to the estimate of fair
        throughput as part of the process of determining appropriate
        congestion behavior.</t>
      </section>

      <section anchor="CBRPW" title="Constant Bit Rate PWs">
        <t>Some types of PW, for example SAToP (Structure Agnostic TDM over
        Packet) <xref target="RFC4553" />, CESoPSN (Circuit Emulation over
        Packet Switched Networks) <xref target="RFC5086" />, TDM over IP <xref
        target="RFC5087" /><xref target="RFC4842" />, SONET/SDH and Constant
        Bit Rate ATM PWs represent an inelastic constant bit-rate (CBR) flow.
        Such PWs cannot respond to congestion in a TCP-friendly manner
        prescribed by <xref target="RFC2914" />. However this inability to
        respond well to congestion by reducing the amount of total bandwidth
        consumed if offset by the fact that such a PW maintains a constant
        bandwidth demand rather than being greedy (in the sense of trying to
        increase their share of a link, as TCP does). AIMD or even more
        gradual TFRC techniques are clearly not applicable to such services;
        it is not feasible to reduce the rate of a CBR service without
        violating the service definition. Such services are also frequently
        more sensitive to packet loss than connectionless packet PWs. Given
        that CBR services are not greedy, there may be a case for allowing
        them greater latitude during congestion peaks. If some CBR PWs are not
        able to endure any significant packet loss or reduction in rate
        without compromising the transported service, such PWs must be
        shutdown when the level of congestion becomes excessive, but at
        suitably low levels of congestion they should be allowed to continue
        to offer traffic to the network.</t>

        <t>Some CBR services may be carried over connectionless packet PWs. An
        example of such a case would be a CBR MPEG-2 video stream carried over
        an Ethernet PW. One could argue that such a service - provided the
        rate was policed at the ingress PE - should be offered the same
        latitude as a PW that explicitly provided a CBR service. Likewise,
        there may not be much value in trying to throttle such a service
        rather than cutting it off completely during severe congestion.
        However, this clearly raises the issue of how to know that a PW is
        indeed carrying a CBR service.</t>
      </section>

      <t />

      <section anchor="Valuable" title="Valuable and Vulnerable PWs">
        <t>Some PWs are a premium service for which the user has paid a
        premium to the provider for higher availability, lower packet loss
        rate etc. Some PWs, for example the CBR services described in <xref
        target="CBRPW" />are particularly vulnerable to packet loss. In both
        of these cases it may be tempting to relax the congestion
        considerations so that these services can press on as best they can in
        the event of congestion. However a more appropriate strategy is to
        engineer the network paths, QoS parameters, queue sizes and scheduling
        parameters so that these services do not suffer congestion discard. If
        despite this network engineering these services still experience
        congestion, that is an indication that the network is having
        difficulty servicing their needs, and the services, like any other
        service should reduce their network load.</t>
      </section>
    </section>

    <section title="Congestion Control Mechanisms">
      <t>There are three components of the congestion control mechanism that
      we need to consider:<list style="numbers">
          <t>Congestion state detection</t>

          <t>Feedback from the receiver to the transmitter</t>

          <t>Method used by the transmitter to respond to congestion state</t>
        </list></t>

      <t>Reference to congestion state apply to both the detection of the
      onset of the congestion state, and detection of the return of the
      network to a state in which greater or normal bandwidth is
      available.</t>

      <t>We discuss the design framework for each of these components in the
      sections below.</t>
    </section>

    <section anchor="detect" title="Detecting Congestion">
      <t>In TCP, congestion is detected by the transmitter; the receipt of
      three successive duplicate TCP acknowledgements are taken to be
      indicative of congestion. What this actually means is that the several
      packets in a row were received at the remote end, such that none of
      those packets had the next expected sequence number. This is interpreted
      as meaning that the packet with the next expected sequence number was
      lost in the network, and the loss of a single packet in the network is
      taken as a sign of congestion. (Naturally, the presence of congestion is
      also inferred if TCP has to retransmit a packet.) Note that it is
      possible for misordered packets to be misinterpreted as lost packets, if
      they do not arrive "soon enough".</t>

      <t>In TCP, a time-out while awaiting an acknowledgement is also
      interpreted as a sign of congestion.</t>

      <t>Since there are normally no acknowledgements on a PW (the only PW
      design that has so far attempted a sophisticated flow control is <xref
      target="I-D.ietf-pwe3-fc-flow"></xref>), the PW-specific congestion
      control mechanism cannot normally be based on either the presence of or
      the absence of acknowledgements. Some types of pseudowire (the CBR PWs)
      have a single bit that indicates that a preset amount of data has been
      lost, but this is a non-quantitative indicator. CBR PWs have the
      advantage that there is a constant two way data flow, while other PW
      types do not have the constant symmetric flow of payload on which to
      piggyback the congestion notification. Most PW types therefore provide
      no way for a transmitter to determine (or even to make an educated guess
      as to) whether any data has been lost.</t>

      <t>Thus we need to add a mechanism for determining whether data packets
      on a PW have become lost. There are several possible methods for doing
      this:<list style="symbols">
          <t>Detect Congestion Using PW Sequence Numbers</t>

          <t>Detect Congestion Using Modified VCCV Packets <xref
          target="RFC5085"></xref></t>

          <t>Rely on Explicit Congestion Notification (ECN) <xref
          target="RFC3168"></xref></t>
        </list>We discuss each option in turn in the following sections.</t>

      <section title="Using Sequence Numbers to Detect Congestion">
        <t>When the optional sequencing feature is in use on a PW <xref
        target="RFC4385"></xref>, it is necessary for the receiver to maintain
        a "next expected sequence number" for the PW. If a packet arrives with
        a sequence number that is earlier than the next expected (a
        "misordered packet"), the packet is discarded; if it arrives with a
        sequence number that is greater than or equal to the next expected,
        the packet is delivered, and the next expected sequence number becomes
        the sequence number of the current packet plus 1.</t>

        <t>It is easy to tell when there is one or more missing packets (i.e.,
        there is a "gap" in the sequence space) -- that is the case when a
        packet arrives whose sequence number is greater than the next
        expected. What is difficult to tell is whether any misordered packets
        that arrive after the gap are indeed the missing packets. One could
        imagine that the receiver remembers the sequence number of each
        missing packet for a period of time, and then checks off each such
        sequence number if a misordered packet carrying that sequence number
        later arrives. The difficulty is doing this in a manner which is
        efficient enough to be done by the microcoded hardware handling the PW
        data path. This approach does not really seem feasible.</t>

        <t>One could make certain simplifying assumptions, such as assuming
        that the presence of any gaps at all indicates congestion. While this
        assumption makes it feasible to use the sequence numbers to "detect
        congestion", it also throttles the PW unnecessarily if there is really
        just misordering and no congestion. Such an approach would be
        considerably more likely to misinterpret misordering as congestion
        than would TCP's approach.</t>

        <t>An intermediate approach would be to keep track of the number of
        missing packets and the number of misordered packets for each PW. One
        could "detect congestion" if the number of missing packets is
        significantly larger than the number of misordered packets over some
        sampling period. However, gaps occurring near the end of a sampling
        period would tend to result in false indications of congestion. To
        avoid this one might try to smooth the results over several sampling
        periods; While this would tend to decrease the responsiveness, it is
        inevitable that there will be a trade-off between the rapidity of
        responsiveness and the rate of false alarms.</t>

        <t>One would not expect the hardware or microcode to keep track of the
        sampling period; presumably software would read the necessary counters
        from hardware at the necessary intervals.</t>

        <t>Such a scheme would have the advantage of being based on existing
        PW mechanisms. However, it has the disadvantage of requiring
        sequencing, which introduces a fairly complicated interaction between
        the control processing and the data path, and precludes the use of
        some pipelined forwarding designs.</t>
      </section>

      <section title="Using VCCV to Detect Congestion">
        <t>It is reasonable to suppose that the hardware keeps counts of the
        number of packets sent and received on each PW. Suppose that the PW
        periodically inserts VCCV packets into the PE data stream, where each
        VCCV packet carries:</t>

        <t><list style="symbols">
            <t>A sequence number, increasing by 1 for each successive VCCV
            packet;</t>

            <t>The current value of the transmission counter for the PW</t>
          </list> We assume that the size of the counter is such that it
        cannot wrap during the interval between n VCCV packets, for some n
        &gt; 1.</t>

        <t>When the receiver gets one of these VCCV packets on a PW, it
        inserts into the packet, or packet metadata the count of received
        packets for that PW, and then delivers the VCCV packet to the
        software. The receiving software can now compute, for the inter-VCCV
        intervals, the count of packets transmitted and the count of packets
        received. The presence of congestion can be inferred if the count of
        packets transmitted is significantly greater than the count of packets
        received during the most recent interval. Even the loss rate could be
        calculated. The loss rate calculated in this way could be used as
        input to the TFRC rate equation.</t>

        <t>VCCV messages would not need to be sent on a PW (for the purpose of
        detecting congestion) in the absence of traffic on that PW.</t>

        <t>Of course, misordered packets that are sent during one interval but
        arrive during the next will throw off the loss rate calculation; hence
        the difference between sent traffic and received traffic should be
        "significant" before the presence of congestion is inferred. The value
        of "significance" can be made larger or smaller depending on the
        probability of misordering.</t>

        <t>Note that congestion can cause a VCCV packet to go missing, and
        anything that misorders packets can misorder a VCCV packet as well as
        any other. One may not want to infer the presence of congestion if a
        single VCCV packet does not arrive when expected, as it may just be
        delayed in the network, even if it hasn't been misordered. However,
        failure to receive a VCCV packet after a certain amount of time has
        elapsed since the last VCCV was received (on a particular PW) may be
        taken as evidence of congestion. This scheme has the disadvantage of
        requiring periodic VCCV packets, and it requires VCCV packet formats
        to be modified to include the necessary counts. However, the
        interaction between the control path and the data path is very simple,
        as there is no polling of counters, no need for timers in the data
        path, and no need for the control path to do read-modify-write
        operations on the data path hardware. A bigger disadvantage may arise
        from the possible inability to ensure that the transmit counts in the
        VCCVs are exactly correct. The transmitting hardware may not be able
        to insert a packet count in the VCCV IMMEDIATELY before transmission
        of the VCCV on the wire, and if it cannot, the count of transmit
        packets will only be approximate.</t>

        <t>Neither scheme can provide the same type of continuous feedback
        that TCP gets. TCP gets a continuous stream of acknowledgments,
        whereas this type of PW congestion detection mechanism would only be
        able to say whether congestion occurred during a particular interval.
        If the interval is about 1 RTT, the PW congestion control would be
        approximately as responsive as TCP congestion control, and there does
        not seem to be any advantage to making it smaller. However, sampling
        at an interval of 1 RTT might generate excessive amounts of overhead.
        Sampling at longer intervals would reduce responsiveness to congestion
        but would not necessarily render the congestion control mechanism
        "TCP-unfriendly".</t>
      </section>

      <section title="Explicit Congestion Notification">
        <t>In networks that support explicit congestion notification (ECN)
        <xref target="RFC3168"></xref> the ECN notification provides
        congestion information to the PEs before the onset of congestion
        discard. This is particularly useful to PWs that are sensitive to
        packet loss, since it gives the PE the opportunity to intelligently
        reduce the offered load. ECN marking rates of packets received on a PW
        could be used to calculate the TFRC rate for a PW. However ECN is not
        widely deployed at the time of writing; hence it seems that PEs must
        also be capable of operating in a network where packet loss is the
        only indicator of congestion.</t>
      </section>
    </section>

    <section title="Feedback from Receiver to Transmitter">
      <t>Given that the receiver can tell, for each sampling interval, whether
      or not a PW's traffic has encountered congestion, the receiver must
      provide this information as feedback to the transmitter, so that the
      transmitter can adjust its transmission rate appropriately. The feedback
      could be as simple as a bit stating whether or not there was any packet
      loss during the specified interval. Alternatively, the actual loss rate
      could be provided in the feedback, if that information turns out to be
      useful to the transmitter (e.g. to enable it to calculate an appropriate
      rate at which to send). There are a number of possible ways in which the
      feedback can be provided: control plane, reverse data traffic, or VCCV
      messages. We discuss each in turn below.</t>

      <section title="Control Plane Feedback">
        <t>A control message can be sent periodically to indicate the presence
        or absence of congestion. For example, when LDP is the control
        protocol <xref target="RFC4447"></xref>, the control message would of
        course be delivered reliably by TCP. (The same considerations apply
        for any protocol which has a reliable control channel.) When
        congestion is detected, a control message can be sent indicating that
        fact. No further congestion control messages would need to be sent
        until congestion is no longer detected. If the loss rate is being
        sent, changes in the loss rate would need to be sent as well. When
        there is no longer any congestion, a message indicating the absence of
        congestion would have to be sent.</t>

        <t>Since congestion in the reverse direction can prevent the delivery
        of these control messages, periodic "no congestion detected" messages
        would need to be sent whenever there is no congestion. Failure to
        receive these in a timely manner would lead the control protocol peer
        to infer that there is congestion. (Actually, there might or might not
        be congestion in the transmitting direction, but in the absence of any
        feedback one cannot assume that everything is fine.) If control
        messages really cannot get through at all, control protocol
        keep-alives will fail and the control connection will go down
        anyway.</t>

        <t>If the control messages simply say whether or not congestion was
        detected, then given a reliable control channel, periodic messages are
        not needed during periods of congestion. Of course, if the control
        messages carry more data, such as the loss rate, then they need to be
        sent whenever that data changes.</t>

        <t>If it is desired to control congestion on a per-tunnel basis, these
        control messages will simply say that there was congestion on some PW
        (one or more) within the tunnel. If it is desired to control
        congestion on a per-PW basis, the control message can list the PWs
        which have experienced congestion, most likely by listing the
        corresponding labels. If the VCCV method of detecting congestion is
        used, one could even include the sent/received statistics for
        particular VCCV intervals.</t>

        <t>This method is very simple, as one does not have to worry about the
        congestion control messages themselves getting lost or out of
        sequence. Feedback traffic is minimized, as a single control message
        relays feedback about an entire tunnel.</t>
      </section>

      <section title="Using Reverse Data Packets for Feedback">
        <t>If a receiver detects congestion on a particular PW, it can set a
        bit in the data packets that are travelling on that PW in the reverse
        direction; when no congestion is detected, the bit would be clear. The
        bit would be ignored on any packet that is received out of sequence,
        of course. There are several disadvantages to this technique:<list
            style="symbols">
            <t>There may be no (or insufficient) data traffic in the reverse
            direction</t>

            <t>Sequencing of the data stream is required</t>

            <t>The transmission of the congestion indications is not
            reliable</t>

            <t>The most one could hope to convey is one bit of information per
            PW (if there is even a bit available in the encapsulation).</t>
          </list></t>
      </section>

      <section title="Reverse VCCV Traffic ">
        <t>Congestion indications for a particular PW could be carried in VCCV
        packets travelling in the reverse direction on that PW. Of course,
        this would require that the VCCV packets be sent periodically in the
        reverse direction whether or not there is reverse direction traffic.
        For congestion feedback purposes they might need to be sent more
        frequently than they'd need to be sent for OAM purposes. It would also
        be necessary for the VCCVs to be sequenced (with respect to each
        other, not necessarily with respect to the data stream). Since VCCV
        transmission is unreliable, one would want to send multiple VCCVs
        within whatever period we want to be able to respond in. Further, this
        method provides no means of aggregating congestion information into
        information about the tunnel.</t>
      </section>
    </section>

    <section title="Responding to Congestion">
      <t>In TCP, one tends to think of the transmission rate in terms of MTUs
      per RTT, which defines the maximum number of unacknowledged packets that
      TCP is allowed to maintain "in flight". Upon detection of a lost packet,
      this rate is halved ("multiplicative decrease"). It will be halved again
      approximately every RTT until the missing data gets through. Once all
      missing data has gotten through, the transmission rate is increased by
      one MTU per RTT. Every time a new acknowledgment (i.e., not a duplicate
      acknowledgment) is received, the rate is similarly increased (additive
      increase). Thus TCP can adjust its transmit rate very rapidly, i.e., it
      responds on the order of a RTT. By contrast, TCP-friendly rate control
      adjusts its rate rather more gradually.</t>

      <t>For simplicity, this discussion only covers the "congestion
      avoidance" phase of TCP congestion control. The analogy of TCP's "slow
      start phase" would also be needed.</t>

      <t>TCP can easily estimate the RTT, since all its transmissions are
      acknowledged. In PWE3, the best way to estimate the RTT might be via the
      control protocol. In fact, if the control protocol is TCP-based, getting
      the RTT estimate from TCP might be a good option.</t>

      <t>TCP's rate control is window-based, expressed as a number of bytes
      that can be in flight. PWE3's rate control would need to be rate-based.
      The TFRC specification <xref target="RFC3448"></xref> provides the
      equation for the TCP-friendly rate for a given loss rate, RTT, and MTU.
      Given some means of determining the loss rate, as described in <xref
      target="detect"></xref>, the TCP friendly rate for a PW or a tunnel can
      be calculated at the ingress PE. However as we noted earlier, a PW may
      be carrying many flows, in which case the use of a single flow TCP rate
      would be a significant underestimate of the true fair rate application,
      and hence damaging to the operation of the PW.</t>

      <t>If the congestion detection mechanism only produces an approximate
      result, the probability of a "false alarm" (thinking that there is
      congestion when there really is not) for some interval becomes
      significant. It would be better then to have some algorithm which
      smoothes the result over several intervals. The TFRC procedures, which
      tend to generate a smoother and less abrupt change in the transmission
      rate than the AIMD procedures, may also be more appropriate in this
      case.</t>

      <t>Once a PE has determined the appropriate rate at which to transmit
      traffic on a given PW or tunnel, it needs some means to enforce that
      rate via policing, shaping, or selective shutting down of PWs. There are
      tradeoffs to be made among these options, depending on various factors
      including the higher layer service that is carried. The effect of
      different mechanisms when the higher layer traffic is already using TCP
      is discussed below.</t>

      <section title="Interaction with TCP">
        <t>Ideally there should be no PW-specific congestion control mechanism
        used when the higher layer traffic is already running over TCP and is
        thus subject to TCP's existing congestion control. However it may be
        difficult to determine what the higher layer is on any given PW. Thus,
        interaction between PW-specific congestion control and TCP's
        congestion control needs to be considered.</t>

        <t>As noted in <xref target="cloops"></xref>, a PW-specific congestion
        control mechanism may interact poorly with the "outer" control loop of
        TCP if the PW carries TCP traffic. A well-documented example of such
        poor interaction is a token bucket policer that drops packets outside
        the token bucket. TCP has difficulty finding the "bottleneck"
        bandwidth in such an environment and tends to overshoot, incurring
        heavy losses and consequent loss of throughput.</t>

        <t>A shaper that queues packets at the PE and only injects them into
        the network at the appropriate rate may be a better choice, but may
        still interact unpredictably with the "outer control loop" of TCP
        flows that happen to traverse the PW. This issue warrants further
        study.</t>

        <t>Another possibility is simply to shut down a PW when the rate of
        traffic on the PW significantly exceeds the appropriate rate that has
        been determined for the PW. While this might be viewed as draconian,
        it does ensure that any PW that is allowed to stay up will behave in a
        predictable manner. Note that this would also be the most likely
        choice of action for CBR PWs (as discussed in <xref
        target="cbr"></xref>). Thus all PWs would be treated alike and there
        would be no need to try to determine what sort of upper layer payload
        a PW is carrying.</t>
      </section>
    </section>

    <section title="Rate Control per Tunnel vs. per PW">
      <t>Rate controls can be applied on a per-tunnel basis or on a per-PW
      basis. Applying them on a per-tunnel basis (and obtaining congestion
      feedback on a per-tunnel basis) would seem to provide the most efficient
      and most scalable system. Achieving fairness among the PWs then becomes
      a local issue for the transmitter. However, if the different PWs follow
      different paths through the network (e.g. because of ECMP over the
      tunnel), it is possible that some PWs will encounter congestion while
      some will not. If rate controls are applied on a per-tunnel basis, then
      if any PW in a tunnel is affected by congestion, all the PWs in the
      tunnel will be throttled. While this is sub-optimal, it is not clear
      that this would be a significant problem in practise, and it may still
      be the best trade-off.</t>

      <t>Per-tunnel rate control also has some desirable properties if the
      action taken during congestion is to selectively shut down certain PWs.
      Since a tunnel will typically carry many PWs, it will be possible to
      make relatively small adjustments in the total bandwidth consumed by the
      tunnel by selectively shutting down or bringing up one or more PWs.</t>

      <t>Note again the issue of estimating the correct rate. We know how many
      PWs there per tunnel, but we do not know how many flows there are per
      PW.</t>
    </section>

    <section anchor="cbr" title="Constant Bit Rate Services">
      <t>As noted above, some PW services may require a fixed rate of
      transmission, and it may be impossible to provide the service while
      throttling the transmission rate. To provide such services, the network
      paths must be engineered so that congestion is impossible; providing
      such services over the Internet is thus not very likely. In fact, as
      congestion control cannot be applied to such services, it may be
      necessary to prohibit these services from being provided in the
      Internet, except in the case where the payload is known to consist of
      TCP connections or other traffic that is congestion-controlled by the
      end-points. It is not clear how such a prohibition could be
      enforced.</t>

      <t>The only feasible mechanism for handling congestion affecting CBR
      services would appear to be to selectively turn off PWs, or channels
      with the PW, when congestion occurs. Clearly it is important to avoid
      "false alarms" in this case. It is also important to avoid bringing PWs
      (or channels within the PW) back up too quickly and re-introducing
      congestion.</t>

      <t>The idea of controlling rate per tunnel rather than PW, discussed
      above, seems particularly attractive when some of the PWs are CBR.
      First, it provides the possibility that non-CBR PWs could be throttled
      before it is necessary to shut down the CBR PWs. Second, with the
      aggregation of multiple PWs on a single rate-controlled tunnel, it
      becomes possible to gradually increase or decrease the total offered
      load on the tunnel by selectively bringing up or shutting down PWs. As
      noted above, local policies at a PE could be used to determine which PWs
      to shut down or bring up first. Similar approaches would apply if the
      CBR PW offers a channelized service, with selected channels being shut
      down and brought up to control the total rate of the PW.</t>
    </section>

    <section title="Managed vs Unmanaged Deployment">
      <t>As discussed in <xref target="Context"></xref>, there are a
      significant set of scenarios in which PW-specific congestion control may
      not be necessary. One might therefore argue that it doesn't seem to make
      sense to require PW-specific congestion control to be used on all PWs at
      all times. On the other hand, if the option of turning off PW-specific
      congestion control is available, there is nothing to stop a provider
      from turning it off in inappropriate situations. As this may contribute
      to congestive collapse outside the provider's own network, it may not be
      advisable to allow this.</t>

      <t>The circumstance in which it is most vocally argued that congestion
      control is not needed are where the PW is part of the service providers
      own network infrastructure. In cases where the PW is deployed in a
      managed, well-engineered network it is probably necessary to permit the
      operator to take the congestion risk upon themselves if they desire to
      do so by disabling any congestion control mechanism. Ironically work in
      progress on the design of an MPLS Transport Profile indicates that some
      of these users require that an OAM is run (end-to-end, or over part of
      the tunnel (known as Tandem monitoring)), and when the OAM detects that
      the performance parameters are breached (delay, jitter or packet loss)
      this is used to trigger a failover to a backup path. When the backup
      path mechanism operates in 1 to 1 mode, this moves the congestive
      traffic to another network path, which is the characteristic we require.
      [** add reference **]</t>

      <t>Further in such an environment it is likely that the PW will be used
      to carry many TCP equivalent flows. In these managed circumstances the
      operator will have to make a judgement call on a fair estimate of the
      number of equivalent flows that the PW represents and set the congestion
      parameters accordingly.</t>

      <t>The counterpoint to a well managed network is a plug and play
      network, of the type typically found in a consumer environment. Whilst
      PWs have been designed to provide an infrastructure tool to the service
      provider, we cannot be sure that they will not find some consumer or
      similar plug and play application. In such an environment a conservative
      approach seems appropriate, and the default PW configuration MUST enable
      the congestion avoidance mechanism, with parameters set to the
      equivalent of a single TCP flow.</t>
    </section>

    <section title="Related Work: Pre-Congestion Notification">
      <t>It has been suggested that Pre-congestion Notification (PCN) <xref
      target="I-D.ietf-pcn-architecture"></xref> might provide a basis for
      addressing the PW congestion control problem. Using PCN, it would
      potentially be possible to determine if the level of congestion
      currently existing between an ingress and an egress PE was sufficiently
      low to safely allow a new PW to be established. PCN's pre-emption
      mechanisms could be used to notify a PE that one or more PWs need to be
      brought down, which again could be coupled with local policies to
      determine exactly which PWs should be shut down first. This approach
      certainly merits further examination, but we note that PCN is
      considerably further away from deployment in the Internet than ECN, and
      thus cannot be considered as a near-term solution to the problem of
      PW-induced congestion in the Internet.</t>
    </section>

    <section title="IANA Considerations">
      <t>This section may be removed by the RFC Editor.</t>

      <t>There are no IANA actions required by this document.</t>
    </section>

    <section title="Security Considerations">
      <t>Security is a good thing (TM).</t>
    </section>

    <section title="Conclusion">
      <t>Pseudowires are sometimes used to emulate services network that carry
      non-TCP data flows. In these circumstances the service payload will not
      react to network congestion by reducing its offered load. Pseudowires
      SHOULD therefore reduces their network bandwidth demands in the face of
      significant packet loss, including if necessary completely ceasing
      transmission. In some service provider network environments the network
      operator may choose to not deploy congestion avoidance mechanisms, but
      they should make that choice in the full knowledge that they are
      avoiding a network safety valve. In an unmanaged environment a
      conservative approach to congestion is REQUIRED.</t>

      <t>Since it is difficult to determine a priori the number of equivalent
      TCP flow that a pseudowire represents, a suitably "fair" rate of
      back-off cannot easily be pre-determined. In a managed environment
      setting the congestion parameters will require some level of informed
      judgement. In an unmanaged environment the equivalent TCP flow should be
      set to one.</t>

      <t>The selection of the appropriate mechanism(s) to implement congestion
      avoidance is work in progress in the IETF PWE3 working group.</t>
    </section>
  </middle>

  <back>
    <references title="Informative References">
      <?rfc include='reference.I-D.ietf-pcn-architecture'?>

      <?rfc include='reference.I-D.briscoe-tsvarea-fair'?>

      <?rfc include='reference.RFC.5085'?>

      <?rfc include='reference.RFC.5086'?>

      <?rfc include='reference.RFC.5087'?>

      <?rfc include='reference.RFC.4842'?>

      <?rfc include='reference.I-D.ietf-pwe3-fc-flow'?>

      <?rfc include="reference.RFC.2001"?>

      <?rfc include="reference.RFC.2119"?>

      <?rfc include="reference.RFC.2581"?>

      <?rfc include="reference.RFC.2914"?>

      <?rfc include='reference.RFC.3168'?>

      <?rfc include='reference.RFC.3448'?>

      <?rfc include='reference.RFC.3985'?>

      <?rfc include='reference.RFC.4340'?>

      <?rfc include='reference.RFC.4385'?>

      <?rfc include='reference.RFC.4447'?>

      <?rfc include='reference.RFC.4553'?>

      <?rfc include='reference.RFC.4960'?>

      <?rfc include='reference.RFC.3758'?>
    </references>
  </back>
</rfc>
