Internet Engineering Task Force     I. Stoica            UC, Berkeley
 Internet Draft                      H. Zhang             CMU
 Expires  April 2003                 N. Venkitaraman      Motorola Labs
                                     J. Mysore            Motorola Labs

                                     October 2002

             Per Hop Behaviors Based on Dynamic Packet State
                   <draft-stoica-diffserv-dps-02.txt>

 Status of this Memo

      This document is an Internet-Draft and is in full conformance with
      all provisions of Section 10 of RFC2026.

      Internet-Drafts are working documents of the Internet Engineering
      Task Force (IETF), its areas, and its working groups. Note that other
      groups may also distribute working documents as Internet-Drafts.

      Internet-Drafts are draft documents valid for a maximum of six months
      and may be updated, replaced, or obsoleted by other documents at any
      time. It is inappropriate to use Internet-Drafts as reference
      material or to cite them other than as "work in progress."

      The list of current Internet-Drafts can be accessed at
      http://www.ietf.org/ietf/1id-abstracts.txt

      The list of Internet-Draft Shadow Directories can be accessed at
      http://www.ietf.org/shadow.html.


 Copyright Notice

      Copyright (C) The Internet Society (2002).  All Rights Reserved.

 Abstract

    This document proposes a family of Per-Hop Behaviors (PHBs) 
    based on Dynamic Packet State (DPS) in the context of the 
    differentiated service architecture. With these PHBs, distributed
    algorithms can be devised to implement services with flexibility,
    utilization, and assurance levels similar to those that can be 
    provided with per-flow mechanisms.

    With Dynamic Packet State, each packet carries in its header, in
    addition to the PHB codepoint, some PHB-specific state. The state
    is initialized by the ingress node. Interior nodes process each
    incoming packet based on the state carried in the packet's 
    header, updating both its internal state and the state in the 
    packet's header before forwarding it to the next hop. By using 
    DPS to coordinate actions of edge and interior nodes along the 
    path traversed by a flow, distributed algorithms can be designed
    to approximate the behavior of a broad class of "stateful" 
    networks using networks in which interior nodes do not maintain 


  Stoica et al                 Expires April 2003                  [Page 1]
  Internet Draft        PHBs based on Dynamic Packet State     October 2002

    per-flow state. We give examples of services that can be implemented by
    PHBs based on DPS. We also discuss several possible solutions for encoding
    Dynamic Packet State that have the minimum incompatibility with IPv4.

 1. Introduction

    While the diffserv architecture [Blake98] is highly scalable, 
    the services it can provide have lower flexibility, utilization,
    or assurance levels than services provided by architectures 
    that employ per-flow management at every node. It is debatable
    whether this should be of significant concern. For example, the
    low utilization by the premium traffic may be acceptable if the
    majority of traffic will be best effort, either because the 
    best effort service is "good enough" for majority applications 
    or the price difference between premium traffic and best effort 
    traffic is too high to justify the performance difference between
    them. Alternatively, if the guaranteed nature of service 
    assurance is not needed, i.e., statistical service assurance is 
    sufficient for premium service, we can then achieve higher 
    network utilization. Providing meaningful statistical service
    is still an open research problem. A discussion of these topics 
    is beyond the scope of this document. Furthermore, there is usually 
    a tradeoff between the complexity of the QoS mechanism and the 
    efficiency of the resource usage. While intserv-style guaranteed 
    service can achieve high resource utilization, premium service 
    needs a much simpler mechanism.

    This document proposes a new set of PHBs based on Dynamic Packet
    State (DPS). These PHBs have relative low complexities (do not 
    require per-flow state), but can be used to implement distributed 
    algorithms to provide services with flexibility, utilization, 
    and assurance levels similar to those that can be achieved with 
    per-flow mechanisms. DS domains implemented with these type of PHBs
    should interoperate with DS domains implemented with other PHBs.

    With Dynamic Packet State, each packet carries in its header, in
    addition to the PHB codepoint, some PHB-specific state. The state
    is initialized by an ingress node, when the packet enters the DS
    domain. Interior nodes process each incoming packet based on the
    state carried in the packet's header, updating both its internal
    state and the state in the packet's header before forwarding it.
    By using DPS to coordinate actions of edge and interior nodes 
    along the path traversed by a flow, distributed algorithms can 
    be designed to approximate the behavior of a broad class of 
    "stateful" networks. Consequently, introducing PHBs based on DPS 
    will significantly increase the flexibility and capabilities of 
    the services that can be built within the diffserv architecture.
    In particular, we will show that a variety of QoS services can 
    be implemented by PHBs based on DPS. These include weighted fair
    share service, and distributed admission control service.

    In this document, we use flow to refer to a subset of packets 
    that traverse the same path inside a DS domain between two edge 

    Stoica et al                 Expires April 2003                  [Page 2]
    Internet Draft        PHBs based on Dynamic Packet State     October 2002

    nodes.  Thus, with the highest level of traffic aggregation, a 
    flow consists of all packets between the same pair of ingress 
    and egress nodes.  

    This document is organized as follows. Section 2 gives a general
    description of PHBs based on DPS. Section 3 presents several 
    services that can be implemented with PHBs based on DPS. 
    Section 4 discusses alternative techniques of storing state in 
    the packet's header. Sections 5 briefly discusses some issues
    related to the specification of DPS PHB's, such as codepoints, 
    tunneling behavior, and interaction with other PHB's. 
    Section 6 discusses security issues.

 2. Description of Per-Hop Behaviors Based on Dynamic Packet State

    Unlike common PHB codepoints [Blake98, Heinanen99, Jacobson98],
    a PHB codepoint based on DPS has extra state associated with it.
    This state is initialized by ingress nodes and carried by packets
    inside the DS domain. The state semantic is PHB dependent. 
    The values taken by the state can be either flow, path, or 
    packet dependent. The state carried by packets can be used by 
    interior nodes for a variety of purposes such as, packet 
    scheduling, updating the local node's state, or extending the 
    codepoint space.

    When a PHB based on DPS is defined, in addition to the guidelines
    given in [Blake98], the following items must be specified:

       o state semantic - the meaning of the information carried by
                          the packets

       o state placement - where is the information stored in the
                           packet's header

       o encoding format - how is the information encoded in packets

    For example, consider a PHB that implements the Stateless 
    Prioritized Fair Queue Sharing algorithm, which is described in 
    Section 3.1. In this case, the state carried by a packet is 
    based on an estimate of the current rate of the flow to which 
    the packet belongs. The state can be placed either (a) between 
    layer two and layer three headers, (b) as an IP option, or 
    (c) in the IP header (see Section 4). Finally, the rate can be 
    encoded by using a floating point like format as described in 
    Section 4.1.1.

    In addition, the following requirement, called the transparency
    requirement, must be satisfied

      o All changes performed at ingress nodes or within the DS 
      domain on packets' headers (possible for the purpose of 
      the state) must be undone by egress nodes


  Stoica et al                 Expires April 2003                  [Page 3]
  Internet Draft        PHBs based on Dynamic Packet State     October 2002

    Any document defining a PHB based on DPS must specify a default
    placement of the state in the packet header and a default bit
    encoding format. However, to increase the flexibility, it is
    acceptable for documents to define alternate state placements and
    encoding formats. Any router that claims to be compatible with a
    particular PHB based on DPS must support at least the default
    placement and the default bit encoding format.

 3. Examples of Services that can be Implemented by PHBs Based on DPS

    To illustrate the power and the flexibility of the PHBs based on
    DPS, we give a few examples. In the first, we address the 
    congestion control problem by approximating the functionality 
    of a "reference" network in which each node performs fair queuing.

 3.1. Stateless Prioritized Fair Queue-sharing (SPFQ)

    We first explain SPFQ using an idealized fluid model and then 
    present its packetized version.

 3.1.1 Fluid Model Algorithm: Bit Labeling

    We first restate the key observation in CSFQ [Stoica98] that we 
    will also use. In a router implementing fair queuing, all flows
    that are bottlenecked at a router have the same output rate.
    We call this the rate threshold(r_t(t)).

    Let C be the capacity of an output link at a router, and r_i(t) 
    the arrival rate of flow i in bits per second. Let A(t) denote 
    the total arrival rate of n flows: A(t)= r_1(t) + r_2(t) + ... +
    r_n(t).  If A(t) <= C then no bits are dropped. But if A(t) > C,
    then r_t(t) is the unique solution to C = min(r_1(t), r_t(t)) + 
    min(r_2(t), r_t(t)) + ... + min(r_n(t), r_t(t)).
    
    In general, if max-min bandwidth allocations are achieved, each 
    flow i receives service at a rate given by min(r_i(t), r_t(t)).

    For every flow, in any given second, we consider up to r_t(t) 
    bits of the flow as being conforming, and all bits in excess of 
    that as being non-conforming. If we mark every bit of a flow as 
    being conforming or non-conforming, we can obtain the allocation
    provided by a fair queuing router, by simply having routers 
    accept all conforming bits of a flow and dropping all 
    non-conforming bits.

    What we need now, is a labeling algorithm at the ingress node, that 
    would enable a router to distinguish between conforming and non-
    conforming traffic. Consider the simple sequential labeling algorithm:

    label(bit)
       served += 1
       bit->label = served
    where the value of 'served' is reset to 0, after every second.

    Stoica et al                 Expires April 2003                  [Page 4]
    Internet Draft        PHBs based on Dynamic Packet State     October 2002

    Let us suppose that the rate at which each flow is sending bits is 
    constant. The result of this algorithm is that during any given second,
    the bits from a flow sending at rate r_i bits per second are marked 
    sequentially from 1 to r_i; and the label is reset to 0 at the end of 
    each second. Then, for any given flow, accepting bits with label
    1 to 'r_a', would be equivalent to providing the flow with a
    rate of 'r_a' bits per second. So, if all bits carry such a
    label, a router can simply identify non-conforming bits
    to be those with label > r_t and drop them. Consequently, no
    flow can receive a service in excess of r_t . Furthermore, as
    all bits with label <= r_t are accepted, all flows sending at a rate 
    less than or equal to r_t will not have any of their bits dropped.

    As described in [Venkitar02], a key advantage of such a labeling 
    procedure is that it allows us to convey rate information as well
    as intra-flow priority using the same field in the packet header.

  3.1.2 Packet Labeling

    In order to extend the sequential labeling algorithm given
    for the fluid model to a packetized model, we essentially
    need to take variable packet sizes into account. Hence,
    instead of incrementing the counter 'served' by 1 (which
    is the size of any packet in a network with purely single-bit
    packets), we increment the value of 'served' by the size of
    the packet. Given below is a pseudo code for the packetized 
    version of the sequential marking algorithm.

    label(pkt)
      served += pkt->size
      pkt->label = served
    where the value of served is reset to 0, after a fixed size epoch.

    All ingress routers in a DS domain must use epochs of equal duration.
    The size of the epoch is a design parameter that should be chosen
    to reflect a tradeoff between mimicking the fluid model 
    accurately and not giving an unfair advantage to flows that 
    arrived most recently in the system. To understand this tradeoff,
    suppose that the epoch is chosen to be very long and that a new 
    flow arrives in the middle of an epoch. Then the bits from the 
    new flow would be labeled starting from a value of one and would 
    have a higher priority throughout the rest of the epoch than 
    the bits of flows that have been sending bits from the beginning
    of the epoch.

 3.1.3 Forwarding Decision in a Router

    [Stoica98], [Cao00], [Barnes01] and [Venkitar02] discuss algorithms 
    for updating the rate threshold (r_t) based on link state, i.e, 
    based on parameters such as queue length and the aggregate 
    accepted rate of packets. The forwarding decision in a router 
    is then made based on the following algorithm:


  Stoica et al                 Expires April 2003                  [Page 5]
  Internet Draft        PHBs based on Dynamic Packet State     October 2002

    enque(pkt)
      if (pkt->label <= r_t)
       Accept(pkt)
      else
       Drop(pkt)


 3.1.4 Additional Services based on SPFQ

    We now present examples of labeling methods at the edge that can 
    provide different services while retaining the same forwarding 
    behavior.

 3.1.4.1 Weighted SPFQ

    The SPFQ algorithm can be readily extended to support flows with 
    different weights. Let w_i be the weight of flow i. An allocation
    is weighted fair if all bottlenecked flows have the same value
    for r_i/w_i. The only major change required to achieve weighted 
    fair allocation is in the ingress labeling algorithm, where we 
    need to use 'served/w_i' instead of 'served'. This enables 
    per-flow service differentiation without maintaining per-flow 
    state in the core nodes of the network. 

 3.1.4.2 Minimum bandwidth allocation

    From the forwarding algorithm given in section 2, it is clear 
    that the packets marked with lower values of label are dropped 
    only after all packets with larger labels have been dropped. 
    This suggests that packets marked with the smallest label (of 0) 
    will not be dropped as long as the aggregate rate of such packets 
    does not exceed the link capacity. So, for a flow requiring a 
    minimum bandwidth allocation of 'b_min', labeling packets with 
    the smallest label at a rate of 'b_min' would ensure that the 
    flow will receive the rate that it has been guaranteed within a 
    reasonable time window (assuming that there is no packet loss due 
    to channel error). An admission control mechanism should be used 
    to ensure that the aggregate reserved rate does not exceed the 
    capacity of the link. A distributed admission control mechanism, 
    such as the one proposed in section 3.2 can be used for this purpose.

 3.2 Distributed Admission Control

    The previous examples focused on data path mechanisms and services.
    In this section, we will show that PHBs based on DPS can also
    implement control plane services such as distributed admission control.

    Admission control is a central component in providing quantitatively 
    defined QoS services.  The main job of the  admission control test is 
    to ensure that the network resources  are not over-committed. In 
    particular it has to ensure that the sum of the reservation rates of 
    all flows that traverse any link in the network is no larger than the 
    link capacity C.

    Stoica et al                 Expires April 2003                  [Page 6]
    Internet Draft        PHBs based on Dynamic Packet State     October 2002

    A new reservation request is granted if it passes the admission 
    test at each hop along its path. There are two main approaches to
    implementing admission control. Traditional reservation-based
    networks adopt a distributed architecture in which each node is
    responsible for its local resources, and where nodes are assumed to
    maintain per-flow state. To support the dynamic creation and 
    deletion of fine grained flows, a large quantity of dynamic per 
    flow state needs to be maintained in a distributed fashion. 
    Complex signaling protocols (e.g., RSVP and ATM UNI) are used 
    to maintain the consistency of this per-flow state.

    A second approach is to use a centralized bandwidth broker that
    maintains the topology as well as the state of all nodes in the
    network. In this case, the admission control can be implemented by
    the broker, eliminating the need for maintaining distributed state.
    Such a centralized approach is more appropriate for an environment
    where most flows are long lived, and set-up and tear-down events 
    are rare. However, to support fine grained and dynamic flows, there
    may be a need for a distributed broker architecture, in which the 
    broker database is replicated or partitioned. Such an architecture
    eliminates the need for a signaling protocol, but requires another
    protocol to maintain the consistency of the different broker
    databases. Unfortunately, since it is impossible to achieve perfect
    consistency, this may create race conditions and/or resource
    fragmentation.

    A third approach is to use a simplified provisioning model that is
    not aware of the details of the network topology, but instead 
    admits a new flow if there is sufficient bandwidth available for 
    the flow's packets to travel anywhere in the network with adequate
    QoS. This simplified model may be based on static provisioning and 
    service level agreements, or on a simple dynamic bandwidth broker. 
    In any case, the tradeoff made in return for the simplicity is 
    that the admission control decision must be more conservative, and
    a much smaller level of QoS-controlled service can be supported.

    In the following, we show that by using a PHB based on DPS, it is
    possible to implement distributed admission control for guaranteed
    services in a DS domain. In our scheme, for each reservation, the
    ingress node generates a request message that is forwarded along the
    path. In turn, each interior node decides whether or not to accept
    the request. When the message reaches the egress node it is returned
    to the ingress, which makes the final decision. We do not make any
    reliability assumptions about the request messages. In addition, the
    algorithms does not require reservation termination messages. In the
    following we describe the per-hop admission control. [StoZha99]
    describes how this scheme can be integrated with RSVP to provide 
    end-to-end delay bounded services.

 3.2.1. Per-Hop Admission Control

    The solution is based on two main ideas. First, we always maintain a
    conservative upper bound of the aggregate reservation R, denoted

  Stoica et al                 Expires April 2003                  [Page 7]
  Internet Draft        PHBs based on Dynamic Packet State     October 2002

    R_bound, which we use for making admission control decisions.
    R_bound is updated with a simple rule:

            R_bound = R_bound + r

    whenever a request for a rate r is received and R_bound + r <= C. It
    should be noted that in order to maintain the invariant that R_bound is
    an upper bound of R, this algorithm does not need to detect duplicate
    request messages, generated either due to retransmission in case of 
    packet loss or retry in case of partial reservation failures. Of course,
    the obvious problem with the algorithm is that R_bound will diverge 
    from R. At the limit, when R_bound reaches the link capacity C, no new
    requests can be accepted even though there might be available capacity.

    To address this problem, we introduce a separate algorithm, based on
    DPS, that periodically estimates the aggregate reserved rate. Based
    on this estimate we compute a second upper bound for R, and then use
    it to re-calibrate the upper bound R_bound. An important aspect of
    the estimation algorithm is that it can be actually shown that the
    discrepancy between the upper bound and the actual reserved rate R 
    is bounded. Then the re-calibration process reduces to choosing the
    minimum of the two upper bounds.

 3.2.2. Estimation Algorithm for the Aggregate Reservation

    To estimate the aggregate reservation, denoted R_est, we again use DPS.
    In this case, the state of each packet consists of the amount of bits a
    flow is entitled to send during the interval between the time when the
    previous packet was transmitted up to the time when the current packet
    is transmitted. Note that unlike the previous examples, in this case 
    the state carried by the packet does not affect the packet's processing
    by interior nodes. This state is solely used to compute each node's
    aggregate reservation.

    The estimation performed by interior nodes is based on the following
    simple observation: the sum of state values of packets of all flows
    received during an interval (a, a + T_W] is a good approximation for
    the total number of bits that all flows are entitled to send during
    this interval. Dividing this sum by T_W, we obtain the aggregate
    reservation rate. This is basically the rate estimation algorithm,
    though we need to account for several estimation errors. In particular,
    we need to account for the fact that not all  reservations continue for
    the entire duration of  interval (a, a + T_W].

    We divide time into intervals of length T_W. Let (u1, u2] be such an
    interval, where u2 = u1 + T_W. Let T_I be the maximum 
    inter-departure time between two consecutive packets in the same 
    flow and T_J be the maximum jitter of a flow, both of which are 
    much smaller than T_W. Further, each interior node is associated 
    a global variable Ra which is initialized at the beginning of 
    each interval (u1, u2] to zero, and is updated to Ra + r every 
    time a request for a reservation r is received and the admission 
    test is passed, i.e., R_bound + r <= C.

  Stoica et al                 Expires April 2003                  [Page 8]
  Internet Draft        PHBs based on Dynamic Packet State     October 2002


    Let Ra(t) denote the value of this variable at time t. Since 
    interior nodes do not differentiate between the original and 
    duplicate requests, Ra(t) is an upper-bound of the sum of all 
    reservations accepted during the interval (u1, t]. (For simplicity,
    here we assume that a flow which is granted a reservation during 
    the interval (u1, u2] becomes active no later than u2.)  Then, it 
    can be shown that the aggregate reservation at time u2, R(u2), 
    is bounded by 

       R(u2) < R_est(u2)/(1-f) + Ra(u2),                     (7)

    where f = (T_I + T_J)/T_W. Finally, this bound is used to 
    re-calibrate the upper bound of the aggregate reservation 
    R_bound(u1) as follows

       R_bound(u2) = min(R_bound(u2), R_est(u2)/(1-f) + Ra(u2)). (8)

    Figure 1 shows the pseudocode of the admission decision and of the
    aggregate reservation estimation algorithm at ingress and interior
    nodes. We make several observations. First, the estimation algorithm
    uses only the information in the current interval. This makes the
    algorithm robust with respect to loss and duplication of signaling
    packets since their effects are "forgotten" after one time interval.
    As an example, if a node processes both the original and a duplicate
    of the same reservation request during the interval (u1, u2], 
    R_bound will be updated twice for the same flow. However, this 
    erroneous update will not be reflected in the computation of 
    R_est(u3), since its computation is based only on the state values 
    received during (u2, u3]. As a consequence, it can be show that the
    admission control algorithm can asymptotically reach a link 
    utilization of C (1 - f)/(1 + f) [StoZha99].

    A possible optimization of the admission control algorithm is to add
    reservation termination messages. This will reduce the discrepancy
    between the upper bound R_bound and the aggregate reservation R.
    However, in order to guarantee that R_bound remains an upper bound
    for R, we need to ensure that a termination message is sent at most
    once, i.e., there are no retransmissions if the message is lost. In
    practice, this property can be enforced by edge nodes, which 
    maintain per-flow state.

    To ensure that the maximum inter-departure time is no larger than
    T_I, the ingress node may need to send a dummy packet in the case
    when no data packet arrives for a flow during an interval T_I. This
    can be achieved by having the ingress node to maintain a timer with
    each flow. An optimization would be to aggregate all "micro-flows"
    between each pair of ingress and egress nodes into one flow, and
    compute the state value based on the aggregated reservation rate, 
    and insert a dummy packet only if there is no data packet for the
    aggregate flow during an interval.


  Stoica et al                 Expires April 2003                  [Page 9]
  Internet Draft        PHBs based on Dynamic Packet State     October 2002

      Figure 1 - Admission control and rate estimation algorithm.

   -----------------------------------------------------------------
                               Ingress node
   -----------------------------------------------------------------
      on packet p departure:
        i = get_flow(p);
        state(p) <- r[i] * (crt_time - prev_time[i]);
        prev_time[i] = crt_time;
   -----------------------------------------------------------------
                                Interior node
   -----------------------------------------------------------------
       Reservation Estimation         |     Admission Control
   -----------------------------------------------------------------
      on packet p arrival:            |  on reservation request r:
        b <- state(p);                |    /* admission ctrl. test */
        L = L + b;                    |    if (R_bound + r <= C)
                                      |      Ra = Ra + r;
      on time-out T_W:                |      R_bound = R_bound + r;
        /* estimated reservation */   |      accept(r);
        R_est = L / T_W;              |    else
        R_bound = min(R_bound,        |      deny(r);
                  R_est/(1 - f) + Ra);|
        Ra = 0;                       |
    -----------------------------------------------------------------


 4. Carrying State in Packets

    There are at least three ways to encode state in the packet
    header: (1) introduce a new IP option, and insert the option at the
    ingress router, (2) introduce a new header between layer 2 and layer
    3, and (3) use the existing IP header.

    Option number 23 has been assigned for adding DPS state in packets.
    Inserting an IP option, has the potential to provide a large
    space for encoding state. However it will require all routers within
    a DS domain to process IP options, which could complicate packet 
    processing. 

    Introducing a new header between layer 2 and layer 3 would require
    solutions be devised for different layer 2 technologies. In the
    context of MPLS [Rosen98, Rosen99] networks, the state can be
    encoded in a special label. One way to do this is by using a 
    particular encoding of the experimental use field indicating a 
    nested label on the label stack that carried the PHB-specific state
    information rather than an ordinary label. In this case, the label
    on the top of the stack would indicate the label-switched path, and
    the inner label the PHB-specific state. This would require a small 
    addition to the MPLS architecture to allow two labels to be pushed 
    or popped in unison, and treated as a single entity on the label 
    stack.


  Stoica et al                 Expires April 2003                 [Page 10]
  Internet Draft        PHBs based on Dynamic Packet State     October 2002

4.1. Encoding State within an IP header

    In this section, we discuss the third option: storing the additional
    states in the IP header. The biggest problem with using the IP header
    is to find enough space to insert the extra information. The main
    challenge is to remain fully compatible with current standards and
    protocols. In particular, we want the network domain to be transparent
    to end-to-end protocols, i.e., the egress node should restore the
    fields changed by ingress and interior nodes to their original values.
    In this respect, we observe that there is an ip_off field in the IPv4
    header to support packet fragmentation/reassembly which is rarely
    used. For example, by analyzing the traces of over 1.7 million packets
    on an OC-3 link [nlanr], we found that less than 0.22% of all packets
    were fragments. In addition, ther are a relatively small number of
    distinct fragment sizes. Therefore, it is possible to use a fraction
    of ip_off field to encode the fragment sizes, and the remaining bits
    to encode DPS information. The idea can be implemented as follows.
    When a packet arrives at an ingress node, the node checks whether a
    packet is a fragment or needs to be fragmented. If neither of these
    is true, all 13 bits of the ip_off field in the packet header will be
    used to encode DPS values. If the packet is a fragment, the fragment
    size is recoded into a more efficient representation and the rest of
    the bits is used to encode the DPS information. The fragment size field
    will be restored at the egress node.

    In the above, we make the implicit assumption that a packet can be
    fragmented only by ingress nodes, and not by interior nodes.  This is
    consistent with the diffserv view that the forwarding behavior of a
    network's component is engineered to be compatible throughout a domain.

    In summary, this gives us up to 13 bits in the current IPv4 header to
    encode the PHB specific state.

 4.2. Example of State Encoding

    The state encoding is PHB dependent. In this section, we give
    examples of encoding the state for the services described in
    Section 3.

 4.2.1. Encoding Flow's Rate

    Recall that in SPFQ, the PHB state is determined by the
    current rate of the flow to which the packet belongs. One possible
    way to represent the rate estimate is to restrict it to only a
    small number of possible values. For example if we limit it to 128
    values, only seven bits are needed to represent this rate. While 
    this can be a reasonable solution in practice, in the following we 
    propose a more sophisticated representation that allows us to 
    express a larger range of values.

    Let r denote the packet label. In the most general case r could be
    a floating poing number. To represent r we use an m bit mantissa 
    and an n bit exponent. Since r >= 0, it is possible to gain

  Stoica et al                 Expires April 2003                 [Page 11]
  Internet Draft        PHBs based on Dynamic Packet State     October 2002

    an extra bit for mantissa. For this we consider two cases: 
    (a) if r >= 2^m we represent r as the closest value of the form 
    u 2^v, where 2^m <= u <= 2^(m+1). Then, since the (m+1)-th most 
    significant bit in the u's representation is always 1, we can 
    ignore it. As an example, assume m = 3, n = 4, and r = 19 = 10011. 
    Then 19 is represented as 18 = u*2^v, where u = 9 = 1001 and v = 1. 
    By ignoring the first bit in the representation of u the mantissa 
    will store 001, while the exponent will be 1. 
    (b) On the other hand, if r < 2^m, the mantissa will contain r, 
    while the exponent will be 2^n - 1. For example, for m = 3, 
    n = 4, and r = 6 = 110, the mantissa is 110, while the exponent
    is 1111. Converting from one format to another can be efficiently
    implemented. Figure 2 shows the conversion code in C. For 
    simplicity, here we assume that integers are truncated rather than 
    rounded when represented in floating point.

    Figure 2. The C code for converting between integer and floating
              point formats. m represents the number of bits used by
              the mantissa; n represents the number of bits in the
              exponent.

    -----------------------------------------------------------------
      intToFP(int val, int *mantissa, int *exponent) {
        int nbits = get_num_bits(val);
        if (nbits <= m) {
          *mantissa = val;
          *exponent = (1 << n) - 1;
        } else {
          *exponent = nbits - m - 1;
          *mantissa = (val >> *exponent) - (1 << m);
        }
      }

      FPToInt(int mantissa, int exponent) {
        int tmp;
        if (exponent == ((1 << n) - 1))
          return mantissa;
        tmp = mantissa | (1 << m);
        return (tmp << exponent)
      }
    ------------------------------------------------------------------

    By using m bits for mantissa and n for exponent, we can represent
    any integer in the range [0..(2^(m+1)-1) * 2^(2^n - 1)] with a
    relative error bounded by (-1/2^(m+1), 1/2^(m+1)). For example, 
    with 7 bits, by allocating 3 for mantissa and 4 for exponent, we can
    represent any integer in the range [1..15*2^15] with a relative error
    of (-6.25%, 6.25%). The worst relative error case occurs when the
    mantissa is 8. For example the number r = 271 = 100001111 is encoded
    as u = 1000, v=5, with a relative error of (8*2^5 - 271)/271 = 
    -0.0554 = -5.54%. Similarly, r = 273 = 100010001 is encoded as 
    u = 1001, v = 5, with a relative error of 5.55%.


  Stoica et al                 Expires April 2003                 [Page 12]
  Internet Draft        PHBs based on Dynamic Packet State     October 2002

4.2.2. Encoding Reservation State

    As shown in Figure 1, when estimating the aggregate reservation, the
    PHB state represents the number of bits that a flow is entitled to
    send during the interval between the time when the previous packet 
    of the flow has been transmitted until the current packet is
    transmitted. This number can be simply encoded as an integer b. To
    reduce the range, a possibility is to store b/l instead of b, where
    l is the length of the packet.

 4.3. Encoding Multiple Values

    Since the space in the packet's header is a scarce resource, 
    encoding multiple values is particularly challenging. In this 
    section we discuss two general methods that helps to alleviate this 
    difficulty.

    In the first method, the idea is to leverage additional knowledge 
    about the state semantic to achieve efficient encoding. In 
    particular one value can be stored as a function of other values. 
    For example, if a value is known to be always greater than the 
    other values, the larger value can be represented in floating point
    format, while the other values may be represented as fractions of 
    this value.

    The idea of the second method is to have different packets within a
    flow carry different state formats.  This method is appropriate for
    PHBs that do not require all packets of a flow to carry the same
    state.  For example, in estimating the aggregate reservation (see
    Section 3.2) there is no need for every packet to carry the number
    of bits the flow is entitled to send between the current time and the
    time when the previous packet has been transmitted.  The only
    requirement is that the distance between any two consecutive packets
    that carry such values to be no larger than T_I. Other packets in
    between can carry different information.  Similarly, if we encode the
    IP fragment size in the packet's state, the packet has to carry this
    value only if the IP fragment is not zero.  When the IP fragment is
    zero the packet can carry other state instead. On the other hand, note
    that in SPFQ, it is mandatory that every packet be labelled by the
    ingress edge, as this value is used in making forward/drop decisions
    by ingress routers.

 5. Specification Issues

    This section briefly describes some issues related to drafting
    specifications for PHB's based on DPS.

 5.1. Recommended Codepoints

    At this time it is appropriate to use values drawn from the 16
    codepoints [Nichols98] reserved for local and experimental use
    (xxxx11) to encode PHBs based on DPS.


  Stoica et al                 Expires April 2003                 [Page 13]
  Internet Draft        PHBs based on Dynamic Packet State     October 2002

 5.2. Interaction with other PHBs

    The interaction of DPS PHB's with other PHB's obviously depends on the
    PHB semantic.  It should be noted that the presence of other PHB's in
    a node may affect the computation and update of DPS state as well as
    the actual forwarding behavior experienced by the packet.

 5.3. Tunneling

    When packets with PHBs based on DPS are tunneled, the end-nodes must
    make sure that (1) the tunnel is marked with a PHB that does not
    violate the original PHB semantic, and (2) the PHB specific state is
    correctly updated at the end of the tunnel. This requirement might be
    met by using a tunnel PHB that records and updates packet state, and
    then copying the state from the encapsulating packet to the inner
    packet at the tunnel endpoint. Alternatively, the behavior of the
    tunnel might be measured or precomputed in a way that allows the
    encapsulated packet's DPS state to be updated at the decapsulation
    point without requiring the tunnel to support DPS behavior.

 6. Security Considerations

    The space allocated for the PHB state in the packet header must be
    compatible with IPsec. In this context we note that using the fragment
    offset to carry PHB state does not affect IPsec's end-to-end security,
    since the fragment offset is not used for cryptographic calculations
    [Kent98]. Thus, as it is the case with the DS field [Nichols98], IPSec
    does not provide any defense against malicious modifications of the
    PHB state. This leaves the door open for theft of service, which inturn
    May cause denial of service to other conforming users.
    For example, in SPFQ, a label based on a small rate estimate may cause
    disproportionate bandwidth being allocated to the flow inside the DS
    domain. In the example in Section 3.2.2, the under estimation of the
    aggregate reservation can lead to resource overprovision.
    One way to expose denial of service attacks is by auditing. In this
    context, we note that associating state with PHBs makes it easier to
    perform efficient auditing at interior nodes. For example, in SPFQ, 
    an eventual attack can be detected by simply measuring a flow rate 
    and then comparing it against the label carried by the flow's packets.

    Security considerations covered in [Blake98] that correspond to
    diffserv code points also apply to PHB code points for DPS. 

 7. Conclusions

    In this document we have proposed an extension of the diffserv 
    architecture by defining a new set of PHBs that are based on 
    Dynamic Packet State. By using DPS to coordinate actions of edge 
    and interior nodes along the path traversed by a flow, distributed 
    algorithms can be designed to approximate the behavior of a broad
    class of "stateful" networks within the diffserv architecture. Such 
    an extension will significantly increase the flexibility and 
    capabilities of the services that can be provided by diffserv.

    Stoica et al                 Expires April 2003                 [Page 14]
    Internet Draft        PHBs based on Dynamic Packet State     October 2002

8.  References

    [Barnes01] R. Barnes, R. Srikant, J. Mysore, N. Venkitaraman.
    Analysis of Stateless Fair Queuing Algorithms, Proc. of the 35th 
    Annual Conference on Information Sciences and Systems, March 2001. 

    [Blake98] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and
    W. Weiss. An Architecture for Differentiated Services, RFC 2475
    December 1998.

    [Cao00] Z. Cao, Z Wang and E. Zegura, Rainbow Fair Queueing: Fair 
    Bandwidth Sharing Without Per-Flow State, Proc. of INFOCOM 2000.

    [Heinanen99] J. Heinanen, F. Baker, W.  Weiss, and J. Wroclawski. 
    Assured Forwarding PHB Group, RFC 2597, June 1999.

    [Jacobson98] V. Jacobson, K. Poduri and K. Nichols.  An
    Expedited Forwarding PHB, RFC 2598, June 1999.

    [Kent98] S. Kent and R. Atkinson. IP Authentication Header,
    RFC 2402, November 1998.

    [Nichols98] K. Nichols, S. Blake, F. Baker, and D. L. Black.
    Definition of the Differentiated Services Field (DS Field) in the
    ipv4 and ipv6 Headers, RFC 2474, December 1998.

    [Stoica98] I. Stoica, S. Shenker, and H. Zhang. Core-Stateless
    Fair Queueing: Achieving Approximately Fair Bandwidth Allocations
    in High Speed Networks. In Proceedings ACM SIGCOMM'98,
    pages 118-130, Vancouver, September 1998.

    [StoZha99] I. Stoica and H. Zhang. Providing Guaranteed Services
    Without Per-flow Management. In Proceedings of ACM SIGCOMM'99,
    Boston, September 1999.

    [Venkitar02] N. Venkitaraman, J. Mysore, M. Needham. Core-Stateless
    Utility Function based Rate Allocation. Proceedings of PfHSN'2002, 
    Berlin, April 2002.

9. Author's Addresses

    Ion Stoica                                    Hui Zhang 
    645 Soda Hall                                 Wean Hall 7115
    Computer Science Division                     School of Computer Science
    University of California, Berkeley            Carnegie Mellon University
    Berkeley, CA 94720                            Pittsburgh, PA 15213
    istoica@cs.berkeley.edu                       hzhang@cs.cmu.edu

    Narayanan Venkitaraman                        Jayanth Mysore
    Motorola Labs                                 Motorola Labs,
    1301 E. Algonquin Rd.                         1301 E. Algonquin Rd.
    Schaumburg, IL 60196                          Schaumburg, IL 60196
    venkitar@labs.mot.com                         jayanth@labs.mot.com

  Stoica et al                 Expires April 2003                 [Page 15]

  Internet Draft        PHBs based on Dynamic Packet State     October 2002


10. Full Copyright Statement

    Copyright (C) The Internet Society (1999).  All Rights Reserved.

    This document and translations of it may be copied and furnished to
    others, and derivative works that comment on or otherwise explain it
    or assist in its implementation may be prepared, copied, published
    and distributed, in whole or in part, without restriction of any
    kind, provided that the above copyright notice and this paragraph 
    are included on all such copies and derivative works.  However, this
    document itself may not be modified in any way, such as by removing
    the copyright notice or references to the Internet Society or other
    Internet organizations, except as needed for the purpose of
    developing Internet standards in which case the procedures for
    copyrights defined in the Internet Standards process must be
    followed, or as required to translate it into languages other than
    English.

    The limited permissions granted above are perpetual and will not be
    revoked by the Internet Society or its successors or assigns.

    This document and the information contained herein is provided on an
    "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
    TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
    BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
    HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


  Stoica et al                 Expires April 2003                 [Page 16]