idnits 2.17.1 draft-stoica-diffserv-dps-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == It seems as if not all pages are separated by form feeds - found 0 form feeds but 16 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 52 instances of too long lines in the document, the longest one being 6 characters in excess of 72. == There are 2 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 314 has weird spacing: '... of the admis...' == Line 315 has weird spacing: '...sources are n...' == Line 414 has weird spacing: '...not all reser...' == Line 415 has weird spacing: '...tion of inter...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 2002) is 7864 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: 'Rosen98' is mentioned on line 519, but not defined

  == Missing Reference: 'Rosen99' is mentioned on line 519, but not defined

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Barnes01'

  ** Downref: Normative reference to an Informational RFC: RFC 2475 (ref.
     'Blake98')

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Cao00'

  ** Obsolete normative reference: RFC 2598 (ref. 'Jacobson98') (Obsoleted by
     RFC 3246)

  ** Obsolete normative reference: RFC 2402 (ref. 'Kent98') (Obsoleted by RFC
     4302, RFC 4305)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Stoica98'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'StoZha99'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'Venkitar02'


     Summary: 8 errors (**), 0 flaws (~~), 10 warnings (==), 8 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	 Internet Engineering Task Force     I. Stoica            UC, Berkeley
2	 Internet Draft                      H. Zhang             CMU
3	 Expires  April 2003                 N. Venkitaraman      Motorola Labs
4	                                     J. Mysore            Motorola Labs

6	                                     October 2002

8	             Per Hop Behaviors Based on Dynamic Packet State
9	                   

11	 Status of this Memo

13	      This document is an Internet-Draft and is in full conformance with
14	      all provisions of Section 10 of RFC2026.

16	      Internet-Drafts are working documents of the Internet Engineering
17	      Task Force (IETF), its areas, and its working groups. Note that other
18	      groups may also distribute working documents as Internet-Drafts.

20	      Internet-Drafts are draft documents valid for a maximum of six months
21	      and may be updated, replaced, or obsoleted by other documents at any
22	      time. It is inappropriate to use Internet-Drafts as reference
23	      material or to cite them other than as "work in progress."

25	      The list of current Internet-Drafts can be accessed at
26	      http://www.ietf.org/ietf/1id-abstracts.txt

28	      The list of Internet-Draft Shadow Directories can be accessed at
29	      http://www.ietf.org/shadow.html.

31	 Copyright Notice

33	      Copyright (C) The Internet Society (2002).  All Rights Reserved.

35	 Abstract

37	    This document proposes a family of Per-Hop Behaviors (PHBs)
38	    based on Dynamic Packet State (DPS) in the context of the
39	    differentiated service architecture. With these PHBs, distributed
40	    algorithms can be devised to implement services with flexibility,
41	    utilization, and assurance levels similar to those that can be
42	    provided with per-flow mechanisms.

44	    With Dynamic Packet State, each packet carries in its header, in
45	    addition to the PHB codepoint, some PHB-specific state. The state
46	    is initialized by the ingress node. Interior nodes process each
47	    incoming packet based on the state carried in the packet's
48	    header, updating both its internal state and the state in the
49	    packet's header before forwarding it to the next hop. By using
50	    DPS to coordinate actions of edge and interior nodes along the
51	    path traversed by a flow, distributed algorithms can be designed
52	    to approximate the behavior of a broad class of "stateful"
53	    networks using networks in which interior nodes do not maintain
54	    per-flow state. We give examples of services that can be implemented by
55	    PHBs based on DPS. We also discuss several possible solutions for encoding
56	    Dynamic Packet State that have the minimum incompatibility with IPv4.

58	 1. Introduction

60	    While the diffserv architecture [Blake98] is highly scalable,
61	    the services it can provide have lower flexibility, utilization,
62	    or assurance levels than services provided by architectures
63	    that employ per-flow management at every node. It is debatable
64	    whether this should be of significant concern. For example, the
65	    low utilization by the premium traffic may be acceptable if the
66	    majority of traffic will be best effort, either because the
67	    best effort service is "good enough" for majority applications
68	    or the price difference between premium traffic and best effort
69	    traffic is too high to justify the performance difference between
70	    them. Alternatively, if the guaranteed nature of service
71	    assurance is not needed, i.e., statistical service assurance is
72	    sufficient for premium service, we can then achieve higher
73	    network utilization. Providing meaningful statistical service
74	    is still an open research problem. A discussion of these topics
75	    is beyond the scope of this document. Furthermore, there is usually
76	    a tradeoff between the complexity of the QoS mechanism and the
77	    efficiency of the resource usage. While intserv-style guaranteed
78	    service can achieve high resource utilization, premium service
79	    needs a much simpler mechanism.

81	    This document proposes a new set of PHBs based on Dynamic Packet
82	    State (DPS). These PHBs have relative low complexities (do not
83	    require per-flow state), but can be used to implement distributed
84	    algorithms to provide services with flexibility, utilization,
85	    and assurance levels similar to those that can be achieved with
86	    per-flow mechanisms. DS domains implemented with these type of PHBs
87	    should interoperate with DS domains implemented with other PHBs.

89	    With Dynamic Packet State, each packet carries in its header, in
90	    addition to the PHB codepoint, some PHB-specific state. The state
91	    is initialized by an ingress node, when the packet enters the DS
92	    domain. Interior nodes process each incoming packet based on the
93	    state carried in the packet's header, updating both its internal
94	    state and the state in the packet's header before forwarding it.
95	    By using DPS to coordinate actions of edge and interior nodes
96	    along the path traversed by a flow, distributed algorithms can
97	    be designed to approximate the behavior of a broad class of
98	    "stateful" networks. Consequently, introducing PHBs based on DPS
99	    will significantly increase the flexibility and capabilities of
100	    the services that can be built within the diffserv architecture.
101	    In particular, we will show that a variety of QoS services can
102	    be implemented by PHBs based on DPS. These include weighted fair
103	    share service, and distributed admission control service.

105	    In this document, we use flow to refer to a subset of packets
106	    that traverse the same path inside a DS domain between two edge
107	    nodes.  Thus, with the highest level of traffic aggregation, a
108	    flow consists of all packets between the same pair of ingress
109	    and egress nodes.

111	    This document is organized as follows. Section 2 gives a general
112	    description of PHBs based on DPS. Section 3 presents several
113	    services that can be implemented with PHBs based on DPS.
114	    Section 4 discusses alternative techniques of storing state in
115	    the packet's header. Sections 5 briefly discusses some issues
116	    related to the specification of DPS PHB's, such as codepoints,
117	    tunneling behavior, and interaction with other PHB's.
118	    Section 6 discusses security issues.

120	 2. Description of Per-Hop Behaviors Based on Dynamic Packet State

122	    Unlike common PHB codepoints [Blake98, Heinanen99, Jacobson98],
123	    a PHB codepoint based on DPS has extra state associated with it.
124	    This state is initialized by ingress nodes and carried by packets
125	    inside the DS domain. The state semantic is PHB dependent.
126	    The values taken by the state can be either flow, path, or
127	    packet dependent. The state carried by packets can be used by
128	    interior nodes for a variety of purposes such as, packet
129	    scheduling, updating the local node's state, or extending the
130	    codepoint space.

132	    When a PHB based on DPS is defined, in addition to the guidelines
133	    given in [Blake98], the following items must be specified:

135	       o state semantic - the meaning of the information carried by
136	                          the packets

138	       o state placement - where is the information stored in the
139	                           packet's header

141	       o encoding format - how is the information encoded in packets

143	    For example, consider a PHB that implements the Stateless
144	    Prioritized Fair Queue Sharing algorithm, which is described in
145	    Section 3.1. In this case, the state carried by a packet is
146	    based on an estimate of the current rate of the flow to which
147	    the packet belongs. The state can be placed either (a) between
148	    layer two and layer three headers, (b) as an IP option, or
149	    (c) in the IP header (see Section 4). Finally, the rate can be
150	    encoded by using a floating point like format as described in
151	    Section 4.1.1.

153	    In addition, the following requirement, called the transparency
154	    requirement, must be satisfied

156	      o All changes performed at ingress nodes or within the DS
157	      domain on packets' headers (possible for the purpose of
158	      the state) must be undone by egress nodes

160	    Any document defining a PHB based on DPS must specify a default
161	    placement of the state in the packet header and a default bit
162	    encoding format. However, to increase the flexibility, it is
163	    acceptable for documents to define alternate state placements and
164	    encoding formats. Any router that claims to be compatible with a
165	    particular PHB based on DPS must support at least the default
166	    placement and the default bit encoding format.

168	 3. Examples of Services that can be Implemented by PHBs Based on DPS

170	    To illustrate the power and the flexibility of the PHBs based on
171	    DPS, we give a few examples. In the first, we address the
172	    congestion control problem by approximating the functionality
173	    of a "reference" network in which each node performs fair queuing.

175	 3.1. Stateless Prioritized Fair Queue-sharing (SPFQ)

177	    We first explain SPFQ using an idealized fluid model and then
178	    present its packetized version.

180	 3.1.1 Fluid Model Algorithm: Bit Labeling

182	    We first restate the key observation in CSFQ [Stoica98] that we
183	    will also use. In a router implementing fair queuing, all flows
184	    that are bottlenecked at a router have the same output rate.
185	    We call this the rate threshold(r_t(t)).

187	    Let C be the capacity of an output link at a router, and r_i(t)
188	    the arrival rate of flow i in bits per second. Let A(t) denote
189	    the total arrival rate of n flows: A(t)= r_1(t) + r_2(t) + ... +
190	    r_n(t).  If A(t) <= C then no bits are dropped. But if A(t) > C,
191	    then r_t(t) is the unique solution to C = min(r_1(t), r_t(t)) +
192	    min(r_2(t), r_t(t)) + ... + min(r_n(t), r_t(t)).

194	    In general, if max-min bandwidth allocations are achieved, each
195	    flow i receives service at a rate given by min(r_i(t), r_t(t)).

197	    For every flow, in any given second, we consider up to r_t(t)
198	    bits of the flow as being conforming, and all bits in excess of
199	    that as being non-conforming. If we mark every bit of a flow as
200	    being conforming or non-conforming, we can obtain the allocation
201	    provided by a fair queuing router, by simply having routers
202	    accept all conforming bits of a flow and dropping all
203	    non-conforming bits.

205	    What we need now, is a labeling algorithm at the ingress node, that
206	    would enable a router to distinguish between conforming and non-
207	    conforming traffic. Consider the simple sequential labeling algorithm:

209	    label(bit)
210	       served += 1
211	       bit->label = served
212	    where the value of 'served' is reset to 0, after every second.

214	    Let us suppose that the rate at which each flow is sending bits is
215	    constant. The result of this algorithm is that during any given second,
216	    the bits from a flow sending at rate r_i bits per second are marked
217	    sequentially from 1 to r_i; and the label is reset to 0 at the end of
218	    each second. Then, for any given flow, accepting bits with label
219	    1 to 'r_a', would be equivalent to providing the flow with a
220	    rate of 'r_a' bits per second. So, if all bits carry such a
221	    label, a router can simply identify non-conforming bits
222	    to be those with label > r_t and drop them. Consequently, no
223	    flow can receive a service in excess of r_t . Furthermore, as
224	    all bits with label <= r_t are accepted, all flows sending at a rate
225	    less than or equal to r_t will not have any of their bits dropped.

227	    As described in [Venkitar02], a key advantage of such a labeling
228	    procedure is that it allows us to convey rate information as well
229	    as intra-flow priority using the same field in the packet header.

231	  3.1.2 Packet Labeling

233	    In order to extend the sequential labeling algorithm given
234	    for the fluid model to a packetized model, we essentially
235	    need to take variable packet sizes into account. Hence,
236	    instead of incrementing the counter 'served' by 1 (which
237	    is the size of any packet in a network with purely single-bit
238	    packets), we increment the value of 'served' by the size of
239	    the packet. Given below is a pseudo code for the packetized
240	    version of the sequential marking algorithm.

242	    label(pkt)
243	      served += pkt->size
244	      pkt->label = served
245	    where the value of served is reset to 0, after a fixed size epoch.

247	    All ingress routers in a DS domain must use epochs of equal duration.
248	    The size of the epoch is a design parameter that should be chosen
249	    to reflect a tradeoff between mimicking the fluid model
250	    accurately and not giving an unfair advantage to flows that
251	    arrived most recently in the system. To understand this tradeoff,
252	    suppose that the epoch is chosen to be very long and that a new
253	    flow arrives in the middle of an epoch. Then the bits from the
254	    new flow would be labeled starting from a value of one and would
255	    have a higher priority throughout the rest of the epoch than
256	    the bits of flows that have been sending bits from the beginning
257	    of the epoch.

259	 3.1.3 Forwarding Decision in a Router

261	    [Stoica98], [Cao00], [Barnes01] and [Venkitar02] discuss algorithms
262	    for updating the rate threshold (r_t) based on link state, i.e,
263	    based on parameters such as queue length and the aggregate
264	    accepted rate of packets. The forwarding decision in a router
265	    is then made based on the following algorithm:

267	    enque(pkt)
268	      if (pkt->label <= r_t)
269	       Accept(pkt)
270	      else
271	       Drop(pkt)

273	 3.1.4 Additional Services based on SPFQ

275	    We now present examples of labeling methods at the edge that can
276	    provide different services while retaining the same forwarding
277	    behavior.

279	 3.1.4.1 Weighted SPFQ

281	    The SPFQ algorithm can be readily extended to support flows with
282	    different weights. Let w_i be the weight of flow i. An allocation
283	    is weighted fair if all bottlenecked flows have the same value
284	    for r_i/w_i. The only major change required to achieve weighted
285	    fair allocation is in the ingress labeling algorithm, where we
286	    need to use 'served/w_i' instead of 'served'. This enables
287	    per-flow service differentiation without maintaining per-flow
288	    state in the core nodes of the network.

290	 3.1.4.2 Minimum bandwidth allocation

292	    From the forwarding algorithm given in section 2, it is clear
293	    that the packets marked with lower values of label are dropped
294	    only after all packets with larger labels have been dropped.
295	    This suggests that packets marked with the smallest label (of 0)
296	    will not be dropped as long as the aggregate rate of such packets
297	    does not exceed the link capacity. So, for a flow requiring a
298	    minimum bandwidth allocation of 'b_min', labeling packets with
299	    the smallest label at a rate of 'b_min' would ensure that the
300	    flow will receive the rate that it has been guaranteed within a
301	    reasonable time window (assuming that there is no packet loss due
302	    to channel error). An admission control mechanism should be used
303	    to ensure that the aggregate reserved rate does not exceed the
304	    capacity of the link. A distributed admission control mechanism,
305	    such as the one proposed in section 3.2 can be used for this purpose.

307	 3.2 Distributed Admission Control

309	    The previous examples focused on data path mechanisms and services.
310	    In this section, we will show that PHBs based on DPS can also
311	    implement control plane services such as distributed admission control.

313	    Admission control is a central component in providing quantitatively
314	    defined QoS services.  The main job of the  admission control test is
315	    to ensure that the network resources  are not over-committed. In
316	    particular it has to ensure that the sum of the reservation rates of
317	    all flows that traverse any link in the network is no larger than the
318	    link capacity C.

320	    A new reservation request is granted if it passes the admission
321	    test at each hop along its path. There are two main approaches to
322	    implementing admission control. Traditional reservation-based
323	    networks adopt a distributed architecture in which each node is
324	    responsible for its local resources, and where nodes are assumed to
325	    maintain per-flow state. To support the dynamic creation and
326	    deletion of fine grained flows, a large quantity of dynamic per
327	    flow state needs to be maintained in a distributed fashion.
328	    Complex signaling protocols (e.g., RSVP and ATM UNI) are used
329	    to maintain the consistency of this per-flow state.

331	    A second approach is to use a centralized bandwidth broker that
332	    maintains the topology as well as the state of all nodes in the
333	    network. In this case, the admission control can be implemented by
334	    the broker, eliminating the need for maintaining distributed state.
335	    Such a centralized approach is more appropriate for an environment
336	    where most flows are long lived, and set-up and tear-down events
337	    are rare. However, to support fine grained and dynamic flows, there
338	    may be a need for a distributed broker architecture, in which the
339	    broker database is replicated or partitioned. Such an architecture
340	    eliminates the need for a signaling protocol, but requires another
341	    protocol to maintain the consistency of the different broker
342	    databases. Unfortunately, since it is impossible to achieve perfect
343	    consistency, this may create race conditions and/or resource
344	    fragmentation.

346	    A third approach is to use a simplified provisioning model that is
347	    not aware of the details of the network topology, but instead
348	    admits a new flow if there is sufficient bandwidth available for
349	    the flow's packets to travel anywhere in the network with adequate
350	    QoS. This simplified model may be based on static provisioning and
351	    service level agreements, or on a simple dynamic bandwidth broker.
352	    In any case, the tradeoff made in return for the simplicity is
353	    that the admission control decision must be more conservative, and
354	    a much smaller level of QoS-controlled service can be supported.

356	    In the following, we show that by using a PHB based on DPS, it is
357	    possible to implement distributed admission control for guaranteed
358	    services in a DS domain. In our scheme, for each reservation, the
359	    ingress node generates a request message that is forwarded along the
360	    path. In turn, each interior node decides whether or not to accept
361	    the request. When the message reaches the egress node it is returned
362	    to the ingress, which makes the final decision. We do not make any
363	    reliability assumptions about the request messages. In addition, the
364	    algorithms does not require reservation termination messages. In the
365	    following we describe the per-hop admission control. [StoZha99]
366	    describes how this scheme can be integrated with RSVP to provide
367	    end-to-end delay bounded services.

369	 3.2.1. Per-Hop Admission Control

371	    The solution is based on two main ideas. First, we always maintain a
372	    conservative upper bound of the aggregate reservation R, denoted
373	    R_bound, which we use for making admission control decisions.
374	    R_bound is updated with a simple rule:

376	            R_bound = R_bound + r

378	    whenever a request for a rate r is received and R_bound + r <= C. It
379	    should be noted that in order to maintain the invariant that R_bound is
380	    an upper bound of R, this algorithm does not need to detect duplicate
381	    request messages, generated either due to retransmission in case of
382	    packet loss or retry in case of partial reservation failures. Of course,
383	    the obvious problem with the algorithm is that R_bound will diverge
384	    from R. At the limit, when R_bound reaches the link capacity C, no new
385	    requests can be accepted even though there might be available capacity.

387	    To address this problem, we introduce a separate algorithm, based on
388	    DPS, that periodically estimates the aggregate reserved rate. Based
389	    on this estimate we compute a second upper bound for R, and then use
390	    it to re-calibrate the upper bound R_bound. An important aspect of
391	    the estimation algorithm is that it can be actually shown that the
392	    discrepancy between the upper bound and the actual reserved rate R
393	    is bounded. Then the re-calibration process reduces to choosing the
394	    minimum of the two upper bounds.

396	 3.2.2. Estimation Algorithm for the Aggregate Reservation

398	    To estimate the aggregate reservation, denoted R_est, we again use DPS.
399	    In this case, the state of each packet consists of the amount of bits a
400	    flow is entitled to send during the interval between the time when the
401	    previous packet was transmitted up to the time when the current packet
402	    is transmitted. Note that unlike the previous examples, in this case
403	    the state carried by the packet does not affect the packet's processing
404	    by interior nodes. This state is solely used to compute each node's
405	    aggregate reservation.

407	    The estimation performed by interior nodes is based on the following
408	    simple observation: the sum of state values of packets of all flows
409	    received during an interval (a, a + T_W] is a good approximation for
410	    the total number of bits that all flows are entitled to send during
411	    this interval. Dividing this sum by T_W, we obtain the aggregate
412	    reservation rate. This is basically the rate estimation algorithm,
413	    though we need to account for several estimation errors. In particular,
414	    we need to account for the fact that not all  reservations continue for
415	    the entire duration of  interval (a, a + T_W].

417	    We divide time into intervals of length T_W. Let (u1, u2] be such an
418	    interval, where u2 = u1 + T_W. Let T_I be the maximum
419	    inter-departure time between two consecutive packets in the same
420	    flow and T_J be the maximum jitter of a flow, both of which are
421	    much smaller than T_W. Further, each interior node is associated
422	    a global variable Ra which is initialized at the beginning of
423	    each interval (u1, u2] to zero, and is updated to Ra + r every
424	    time a request for a reservation r is received and the admission
425	    test is passed, i.e., R_bound + r <= C.

427	    Let Ra(t) denote the value of this variable at time t. Since
428	    interior nodes do not differentiate between the original and
429	    duplicate requests, Ra(t) is an upper-bound of the sum of all
430	    reservations accepted during the interval (u1, t]. (For simplicity,
431	    here we assume that a flow which is granted a reservation during
432	    the interval (u1, u2] becomes active no later than u2.)  Then, it
433	    can be shown that the aggregate reservation at time u2, R(u2),
434	    is bounded by

436	       R(u2) < R_est(u2)/(1-f) + Ra(u2),                     (7)

438	    where f = (T_I + T_J)/T_W. Finally, this bound is used to
439	    re-calibrate the upper bound of the aggregate reservation
440	    R_bound(u1) as follows

442	       R_bound(u2) = min(R_bound(u2), R_est(u2)/(1-f) + Ra(u2)). (8)

444	    Figure 1 shows the pseudocode of the admission decision and of the
445	    aggregate reservation estimation algorithm at ingress and interior
446	    nodes. We make several observations. First, the estimation algorithm
447	    uses only the information in the current interval. This makes the
448	    algorithm robust with respect to loss and duplication of signaling
449	    packets since their effects are "forgotten" after one time interval.
450	    As an example, if a node processes both the original and a duplicate
451	    of the same reservation request during the interval (u1, u2],
452	    R_bound will be updated twice for the same flow. However, this
453	    erroneous update will not be reflected in the computation of
454	    R_est(u3), since its computation is based only on the state values
455	    received during (u2, u3]. As a consequence, it can be show that the
456	    admission control algorithm can asymptotically reach a link
457	    utilization of C (1 - f)/(1 + f) [StoZha99].

459	    A possible optimization of the admission control algorithm is to add
460	    reservation termination messages. This will reduce the discrepancy
461	    between the upper bound R_bound and the aggregate reservation R.
462	    However, in order to guarantee that R_bound remains an upper bound
463	    for R, we need to ensure that a termination message is sent at most
464	    once, i.e., there are no retransmissions if the message is lost. In
465	    practice, this property can be enforced by edge nodes, which
466	    maintain per-flow state.

468	    To ensure that the maximum inter-departure time is no larger than
469	    T_I, the ingress node may need to send a dummy packet in the case
470	    when no data packet arrives for a flow during an interval T_I. This
471	    can be achieved by having the ingress node to maintain a timer with
472	    each flow. An optimization would be to aggregate all "micro-flows"
473	    between each pair of ingress and egress nodes into one flow, and
474	    compute the state value based on the aggregated reservation rate,
475	    and insert a dummy packet only if there is no data packet for the
476	    aggregate flow during an interval.

478	      Figure 1 - Admission control and rate estimation algorithm.

480	   -----------------------------------------------------------------
481	                               Ingress node
482	   -----------------------------------------------------------------
483	      on packet p departure:
484	        i = get_flow(p);
485	        state(p) <- r[i] * (crt_time - prev_time[i]);
486	        prev_time[i] = crt_time;
487	   -----------------------------------------------------------------
488	                                Interior node
489	   -----------------------------------------------------------------
490	       Reservation Estimation         |     Admission Control
491	   -----------------------------------------------------------------
492	      on packet p arrival:            |  on reservation request r:
493	        b <- state(p);                |    /* admission ctrl. test */
494	        L = L + b;                    |    if (R_bound + r <= C)
495	                                      |      Ra = Ra + r;
496	      on time-out T_W:                |      R_bound = R_bound + r;
497	        /* estimated reservation */   |      accept(r);
498	        R_est = L / T_W;              |    else
499	        R_bound = min(R_bound,        |      deny(r);
500	                  R_est/(1 - f) + Ra);|
501	        Ra = 0;                       |
502	    -----------------------------------------------------------------

504	 4. Carrying State in Packets

506	    There are at least three ways to encode state in the packet
507	    header: (1) introduce a new IP option, and insert the option at the
508	    ingress router, (2) introduce a new header between layer 2 and layer
509	    3, and (3) use the existing IP header.

511	    Option number 23 has been assigned for adding DPS state in packets.
512	    Inserting an IP option, has the potential to provide a large
513	    space for encoding state. However it will require all routers within
514	    a DS domain to process IP options, which could complicate packet
515	    processing.

517	    Introducing a new header between layer 2 and layer 3 would require
518	    solutions be devised for different layer 2 technologies. In the
519	    context of MPLS [Rosen98, Rosen99] networks, the state can be
520	    encoded in a special label. One way to do this is by using a
521	    particular encoding of the experimental use field indicating a
522	    nested label on the label stack that carried the PHB-specific state
523	    information rather than an ordinary label. In this case, the label
524	    on the top of the stack would indicate the label-switched path, and
525	    the inner label the PHB-specific state. This would require a small
526	    addition to the MPLS architecture to allow two labels to be pushed
527	    or popped in unison, and treated as a single entity on the label
528	    stack.

530	4.1. Encoding State within an IP header

532	    In this section, we discuss the third option: storing the additional
533	    states in the IP header. The biggest problem with using the IP header
534	    is to find enough space to insert the extra information. The main
535	    challenge is to remain fully compatible with current standards and
536	    protocols. In particular, we want the network domain to be transparent
537	    to end-to-end protocols, i.e., the egress node should restore the
538	    fields changed by ingress and interior nodes to their original values.
539	    In this respect, we observe that there is an ip_off field in the IPv4
540	    header to support packet fragmentation/reassembly which is rarely
541	    used. For example, by analyzing the traces of over 1.7 million packets
542	    on an OC-3 link [nlanr], we found that less than 0.22% of all packets
543	    were fragments. In addition, ther are a relatively small number of
544	    distinct fragment sizes. Therefore, it is possible to use a fraction
545	    of ip_off field to encode the fragment sizes, and the remaining bits
546	    to encode DPS information. The idea can be implemented as follows.
547	    When a packet arrives at an ingress node, the node checks whether a
548	    packet is a fragment or needs to be fragmented. If neither of these
549	    is true, all 13 bits of the ip_off field in the packet header will be
550	    used to encode DPS values. If the packet is a fragment, the fragment
551	    size is recoded into a more efficient representation and the rest of
552	    the bits is used to encode the DPS information. The fragment size field
553	    will be restored at the egress node.

555	    In the above, we make the implicit assumption that a packet can be
556	    fragmented only by ingress nodes, and not by interior nodes.  This is
557	    consistent with the diffserv view that the forwarding behavior of a
558	    network's component is engineered to be compatible throughout a domain.

560	    In summary, this gives us up to 13 bits in the current IPv4 header to
561	    encode the PHB specific state.

563	 4.2. Example of State Encoding

565	    The state encoding is PHB dependent. In this section, we give
566	    examples of encoding the state for the services described in
567	    Section 3.

569	 4.2.1. Encoding Flow's Rate

571	    Recall that in SPFQ, the PHB state is determined by the
572	    current rate of the flow to which the packet belongs. One possible
573	    way to represent the rate estimate is to restrict it to only a
574	    small number of possible values. For example if we limit it to 128
575	    values, only seven bits are needed to represent this rate. While
576	    this can be a reasonable solution in practice, in the following we
577	    propose a more sophisticated representation that allows us to
578	    express a larger range of values.

580	    Let r denote the packet label. In the most general case r could be
581	    a floating poing number. To represent r we use an m bit mantissa
582	    and an n bit exponent. Since r >= 0, it is possible to gain
583	    an extra bit for mantissa. For this we consider two cases:
584	    (a) if r >= 2^m we represent r as the closest value of the form
585	    u 2^v, where 2^m <= u <= 2^(m+1). Then, since the (m+1)-th most
586	    significant bit in the u's representation is always 1, we can
587	    ignore it. As an example, assume m = 3, n = 4, and r = 19 = 10011.
588	    Then 19 is represented as 18 = u*2^v, where u = 9 = 1001 and v = 1.
589	    By ignoring the first bit in the representation of u the mantissa
590	    will store 001, while the exponent will be 1.
591	    (b) On the other hand, if r < 2^m, the mantissa will contain r,
592	    while the exponent will be 2^n - 1. For example, for m = 3,
593	    n = 4, and r = 6 = 110, the mantissa is 110, while the exponent
594	    is 1111. Converting from one format to another can be efficiently
595	    implemented. Figure 2 shows the conversion code in C. For
596	    simplicity, here we assume that integers are truncated rather than
597	    rounded when represented in floating point.

599	    Figure 2. The C code for converting between integer and floating
600	              point formats. m represents the number of bits used by
601	              the mantissa; n represents the number of bits in the
602	              exponent.

604	    -----------------------------------------------------------------
605	      intToFP(int val, int *mantissa, int *exponent) {
606	        int nbits = get_num_bits(val);
607	        if (nbits <= m) {
608	          *mantissa = val;
609	          *exponent = (1 << n) - 1;
610	        } else {
611	          *exponent = nbits - m - 1;
612	          *mantissa = (val >> *exponent) - (1 << m);
613	        }
614	      }

616	      FPToInt(int mantissa, int exponent) {
617	        int tmp;
618	        if (exponent == ((1 << n) - 1))
619	          return mantissa;
620	        tmp = mantissa | (1 << m);
621	        return (tmp << exponent)
622	      }
623	    ------------------------------------------------------------------

625	    By using m bits for mantissa and n for exponent, we can represent
626	    any integer in the range [0..(2^(m+1)-1) * 2^(2^n - 1)] with a
627	    relative error bounded by (-1/2^(m+1), 1/2^(m+1)). For example,
628	    with 7 bits, by allocating 3 for mantissa and 4 for exponent, we can
629	    represent any integer in the range [1..15*2^15] with a relative error
630	    of (-6.25%, 6.25%). The worst relative error case occurs when the
631	    mantissa is 8. For example the number r = 271 = 100001111 is encoded
632	    as u = 1000, v=5, with a relative error of (8*2^5 - 271)/271 =
633	    -0.0554 = -5.54%. Similarly, r = 273 = 100010001 is encoded as
634	    u = 1001, v = 5, with a relative error of 5.55%.

636	4.2.2. Encoding Reservation State

638	    As shown in Figure 1, when estimating the aggregate reservation, the
639	    PHB state represents the number of bits that a flow is entitled to
640	    send during the interval between the time when the previous packet
641	    of the flow has been transmitted until the current packet is
642	    transmitted. This number can be simply encoded as an integer b. To
643	    reduce the range, a possibility is to store b/l instead of b, where
644	    l is the length of the packet.

646	 4.3. Encoding Multiple Values

648	    Since the space in the packet's header is a scarce resource,
649	    encoding multiple values is particularly challenging. In this
650	    section we discuss two general methods that helps to alleviate this
651	    difficulty.

653	    In the first method, the idea is to leverage additional knowledge
654	    about the state semantic to achieve efficient encoding. In
655	    particular one value can be stored as a function of other values.
656	    For example, if a value is known to be always greater than the
657	    other values, the larger value can be represented in floating point
658	    format, while the other values may be represented as fractions of
659	    this value.

661	    The idea of the second method is to have different packets within a
662	    flow carry different state formats.  This method is appropriate for
663	    PHBs that do not require all packets of a flow to carry the same
664	    state.  For example, in estimating the aggregate reservation (see
665	    Section 3.2) there is no need for every packet to carry the number
666	    of bits the flow is entitled to send between the current time and the
667	    time when the previous packet has been transmitted.  The only
668	    requirement is that the distance between any two consecutive packets
669	    that carry such values to be no larger than T_I. Other packets in
670	    between can carry different information.  Similarly, if we encode the
671	    IP fragment size in the packet's state, the packet has to carry this
672	    value only if the IP fragment is not zero.  When the IP fragment is
673	    zero the packet can carry other state instead. On the other hand, note
674	    that in SPFQ, it is mandatory that every packet be labelled by the
675	    ingress edge, as this value is used in making forward/drop decisions
676	    by ingress routers.

678	 5. Specification Issues

680	    This section briefly describes some issues related to drafting
681	    specifications for PHB's based on DPS.

683	 5.1. Recommended Codepoints

685	    At this time it is appropriate to use values drawn from the 16
686	    codepoints [Nichols98] reserved for local and experimental use
687	    (xxxx11) to encode PHBs based on DPS.

689	 5.2. Interaction with other PHBs

691	    The interaction of DPS PHB's with other PHB's obviously depends on the
692	    PHB semantic.  It should be noted that the presence of other PHB's in
693	    a node may affect the computation and update of DPS state as well as
694	    the actual forwarding behavior experienced by the packet.

696	 5.3. Tunneling

698	    When packets with PHBs based on DPS are tunneled, the end-nodes must
699	    make sure that (1) the tunnel is marked with a PHB that does not
700	    violate the original PHB semantic, and (2) the PHB specific state is
701	    correctly updated at the end of the tunnel. This requirement might be
702	    met by using a tunnel PHB that records and updates packet state, and
703	    then copying the state from the encapsulating packet to the inner
704	    packet at the tunnel endpoint. Alternatively, the behavior of the
705	    tunnel might be measured or precomputed in a way that allows the
706	    encapsulated packet's DPS state to be updated at the decapsulation
707	    point without requiring the tunnel to support DPS behavior.

709	 6. Security Considerations

711	    The space allocated for the PHB state in the packet header must be
712	    compatible with IPsec. In this context we note that using the fragment
713	    offset to carry PHB state does not affect IPsec's end-to-end security,
714	    since the fragment offset is not used for cryptographic calculations
715	    [Kent98]. Thus, as it is the case with the DS field [Nichols98], IPSec
716	    does not provide any defense against malicious modifications of the
717	    PHB state. This leaves the door open for theft of service, which inturn
718	    May cause denial of service to other conforming users.
719	    For example, in SPFQ, a label based on a small rate estimate may cause
720	    disproportionate bandwidth being allocated to the flow inside the DS
721	    domain. In the example in Section 3.2.2, the under estimation of the
722	    aggregate reservation can lead to resource overprovision.
723	    One way to expose denial of service attacks is by auditing. In this
724	    context, we note that associating state with PHBs makes it easier to
725	    perform efficient auditing at interior nodes. For example, in SPFQ,
726	    an eventual attack can be detected by simply measuring a flow rate
727	    and then comparing it against the label carried by the flow's packets.

729	    Security considerations covered in [Blake98] that correspond to
730	    diffserv code points also apply to PHB code points for DPS.

732	 7. Conclusions

734	    In this document we have proposed an extension of the diffserv
735	    architecture by defining a new set of PHBs that are based on
736	    Dynamic Packet State. By using DPS to coordinate actions of edge
737	    and interior nodes along the path traversed by a flow, distributed
738	    algorithms can be designed to approximate the behavior of a broad
739	    class of "stateful" networks within the diffserv architecture. Such
740	    an extension will significantly increase the flexibility and
741	    capabilities of the services that can be provided by diffserv.

743	8.  References

745	    [Barnes01] R. Barnes, R. Srikant, J. Mysore, N. Venkitaraman.
746	    Analysis of Stateless Fair Queuing Algorithms, Proc. of the 35th
747	    Annual Conference on Information Sciences and Systems, March 2001.

749	    [Blake98] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and
750	    W. Weiss. An Architecture for Differentiated Services, RFC 2475
751	    December 1998.

753	    [Cao00] Z. Cao, Z Wang and E. Zegura, Rainbow Fair Queueing: Fair
754	    Bandwidth Sharing Without Per-Flow State, Proc. of INFOCOM 2000.

756	    [Heinanen99] J. Heinanen, F. Baker, W.  Weiss, and J. Wroclawski.
757	    Assured Forwarding PHB Group, RFC 2597, June 1999.

759	    [Jacobson98] V. Jacobson, K. Poduri and K. Nichols.  An
760	    Expedited Forwarding PHB, RFC 2598, June 1999.

762	    [Kent98] S. Kent and R. Atkinson. IP Authentication Header,
763	    RFC 2402, November 1998.

765	    [Nichols98] K. Nichols, S. Blake, F. Baker, and D. L. Black.
766	    Definition of the Differentiated Services Field (DS Field) in the
767	    ipv4 and ipv6 Headers, RFC 2474, December 1998.

769	    [Stoica98] I. Stoica, S. Shenker, and H. Zhang. Core-Stateless
770	    Fair Queueing: Achieving Approximately Fair Bandwidth Allocations
771	    in High Speed Networks. In Proceedings ACM SIGCOMM'98,
772	    pages 118-130, Vancouver, September 1998.

774	    [StoZha99] I. Stoica and H. Zhang. Providing Guaranteed Services
775	    Without Per-flow Management. In Proceedings of ACM SIGCOMM'99,
776	    Boston, September 1999.

778	    [Venkitar02] N. Venkitaraman, J. Mysore, M. Needham. Core-Stateless
779	    Utility Function based Rate Allocation. Proceedings of PfHSN'2002,
780	    Berlin, April 2002.

782	9. Author's Addresses

784	    Ion Stoica                                    Hui Zhang
785	    645 Soda Hall                                 Wean Hall 7115
786	    Computer Science Division                     School of Computer Science
787	    University of California, Berkeley            Carnegie Mellon University
788	    Berkeley, CA 94720                            Pittsburgh, PA 15213
789	    istoica@cs.berkeley.edu                       hzhang@cs.cmu.edu

791	    Narayanan Venkitaraman                        Jayanth Mysore
792	    Motorola Labs                                 Motorola Labs,
793	    1301 E. Algonquin Rd.                         1301 E. Algonquin Rd.
794	    Schaumburg, IL 60196                          Schaumburg, IL 60196
795	    venkitar@labs.mot.com                         jayanth@labs.mot.com

797	10. Full Copyright Statement

799	    Copyright (C) The Internet Society (1999).  All Rights Reserved.

801	    This document and translations of it may be copied and furnished to
802	    others, and derivative works that comment on or otherwise explain it
803	    or assist in its implementation may be prepared, copied, published
804	    and distributed, in whole or in part, without restriction of any
805	    kind, provided that the above copyright notice and this paragraph
806	    are included on all such copies and derivative works.  However, this
807	    document itself may not be modified in any way, such as by removing
808	    the copyright notice or references to the Internet Society or other
809	    Internet organizations, except as needed for the purpose of
810	    developing Internet standards in which case the procedures for
811	    copyrights defined in the Internet Standards process must be
812	    followed, or as required to translate it into languages other than
813	    English.

815	    The limited permissions granted above are perpetual and will not be
816	    revoked by the Internet Society or its successors or assigns.

818	    This document and the information contained herein is provided on an
819	    "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
820	    TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
821	    BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
822	    HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
823	    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.