Internet Engineering Task Force I. Stoica UC, Berkeley Internet Draft H. Zhang CMU Expires April 2003 N. Venkitaraman Motorola Labs J. Mysore Motorola Labs October 2002 Per Hop Behaviors Based on Dynamic Packet State Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society (2002). All Rights Reserved. Abstract This document proposes a family of Per-Hop Behaviors (PHBs) based on Dynamic Packet State (DPS) in the context of the differentiated service architecture. With these PHBs, distributed algorithms can be devised to implement services with flexibility, utilization, and assurance levels similar to those that can be provided with per-flow mechanisms. With Dynamic Packet State, each packet carries in its header, in addition to the PHB codepoint, some PHB-specific state. The state is initialized by the ingress node. Interior nodes process each incoming packet based on the state carried in the packet's header, updating both its internal state and the state in the packet's header before forwarding it to the next hop. By using DPS to coordinate actions of edge and interior nodes along the path traversed by a flow, distributed algorithms can be designed to approximate the behavior of a broad class of "stateful" networks using networks in which interior nodes do not maintain Stoica et al Expires April 2003 [Page 1] Internet Draft PHBs based on Dynamic Packet State October 2002 per-flow state. We give examples of services that can be implemented by PHBs based on DPS. We also discuss several possible solutions for encoding Dynamic Packet State that have the minimum incompatibility with IPv4. 1. Introduction While the diffserv architecture [Blake98] is highly scalable, the services it can provide have lower flexibility, utilization, or assurance levels than services provided by architectures that employ per-flow management at every node. It is debatable whether this should be of significant concern. For example, the low utilization by the premium traffic may be acceptable if the majority of traffic will be best effort, either because the best effort service is "good enough" for majority applications or the price difference between premium traffic and best effort traffic is too high to justify the performance difference between them. Alternatively, if the guaranteed nature of service assurance is not needed, i.e., statistical service assurance is sufficient for premium service, we can then achieve higher network utilization. Providing meaningful statistical service is still an open research problem. A discussion of these topics is beyond the scope of this document. Furthermore, there is usually a tradeoff between the complexity of the QoS mechanism and the efficiency of the resource usage. While intserv-style guaranteed service can achieve high resource utilization, premium service needs a much simpler mechanism. This document proposes a new set of PHBs based on Dynamic Packet State (DPS). These PHBs have relative low complexities (do not require per-flow state), but can be used to implement distributed algorithms to provide services with flexibility, utilization, and assurance levels similar to those that can be achieved with per-flow mechanisms. DS domains implemented with these type of PHBs should interoperate with DS domains implemented with other PHBs. With Dynamic Packet State, each packet carries in its header, in addition to the PHB codepoint, some PHB-specific state. The state is initialized by an ingress node, when the packet enters the DS domain. Interior nodes process each incoming packet based on the state carried in the packet's header, updating both its internal state and the state in the packet's header before forwarding it. By using DPS to coordinate actions of edge and interior nodes along the path traversed by a flow, distributed algorithms can be designed to approximate the behavior of a broad class of "stateful" networks. Consequently, introducing PHBs based on DPS will significantly increase the flexibility and capabilities of the services that can be built within the diffserv architecture. In particular, we will show that a variety of QoS services can be implemented by PHBs based on DPS. These include weighted fair share service, and distributed admission control service. In this document, we use flow to refer to a subset of packets that traverse the same path inside a DS domain between two edge Stoica et al Expires April 2003 [Page 2] Internet Draft PHBs based on Dynamic Packet State October 2002 nodes. Thus, with the highest level of traffic aggregation, a flow consists of all packets between the same pair of ingress and egress nodes. This document is organized as follows. Section 2 gives a general description of PHBs based on DPS. Section 3 presents several services that can be implemented with PHBs based on DPS. Section 4 discusses alternative techniques of storing state in the packet's header. Sections 5 briefly discusses some issues related to the specification of DPS PHB's, such as codepoints, tunneling behavior, and interaction with other PHB's. Section 6 discusses security issues. 2. Description of Per-Hop Behaviors Based on Dynamic Packet State Unlike common PHB codepoints [Blake98, Heinanen99, Jacobson98], a PHB codepoint based on DPS has extra state associated with it. This state is initialized by ingress nodes and carried by packets inside the DS domain. The state semantic is PHB dependent. The values taken by the state can be either flow, path, or packet dependent. The state carried by packets can be used by interior nodes for a variety of purposes such as, packet scheduling, updating the local node's state, or extending the codepoint space. When a PHB based on DPS is defined, in addition to the guidelines given in [Blake98], the following items must be specified: o state semantic - the meaning of the information carried by the packets o state placement - where is the information stored in the packet's header o encoding format - how is the information encoded in packets For example, consider a PHB that implements the Stateless Prioritized Fair Queue Sharing algorithm, which is described in Section 3.1. In this case, the state carried by a packet is based on an estimate of the current rate of the flow to which the packet belongs. The state can be placed either (a) between layer two and layer three headers, (b) as an IP option, or (c) in the IP header (see Section 4). Finally, the rate can be encoded by using a floating point like format as described in Section 4.1.1. In addition, the following requirement, called the transparency requirement, must be satisfied o All changes performed at ingress nodes or within the DS domain on packets' headers (possible for the purpose of the state) must be undone by egress nodes Stoica et al Expires April 2003 [Page 3] Internet Draft PHBs based on Dynamic Packet State October 2002 Any document defining a PHB based on DPS must specify a default placement of the state in the packet header and a default bit encoding format. However, to increase the flexibility, it is acceptable for documents to define alternate state placements and encoding formats. Any router that claims to be compatible with a particular PHB based on DPS must support at least the default placement and the default bit encoding format. 3. Examples of Services that can be Implemented by PHBs Based on DPS To illustrate the power and the flexibility of the PHBs based on DPS, we give a few examples. In the first, we address the congestion control problem by approximating the functionality of a "reference" network in which each node performs fair queuing. 3.1. Stateless Prioritized Fair Queue-sharing (SPFQ) We first explain SPFQ using an idealized fluid model and then present its packetized version. 3.1.1 Fluid Model Algorithm: Bit Labeling We first restate the key observation in CSFQ [Stoica98] that we will also use. In a router implementing fair queuing, all flows that are bottlenecked at a router have the same output rate. We call this the rate threshold(r_t(t)). Let C be the capacity of an output link at a router, and r_i(t) the arrival rate of flow i in bits per second. Let A(t) denote the total arrival rate of n flows: A(t)= r_1(t) + r_2(t) + ... + r_n(t). If A(t) <= C then no bits are dropped. But if A(t) > C, then r_t(t) is the unique solution to C = min(r_1(t), r_t(t)) + min(r_2(t), r_t(t)) + ... + min(r_n(t), r_t(t)). In general, if max-min bandwidth allocations are achieved, each flow i receives service at a rate given by min(r_i(t), r_t(t)). For every flow, in any given second, we consider up to r_t(t) bits of the flow as being conforming, and all bits in excess of that as being non-conforming. If we mark every bit of a flow as being conforming or non-conforming, we can obtain the allocation provided by a fair queuing router, by simply having routers accept all conforming bits of a flow and dropping all non-conforming bits. What we need now, is a labeling algorithm at the ingress node, that would enable a router to distinguish between conforming and non- conforming traffic. Consider the simple sequential labeling algorithm: label(bit) served += 1 bit->label = served where the value of 'served' is reset to 0, after every second. Stoica et al Expires April 2003 [Page 4] Internet Draft PHBs based on Dynamic Packet State October 2002 Let us suppose that the rate at which each flow is sending bits is constant. The result of this algorithm is that during any given second, the bits from a flow sending at rate r_i bits per second are marked sequentially from 1 to r_i; and the label is reset to 0 at the end of each second. Then, for any given flow, accepting bits with label 1 to 'r_a', would be equivalent to providing the flow with a rate of 'r_a' bits per second. So, if all bits carry such a label, a router can simply identify non-conforming bits to be those with label > r_t and drop them. Consequently, no flow can receive a service in excess of r_t . Furthermore, as all bits with label <= r_t are accepted, all flows sending at a rate less than or equal to r_t will not have any of their bits dropped. As described in [Venkitar02], a key advantage of such a labeling procedure is that it allows us to convey rate information as well as intra-flow priority using the same field in the packet header. 3.1.2 Packet Labeling In order to extend the sequential labeling algorithm given for the fluid model to a packetized model, we essentially need to take variable packet sizes into account. Hence, instead of incrementing the counter 'served' by 1 (which is the size of any packet in a network with purely single-bit packets), we increment the value of 'served' by the size of the packet. Given below is a pseudo code for the packetized version of the sequential marking algorithm. label(pkt) served += pkt->size pkt->label = served where the value of served is reset to 0, after a fixed size epoch. All ingress routers in a DS domain must use epochs of equal duration. The size of the epoch is a design parameter that should be chosen to reflect a tradeoff between mimicking the fluid model accurately and not giving an unfair advantage to flows that arrived most recently in the system. To understand this tradeoff, suppose that the epoch is chosen to be very long and that a new flow arrives in the middle of an epoch. Then the bits from the new flow would be labeled starting from a value of one and would have a higher priority throughout the rest of the epoch than the bits of flows that have been sending bits from the beginning of the epoch. 3.1.3 Forwarding Decision in a Router [Stoica98], [Cao00], [Barnes01] and [Venkitar02] discuss algorithms for updating the rate threshold (r_t) based on link state, i.e, based on parameters such as queue length and the aggregate accepted rate of packets. The forwarding decision in a router is then made based on the following algorithm: Stoica et al Expires April 2003 [Page 5] Internet Draft PHBs based on Dynamic Packet State October 2002 enque(pkt) if (pkt->label <= r_t) Accept(pkt) else Drop(pkt) 3.1.4 Additional Services based on SPFQ We now present examples of labeling methods at the edge that can provide different services while retaining the same forwarding behavior. 3.1.4.1 Weighted SPFQ The SPFQ algorithm can be readily extended to support flows with different weights. Let w_i be the weight of flow i. An allocation is weighted fair if all bottlenecked flows have the same value for r_i/w_i. The only major change required to achieve weighted fair allocation is in the ingress labeling algorithm, where we need to use 'served/w_i' instead of 'served'. This enables per-flow service differentiation without maintaining per-flow state in the core nodes of the network. 3.1.4.2 Minimum bandwidth allocation From the forwarding algorithm given in section 2, it is clear that the packets marked with lower values of label are dropped only after all packets with larger labels have been dropped. This suggests that packets marked with the smallest label (of 0) will not be dropped as long as the aggregate rate of such packets does not exceed the link capacity. So, for a flow requiring a minimum bandwidth allocation of 'b_min', labeling packets with the smallest label at a rate of 'b_min' would ensure that the flow will receive the rate that it has been guaranteed within a reasonable time window (assuming that there is no packet loss due to channel error). An admission control mechanism should be used to ensure that the aggregate reserved rate does not exceed the capacity of the link. A distributed admission control mechanism, such as the one proposed in section 3.2 can be used for this purpose. 3.2 Distributed Admission Control The previous examples focused on data path mechanisms and services. In this section, we will show that PHBs based on DPS can also implement control plane services such as distributed admission control. Admission control is a central component in providing quantitatively defined QoS services. The main job of the admission control test is to ensure that the network resources are not over-committed. In particular it has to ensure that the sum of the reservation rates of all flows that traverse any link in the network is no larger than the link capacity C. Stoica et al Expires April 2003 [Page 6] Internet Draft PHBs based on Dynamic Packet State October 2002 A new reservation request is granted if it passes the admission test at each hop along its path. There are two main approaches to implementing admission control. Traditional reservation-based networks adopt a distributed architecture in which each node is responsible for its local resources, and where nodes are assumed to maintain per-flow state. To support the dynamic creation and deletion of fine grained flows, a large quantity of dynamic per flow state needs to be maintained in a distributed fashion. Complex signaling protocols (e.g., RSVP and ATM UNI) are used to maintain the consistency of this per-flow state. A second approach is to use a centralized bandwidth broker that maintains the topology as well as the state of all nodes in the network. In this case, the admission control can be implemented by the broker, eliminating the need for maintaining distributed state. Such a centralized approach is more appropriate for an environment where most flows are long lived, and set-up and tear-down events are rare. However, to support fine grained and dynamic flows, there may be a need for a distributed broker architecture, in which the broker database is replicated or partitioned. Such an architecture eliminates the need for a signaling protocol, but requires another protocol to maintain the consistency of the different broker databases. Unfortunately, since it is impossible to achieve perfect consistency, this may create race conditions and/or resource fragmentation. A third approach is to use a simplified provisioning model that is not aware of the details of the network topology, but instead admits a new flow if there is sufficient bandwidth available for the flow's packets to travel anywhere in the network with adequate QoS. This simplified model may be based on static provisioning and service level agreements, or on a simple dynamic bandwidth broker. In any case, the tradeoff made in return for the simplicity is that the admission control decision must be more conservative, and a much smaller level of QoS-controlled service can be supported. In the following, we show that by using a PHB based on DPS, it is possible to implement distributed admission control for guaranteed services in a DS domain. In our scheme, for each reservation, the ingress node generates a request message that is forwarded along the path. In turn, each interior node decides whether or not to accept the request. When the message reaches the egress node it is returned to the ingress, which makes the final decision. We do not make any reliability assumptions about the request messages. In addition, the algorithms does not require reservation termination messages. In the following we describe the per-hop admission control. [StoZha99] describes how this scheme can be integrated with RSVP to provide end-to-end delay bounded services. 3.2.1. Per-Hop Admission Control The solution is based on two main ideas. First, we always maintain a conservative upper bound of the aggregate reservation R, denoted Stoica et al Expires April 2003 [Page 7] Internet Draft PHBs based on Dynamic Packet State October 2002 R_bound, which we use for making admission control decisions. R_bound is updated with a simple rule: R_bound = R_bound + r whenever a request for a rate r is received and R_bound + r <= C. It should be noted that in order to maintain the invariant that R_bound is an upper bound of R, this algorithm does not need to detect duplicate request messages, generated either due to retransmission in case of packet loss or retry in case of partial reservation failures. Of course, the obvious problem with the algorithm is that R_bound will diverge from R. At the limit, when R_bound reaches the link capacity C, no new requests can be accepted even though there might be available capacity. To address this problem, we introduce a separate algorithm, based on DPS, that periodically estimates the aggregate reserved rate. Based on this estimate we compute a second upper bound for R, and then use it to re-calibrate the upper bound R_bound. An important aspect of the estimation algorithm is that it can be actually shown that the discrepancy between the upper bound and the actual reserved rate R is bounded. Then the re-calibration process reduces to choosing the minimum of the two upper bounds. 3.2.2. Estimation Algorithm for the Aggregate Reservation To estimate the aggregate reservation, denoted R_est, we again use DPS. In this case, the state of each packet consists of the amount of bits a flow is entitled to send during the interval between the time when the previous packet was transmitted up to the time when the current packet is transmitted. Note that unlike the previous examples, in this case the state carried by the packet does not affect the packet's processing by interior nodes. This state is solely used to compute each node's aggregate reservation. The estimation performed by interior nodes is based on the following simple observation: the sum of state values of packets of all flows received during an interval (a, a + T_W] is a good approximation for the total number of bits that all flows are entitled to send during this interval. Dividing this sum by T_W, we obtain the aggregate reservation rate. This is basically the rate estimation algorithm, though we need to account for several estimation errors. In particular, we need to account for the fact that not all reservations continue for the entire duration of interval (a, a + T_W]. We divide time into intervals of length T_W. Let (u1, u2] be such an interval, where u2 = u1 + T_W. Let T_I be the maximum inter-departure time between two consecutive packets in the same flow and T_J be the maximum jitter of a flow, both of which are much smaller than T_W. Further, each interior node is associated a global variable Ra which is initialized at the beginning of each interval (u1, u2] to zero, and is updated to Ra + r every time a request for a reservation r is received and the admission test is passed, i.e., R_bound + r <= C. Stoica et al Expires April 2003 [Page 8] Internet Draft PHBs based on Dynamic Packet State October 2002 Let Ra(t) denote the value of this variable at time t. Since interior nodes do not differentiate between the original and duplicate requests, Ra(t) is an upper-bound of the sum of all reservations accepted during the interval (u1, t]. (For simplicity, here we assume that a flow which is granted a reservation during the interval (u1, u2] becomes active no later than u2.) Then, it can be shown that the aggregate reservation at time u2, R(u2), is bounded by R(u2) < R_est(u2)/(1-f) + Ra(u2), (7) where f = (T_I + T_J)/T_W. Finally, this bound is used to re-calibrate the upper bound of the aggregate reservation R_bound(u1) as follows R_bound(u2) = min(R_bound(u2), R_est(u2)/(1-f) + Ra(u2)). (8) Figure 1 shows the pseudocode of the admission decision and of the aggregate reservation estimation algorithm at ingress and interior nodes. We make several observations. First, the estimation algorithm uses only the information in the current interval. This makes the algorithm robust with respect to loss and duplication of signaling packets since their effects are "forgotten" after one time interval. As an example, if a node processes both the original and a duplicate of the same reservation request during the interval (u1, u2], R_bound will be updated twice for the same flow. However, this erroneous update will not be reflected in the computation of R_est(u3), since its computation is based only on the state values received during (u2, u3]. As a consequence, it can be show that the admission control algorithm can asymptotically reach a link utilization of C (1 - f)/(1 + f) [StoZha99]. A possible optimization of the admission control algorithm is to add reservation termination messages. This will reduce the discrepancy between the upper bound R_bound and the aggregate reservation R. However, in order to guarantee that R_bound remains an upper bound for R, we need to ensure that a termination message is sent at most once, i.e., there are no retransmissions if the message is lost. In practice, this property can be enforced by edge nodes, which maintain per-flow state. To ensure that the maximum inter-departure time is no larger than T_I, the ingress node may need to send a dummy packet in the case when no data packet arrives for a flow during an interval T_I. This can be achieved by having the ingress node to maintain a timer with each flow. An optimization would be to aggregate all "micro-flows" between each pair of ingress and egress nodes into one flow, and compute the state value based on the aggregated reservation rate, and insert a dummy packet only if there is no data packet for the aggregate flow during an interval. Stoica et al Expires April 2003 [Page 9] Internet Draft PHBs based on Dynamic Packet State October 2002 Figure 1 - Admission control and rate estimation algorithm. ----------------------------------------------------------------- Ingress node ----------------------------------------------------------------- on packet p departure: i = get_flow(p); state(p) <- r[i] * (crt_time - prev_time[i]); prev_time[i] = crt_time; ----------------------------------------------------------------- Interior node ----------------------------------------------------------------- Reservation Estimation | Admission Control ----------------------------------------------------------------- on packet p arrival: | on reservation request r: b <- state(p); | /* admission ctrl. test */ L = L + b; | if (R_bound + r <= C) | Ra = Ra + r; on time-out T_W: | R_bound = R_bound + r; /* estimated reservation */ | accept(r); R_est = L / T_W; | else R_bound = min(R_bound, | deny(r); R_est/(1 - f) + Ra);| Ra = 0; | ----------------------------------------------------------------- 4. Carrying State in Packets There are at least three ways to encode state in the packet header: (1) introduce a new IP option, and insert the option at the ingress router, (2) introduce a new header between layer 2 and layer 3, and (3) use the existing IP header. Option number 23 has been assigned for adding DPS state in packets. Inserting an IP option, has the potential to provide a large space for encoding state. However it will require all routers within a DS domain to process IP options, which could complicate packet processing. Introducing a new header between layer 2 and layer 3 would require solutions be devised for different layer 2 technologies. In the context of MPLS [Rosen98, Rosen99] networks, the state can be encoded in a special label. One way to do this is by using a particular encoding of the experimental use field indicating a nested label on the label stack that carried the PHB-specific state information rather than an ordinary label. In this case, the label on the top of the stack would indicate the label-switched path, and the inner label the PHB-specific state. This would require a small addition to the MPLS architecture to allow two labels to be pushed or popped in unison, and treated as a single entity on the label stack. Stoica et al Expires April 2003 [Page 10] Internet Draft PHBs based on Dynamic Packet State October 2002 4.1. Encoding State within an IP header In this section, we discuss the third option: storing the additional states in the IP header. The biggest problem with using the IP header is to find enough space to insert the extra information. The main challenge is to remain fully compatible with current standards and protocols. In particular, we want the network domain to be transparent to end-to-end protocols, i.e., the egress node should restore the fields changed by ingress and interior nodes to their original values. In this respect, we observe that there is an ip_off field in the IPv4 header to support packet fragmentation/reassembly which is rarely used. For example, by analyzing the traces of over 1.7 million packets on an OC-3 link [nlanr], we found that less than 0.22% of all packets were fragments. In addition, ther are a relatively small number of distinct fragment sizes. Therefore, it is possible to use a fraction of ip_off field to encode the fragment sizes, and the remaining bits to encode DPS information. The idea can be implemented as follows. When a packet arrives at an ingress node, the node checks whether a packet is a fragment or needs to be fragmented. If neither of these is true, all 13 bits of the ip_off field in the packet header will be used to encode DPS values. If the packet is a fragment, the fragment size is recoded into a more efficient representation and the rest of the bits is used to encode the DPS information. The fragment size field will be restored at the egress node. In the above, we make the implicit assumption that a packet can be fragmented only by ingress nodes, and not by interior nodes. This is consistent with the diffserv view that the forwarding behavior of a network's component is engineered to be compatible throughout a domain. In summary, this gives us up to 13 bits in the current IPv4 header to encode the PHB specific state. 4.2. Example of State Encoding The state encoding is PHB dependent. In this section, we give examples of encoding the state for the services described in Section 3. 4.2.1. Encoding Flow's Rate Recall that in SPFQ, the PHB state is determined by the current rate of the flow to which the packet belongs. One possible way to represent the rate estimate is to restrict it to only a small number of possible values. For example if we limit it to 128 values, only seven bits are needed to represent this rate. While this can be a reasonable solution in practice, in the following we propose a more sophisticated representation that allows us to express a larger range of values. Let r denote the packet label. In the most general case r could be a floating poing number. To represent r we use an m bit mantissa and an n bit exponent. Since r >= 0, it is possible to gain Stoica et al Expires April 2003 [Page 11] Internet Draft PHBs based on Dynamic Packet State October 2002 an extra bit for mantissa. For this we consider two cases: (a) if r >= 2^m we represent r as the closest value of the form u 2^v, where 2^m <= u <= 2^(m+1). Then, since the (m+1)-th most significant bit in the u's representation is always 1, we can ignore it. As an example, assume m = 3, n = 4, and r = 19 = 10011. Then 19 is represented as 18 = u*2^v, where u = 9 = 1001 and v = 1. By ignoring the first bit in the representation of u the mantissa will store 001, while the exponent will be 1. (b) On the other hand, if r < 2^m, the mantissa will contain r, while the exponent will be 2^n - 1. For example, for m = 3, n = 4, and r = 6 = 110, the mantissa is 110, while the exponent is 1111. Converting from one format to another can be efficiently implemented. Figure 2 shows the conversion code in C. For simplicity, here we assume that integers are truncated rather than rounded when represented in floating point. Figure 2. The C code for converting between integer and floating point formats. m represents the number of bits used by the mantissa; n represents the number of bits in the exponent. ----------------------------------------------------------------- intToFP(int val, int *mantissa, int *exponent) { int nbits = get_num_bits(val); if (nbits <= m) { *mantissa = val; *exponent = (1 << n) - 1; } else { *exponent = nbits - m - 1; *mantissa = (val >> *exponent) - (1 << m); } } FPToInt(int mantissa, int exponent) { int tmp; if (exponent == ((1 << n) - 1)) return mantissa; tmp = mantissa | (1 << m); return (tmp << exponent) } ------------------------------------------------------------------ By using m bits for mantissa and n for exponent, we can represent any integer in the range [0..(2^(m+1)-1) * 2^(2^n - 1)] with a relative error bounded by (-1/2^(m+1), 1/2^(m+1)). For example, with 7 bits, by allocating 3 for mantissa and 4 for exponent, we can represent any integer in the range [1..15*2^15] with a relative error of (-6.25%, 6.25%). The worst relative error case occurs when the mantissa is 8. For example the number r = 271 = 100001111 is encoded as u = 1000, v=5, with a relative error of (8*2^5 - 271)/271 = -0.0554 = -5.54%. Similarly, r = 273 = 100010001 is encoded as u = 1001, v = 5, with a relative error of 5.55%. Stoica et al Expires April 2003 [Page 12] Internet Draft PHBs based on Dynamic Packet State October 2002 4.2.2. Encoding Reservation State As shown in Figure 1, when estimating the aggregate reservation, the PHB state represents the number of bits that a flow is entitled to send during the interval between the time when the previous packet of the flow has been transmitted until the current packet is transmitted. This number can be simply encoded as an integer b. To reduce the range, a possibility is to store b/l instead of b, where l is the length of the packet. 4.3. Encoding Multiple Values Since the space in the packet's header is a scarce resource, encoding multiple values is particularly challenging. In this section we discuss two general methods that helps to alleviate this difficulty. In the first method, the idea is to leverage additional knowledge about the state semantic to achieve efficient encoding. In particular one value can be stored as a function of other values. For example, if a value is known to be always greater than the other values, the larger value can be represented in floating point format, while the other values may be represented as fractions of this value. The idea of the second method is to have different packets within a flow carry different state formats. This method is appropriate for PHBs that do not require all packets of a flow to carry the same state. For example, in estimating the aggregate reservation (see Section 3.2) there is no need for every packet to carry the number of bits the flow is entitled to send between the current time and the time when the previous packet has been transmitted. The only requirement is that the distance between any two consecutive packets that carry such values to be no larger than T_I. Other packets in between can carry different information. Similarly, if we encode the IP fragment size in the packet's state, the packet has to carry this value only if the IP fragment is not zero. When the IP fragment is zero the packet can carry other state instead. On the other hand, note that in SPFQ, it is mandatory that every packet be labelled by the ingress edge, as this value is used in making forward/drop decisions by ingress routers. 5. Specification Issues This section briefly describes some issues related to drafting specifications for PHB's based on DPS. 5.1. Recommended Codepoints At this time it is appropriate to use values drawn from the 16 codepoints [Nichols98] reserved for local and experimental use (xxxx11) to encode PHBs based on DPS. Stoica et al Expires April 2003 [Page 13] Internet Draft PHBs based on Dynamic Packet State October 2002 5.2. Interaction with other PHBs The interaction of DPS PHB's with other PHB's obviously depends on the PHB semantic. It should be noted that the presence of other PHB's in a node may affect the computation and update of DPS state as well as the actual forwarding behavior experienced by the packet. 5.3. Tunneling When packets with PHBs based on DPS are tunneled, the end-nodes must make sure that (1) the tunnel is marked with a PHB that does not violate the original PHB semantic, and (2) the PHB specific state is correctly updated at the end of the tunnel. This requirement might be met by using a tunnel PHB that records and updates packet state, and then copying the state from the encapsulating packet to the inner packet at the tunnel endpoint. Alternatively, the behavior of the tunnel might be measured or precomputed in a way that allows the encapsulated packet's DPS state to be updated at the decapsulation point without requiring the tunnel to support DPS behavior. 6. Security Considerations The space allocated for the PHB state in the packet header must be compatible with IPsec. In this context we note that using the fragment offset to carry PHB state does not affect IPsec's end-to-end security, since the fragment offset is not used for cryptographic calculations [Kent98]. Thus, as it is the case with the DS field [Nichols98], IPSec does not provide any defense against malicious modifications of the PHB state. This leaves the door open for theft of service, which inturn May cause denial of service to other conforming users. For example, in SPFQ, a label based on a small rate estimate may cause disproportionate bandwidth being allocated to the flow inside the DS domain. In the example in Section 3.2.2, the under estimation of the aggregate reservation can lead to resource overprovision. One way to expose denial of service attacks is by auditing. In this context, we note that associating state with PHBs makes it easier to perform efficient auditing at interior nodes. For example, in SPFQ, an eventual attack can be detected by simply measuring a flow rate and then comparing it against the label carried by the flow's packets. Security considerations covered in [Blake98] that correspond to diffserv code points also apply to PHB code points for DPS. 7. Conclusions In this document we have proposed an extension of the diffserv architecture by defining a new set of PHBs that are based on Dynamic Packet State. By using DPS to coordinate actions of edge and interior nodes along the path traversed by a flow, distributed algorithms can be designed to approximate the behavior of a broad class of "stateful" networks within the diffserv architecture. Such an extension will significantly increase the flexibility and capabilities of the services that can be provided by diffserv. Stoica et al Expires April 2003 [Page 14] Internet Draft PHBs based on Dynamic Packet State October 2002 8. References [Barnes01] R. Barnes, R. Srikant, J. Mysore, N. Venkitaraman. Analysis of Stateless Fair Queuing Algorithms, Proc. of the 35th Annual Conference on Information Sciences and Systems, March 2001. [Blake98] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. An Architecture for Differentiated Services, RFC 2475 December 1998. [Cao00] Z. Cao, Z Wang and E. Zegura, Rainbow Fair Queueing: Fair Bandwidth Sharing Without Per-Flow State, Proc. of INFOCOM 2000. [Heinanen99] J. Heinanen, F. Baker, W. Weiss, and J. Wroclawski. Assured Forwarding PHB Group, RFC 2597, June 1999. [Jacobson98] V. Jacobson, K. Poduri and K. Nichols. An Expedited Forwarding PHB, RFC 2598, June 1999. [Kent98] S. Kent and R. Atkinson. IP Authentication Header, RFC 2402, November 1998. [Nichols98] K. Nichols, S. Blake, F. Baker, and D. L. Black. Definition of the Differentiated Services Field (DS Field) in the ipv4 and ipv6 Headers, RFC 2474, December 1998. [Stoica98] I. Stoica, S. Shenker, and H. Zhang. Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks. In Proceedings ACM SIGCOMM'98, pages 118-130, Vancouver, September 1998. [StoZha99] I. Stoica and H. Zhang. Providing Guaranteed Services Without Per-flow Management. In Proceedings of ACM SIGCOMM'99, Boston, September 1999. [Venkitar02] N. Venkitaraman, J. Mysore, M. Needham. Core-Stateless Utility Function based Rate Allocation. Proceedings of PfHSN'2002, Berlin, April 2002. 9. Author's Addresses Ion Stoica Hui Zhang 645 Soda Hall Wean Hall 7115 Computer Science Division School of Computer Science University of California, Berkeley Carnegie Mellon University Berkeley, CA 94720 Pittsburgh, PA 15213 istoica@cs.berkeley.edu hzhang@cs.cmu.edu Narayanan Venkitaraman Jayanth Mysore Motorola Labs Motorola Labs, 1301 E. Algonquin Rd. 1301 E. Algonquin Rd. Schaumburg, IL 60196 Schaumburg, IL 60196 venkitar@labs.mot.com jayanth@labs.mot.com Stoica et al Expires April 2003 [Page 15] Internet Draft PHBs based on Dynamic Packet State October 2002 10. Full Copyright Statement Copyright (C) The Internet Society (1999). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Stoica et al Expires April 2003 [Page 16]