idnits 2.17.1 draft-stoica-diffserv-dps-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == It seems as if not all pages are separated by form feeds - found 0 form feeds but 16 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** There are 52 instances of too long lines in the document, the longest one being 6 characters in excess of 72. == There are 2 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Line 314 has weird spacing: '... of the admis...' == Line 315 has weird spacing: '...sources are n...' == Line 414 has weird spacing: '...not all reser...' == Line 415 has weird spacing: '...tion of inter...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 2002) is 7864 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'Rosen98' is mentioned on line 519, but not defined == Missing Reference: 'Rosen99' is mentioned on line 519, but not defined -- Possible downref: Non-RFC (?) normative reference: ref. 'Barnes01' ** Downref: Normative reference to an Informational RFC: RFC 2475 (ref. 'Blake98') -- Possible downref: Non-RFC (?) normative reference: ref. 'Cao00' ** Obsolete normative reference: RFC 2598 (ref. 'Jacobson98') (Obsoleted by RFC 3246) ** Obsolete normative reference: RFC 2402 (ref. 'Kent98') (Obsoleted by RFC 4302, RFC 4305) -- Possible downref: Non-RFC (?) normative reference: ref. 'Stoica98' -- Possible downref: Non-RFC (?) normative reference: ref. 'StoZha99' -- Possible downref: Non-RFC (?) normative reference: ref. 'Venkitar02' Summary: 8 errors (**), 0 flaws (~~), 10 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force I. Stoica UC, Berkeley 2 Internet Draft H. Zhang CMU 3 Expires April 2003 N. Venkitaraman Motorola Labs 4 J. Mysore Motorola Labs 6 October 2002 8 Per Hop Behaviors Based on Dynamic Packet State 9 11 Status of this Memo 13 This document is an Internet-Draft and is in full conformance with 14 all provisions of Section 10 of RFC2026. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that other 18 groups may also distribute working documents as Internet-Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six months 21 and may be updated, replaced, or obsoleted by other documents at any 22 time. It is inappropriate to use Internet-Drafts as reference 23 material or to cite them other than as "work in progress." 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/ietf/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html. 31 Copyright Notice 33 Copyright (C) The Internet Society (2002). All Rights Reserved. 35 Abstract 37 This document proposes a family of Per-Hop Behaviors (PHBs) 38 based on Dynamic Packet State (DPS) in the context of the 39 differentiated service architecture. With these PHBs, distributed 40 algorithms can be devised to implement services with flexibility, 41 utilization, and assurance levels similar to those that can be 42 provided with per-flow mechanisms. 44 With Dynamic Packet State, each packet carries in its header, in 45 addition to the PHB codepoint, some PHB-specific state. The state 46 is initialized by the ingress node. Interior nodes process each 47 incoming packet based on the state carried in the packet's 48 header, updating both its internal state and the state in the 49 packet's header before forwarding it to the next hop. By using 50 DPS to coordinate actions of edge and interior nodes along the 51 path traversed by a flow, distributed algorithms can be designed 52 to approximate the behavior of a broad class of "stateful" 53 networks using networks in which interior nodes do not maintain 54 per-flow state. We give examples of services that can be implemented by 55 PHBs based on DPS. We also discuss several possible solutions for encoding 56 Dynamic Packet State that have the minimum incompatibility with IPv4. 58 1. Introduction 60 While the diffserv architecture [Blake98] is highly scalable, 61 the services it can provide have lower flexibility, utilization, 62 or assurance levels than services provided by architectures 63 that employ per-flow management at every node. It is debatable 64 whether this should be of significant concern. For example, the 65 low utilization by the premium traffic may be acceptable if the 66 majority of traffic will be best effort, either because the 67 best effort service is "good enough" for majority applications 68 or the price difference between premium traffic and best effort 69 traffic is too high to justify the performance difference between 70 them. Alternatively, if the guaranteed nature of service 71 assurance is not needed, i.e., statistical service assurance is 72 sufficient for premium service, we can then achieve higher 73 network utilization. Providing meaningful statistical service 74 is still an open research problem. A discussion of these topics 75 is beyond the scope of this document. Furthermore, there is usually 76 a tradeoff between the complexity of the QoS mechanism and the 77 efficiency of the resource usage. While intserv-style guaranteed 78 service can achieve high resource utilization, premium service 79 needs a much simpler mechanism. 81 This document proposes a new set of PHBs based on Dynamic Packet 82 State (DPS). These PHBs have relative low complexities (do not 83 require per-flow state), but can be used to implement distributed 84 algorithms to provide services with flexibility, utilization, 85 and assurance levels similar to those that can be achieved with 86 per-flow mechanisms. DS domains implemented with these type of PHBs 87 should interoperate with DS domains implemented with other PHBs. 89 With Dynamic Packet State, each packet carries in its header, in 90 addition to the PHB codepoint, some PHB-specific state. The state 91 is initialized by an ingress node, when the packet enters the DS 92 domain. Interior nodes process each incoming packet based on the 93 state carried in the packet's header, updating both its internal 94 state and the state in the packet's header before forwarding it. 95 By using DPS to coordinate actions of edge and interior nodes 96 along the path traversed by a flow, distributed algorithms can 97 be designed to approximate the behavior of a broad class of 98 "stateful" networks. Consequently, introducing PHBs based on DPS 99 will significantly increase the flexibility and capabilities of 100 the services that can be built within the diffserv architecture. 101 In particular, we will show that a variety of QoS services can 102 be implemented by PHBs based on DPS. These include weighted fair 103 share service, and distributed admission control service. 105 In this document, we use flow to refer to a subset of packets 106 that traverse the same path inside a DS domain between two edge 107 nodes. Thus, with the highest level of traffic aggregation, a 108 flow consists of all packets between the same pair of ingress 109 and egress nodes. 111 This document is organized as follows. Section 2 gives a general 112 description of PHBs based on DPS. Section 3 presents several 113 services that can be implemented with PHBs based on DPS. 114 Section 4 discusses alternative techniques of storing state in 115 the packet's header. Sections 5 briefly discusses some issues 116 related to the specification of DPS PHB's, such as codepoints, 117 tunneling behavior, and interaction with other PHB's. 118 Section 6 discusses security issues. 120 2. Description of Per-Hop Behaviors Based on Dynamic Packet State 122 Unlike common PHB codepoints [Blake98, Heinanen99, Jacobson98], 123 a PHB codepoint based on DPS has extra state associated with it. 124 This state is initialized by ingress nodes and carried by packets 125 inside the DS domain. The state semantic is PHB dependent. 126 The values taken by the state can be either flow, path, or 127 packet dependent. The state carried by packets can be used by 128 interior nodes for a variety of purposes such as, packet 129 scheduling, updating the local node's state, or extending the 130 codepoint space. 132 When a PHB based on DPS is defined, in addition to the guidelines 133 given in [Blake98], the following items must be specified: 135 o state semantic - the meaning of the information carried by 136 the packets 138 o state placement - where is the information stored in the 139 packet's header 141 o encoding format - how is the information encoded in packets 143 For example, consider a PHB that implements the Stateless 144 Prioritized Fair Queue Sharing algorithm, which is described in 145 Section 3.1. In this case, the state carried by a packet is 146 based on an estimate of the current rate of the flow to which 147 the packet belongs. The state can be placed either (a) between 148 layer two and layer three headers, (b) as an IP option, or 149 (c) in the IP header (see Section 4). Finally, the rate can be 150 encoded by using a floating point like format as described in 151 Section 4.1.1. 153 In addition, the following requirement, called the transparency 154 requirement, must be satisfied 156 o All changes performed at ingress nodes or within the DS 157 domain on packets' headers (possible for the purpose of 158 the state) must be undone by egress nodes 160 Any document defining a PHB based on DPS must specify a default 161 placement of the state in the packet header and a default bit 162 encoding format. However, to increase the flexibility, it is 163 acceptable for documents to define alternate state placements and 164 encoding formats. Any router that claims to be compatible with a 165 particular PHB based on DPS must support at least the default 166 placement and the default bit encoding format. 168 3. Examples of Services that can be Implemented by PHBs Based on DPS 170 To illustrate the power and the flexibility of the PHBs based on 171 DPS, we give a few examples. In the first, we address the 172 congestion control problem by approximating the functionality 173 of a "reference" network in which each node performs fair queuing. 175 3.1. Stateless Prioritized Fair Queue-sharing (SPFQ) 177 We first explain SPFQ using an idealized fluid model and then 178 present its packetized version. 180 3.1.1 Fluid Model Algorithm: Bit Labeling 182 We first restate the key observation in CSFQ [Stoica98] that we 183 will also use. In a router implementing fair queuing, all flows 184 that are bottlenecked at a router have the same output rate. 185 We call this the rate threshold(r_t(t)). 187 Let C be the capacity of an output link at a router, and r_i(t) 188 the arrival rate of flow i in bits per second. Let A(t) denote 189 the total arrival rate of n flows: A(t)= r_1(t) + r_2(t) + ... + 190 r_n(t). If A(t) <= C then no bits are dropped. But if A(t) > C, 191 then r_t(t) is the unique solution to C = min(r_1(t), r_t(t)) + 192 min(r_2(t), r_t(t)) + ... + min(r_n(t), r_t(t)). 194 In general, if max-min bandwidth allocations are achieved, each 195 flow i receives service at a rate given by min(r_i(t), r_t(t)). 197 For every flow, in any given second, we consider up to r_t(t) 198 bits of the flow as being conforming, and all bits in excess of 199 that as being non-conforming. If we mark every bit of a flow as 200 being conforming or non-conforming, we can obtain the allocation 201 provided by a fair queuing router, by simply having routers 202 accept all conforming bits of a flow and dropping all 203 non-conforming bits. 205 What we need now, is a labeling algorithm at the ingress node, that 206 would enable a router to distinguish between conforming and non- 207 conforming traffic. Consider the simple sequential labeling algorithm: 209 label(bit) 210 served += 1 211 bit->label = served 212 where the value of 'served' is reset to 0, after every second. 214 Let us suppose that the rate at which each flow is sending bits is 215 constant. The result of this algorithm is that during any given second, 216 the bits from a flow sending at rate r_i bits per second are marked 217 sequentially from 1 to r_i; and the label is reset to 0 at the end of 218 each second. Then, for any given flow, accepting bits with label 219 1 to 'r_a', would be equivalent to providing the flow with a 220 rate of 'r_a' bits per second. So, if all bits carry such a 221 label, a router can simply identify non-conforming bits 222 to be those with label > r_t and drop them. Consequently, no 223 flow can receive a service in excess of r_t . Furthermore, as 224 all bits with label <= r_t are accepted, all flows sending at a rate 225 less than or equal to r_t will not have any of their bits dropped. 227 As described in [Venkitar02], a key advantage of such a labeling 228 procedure is that it allows us to convey rate information as well 229 as intra-flow priority using the same field in the packet header. 231 3.1.2 Packet Labeling 233 In order to extend the sequential labeling algorithm given 234 for the fluid model to a packetized model, we essentially 235 need to take variable packet sizes into account. Hence, 236 instead of incrementing the counter 'served' by 1 (which 237 is the size of any packet in a network with purely single-bit 238 packets), we increment the value of 'served' by the size of 239 the packet. Given below is a pseudo code for the packetized 240 version of the sequential marking algorithm. 242 label(pkt) 243 served += pkt->size 244 pkt->label = served 245 where the value of served is reset to 0, after a fixed size epoch. 247 All ingress routers in a DS domain must use epochs of equal duration. 248 The size of the epoch is a design parameter that should be chosen 249 to reflect a tradeoff between mimicking the fluid model 250 accurately and not giving an unfair advantage to flows that 251 arrived most recently in the system. To understand this tradeoff, 252 suppose that the epoch is chosen to be very long and that a new 253 flow arrives in the middle of an epoch. Then the bits from the 254 new flow would be labeled starting from a value of one and would 255 have a higher priority throughout the rest of the epoch than 256 the bits of flows that have been sending bits from the beginning 257 of the epoch. 259 3.1.3 Forwarding Decision in a Router 261 [Stoica98], [Cao00], [Barnes01] and [Venkitar02] discuss algorithms 262 for updating the rate threshold (r_t) based on link state, i.e, 263 based on parameters such as queue length and the aggregate 264 accepted rate of packets. The forwarding decision in a router 265 is then made based on the following algorithm: 267 enque(pkt) 268 if (pkt->label <= r_t) 269 Accept(pkt) 270 else 271 Drop(pkt) 273 3.1.4 Additional Services based on SPFQ 275 We now present examples of labeling methods at the edge that can 276 provide different services while retaining the same forwarding 277 behavior. 279 3.1.4.1 Weighted SPFQ 281 The SPFQ algorithm can be readily extended to support flows with 282 different weights. Let w_i be the weight of flow i. An allocation 283 is weighted fair if all bottlenecked flows have the same value 284 for r_i/w_i. The only major change required to achieve weighted 285 fair allocation is in the ingress labeling algorithm, where we 286 need to use 'served/w_i' instead of 'served'. This enables 287 per-flow service differentiation without maintaining per-flow 288 state in the core nodes of the network. 290 3.1.4.2 Minimum bandwidth allocation 292 From the forwarding algorithm given in section 2, it is clear 293 that the packets marked with lower values of label are dropped 294 only after all packets with larger labels have been dropped. 295 This suggests that packets marked with the smallest label (of 0) 296 will not be dropped as long as the aggregate rate of such packets 297 does not exceed the link capacity. So, for a flow requiring a 298 minimum bandwidth allocation of 'b_min', labeling packets with 299 the smallest label at a rate of 'b_min' would ensure that the 300 flow will receive the rate that it has been guaranteed within a 301 reasonable time window (assuming that there is no packet loss due 302 to channel error). An admission control mechanism should be used 303 to ensure that the aggregate reserved rate does not exceed the 304 capacity of the link. A distributed admission control mechanism, 305 such as the one proposed in section 3.2 can be used for this purpose. 307 3.2 Distributed Admission Control 309 The previous examples focused on data path mechanisms and services. 310 In this section, we will show that PHBs based on DPS can also 311 implement control plane services such as distributed admission control. 313 Admission control is a central component in providing quantitatively 314 defined QoS services. The main job of the admission control test is 315 to ensure that the network resources are not over-committed. In 316 particular it has to ensure that the sum of the reservation rates of 317 all flows that traverse any link in the network is no larger than the 318 link capacity C. 320 A new reservation request is granted if it passes the admission 321 test at each hop along its path. There are two main approaches to 322 implementing admission control. Traditional reservation-based 323 networks adopt a distributed architecture in which each node is 324 responsible for its local resources, and where nodes are assumed to 325 maintain per-flow state. To support the dynamic creation and 326 deletion of fine grained flows, a large quantity of dynamic per 327 flow state needs to be maintained in a distributed fashion. 328 Complex signaling protocols (e.g., RSVP and ATM UNI) are used 329 to maintain the consistency of this per-flow state. 331 A second approach is to use a centralized bandwidth broker that 332 maintains the topology as well as the state of all nodes in the 333 network. In this case, the admission control can be implemented by 334 the broker, eliminating the need for maintaining distributed state. 335 Such a centralized approach is more appropriate for an environment 336 where most flows are long lived, and set-up and tear-down events 337 are rare. However, to support fine grained and dynamic flows, there 338 may be a need for a distributed broker architecture, in which the 339 broker database is replicated or partitioned. Such an architecture 340 eliminates the need for a signaling protocol, but requires another 341 protocol to maintain the consistency of the different broker 342 databases. Unfortunately, since it is impossible to achieve perfect 343 consistency, this may create race conditions and/or resource 344 fragmentation. 346 A third approach is to use a simplified provisioning model that is 347 not aware of the details of the network topology, but instead 348 admits a new flow if there is sufficient bandwidth available for 349 the flow's packets to travel anywhere in the network with adequate 350 QoS. This simplified model may be based on static provisioning and 351 service level agreements, or on a simple dynamic bandwidth broker. 352 In any case, the tradeoff made in return for the simplicity is 353 that the admission control decision must be more conservative, and 354 a much smaller level of QoS-controlled service can be supported. 356 In the following, we show that by using a PHB based on DPS, it is 357 possible to implement distributed admission control for guaranteed 358 services in a DS domain. In our scheme, for each reservation, the 359 ingress node generates a request message that is forwarded along the 360 path. In turn, each interior node decides whether or not to accept 361 the request. When the message reaches the egress node it is returned 362 to the ingress, which makes the final decision. We do not make any 363 reliability assumptions about the request messages. In addition, the 364 algorithms does not require reservation termination messages. In the 365 following we describe the per-hop admission control. [StoZha99] 366 describes how this scheme can be integrated with RSVP to provide 367 end-to-end delay bounded services. 369 3.2.1. Per-Hop Admission Control 371 The solution is based on two main ideas. First, we always maintain a 372 conservative upper bound of the aggregate reservation R, denoted 373 R_bound, which we use for making admission control decisions. 374 R_bound is updated with a simple rule: 376 R_bound = R_bound + r 378 whenever a request for a rate r is received and R_bound + r <= C. It 379 should be noted that in order to maintain the invariant that R_bound is 380 an upper bound of R, this algorithm does not need to detect duplicate 381 request messages, generated either due to retransmission in case of 382 packet loss or retry in case of partial reservation failures. Of course, 383 the obvious problem with the algorithm is that R_bound will diverge 384 from R. At the limit, when R_bound reaches the link capacity C, no new 385 requests can be accepted even though there might be available capacity. 387 To address this problem, we introduce a separate algorithm, based on 388 DPS, that periodically estimates the aggregate reserved rate. Based 389 on this estimate we compute a second upper bound for R, and then use 390 it to re-calibrate the upper bound R_bound. An important aspect of 391 the estimation algorithm is that it can be actually shown that the 392 discrepancy between the upper bound and the actual reserved rate R 393 is bounded. Then the re-calibration process reduces to choosing the 394 minimum of the two upper bounds. 396 3.2.2. Estimation Algorithm for the Aggregate Reservation 398 To estimate the aggregate reservation, denoted R_est, we again use DPS. 399 In this case, the state of each packet consists of the amount of bits a 400 flow is entitled to send during the interval between the time when the 401 previous packet was transmitted up to the time when the current packet 402 is transmitted. Note that unlike the previous examples, in this case 403 the state carried by the packet does not affect the packet's processing 404 by interior nodes. This state is solely used to compute each node's 405 aggregate reservation. 407 The estimation performed by interior nodes is based on the following 408 simple observation: the sum of state values of packets of all flows 409 received during an interval (a, a + T_W] is a good approximation for 410 the total number of bits that all flows are entitled to send during 411 this interval. Dividing this sum by T_W, we obtain the aggregate 412 reservation rate. This is basically the rate estimation algorithm, 413 though we need to account for several estimation errors. In particular, 414 we need to account for the fact that not all reservations continue for 415 the entire duration of interval (a, a + T_W]. 417 We divide time into intervals of length T_W. Let (u1, u2] be such an 418 interval, where u2 = u1 + T_W. Let T_I be the maximum 419 inter-departure time between two consecutive packets in the same 420 flow and T_J be the maximum jitter of a flow, both of which are 421 much smaller than T_W. Further, each interior node is associated 422 a global variable Ra which is initialized at the beginning of 423 each interval (u1, u2] to zero, and is updated to Ra + r every 424 time a request for a reservation r is received and the admission 425 test is passed, i.e., R_bound + r <= C. 427 Let Ra(t) denote the value of this variable at time t. Since 428 interior nodes do not differentiate between the original and 429 duplicate requests, Ra(t) is an upper-bound of the sum of all 430 reservations accepted during the interval (u1, t]. (For simplicity, 431 here we assume that a flow which is granted a reservation during 432 the interval (u1, u2] becomes active no later than u2.) Then, it 433 can be shown that the aggregate reservation at time u2, R(u2), 434 is bounded by 436 R(u2) < R_est(u2)/(1-f) + Ra(u2), (7) 438 where f = (T_I + T_J)/T_W. Finally, this bound is used to 439 re-calibrate the upper bound of the aggregate reservation 440 R_bound(u1) as follows 442 R_bound(u2) = min(R_bound(u2), R_est(u2)/(1-f) + Ra(u2)). (8) 444 Figure 1 shows the pseudocode of the admission decision and of the 445 aggregate reservation estimation algorithm at ingress and interior 446 nodes. We make several observations. First, the estimation algorithm 447 uses only the information in the current interval. This makes the 448 algorithm robust with respect to loss and duplication of signaling 449 packets since their effects are "forgotten" after one time interval. 450 As an example, if a node processes both the original and a duplicate 451 of the same reservation request during the interval (u1, u2], 452 R_bound will be updated twice for the same flow. However, this 453 erroneous update will not be reflected in the computation of 454 R_est(u3), since its computation is based only on the state values 455 received during (u2, u3]. As a consequence, it can be show that the 456 admission control algorithm can asymptotically reach a link 457 utilization of C (1 - f)/(1 + f) [StoZha99]. 459 A possible optimization of the admission control algorithm is to add 460 reservation termination messages. This will reduce the discrepancy 461 between the upper bound R_bound and the aggregate reservation R. 462 However, in order to guarantee that R_bound remains an upper bound 463 for R, we need to ensure that a termination message is sent at most 464 once, i.e., there are no retransmissions if the message is lost. In 465 practice, this property can be enforced by edge nodes, which 466 maintain per-flow state. 468 To ensure that the maximum inter-departure time is no larger than 469 T_I, the ingress node may need to send a dummy packet in the case 470 when no data packet arrives for a flow during an interval T_I. This 471 can be achieved by having the ingress node to maintain a timer with 472 each flow. An optimization would be to aggregate all "micro-flows" 473 between each pair of ingress and egress nodes into one flow, and 474 compute the state value based on the aggregated reservation rate, 475 and insert a dummy packet only if there is no data packet for the 476 aggregate flow during an interval. 478 Figure 1 - Admission control and rate estimation algorithm. 480 ----------------------------------------------------------------- 481 Ingress node 482 ----------------------------------------------------------------- 483 on packet p departure: 484 i = get_flow(p); 485 state(p) <- r[i] * (crt_time - prev_time[i]); 486 prev_time[i] = crt_time; 487 ----------------------------------------------------------------- 488 Interior node 489 ----------------------------------------------------------------- 490 Reservation Estimation | Admission Control 491 ----------------------------------------------------------------- 492 on packet p arrival: | on reservation request r: 493 b <- state(p); | /* admission ctrl. test */ 494 L = L + b; | if (R_bound + r <= C) 495 | Ra = Ra + r; 496 on time-out T_W: | R_bound = R_bound + r; 497 /* estimated reservation */ | accept(r); 498 R_est = L / T_W; | else 499 R_bound = min(R_bound, | deny(r); 500 R_est/(1 - f) + Ra);| 501 Ra = 0; | 502 ----------------------------------------------------------------- 504 4. Carrying State in Packets 506 There are at least three ways to encode state in the packet 507 header: (1) introduce a new IP option, and insert the option at the 508 ingress router, (2) introduce a new header between layer 2 and layer 509 3, and (3) use the existing IP header. 511 Option number 23 has been assigned for adding DPS state in packets. 512 Inserting an IP option, has the potential to provide a large 513 space for encoding state. However it will require all routers within 514 a DS domain to process IP options, which could complicate packet 515 processing. 517 Introducing a new header between layer 2 and layer 3 would require 518 solutions be devised for different layer 2 technologies. In the 519 context of MPLS [Rosen98, Rosen99] networks, the state can be 520 encoded in a special label. One way to do this is by using a 521 particular encoding of the experimental use field indicating a 522 nested label on the label stack that carried the PHB-specific state 523 information rather than an ordinary label. In this case, the label 524 on the top of the stack would indicate the label-switched path, and 525 the inner label the PHB-specific state. This would require a small 526 addition to the MPLS architecture to allow two labels to be pushed 527 or popped in unison, and treated as a single entity on the label 528 stack. 530 4.1. Encoding State within an IP header 532 In this section, we discuss the third option: storing the additional 533 states in the IP header. The biggest problem with using the IP header 534 is to find enough space to insert the extra information. The main 535 challenge is to remain fully compatible with current standards and 536 protocols. In particular, we want the network domain to be transparent 537 to end-to-end protocols, i.e., the egress node should restore the 538 fields changed by ingress and interior nodes to their original values. 539 In this respect, we observe that there is an ip_off field in the IPv4 540 header to support packet fragmentation/reassembly which is rarely 541 used. For example, by analyzing the traces of over 1.7 million packets 542 on an OC-3 link [nlanr], we found that less than 0.22% of all packets 543 were fragments. In addition, ther are a relatively small number of 544 distinct fragment sizes. Therefore, it is possible to use a fraction 545 of ip_off field to encode the fragment sizes, and the remaining bits 546 to encode DPS information. The idea can be implemented as follows. 547 When a packet arrives at an ingress node, the node checks whether a 548 packet is a fragment or needs to be fragmented. If neither of these 549 is true, all 13 bits of the ip_off field in the packet header will be 550 used to encode DPS values. If the packet is a fragment, the fragment 551 size is recoded into a more efficient representation and the rest of 552 the bits is used to encode the DPS information. The fragment size field 553 will be restored at the egress node. 555 In the above, we make the implicit assumption that a packet can be 556 fragmented only by ingress nodes, and not by interior nodes. This is 557 consistent with the diffserv view that the forwarding behavior of a 558 network's component is engineered to be compatible throughout a domain. 560 In summary, this gives us up to 13 bits in the current IPv4 header to 561 encode the PHB specific state. 563 4.2. Example of State Encoding 565 The state encoding is PHB dependent. In this section, we give 566 examples of encoding the state for the services described in 567 Section 3. 569 4.2.1. Encoding Flow's Rate 571 Recall that in SPFQ, the PHB state is determined by the 572 current rate of the flow to which the packet belongs. One possible 573 way to represent the rate estimate is to restrict it to only a 574 small number of possible values. For example if we limit it to 128 575 values, only seven bits are needed to represent this rate. While 576 this can be a reasonable solution in practice, in the following we 577 propose a more sophisticated representation that allows us to 578 express a larger range of values. 580 Let r denote the packet label. In the most general case r could be 581 a floating poing number. To represent r we use an m bit mantissa 582 and an n bit exponent. Since r >= 0, it is possible to gain 583 an extra bit for mantissa. For this we consider two cases: 584 (a) if r >= 2^m we represent r as the closest value of the form 585 u 2^v, where 2^m <= u <= 2^(m+1). Then, since the (m+1)-th most 586 significant bit in the u's representation is always 1, we can 587 ignore it. As an example, assume m = 3, n = 4, and r = 19 = 10011. 588 Then 19 is represented as 18 = u*2^v, where u = 9 = 1001 and v = 1. 589 By ignoring the first bit in the representation of u the mantissa 590 will store 001, while the exponent will be 1. 591 (b) On the other hand, if r < 2^m, the mantissa will contain r, 592 while the exponent will be 2^n - 1. For example, for m = 3, 593 n = 4, and r = 6 = 110, the mantissa is 110, while the exponent 594 is 1111. Converting from one format to another can be efficiently 595 implemented. Figure 2 shows the conversion code in C. For 596 simplicity, here we assume that integers are truncated rather than 597 rounded when represented in floating point. 599 Figure 2. The C code for converting between integer and floating 600 point formats. m represents the number of bits used by 601 the mantissa; n represents the number of bits in the 602 exponent. 604 ----------------------------------------------------------------- 605 intToFP(int val, int *mantissa, int *exponent) { 606 int nbits = get_num_bits(val); 607 if (nbits <= m) { 608 *mantissa = val; 609 *exponent = (1 << n) - 1; 610 } else { 611 *exponent = nbits - m - 1; 612 *mantissa = (val >> *exponent) - (1 << m); 613 } 614 } 616 FPToInt(int mantissa, int exponent) { 617 int tmp; 618 if (exponent == ((1 << n) - 1)) 619 return mantissa; 620 tmp = mantissa | (1 << m); 621 return (tmp << exponent) 622 } 623 ------------------------------------------------------------------ 625 By using m bits for mantissa and n for exponent, we can represent 626 any integer in the range [0..(2^(m+1)-1) * 2^(2^n - 1)] with a 627 relative error bounded by (-1/2^(m+1), 1/2^(m+1)). For example, 628 with 7 bits, by allocating 3 for mantissa and 4 for exponent, we can 629 represent any integer in the range [1..15*2^15] with a relative error 630 of (-6.25%, 6.25%). The worst relative error case occurs when the 631 mantissa is 8. For example the number r = 271 = 100001111 is encoded 632 as u = 1000, v=5, with a relative error of (8*2^5 - 271)/271 = 633 -0.0554 = -5.54%. Similarly, r = 273 = 100010001 is encoded as 634 u = 1001, v = 5, with a relative error of 5.55%. 636 4.2.2. Encoding Reservation State 638 As shown in Figure 1, when estimating the aggregate reservation, the 639 PHB state represents the number of bits that a flow is entitled to 640 send during the interval between the time when the previous packet 641 of the flow has been transmitted until the current packet is 642 transmitted. This number can be simply encoded as an integer b. To 643 reduce the range, a possibility is to store b/l instead of b, where 644 l is the length of the packet. 646 4.3. Encoding Multiple Values 648 Since the space in the packet's header is a scarce resource, 649 encoding multiple values is particularly challenging. In this 650 section we discuss two general methods that helps to alleviate this 651 difficulty. 653 In the first method, the idea is to leverage additional knowledge 654 about the state semantic to achieve efficient encoding. In 655 particular one value can be stored as a function of other values. 656 For example, if a value is known to be always greater than the 657 other values, the larger value can be represented in floating point 658 format, while the other values may be represented as fractions of 659 this value. 661 The idea of the second method is to have different packets within a 662 flow carry different state formats. This method is appropriate for 663 PHBs that do not require all packets of a flow to carry the same 664 state. For example, in estimating the aggregate reservation (see 665 Section 3.2) there is no need for every packet to carry the number 666 of bits the flow is entitled to send between the current time and the 667 time when the previous packet has been transmitted. The only 668 requirement is that the distance between any two consecutive packets 669 that carry such values to be no larger than T_I. Other packets in 670 between can carry different information. Similarly, if we encode the 671 IP fragment size in the packet's state, the packet has to carry this 672 value only if the IP fragment is not zero. When the IP fragment is 673 zero the packet can carry other state instead. On the other hand, note 674 that in SPFQ, it is mandatory that every packet be labelled by the 675 ingress edge, as this value is used in making forward/drop decisions 676 by ingress routers. 678 5. Specification Issues 680 This section briefly describes some issues related to drafting 681 specifications for PHB's based on DPS. 683 5.1. Recommended Codepoints 685 At this time it is appropriate to use values drawn from the 16 686 codepoints [Nichols98] reserved for local and experimental use 687 (xxxx11) to encode PHBs based on DPS. 689 5.2. Interaction with other PHBs 691 The interaction of DPS PHB's with other PHB's obviously depends on the 692 PHB semantic. It should be noted that the presence of other PHB's in 693 a node may affect the computation and update of DPS state as well as 694 the actual forwarding behavior experienced by the packet. 696 5.3. Tunneling 698 When packets with PHBs based on DPS are tunneled, the end-nodes must 699 make sure that (1) the tunnel is marked with a PHB that does not 700 violate the original PHB semantic, and (2) the PHB specific state is 701 correctly updated at the end of the tunnel. This requirement might be 702 met by using a tunnel PHB that records and updates packet state, and 703 then copying the state from the encapsulating packet to the inner 704 packet at the tunnel endpoint. Alternatively, the behavior of the 705 tunnel might be measured or precomputed in a way that allows the 706 encapsulated packet's DPS state to be updated at the decapsulation 707 point without requiring the tunnel to support DPS behavior. 709 6. Security Considerations 711 The space allocated for the PHB state in the packet header must be 712 compatible with IPsec. In this context we note that using the fragment 713 offset to carry PHB state does not affect IPsec's end-to-end security, 714 since the fragment offset is not used for cryptographic calculations 715 [Kent98]. Thus, as it is the case with the DS field [Nichols98], IPSec 716 does not provide any defense against malicious modifications of the 717 PHB state. This leaves the door open for theft of service, which inturn 718 May cause denial of service to other conforming users. 719 For example, in SPFQ, a label based on a small rate estimate may cause 720 disproportionate bandwidth being allocated to the flow inside the DS 721 domain. In the example in Section 3.2.2, the under estimation of the 722 aggregate reservation can lead to resource overprovision. 723 One way to expose denial of service attacks is by auditing. In this 724 context, we note that associating state with PHBs makes it easier to 725 perform efficient auditing at interior nodes. For example, in SPFQ, 726 an eventual attack can be detected by simply measuring a flow rate 727 and then comparing it against the label carried by the flow's packets. 729 Security considerations covered in [Blake98] that correspond to 730 diffserv code points also apply to PHB code points for DPS. 732 7. Conclusions 734 In this document we have proposed an extension of the diffserv 735 architecture by defining a new set of PHBs that are based on 736 Dynamic Packet State. By using DPS to coordinate actions of edge 737 and interior nodes along the path traversed by a flow, distributed 738 algorithms can be designed to approximate the behavior of a broad 739 class of "stateful" networks within the diffserv architecture. Such 740 an extension will significantly increase the flexibility and 741 capabilities of the services that can be provided by diffserv. 743 8. References 745 [Barnes01] R. Barnes, R. Srikant, J. Mysore, N. Venkitaraman. 746 Analysis of Stateless Fair Queuing Algorithms, Proc. of the 35th 747 Annual Conference on Information Sciences and Systems, March 2001. 749 [Blake98] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and 750 W. Weiss. An Architecture for Differentiated Services, RFC 2475 751 December 1998. 753 [Cao00] Z. Cao, Z Wang and E. Zegura, Rainbow Fair Queueing: Fair 754 Bandwidth Sharing Without Per-Flow State, Proc. of INFOCOM 2000. 756 [Heinanen99] J. Heinanen, F. Baker, W. Weiss, and J. Wroclawski. 757 Assured Forwarding PHB Group, RFC 2597, June 1999. 759 [Jacobson98] V. Jacobson, K. Poduri and K. Nichols. An 760 Expedited Forwarding PHB, RFC 2598, June 1999. 762 [Kent98] S. Kent and R. Atkinson. IP Authentication Header, 763 RFC 2402, November 1998. 765 [Nichols98] K. Nichols, S. Blake, F. Baker, and D. L. Black. 766 Definition of the Differentiated Services Field (DS Field) in the 767 ipv4 and ipv6 Headers, RFC 2474, December 1998. 769 [Stoica98] I. Stoica, S. Shenker, and H. Zhang. Core-Stateless 770 Fair Queueing: Achieving Approximately Fair Bandwidth Allocations 771 in High Speed Networks. In Proceedings ACM SIGCOMM'98, 772 pages 118-130, Vancouver, September 1998. 774 [StoZha99] I. Stoica and H. Zhang. Providing Guaranteed Services 775 Without Per-flow Management. In Proceedings of ACM SIGCOMM'99, 776 Boston, September 1999. 778 [Venkitar02] N. Venkitaraman, J. Mysore, M. Needham. Core-Stateless 779 Utility Function based Rate Allocation. Proceedings of PfHSN'2002, 780 Berlin, April 2002. 782 9. Author's Addresses 784 Ion Stoica Hui Zhang 785 645 Soda Hall Wean Hall 7115 786 Computer Science Division School of Computer Science 787 University of California, Berkeley Carnegie Mellon University 788 Berkeley, CA 94720 Pittsburgh, PA 15213 789 istoica@cs.berkeley.edu hzhang@cs.cmu.edu 791 Narayanan Venkitaraman Jayanth Mysore 792 Motorola Labs Motorola Labs, 793 1301 E. Algonquin Rd. 1301 E. Algonquin Rd. 794 Schaumburg, IL 60196 Schaumburg, IL 60196 795 venkitar@labs.mot.com jayanth@labs.mot.com 797 10. Full Copyright Statement 799 Copyright (C) The Internet Society (1999). All Rights Reserved. 801 This document and translations of it may be copied and furnished to 802 others, and derivative works that comment on or otherwise explain it 803 or assist in its implementation may be prepared, copied, published 804 and distributed, in whole or in part, without restriction of any 805 kind, provided that the above copyright notice and this paragraph 806 are included on all such copies and derivative works. However, this 807 document itself may not be modified in any way, such as by removing 808 the copyright notice or references to the Internet Society or other 809 Internet organizations, except as needed for the purpose of 810 developing Internet standards in which case the procedures for 811 copyrights defined in the Internet Standards process must be 812 followed, or as required to translate it into languages other than 813 English. 815 The limited permissions granted above are perpetual and will not be 816 revoked by the Internet Society or its successors or assigns. 818 This document and the information contained herein is provided on an 819 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 820 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 821 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 822 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 823 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.