idnits 2.17.1 draft-clark-diff-svc-alloc-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-25) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 213 has weird spacing: '...er. The route...' == Line 656 has weird spacing: '...le, the flood...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- Couldn't find a document date in the document -- date freshness check skipped. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'Floyd93' is defined on line 1473, but no explicit reference was found in the text == Unused Reference: 'Kalevi97' is defined on line 1483, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'Clark97' -- Possible downref: Non-RFC (?) normative reference: ref. 'CF97' -- Possible downref: Non-RFC (?) normative reference: ref. 'Floyd93' -- Possible downref: Non-RFC (?) normative reference: ref. 'Floyd95' -- Possible downref: Non-RFC (?) normative reference: ref. 'FF97' -- Possible downref: Non-RFC (?) normative reference: ref. 'Kalevi97' -- Possible downref: Non-RFC (?) normative reference: ref. 'Kelly97' -- Possible downref: Non-RFC (?) normative reference: ref. 'MMV95' Summary: 8 errors (**), 0 flaws (~~), 5 warnings (==), 10 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force D. Clark / J. Wroclawski 2 INTERNET-DRAFT MIT LCS 3 draft-clark-diff-svc-alloc-00.txt July, 1997 4 Expires: 12/97 6 An Approach to Service Allocation in the Internet 8 Abstract 10 This note describes the Service Allocation Profile scheme for 11 differential service allocation within the Internet. The scheme is 12 based on a simple packet drop preference mechanism at interior 13 nodes, and highly flexible service profiles at edges and inter- 14 provider boundary points within the net. The service profiles 15 capture a wide variety of user requirements and expectations, and 16 allow different users to receive different levels of service from 17 the network in a scalable and efficient manner. 19 The note describes the basic form of the mechanism, discusses the 20 range of services that users and providers can obtain by employing 21 it, and gives a more detailed presentation of particular 22 technical, deployment, standardization, and economic issues 23 related to its use. 25 Status of this Memo 27 This document is an Internet-Draft. Internet-Drafts are working 28 documents of the Internet Engineering Task Force (IETF), its areas, 29 and its working groups. Note that other groups may also distribute 30 working documents as Internet-Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress". 37 To learn the current status of any Internet-Draft, please check the 38 "1id-abstracts.txt" listing contained in the Internet- Drafts Shadow 39 Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), 40 munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or 41 ftp.isi.edu (US West Coast). 43 NOTE: This draft is a snapshot of a document in progress, and was 44 somewhat arbitrarily cast into its current form at the Internet-Draft 45 submission deadline for the Munich IETF. The authors apologize in 46 advance for a certain raggedness of presentation.. 48 1. Introduction 50 This document describes a framework for providing what has been 51 called differentiated service, or allocated capacity service, in the 52 Internet. The goal of the mechanism is to allocate the bandwidth of 53 the Internet to different users in a controlled way during periods of 54 congestion. The mechanism applies equally to traditional applications 55 based on TCP, such as file transfer, data base access or Web servers, 56 and new sorts of applications such as real time audio or video. 58 The mechanism we describe can provide users with a predictable 59 expectation of what service the Internet will provide to them in 60 times of congestion, and can allow different users to obtain 61 different levels of service from the network. This contrasts with 62 today's Internet, in which each user gets some unpredictable share of 63 the capacity. 65 Our mechanism provides two additional things that are important to 66 this task. First, it allows users and providers with a wide range of 67 business and administrative models to make capacity allocation 68 decisions. In the public Internet, where commercial providers offer 69 service for payment, the feedback will most often be different prices 70 charged to customers with different requirements. This allows the 71 providers to charge differential prices to users that attach greater 72 value to their Internet access, and thus fund the deployment of 73 additional resources to better serve them. But whether pricing, or 74 some other administrative control is used (as might apply in a 75 corporate or military network), the same mechanism for allocating 76 capacity can be used. 78 The mechanism also provides useful information to providers about 79 provisioning requirements. With our mechanism in place, service 80 providers can more easily allocate specific levels of 3assured2 81 capacity to customers, and can easily monitor their networks to 82 detect when the actual service needs of their customers are not being 83 met. 85 While this document does describe a mechanism, this is a small part 86 of its goal. There are a number of mechanisms that might be proposed, 87 and the issue is not just demonstrating which of them works (most do 88 work in some fashion), but to discuss what the problem to be solved 89 actually is, and therefore which of the possible mechanisms best 90 meets the needs of the Internet. This document is thus as much about 91 what the problem actually is, as it is about a preferred solution. 93 1.1 Separating results from mechanism 95 An essential aspect of this scheme is the range of services the user 96 can obtain using the mechanism. The mechanism is obviously not 97 useful if it does not meet a current need. Some initial requirements 98 we see for services are that they must be useful, easy to understand, 99 possible to measure (so the user can tell whether he is getting the 100 service he contracted for), and easy to implement. 102 At the same time, we should try very hard not to embed a specific set 103 of services into the core of the Internet. As we gain experience in 104 the marketplace, we may discover that our first speculations are 105 wrong about what service the user actually wants. It should be 106 possible to change the model, evolve it, and indeed to try different 107 models at the same time to see which better meets the needs of the 108 user and the market. So this scheme has the two goals: defining and 109 implementing a first set of services, but permitting these services 110 to be changed without modifying the "insides" of the Internet, 111 specifically the routers. 113 We will return later to the discussion of different sorts of 114 services. 116 2. Outline of this document 118 This document is organized as follows: 120 Section 3 describes the basic mechanism, to give a general idea of 121 how such a service can be implemented. 123 Section 4 discusses the services which might be desired. It proposes 124 a first set of services that might be implemented, and discusses the 125 range of services that can be built out of this mechanism. 127 Section 5 describes the location of service profiles in the network. 129 Section 6 describes details of the mechanism. These include our 130 specific algorithm for the dropper, issues concerning rate control of 131 TCP, and dealing with non-responsive flows. 133 Section 7 compares this mechanism with some alternatives. 135 Section 8 discusses deployment issues, incremental deployment, and 136 what portions of the mechanism require standardization. 138 Section 9 discusses security issues. 140 3. The basic scheme 141 The general approach of this mechanism is to define a service profile 142 for each user, and to design a mechanism in the router that favors 143 traffic that is within those service profiles. The core of the idea 144 is very simple -- monitor the traffic of each user as it enters the 145 network, and tag packets as being "in" or "out" of their service 146 profile. Then at each router, if congestion occurs, preferentially 147 drop traffic that is tagged as being "out". 149 Inside the network, at the routers, there is no separation of traffic 150 from different users into different flows or queues. The packets of 151 all the users are aggregated into one queue, just as they are today. 152 Different users can have very different profiles, which will result 153 in different users having different quantities of "in" packets in the 154 service queue. A router can treat these packets as a single 155 commingled pool. This attribute of the scheme makes it very easy to 156 implement, in contrast to a scheme like current RSVP reservations, in 157 which the packets must be explicitly classified at each node. We have 158 more to say about this issue below. 160 To implement this scheme, the routers must be augmented to implement 161 our dropping scheme, and a new function must be implemented to tag 162 the traffic according to its service profile. This algorithm can be 163 implemented as part of an existing network component (host, access 164 device or router) or in a new component created for the purpose. 165 Conceptually, we will refer to it as a distinct device called a 166 "profile meter". We use the term "meter" rather than "tagger", 167 because, as we will discuss below, the profile meter can actually 168 take a more general set of actions. 170 The idea of a service profile can be applied at any point in the 171 network where a customer-provider relationship exists. A profile may 172 describe the needs of a specific user within a campus, the service 173 purchased by a corporate customer from an ISP, or the traffic 174 handling agreement between two international providers. We discuss 175 the location of profiles further in Section 5. 177 The description above associates the profile with the traffic sender. 178 That is, the sender has a service profile, the traffic is tagged at 179 the source according to that profile, and then dropped if necessary 180 inside the network. In some circumstances, however, it is the 181 receiving user that wishes to control the level of service. The web 182 provides a simple example; a customer using his browser for business 183 research may be much more interested in a predictable level of 184 performance than the casual surfer. The key observation is that 185 "value" in the network does not always flow in the same direction as 186 the data packets. 188 Thus, for full generality, a "dual" mechanism is required, that can 189 be either "sender driven" or "receiver driven". Most of this document 190 is written, for simplicity, in terms of a sender scheme, but we 191 briefly describe the receiver version as well, and discuss the 192 circumstances in which it is important. In evaluating alternative 193 mechanisms, it is important to see if both a sender and receiver 194 version can be built. 196 In later sections, we discuss the specifics of the profiling or 197 tagging mechanism and the treatment of profiled packets within the 198 network. First we turn to the question of the range of services the 199 mechanism ought to support. 201 4. Range of services 203 As discussed above, there are two general issues concerning service 204 models. First, we want to start out by implementing a simple set of 205 services, which are useful and easy to understand. At the same time, 206 we should not embed these services into the mechanism, but should 207 build a general mechanism that allows us to change the services as 208 our experience matures. 210 Our scheme provides this flexibility. To oversimplify, a service is 211 defined by the profile meter, which implements the user's service 212 profile. To change the service, it is necessary "only" to change the 213 profile meter. The routers in the interior of the network implement 214 a single common mechanism which is used by the different meters to 215 provide different services. 217 Three things must be considered when describing a service: 218 - what exactly is provided to the customer (an example might be 219 "one megabit per second of bandwidth, continuously available") 221 - to where this service is provided (examples might be a specific 222 destination, a group of destinations, all nodes on the local 223 provider, or "everywhere") 225 - with what level of assurance is the service provided (or 226 alternately, what level of performance uncertainty can the user 227 tolerate) 228 These things are coupled; it is much easier to provide "a guaranteed 229 one megabit per second" to a specific destination than to anywhere in 230 the Internet. 232 4.1 A first service model 234 As a place to start, a particularly simple service might provide the 235 equivalent of a dedicated link of some specified bandwidth from 236 source to destination. (The virtue of this simple model has been 237 clearly articulated by Van Jacobson.) This model is easy for the user 238 to understand -- he can take his existing application, connect it 239 across a physical link and see how it performs. If he can make it 240 work in that context, then this service will allow him to run that 241 application over the Internet. 243 This model has been implemented in a number of network architectures, 244 with different "enhancements". The CBR service of ATM is an example, 245 as is (to some extent) the CIR mechanism of Frame Relay. However, 246 there are some issues and limitations to this very simple model. 248 One very important limit to a pure virtual link model is that the 249 user may not wish to purchase this virtual link full time. He may 250 need it only some of the time, and in exchange would hope to obtain a 251 lower cost. A provider could meet this desire by offering a more 252 expressive profile; say a committed bandwidth with some duty cycle, 253 e.g. "3 mb/s with a 5% duty cycle measured over 5 minutes". Or, the 254 provider could offer the user a rebate based on observed (non)usage, 255 or allow him to reserve the capacity dynamically on demand. 257 A second issue is whether the user can exceed the capacity of the 258 virtual link when the network is unloaded. Today, the Internet allows 259 its users to go faster under that circumstance. Continuing to capture 260 that benefit may be important in user acceptance. The CIR of Frame 261 Relay works this way, and it is an important aspect of its market 262 appeal. 264 An equally important issue is that the user may not wish to set up 265 different distinguished committed bandwidth flows to different 266 destinations, but may prefer to have a more aggregated commitment. 267 There are several drawbacks to making distinct bandwidth commitments 268 between each source and destination. First, this may result in a 269 large number of flow specifications. If the user is interested in 270 1000 network access points, he must specify one million source- 271 destination pairs. Frame Relay has this specification problem. 272 Second, the sum of the distinct commitments for any source (or 273 destination) cannot exceed the physical capacity of the access link 274 at that point, which may force each individual assurance to be rather 275 small. Finally, the source-destination model implies that the user 276 can determine his destinations in advance, and in some cases that he 277 understands the network topology; two situations which are not 278 universally true. 280 In fact, several variations of service commitment might make sense to 281 different users; from one source to a specific destination, from a 282 source to a pool of specified destinations (one might configure a 283 Virtual Private Network in this way) and finally from a source to 284 "anywhere", which could mean either all points on the ISP, on a 285 collection of ISPs, or any reachable node. 287 The latter sorts of commitments are by their nature more difficult to 288 offer with high assurance. There is no way to know for sure what the 289 service will be to any specific destination, because that depends on 290 what other traffic is leaving the source, and what other traffic is 291 arriving at the destination. Offering commitments to "anywhere within 292 the ISP" implies that the ISP has provisioned its resources 293 adequately to support all in-profile users simultaneously to the same 294 destination. Offering commitments to "anywhere at all" implies that 295 all ISPs in any reachable path from the user have provisioned 296 sufficiently, which is most unlikely. 298 4.2 Managing bursty traffic 300 Not all Internet traffic is continuous in its requirement for 301 bandwidth. Indeed, based on measurements on the Internet, much of the 302 traffic is very bursty. It may thus be that a service model based on 303 a fixed capacity "virtual link" does not actually meet user's needs 304 very well. Some other more complex service profile that permits 305 bursty traffic may be more suitable. 307 It is possible to support bursty traffic by changing the profile 308 meter to implement this new sort of service. The key issue is to 309 insure, in the center of the network, that there is enough capacity 310 to carry this bursty traffic, and thus actually meet the commitments 311 implied by the outstanding profiles. This requires a more 312 sophisticated provisioning strategy than the simple "add 'em up" 313 needed for constant bit-rate virtual links. A body of mathematics 314 that is now maturing provides a way to relate the bursty behavior of 315 a single flow to the resulting implications for the required overall 316 bandwidth when a number of such flows are combined. (see, for example 317 [Kelly97]). This sort of analysis can be employed as a way to predict 318 the capacity that must be provided to support profiles describing 319 bursty traffic. As a practical matter, in the center of the 320 existing Internet, at the backbone routers of the major ISPs, there 321 is such a high degree of traffic aggregation that the bursty nature 322 of individual traffic flows is essentially invisible. So providing 323 bursty service profiles to individual users will not create a 324 substantial provisioning issue in the center of the network, while 325 possibly adding significant value to the service as perceived by the 326 user. 328 4.3 Degrees of assurance 330 The next aspect of sorting out services is to consider the degree of 331 assurance that the user will receive that the contracted capacity 332 will actually be there when he attempts to use it. 334 Statistical bandwidth allocation allows the Internet to support an 335 increased number of users, use bandwidth vastly more efficiently, and 336 deal flexibly with new applications and services. However, it does 337 lead to some uncertainty as to the bandwidth that will be available 338 at any instant. Our approach to allocating traffic is to follow this 339 philosophy to the degree that the user can tolerate the uncertainty. 340 In other words, we believe that a capacity allocation scheme should 341 provide a range of service assurance. At one extreme, the user may 342 demand an absolute service assurance, even in the face of some number 343 of network failures. (Wall Street traders often have two phones on 344 their desk, connected by different building wiring to different phone 345 exchanges, so that they can continue to make money even if a central 346 office goes down or half the building burns.) Less demanding users 347 may wish to purchase a service profile that is "usually" available, 348 but may still fail with low probability. The presumption is that a 349 higher assurance service will cost substantially more to implement. 351 We have called these statistically provisioned service profiles 352 "expected capacity" profiles. This term was picked to suggest that 353 the profiles do not describe a strict guarantee, but rather an 354 expectation that the user can have about the service he will receive 355 during times of congestion. This sort of service will somewhat 356 resemble the Internet of today in that users today have some 357 expectation of what performance they will receive; the key change is 358 that our mechanism by which different users can have very different 359 expectations. 361 Statistical assurance is a matter of provisioning. In our scenario, 362 an ISP can track the amount of traffic tagged as "in" crossing 363 various links over time, and provide enough capacity to carry this 364 subset of the traffic, even at times of congestion. This is how the 365 Internet is managed today, but the addition of tags gives the ISP a 366 better handle on how much of the traffic at any instant is "valued" 367 traffic, and how much is discretionary or opportunistic traffic for 368 which a more relaxed attitude can be tolerated. 370 For traffic that requires a higher level of commitment, more explicit 371 actions must be taken. The specific sources and destinations must be 372 determined, and then the paths between these points must be inspected 373 to determine if there is sufficient capacity. There are two 374 approaches. The static approach involves making a long term 375 commitment to the user, and reserving the network resources to match 376 this commitment. This involves some computation based on the 377 topology map of the network to allocate the needed bandwidth along 378 the primary (and perhaps secondary) routes. The dynamic approach 379 involves using a setup or reservation protocol such as RSVP to 380 request the service. This would explore the network path at the time 381 of the request, and reserve the bandwidth from a pool available for 382 dynamic services. Information concerning this pool would have to be 383 stored in the routers themselves, to support the operation of RSVP. 384 We have proposed a lightweight version of RSVP, called RSVP with 385 Trusted TOS Tags, or T3, as a way to implement this dynamic service 386 efficiently. Within one ISP, the reservation could be submitted to a 387 central location for acceptance, depending on the design adopted for 388 bandwidth management. 390 It is important to note that traffic requiring this higher level of 391 assurance can still be aggregated with other similar traffic. It is 392 not necessary to separate out each individual flow to insure that it 393 receives it promised service. It is necessary only to insure that 394 sufficient capacity is available between the specific sources and 395 destinations desiring the service, and that the high-assurance 396 packets can draw on that capacity. This implies that there would be 397 two queues in the router, one for traffic that has received a 398 statistical assurance, and one for this higher or "guaranteed" 399 assurance. Within each queue, "in" and "out" tags would be used to 400 distinguish the subset of the traffic that is to receive the 401 preferred treatment. However, some other discriminator must be used 402 to separate the two classes, and thus sort packets into the two 403 queues. Our specific proposal, which we detail later, is that two 404 values of the TOS field be used, one to mean statistical assurance, 405 and one to mean guaranteed assurance. Statistical assurance would 406 correspond to the service delivered across the Internet today, 407 augmented with "in" and "out" tags. 409 An ISP could avoid the complexity of multiple queues and still 410 provide the high-assurance service by over-provisioning the network 411 to the point where all "in" traffic is essentially never dropped, no 412 matter what usage patterns the users generate. It is an engineering 413 decision of the ISP whether this approach is feasible. 415 4.4 A service profile for the access path 417 In some cases, what the user is concerned with is not the end-to-end 418 behavior he achieves, but the profile for utilizing his access path 419 to the network. For example, users today buy a high-speed access 420 path for two different reasons. One is to transfer a continuous flow 421 of traffic, the other to transfer bursts at high speed. The user who 422 has bursty traffic might want on the one hand an assurance that the 423 bursts can go through at some known speed, but on the other hand a 424 lower price than the user who delivers a continuous flow into the 425 Internet. Giving these two sorts of users different service profiles 426 that describe the aggregated traffic across the access link will help 427 discriminate between them, and provide a basis for differential 428 charging. 430 A service profile of the sort discussed here is a reasonable way to 431 capture this sort of requirement. By tagging the traffic that crosses 432 the access path according to some service profile, the ISP commits to 433 forward that subset of the traffic within its region, and only 434 delivers the rest if the network is underloaded. It is instructive 435 to compare this approach to pricing an access path to the more 436 traditional "usage-based" scheme. In the traditional scheme, the 437 actual usage is metered, and the user is charged a fee that depends 438 on the usage. If the user sends more, he pays more. However, since 439 TCP goes faster if the net is underloaded, it is hard for the user 440 (or the ISP aggregating his traffic) to actually regulate his usage. 441 In contrast, a service profile allows two users with different needs 442 to be distinguished (and presumably charged differently) but each 443 user could be charged a known price based on the profile. If the 444 traffic exceeds the profile, the consequence is not increased fees, 445 but congestion pushback if the network is congested. 447 4.5 An example of a more sophisticated profile 449 Our initial service profile modeled a dedicated link of some set 450 capacity. This service profile is easy to understand at one level, 451 but once one runs TCP over this link, it becomes much harder to 452 predict what behavior can actually be achieved. TCP hunts for the 453 correct rate by increasing its window size until a packet is 454 discarded at the bottleneck point, and then cutting its window size 455 by two (in many current implementations). How this behavior interacts 456 with a link of fixed size is a function of buffer size and 457 implementation details in TCP. 459 A more sophisticated service profile would be one that attempted to 460 provide a specified and predictable throughput to a TCP stream, so 461 long as the TCP was "well-behaved". This would actually make it 462 easier for the user to test the profile, because he could just run a 463 TCP-based application and observe the throughput. This is an example 464 of a "higher-level" profile, because it provides a service which is 465 less closely related to some existing network component and more 466 closely related to the user's actual needs. These profiles are more 467 difficult to define, because they depend on the behavior of both the 468 network and the end-nodes. However, we have experimented with the 469 design of such a profile, and believe that it is possible to 470 implement this sort of service as well. A more detailed description 471 of the profile needed to fix a TCP transfer rate is given in Appendix 472 B. 474 5. Location of Service Profiles in the Network 476 In the simple sender-based scheme described so far, the function that 477 checks whether traffic fits within a profile is implemented by 478 tagging packets as in or out of profile at the edge of the network. 479 The complete story is more complex. A profile describes an 480 expectation of service obtained by a customer from a provider. These 481 relationships exist at many points in the network, ranging from 482 individual users and their campus LANs to the peering relationships 483 between global ISP's. Any such boundary may be an appropriate place 484 for a profile meter. 486 Further, the packet tagging associated with this service profile 487 will, in the general case, be performed by devices at either side of 488 the boundary. One sort, located on the traffic sourcing side of a 489 network boundary, is a "policy meter". This sort implements some 490 policy by choosing the packets that leave the network (or user's 491 machine) with their in-profile bit set, and thus receive the assured 492 service. Another sort, a "checking meter", sits on the arriving- 493 traffic side of a network boundary, checks the incoming traffic, and 494 marks packets as out of profile (or turns off excess in-profile bits) 495 if the arriving traffic exceeds the assigned profile. 497 A general model is that the first meter that the traffic encounters 498 should provide the highest degree of discrimination among the flows. 499 A profile meter could be integrated into a host implementation of TCP 500 and IP, where it could serve to regulate the relative use of the 501 network by individual flows. The subsequent meters, looking only at 502 larger aggregates, serve to verify that there is a large enough 503 overall service contract in place at that point to carry all of the 504 traffic tagged as "in" (the valuable traffic) at the interior points. 505 When a traffic meter is placed at the point where a campus or 506 corporate network connects to an ISP, or one ISP connects to another, 507 the traffic being passed across the link is highly aggregated. The 508 ISP, on the arriving- traffic side of the link, will check only to 509 verify the total behavior. On the traffic sourcing side of the link, 510 an additional profile meter can be installed to verify that tags have 511 been applied according to policy of the user. 513 Additional profile meters installed at intermediate points can 514 provide good feedback on network demand. Consider a specific 515 situation, where traffic is tagged at individual hosts according to 516 policies specific to these hosts, and then passes through a second 517 meter at the point of attachment from the private network to the 518 public Internet. If the number of "in" packets arriving at that point 519 exceeds the aggregate service profile purchased at that point, this 520 means that the user has not purchased enough aggregate capacity to 521 match the needs of his individual policy setting. In the short run, 522 there is no choice but to turn some of these "in" packets to "out", 523 (or to charge an extra fee for carrying unexpected overloads), but in 524 the long run, this provides a basis to negotiate a higher service 525 level with the ISP. So traffic meters actually provide a basis for 526 monitoring user needs, and moving users to a higher service profile 527 as needed. 529 5.1 Controlling the scope of profiles 531 Even in the case where the user wants to obtain a service profile 532 that is not specific to one destination, but rather applies to "all" 533 possible destinations, it is clear that the "all" cannot be literally 534 true. Any service profile that involves an unspecified set of 535 destinations will have to bound the scope of the agreement. For 536 example, a single ISP or a set of co-operating ISPs may agree to 537 provide an assured service profile among all of their end points, but 538 if the traffic passes beyond that point, the profile will cease to 539 apply. 541 The user might be given further options in the design of his profile. 542 For example, if there are regions of restricted bandwidth within the 543 Internet, some users may wish to pay more in order to have their "in" 544 tags define their service across this part of the net, while others 545 may be willing to have their "in" tags reset if the traffic reaches 546 this point. 548 This could be implemented by installing a profile meter at that point 549 in the network, with explicit lists of source-destination pairs that 550 are and are not allowed to send "in" traffic beyond this point. The 551 alternative would be some sort of "zone system" for service profiles 552 that is specified in the packets themselves. See [Clark97] for a 553 discussion of a zone system. 555 6. Details of the Mechanism 557 This section describes several aspects of our proposed mechanism in 558 more detail. 560 6.1 Design of the dropper 562 One of the key parts of this scheme is the algorithm in the router 563 that drops "out" packets preferentially during times of congestion. 564 The behavior of this algorithm must be well understood and agreed on, 565 because the taggers at the edge of the network must take this 566 behavior into account in their design. There can be many taggers, 567 with different goals as to the service profile, the degree of 568 aggregation etc. There is only one dropper, and all the routers have 569 to perform an agreed behavior. 571 The essence of our dropper is an algorithm which processes all 572 packets in order as received, in a single queue, but preferentially 573 drops "out" packets. There are other designs that could be proposed 574 for queue management, for example to put the "in" packets in a higher 575 priority queue. There are specific reasons why we prefer drop 576 preference to priority queuing for the allocation of best effort 577 traffic, but we delay that discussion until Section 7. 579 The primary design goals of the dropper are the following: 580 - It must allow the taggers to implement a range of service 581 profiles in a useful and understandable way. 583 - If the router is flooded with "out" packets, it must be able to 584 discard them all without harming the "in" packets. In other words, 585 it must deal well with non-conforming flows that do not adjust 586 their sending rate when they observe packet loss. 588 - If the router is receiving a number of "well-behaved" TCP flows, 589 which will (as TCP always does) speed up until they encounter a 590 lost packet, it must have enough real buffering available that once 591 it starts to get overloaded with packets, it can discard "out" 592 packets and still receive traffic bursts for a round trip until the 593 affected TCP slows down. 595 6.2 RIO -- RED with In and Out 597 Our specific dropping scheme is an extension of the Random Early 598 Detection scheme, or RED, that is now being deployed in the Internet. 599 The general behavior of RED is that, as the queue begins to build up, 600 it drops packets with a low but increasing probability, instead of 601 waiting until the queue is full and then dropping all arriving 602 packets. This results in better overall behavior, shorter queues, 603 and lower drop rates. 605 Our approach is to run two RED algorithms at the same time, one for 606 "in" packets, and one for "out" packets. The "out" RED algorithm 607 starts dropping at a much shorter average queue length, and drops 608 much more aggressively than the "in" algorithm. With proper setting 609 of the parameters, the "out" traffic can be controlled before the 610 queue grows to the point that any "in" traffic is dropped. We call 611 this scheme RIO. 613 There are some subtle aspects to this scheme. The "in" dropper must 614 look at the number of "in" packets in the queue. The "out" dropper 615 must look at the total queue length, taking into account both "in" 616 and "out". This is because the link can be subjected to a range of 617 overloads, from a mix of "in" and "out" traffic to just "out". In 618 both cases, the router must start dropping "outs" before the "in" 619 traffic is affected, and must continue to achieve the basic function 620 of RED; preserving enough free buffer space to absorb transient loads 621 with a duration too short to be affected by feedback congestion 622 control. 624 6.3. Rate control of TCP 626 A useful, but challenging, problem is to build a traffic meter that 627 causes a TCP to send at a specified maximum rate in periods of 628 congestion. Such a meter works by causing the TCP's bandwidth usage 629 (actually congestion avoidance) algorithm to "hunt" between somewhat 630 over and somewhat under the target rate, by tagging packets such that 631 the RIO algorithm will drop them appropriately when the network is 632 loaded. An important aspect of this is that the meter and RIO work 633 together to avoid *too many* closely spaced packet discards, forcing 634 the TCP into slow-start and causing it to obtain less than the 635 desired bandwidth. 637 A detailed description of a traffic meter which meets these 638 objectives is given in Appendix B of this note. 640 6.4. Dealing with non-responsive flows 642 A well-behaved TCP, or other traffic source that responds similarly 643 to congestion signaled by packet loss, will respond well to the RIO 644 dropper. As more of its packets are marked as "out", one will 645 eventually be dropped. At this point, the source will back off. As a 646 result, most of the time a network of well-behaved TCPs will contain 647 just enough "out" packets to use up any excess capacity not claimed 648 by the "in" packets being sent. 650 But what if there is a source of packets that does not adjust? This 651 could happen because of a poorly implemented TCP, or from a source of 652 packets, such as a video data application, that does not or cannot 653 adjust. 655 In this circumstance, if the unresponsive flow1s packets are marked 656 as out of profile, the flood of "out" packets will cause a RIO 657 router to operate in a different way, but well behaved TCPs and 658 similar flows must continue to receive good service. (If the 659 unresponsive flow1s packets are in profile, the network should be 660 able carry them, and there is no issue.) 662 6.4.1. Robustness against non-responsive flows 664 In the RIO scheme, once the level of "out" packets exceeds a certain 665 average level, all the incoming "out" packets will be discarded (this 666 is similar to the non-RIO RED behavior). This behavior has the 667 consequence of increasing the router1s queue length. The average 668 queue length will increase by the number of "out" packets that are 669 allowed to sit in the queue before RIO switches over to the phase 670 where it drops every "out". There must be enough physical room in 671 the buffer so that even when there are this many "out" packets 672 present, there is enough room for the normal instantaneous bursts of 673 "in" packets which would be seen in any event. Thus, a RIO router 674 may require slightly larger queues than a non-RIO router. 676 In the simulations summarized in Appendix B, the maximum number of 677 "out" packets is approximately 15. (This particular number is not 678 magic -- the point is that it is not 1, nor 100.) So to operate RIO, 679 it will be necessary to increase the minimum physical buffer size by 680 perhaps this amount, or a little more, to allow for swings in the 681 instantaneous numbers of "out" packets as well. But in most 682 circumstances, this is a modest increase in the buffer size. 684 6.4.2. Filtering out non-responsive flows 686 Although RIO is reasonably robust against overload from non- 687 responsive flows, it may be useful to consider the alternative 688 strategy of singling out non-conforming flows and selectively 689 dropping them in the congested router. There has been work [FF97] 690 towards enhancing the traditional RED scheme with a mechanism to 691 detect and discriminate against non-conforming flows. Discriminating 692 against these flows requires the installation of a packet classifier 693 or filter that can select these packet flows, so that they can be 694 discarded. This adds complexity and introduces scaling concerns to 695 the scheme. These concerns are somewhat mitigated because only the 696 misbehaving flows, not the majority of flows that behave, need be 697 recognized. Whatever classification scheme that basic RED might use 698 can be used by RIO as well. 700 The difference between our framework and RED is that the designers of 701 RED are working to design an algorithm that runs locally in each 702 router to detect non-conforming flows, without any concept of a 703 service profile. In that case, the only sort of traffic allocation 704 that can be done is some form of local fairness. However, with the 705 addition of profile tags, the router can look only at the "out" 706 packets, which by definition represent that portion of a flow that is 707 in excess. This might make it easier to detect locally flows that 708 were non- conforming. The alternative approach would be an 709 indication from the traffic meter that the flow is persistently 710 exceeding the service profile in a time of congestion. This 711 indication, a control packet, could either install a classifier in 712 each potential point of congestion, or flow all the way back to the 713 traffic meter nearest the sender, where the traffic can be 714 distinguished and discarded (or otherwise discriminated against). The 715 latter approach has the benefit that the control packet need not 716 follow the detailed hop-by-hop route of the data packet in reverse, 717 which is hard to do in today's Internet with asymmetric routes. 719 We consider the question of whether such a mechanism provides 720 sufficient benefit over the approach of employing local detection of 721 non-responsive flows at each node to be unresolved at present. 723 7. Alternatives to the mechanism 725 Schemes for differential service or capacity allocation differ in a 726 number of respects. Some standardize on the service profiles, and 727 embed them directly in the routers. As discussed above, this scheme 728 has the advantage that the actual service profile is not a part of 729 what is standardized, but is instead realized locally in the traffic 730 meter, which gives this scheme much greater flexibility in changing 731 the profile. 733 7.1. Drop preference vs. priority 735 One possible difference is what the router does when it is presented 736 with an overload. Our scheme is based on a specific algorithm for 737 drop preference for packets marked as "out". An alternative would be 738 to put packets marked as "out" in a lower priority queue. Under 739 overload that lower priority queue would be subjected to service 740 starvation, queue overflow and eventually packet drops. Thus a 741 priority scheme might be seen as similar to a drop preference scheme. 743 They are similar, but not the same. The priority scheme has the 744 consequence that packets in the two queues are reordered by the 745 scheduling discipline that implements the priority behavior. If 746 packets from a single TCP flow are metered such that some are marked 747 as "in" and some as "out", they will in general arrive at the 748 receiver out of order, which will cause performance problems with the 749 TCP. In contrast, the RIO scheme always keeps the packets in order, 750 and just explicitly drops some of the "out" packets if necessary. 751 That makes TCP work much better under slight overload. 753 The priority scheme is often proposed for the case of a restricted 754 class of service profiles in which all the packets of a single flow 755 are either "in" or "out". These schemes include the concept of a 756 "premium" customer (all its packets are "in"), or a rate-limited flow 757 (packets that exceed the service profile are dropped at the meter, 758 rather than being passed on.) These proposals are valid experiments 759 in what a service profile should be, but they are not the only 760 possibilities. The drop preference scheme has the advantage that it 761 seems to support a wider range of potential service profiles 762 (including the above two), and thus offers an important level of 763 flexibility. 765 7.2. More bits? 766 A variation on this scheme is to have more than two levels of control 767 -- more than simple "in" and "out". One reason to have more than two 768 levels is to allow the user to express his range of values more 769 completely. With three or four levels of tagging, the user could 770 express what service profile he would like at different levels of 771 congestion -- none, low, medium and severe. The question is whether 772 this actually brings much real incremental value. In commercial 773 networks, which are usually provisioned in a conservative fashion, it 774 is not clear that there will be enough congestion to discriminate 775 between more than two states. In other circumstances, for example 776 military networks where severe service degradation might occur under 777 adverse circumstances, having several levels of usage preference 778 might be helpful. Asking the user to define these several tiers of 779 service profiles raises one issue, however; it presumes that the user 780 is actually able to determine what his needs are to this degree of 781 precision. It is not actually clear that the user has this level of 782 understanding of how he would trade off usage against cost. 784 There is an alternative way to deal with variation in the degree of 785 congestion. Instead of coding the user's desires into each packet, 786 one could imagine a management protocol running in the background 787 that reports to the edges of the network what the current level of 788 congestion is, or whether a special or crisis circumstance exists. 789 Based on information from that protocol, the service profile of the 790 user could be changed. Both approaches may have advantages. An 791 advantage of the first is the lack of need for a management protocol. 792 An advantage of the second is that the management protocol can 793 express a much wider range of policies and reallocation actions. 795 Another reason to have multiple levels of control is to achieve a 796 smoother transition between the two states of a flow. As discussed 797 above, when controlling TCP, because of the specific congestion 798 schemes used in TCP, it is helpful not to drop a number of packets 799 from one flow at once, because it is likely to trigger a full TCP 800 slow- start, rather then the preferable fast recovery action. Having 801 more bits might enhance this discrimination. However, based on our 802 simulations, if we are going to use more bits from the packet header 803 for control, it might be a better option to move to an Explicit 804 Congestion Notification design for the Internet, which seems to 805 provide a better degree of control overall. 807 8. Deployment Issues 809 8.1. Incremental deployment plan. 811 No scheme like this can be deployed at once in all parts of the 812 Internet. It must be possible to install it incrementally, if it is 813 to succeed at all. 815 The obvious path is to provide these capabilities first within a 816 single ISP. This implies installing RIO routers within the ISP, and 817 tagging the traffic at the access points to that ISP. This requires 818 a profile meter at each access link into that ISP. The meter could 819 maintain a large amount of user-specific information about desired 820 usage patterns between specific sources and destinations (and this 821 might represent a business opportunity), but more likely would 822 provide only rough categories of traffic classification. 824 A user of this ISP could then install a profile meter on his end of 825 the access link, which he controls and configures, to provide a 826 finer- grained set of controls over which traffic is to be marked as 827 "in" and "out". Eventually, meters might appear as part of host 828 implementations, which would permit the construction of profiles that 829 took into account the behavior of specific applications, and which 830 would also control the use of network resources within the campus or 831 corporate area. 833 At the boundary to the region of routers implementing RIO, all 834 traffic must be checked, to make sure that no un-metered traffic 835 sneaks into the network tagged as "in". So the implementation of this 836 scheme requires a consistent engineering of the network configuration 837 within an administrative region (such as an ISP) to make sure that 838 all sources of traffic have been identified, and either metered or 839 "turned out". 841 If some routers implement RIO, and some do not, but just implement 842 simple RED, the user may fail to receive the committed service 843 profile. But no other major failures will occur. That is, the worst 844 that the user will see is what he sees today. One can achieve 845 substantial incremental improvements by identifying points of actual 846 congestion, and putting RIO routers there first. 848 8.2. What has to be standardized 850 In fact, very little of this scheme needs to be standardized in the 851 normal pattern of IETF undertakings. What is required is to agree on 852 the general approach, and set a few specific standards. 854 8.2.1. Semantics of router behavior 856 It is necessary to agree on the common semantics that all routers 857 will display for "in" and "out" bits. Our proposal is that routers 858 implement the RIO scheme, as described above. The parameters should 859 be left for operational adjustment. 861 For the receiver-based scheme, the router has to tag packets rather 862 than drop them. We omit the description of the tagging algorithm, 863 only noting that it, too, must be agreed to if a receiver-based 864 scheme is to be deployed. 866 8.2.2. Use of IP precedence field 868 Currently, the three bit precedence field in the IP header is not 869 widely used. Bit x of this field will be used as the "in/out" bit. 870 This bit will be known as the In Profile Indicator, or IPI. The 871 meaning of the IPI is that a 1 value implies "in". This has the 872 effect that the normal default value of the field, 0, will map to the 873 baseline behavior, which is out of profile service. 875 8.2.3. Use of IP TOS field 877 This document proposes to view Type of Service in a slightly 878 different way than has been usual in the past. While previous RFCs 879 have not been explicit (e.g. RFC 1349), the role of the ToS field has 880 been thought of more to control routing than scheduling and dropping 881 within the router. This document explicitly proposes to specify these 882 features. The TOS field can be used for this purpose, but doing so 883 will preclude its use in the same packet to select the service 884 defined in RFC 1349 and RFC 1700: low delay, high throughput, low 885 cost, high reliability and high security. 887 According to RFC 1349, the TOS field should be viewed as a 4 bit 888 integer value, with certain values reserved for backwards 889 compatibility. We propose that the six defined values of TOS be 890 associated with the statistical service profiles ("expected capacity 891 profiles") defined in this document. That is, the use of the IPI is 892 legal with any of these value of TOS, and the difference among them 893 is routing options. 895 A new value of TOS (yyyy) shall be used to specify the assured 896 service profile, which has a level of assurance for the service 897 profile that is not statistical in nature. As part of the design of 898 this type of service, the routing will have to be controlled to 899 achieve this goal, so the value yyyy for the TOS will also imply some 900 routing constraints for the ISPs. It is an engineering decision of 901 the service provider how this sort of traffic is routed, so that it 902 follows the routes along which the resources have been reserved. 904 8.2.4. Additional issues for the sender/receiver based scheme 906 The combined sender-receiver scheme is capable of expressing a much 907 more complex set of value relationships than the sender-based scheme. 908 However, it implies more complexity and more bits in the header. It 909 does not appear possible to encode all the necessary information for 910 the combined scheme in an IPv4 header. This option is thus proposed 911 as a consideration for IPv6, if there seems to be sufficient demand. 913 9. Security considerations 915 This scheme is concerned with resource allocation, and thus the major 916 security concern is theft of resources. Resources can be stolen by 917 injecting traffic that is marked as "in" but which has not passed 918 through a legitimate profile meter into a RIO-controlled region of 919 the network. 921 To protect against this, it is necessary to define "regions of shared 922 trust", and engineer and audit all the links that bring traffic into 923 each of these regions to insure that a profile meter has been 924 installed in each such link. Such a region might correspond to a 925 single ISP, the backbone component of a single ISP, a collection of 926 co-operating ISPs and so on. In general, the presence of a profile 927 meter is an indication of a possible boundary where trust is not 928 shared, and the traffic has to be verified. 930 It is a matter for further research whether algorithms can be 931 designed to detect (locally, at each router) a flow of packets that 932 is not legitimate. 934 10. Acknowledgments 936 The simulations reported in this paper were performed by Wenjia Fang. 937 Earlier simulations that proved the concepts of the profile meter and 938 the receiver-based scheme were performed by Pedro Zayas. We 939 acknowledge the valuable discussions with the members of the End-to- 940 End research group. 942 Appendix A: Description of a receiver-based scheme 944 The tagging scheme described above implements a model in which the 945 sender, by selecting one or another service profile, determines what 946 service will govern each transfer. However, the sender- controlled 947 model is not the only appropriate model for determining how Internet 948 transmissions should be regulated. For much of the traditional 949 Internet, where information has been made available, often for free, 950 to those users who care enough to retrieve it, it is the value that 951 the receiver places on the transfer, not the sender, that would 952 properly dictate the service allocated to the transfer. In this 953 document, we do not debate the philosophical tradeoff between sender 954 and receiver controlled schemes. Instead, we describe a mechanism 955 that implements receiver control of service, which is similar in 956 approach and meshes with the sender controlled tagging scheme. 958 One technique that does not work is to have the receiver send some 959 credentials to the sender, on the basis of which a flag is set in the 960 packet. This runs the risks of great complexity, but more 961 fundamentally does not deal with multicast, where one packet may go 962 to several receivers, each of which attaches a different value to the 963 transfer. 965 A critical design decision is whether the scheme must react to 966 congestion instantly, or with one round trip delay. If it must react 967 instantly, then each point of congestion must have stored state, 968 installed by the receiver, that will allow that point to determine if 969 the packet is "in" or "out" of profile. This runs the risk of 970 formidable complexity. If, however, we are willing to have the 971 reaction to congestion occur one round trip later, several quite 972 tractable schemes can be proposed, which are similar to the sender 973 controlled scheme in spirit. 975 A receiver controlled scheme can be built using a traffic meter at 976 the receiver, similar to the traffic meter at the sender in the 977 sender tagging scheme. The meter knows what the current usage profile 978 of the receiver is, and thus can check to see whether a stream of 979 received packets is inside of the profile. A (different) new flag in 980 the packet, called Forward Congestion Notification, or FCN, is used 981 to carry information about congestion to the receiver's traffic 982 meter. A packet under this receiver controlled scheme starts out from 983 the sender with the FCN bit off, and when the packet encounters 984 congestion the bit is set on. As the packet reaches the destination, 985 the receiver's traffic meter notes that the bit is on, and checks to 986 see if the packet fits within the profile of the receiver. If it 987 does, the service profile of the receiver is debited, and the bit is 988 turned off in the packet. If the packet cannot fit within the profile 989 of the user, the bit remains on. 991 When the receiver receives a packet with the FCN on, which means that 992 the receiver's profile does not have sufficient capacity to cover all 993 the packets that encountered congestion, the sender must be 994 instructed to slow down. This can occur in a number of ways. One, for 995 TCP, the receiver could reduce the window size. That is, the receiver 996 as well as the sender could compute a dynamic congestion window. 997 This is complex to design. Second, again for TCP, the ACK packet or 998 a separate control message (such as an ICMP Source Quench) could 999 carry back to the sender some explicit indication to slow down. 1000 Third, for TCP, if the traffic meter noted that the receiver seemed 1001 to have taken no action in response to the FCN bit, the meter can 1002 delete some returning ACKs or an incoming data packet, which will 1003 trigger a congestion slowdown in the sender. 1005 The paper by Floyd [Floyd95] contains a detailed discussion of 1006 enhancing TCP to include explicit congestion notification, using 1007 either bits in the header or the ICMP Source Quench message with 1008 redefined semantics. The range of algorithms explored there for 1009 implementing explicit notification are directly applicable to this 1010 scheme. In fact, the end node behavior (the source and destination 1011 TCP) for her scheme is exactly the same as the scheme here. What is 1012 different is the method of notifying the end node of the congestion. 1013 In her scheme, random packets are selected to trigger congestion 1014 notification. In this scheme, during periods of congestion all 1015 packets are marked, but these marks are then removed by the 1016 receiver's traffic meter, unless the rate exceeds the installed 1017 service profile. 1019 We have simulated the receiver-based scheme, using the ECN mechanism 1020 proposed by Floyd to notify the sending TCP to slow down. Because of 1021 the very well-behaved characteristics of the ECN scheme, we can 1022 regulate TCPs to different sending rates essentially flawlessly. 1024 A key question in the successful implementation of the receiver 1025 scheme is defining what constitutes congestion in the router -- under 1026 what conditions the router should start setting the FCN bit. 1027 Hypothetically, the router should start setting the bit as soon as it 1028 detects the onset of queuing in the router. It is important to detect 1029 congestion and limit traffic as soon as possible, because it is very 1030 undesirable for the queue to build up to the point where packets must 1031 be discarded. 1033 Key differences between sender and receiver control 1035 There are a number of interesting asymmetries between the sender and 1036 the receiver versions of this tag and profile scheme, asymmetries 1037 that arise from the fact that the data packets flow from the sender 1038 to the receiver. In the sender scheme, the packet first passes 1039 through the meter, where it is tagged, and then through any points of 1040 congestion, while in the receiver payment scheme the packet first 1041 passes through any points of congestion, where it is tagged, and then 1042 through the receiver's meter. The receiver scheme, since it only sets 1043 the FCN bit if congestion is actually detected, can convey to the end 1044 point dynamic information about current congestion levels. The 1045 sender scheme, in contrast, must set the IPI and tag the packet as 1046 "in" or "out" without knowing if congestion is actually present. 1047 Thus, it would be harder, in the sender scheme, to construct a 1048 service that billed the user for actual usage during periods of 1049 congestion. 1051 While the receiver scheme seems preferable in that it can naturally 1052 implement both static and dynamic payment schemes, the sender scheme 1053 has the advantage that since the packet carries in it the explicit 1054 assertion of whether it is in or out of profile, when it reaches a 1055 point of congestion, the treatment of the packet is evident. In the 1056 receiver scheme, the data packet itself carries no indication of 1057 whether it is in or out of profile, so all the point of congestion 1058 can do is set the FCN bit, attempt to forward the packet, and trust 1059 that the sender will correctly adjust its transmission rate. The 1060 receiver scheme is thus much more indirect in its ability to respond 1061 to congestion. Of course, the controller at the point of congestion 1062 may employ a scheme to discard a packet from the queue, as it does 1063 now. However, the receiver scheme gives no guidance as to which 1064 packet to delete. 1066 Another difference between the two schemes is that in the sender 1067 scheme, the sending application can set the In Profile Indicator in 1068 different packets to control which packets are favored during 1069 congestion. In the receiver scheme, all packets sent to the receiver 1070 pass through and debit the traffic meter before the receiving host 1071 gets to see them. Thus, in order for the receiving host to 1072 distinguish those packets that should receive preferred service, it 1073 would be necessary for it to install some sort of packet filter in 1074 the traffic meter. This seems feasible but potentially complex. 1075 However, it is again a local matter between the traffic meter and the 1076 attached host. 1078 While this scheme works well to control TCP, what about a source that 1079 does not adjust when faced with lost packets, or otherwise just 1080 floods the congested router? In the receiver-based scheme, there is 1081 an increase need for some sort of notification message that can flow 1082 backwards through the network from the receiver's traffic meter 1083 towards the source of the traffic (or towards the congested routers 1084 along the path) so that offending traffic can be distinguished and 1085 discriminated against. This sort of mechanism was discussed above in 1086 the section on Filtering out Non-Responsive Flows. 1088 Appendix B: Designing traffic meters to control TCP throughput 1090 We have suggested that a useful goal for a traffic meter is to let a 1091 well-behaved TCP operate at a specific speed. This is more complex 1092 than a service that mimics a link of a specific speed, since a TCP 1093 may not be able to fully utilize a physical link because of its 1094 behavior dealing with congestion. In order to design a traffic meter 1095 that allows a TCP to go at a set speed, the designer must take into 1096 account the behavior of TCP. This appendix presents a quick review of 1097 the relevant TCP behavior, describes the preliminary design of a 1098 traffic meter that directly controls TCP bandwidth, and summarizes 1099 some simulation results. Further details of this work can be found in 1100 [CF97]. 1102 TCP attempts to adjust its performance by varying its window size. 1104 Within the limit imposed by the receive window, the sender increases 1105 its window size until a packet is discarded; then reduces its window 1106 size and begins again. This process controls the TCP1s effective 1107 throughput rate. 1109 There are several different operating regions for TCP. When a number 1110 of packets are lost, a TCP reduces its window size to 1, and then 1111 (roughly) doubles it each round trip until a threshold is reached. 1112 (This threshold is often referred to by the variable name used to 1113 implement it in the Berkeley Unix code: ss-thresh.) Once the send 1114 window has exceeded ss-thresh, it increases more slowly -- one packet 1115 per round trip. When only one (or a small number) of packets are 1116 lost, the window size is reduced less drastically; it is cut in half, 1117 and ss-thresh is set to the new current window size. It is this 1118 latter behavior that is the desired one in order to achieve a 1119 reasonable control over the sending rate of the TCP. 1121 When TCP is in this part of its operating range, its window size 1122 resembles a saw-tooth, swinging between two values differing by a 1123 factor of two. The effect of this saw-tooth window size is to slowly 1124 fill up the buffer at the point of congestion until a packet is 1125 discarded, then cut the window size by two, which allows the buffer 1126 to drain, and may actually cause a period of underutilizing the link. 1127 Some thought will suggest that the actual average throughput achieved 1128 by the TCP is a function of the buffer size in the router, as well as 1129 other parameters. It is difficult to predict. 1131 To design a traffic meter that allows a TCP to achieve a given 1132 average rate, it is necessary for the meter to recognize the swings, 1133 and fit them into the profile. One approach would be to build a meter 1134 that looks at the very long-term average rate, and allows the TCP to 1135 send so long as that average is less than the target rate. However, 1136 this has the severe drawback that if the TCP undersends for some time 1137 because it has no data to send, it builds up a credit in the meter 1138 that allows it to exceed the average rate for an excessive time. 1139 This sort of overrun can interfere with other TCPs. 1141 The alternative is to build a meter that measures the rate of the 1142 sending TCP, and looks for a peak rate (the high point of the saw- 1143 tooth). A simple approach is to build a meter that looks for short 1144 term sending rates above 1.33 times the target rate R. Once that rate 1145 is detected, the meter starts tagging a few packets as "out". When 1146 one of these is discarded, the TCP cuts its window size by a factor 1147 of two, which will cause some sort of rate reduction, perhaps also to 1148 a factor of two. The TCP will thus swing between 1.33 R and .66 R, 1149 which averages out to R. One can build a meter that does this, but 1150 it is necessary to consider several factors. 1152 The relationship between the window size of a TCP and its sending 1153 rate is complex. Once the buffer at the point of congestion starts to 1154 fill up, increasing the window size does not increase the sending 1155 rate. Each packet added to the window adds one packet in 1156 circulation, but adds to the round trip delay by the transmission 1157 time of one packet because of the increased waiting time in the 1158 buffer. The combination of these two effects is to leave the 1159 achieved throughput unchanged. If, on the other hand, the buffer is 1160 largely empty, then if the window is cut by 2, the rate will be cut 1161 by two. 1163 It is important that the RIO dropper operate in this region, both so 1164 that it has enough empty buffer to handle transient congestion, and 1165 to improve its ability to control the TCP throughput. With RIO, the 1166 average buffer utilization by "out" packets is small, although the 1167 instantaneous buffer size can fluctuate due to traffic bursts. As 1168 soon as the TCP exceeds its short-term target rate of 1.33 R, some 1169 number of "out" packets begin to appear, and if they generate a queue 1170 in the router, a packet is dropped probabilistically, which causes 1171 the TCP in question to cut its rate by 2. 1173 (Note that in a properly provisioned network, there is enough 1174 capacity to carry all the offered "in" packets, and thus "in" packets 1175 do not contribute to the RIO buffer load. In a sufficiently 1176 underprovisioned network, "in" packet dropping will be triggered, and 1177 the TCP congestion control mechanism will limit the packet load as 1178 always. Loss of "in" packets indicates to the customer that his 1179 provider's provisioning is inadequate to support the customer's 1180 profile.) 1182 An important issue in the design of this meter is finding the time 1183 period over which to average in order to detect the 1.33 R peak. 1184 Average over too long a time, and the average takes into account too 1185 much of the saw-tooth, and underestimates the peak rate. Average over 1186 too short a period, and the meter begins to detect the short- term 1187 bursty nature of the traffic, and detects the 1.33 R peak too soon. 1188 Since the round trip of different TCPs can differ by at least one 1189 order of magnitude and perhaps two, designing a meter (unless it is 1190 integrated into the host implementation and knows the round trip) is 1191 difficult. However, reasonable parameters can be set which work over 1192 a range of round trip delays, say 10 to 100 ms. 1194 One objection to this approach, in which the meters looks for a short 1195 term peak at 1.33 R, is that a creative user could abuse the design 1196 by carefully adjusting the window manually so that it achieved a 1197 steady-state rate somewhat above R (the long term target average) but 1198 below 1.33R. To detect this, the meter has two rate measurements, one 1199 of which looks with a short averaging time for a peak of 1.33 R, and 1200 a second one, with a substantially longer period (longer than a saw- 1201 tooth) for a flow that exceeds R. If the flow falls short of R, no 1202 action is taken, because this might simply be a lack of data to send. 1203 But if the TCP exceeds the rate R over a long time, the parameters of 1204 the short-term averaging meter are adjusted. 1206 This meter is a sophisticated objective, because it represents a 1207 difficult control problem. First, it attempts to set a rate for a 1208 sending TCP, rather then just emulating a physical link. Second, it 1209 is operating at a low level of traffic aggregation (we have simulated 1210 situations with as few as two flows). Third, the meter operates 1211 without knowledge of the round-trips of the individual flows. 1212 Integrating the meter into the host, so that it can know the measured 1213 RTT (which TCP computes anyway) greatly simplifies the design. 1214 However, this non-integrated design is more appropriate for an 1215 incremental deployment strategy using unmodified hosts. 1217 Avoiding slow-start 1219 As noted above, it is desirable to keep TCP operating in the region 1220 where, in response to a lost packet, it cuts its window size in half 1221 and sets ss-thresh equal to this new window size. However, if several 1222 packets are lost at once, the TCP will execute a different algorithm, 1223 called "slow-start", in which it goes idle for some period of time 1224 and then sets the window size to 1. It is preferable to avoid this 1225 behavior. 1227 One way to avoid this is to avoid dropping several packets in close 1228 proximity. There are two halves to achieving this goal. 1230 The first is that the dropper should avoid dropping a block of 1231 packets if it has not recently dropped any. That it, it should 1232 undergo a gradual transition between the states where it is not 1233 dropping any packets, and where it starts to drop. RED, and by 1234 extension RIO, has this behavior. Up to some average queue length, 1235 RED drops no packets. As the average packet length starts to exceed 1236 this length, the probability of loss starts to build, but it is a 1237 linear function of how much longer the average is than this minimum. 1238 So at first, the rate of drops is very low. 1240 However, if the dropper is overloaded with "out" packets, it will be 1241 forced to drop every one that arrives. To deal with this situation, 1242 the meter, when it starts tagging packets as "out", also should at 1243 first tag the packets infrequently. It should not suddenly enter a 1244 mode where it tags a block of packets as "out". However, if the TCP 1245 continues to speed up, as it will if the path is uncongested and it 1246 can sustain the speed, more and more of the packets will be marked as 1247 out, so a gradual transition to tagging in the meter is not 1248 sufficient to avoid all cases of clumped dropping. Both halves of the 1249 scheme, the meter and the dropper, must enter the control phase 1250 gradually. 1252 In essence, this strategy introduces low-pass filters into both the 1253 traffic metering and congestion detection data. These filters are 1254 needed to address the two separate cases of the system dropping out 1255 packets because the TCP exceeding its profile in an otherwise loaded 1256 network, and the system dropping out packets because of new 1257 congestion in a network with TCP1s previously operating above profile 1259 Brief simulation results 1261 We have performed some simulations of this traffic meter and the RIO 1262 dropper. In this note we show one test case from our simulations. The 1263 first column is the target rate, the second column is the actual 1264 rate, the third column is the round trip delay. 1266 Target rate Actual rate TCP RTT 1267 .1 mb/s .158 mb/s 20 ms. 1268 1 1.032 20 1269 .1 .193 40 1270 1 1.02 40 1271 .1 .165 60 1272 1 1.01 60 1273 .1 .15 80 1274 1 .95 80 1275 .1 .15 100 1276 1 .93 100 1278 In this simulation, the actual link capacity was exactly the sum of 1279 the target rates, so there was no "headroom" for overshoot. As the 1280 numbers indicate, we can control the rates of the large flows to 1281 within 10% over a range of round trips from 20 to 100 ms, with the 1282 longer delay flows having the greater difficulty achieving full 1283 speed. The smaller flows, for a number of reasons, are more 1284 opportunistic in using any unclaimed capacity, and exceed their 1285 target ranges. By adjusting the RIO parameters and the parameters in 1286 the meter, different detailed behavior can be produced. We are using 1287 this research to fine tune our best understanding of the RIO 1288 parameters, as well as the design of advanced meters. 1290 New TCP designs help greatly 1292 Improvements to the dynamic performance of TCP have been proposed for 1293 reasons unrelated to this scheme, but rather to more general goals 1294 for improved operation. These include SACK TCP, which supports 1295 selective acknowledgment when specific packets are lost, and other 1296 TCP tuning changes that deal better with multiple losses. We have 1297 simulated our taggers and droppers with these newer TCPs, and the 1298 effect is to make the approach work much better. The reason for this 1299 is that much of the care in the detailed design is required to avoid 1300 triggering slow-start rather than fast recovery, and thus reduce our 1301 ability to control the TCP's throughput. The newer TCP designs, 1302 which achieve that same goal generally, make our design much more 1303 robust. 1305 Another way to improve the operation of this scheme is to use an 1306 Explicit Congestion Notification scheme, as has been proposed by 1307 Sally Floyd. In this variation of RIO, RIO-ECN, the algorithm does 1308 not drop "out" packets at first, but just sends an ECN indication to 1309 the destination, where it is returned to the source. The design of 1310 Floyd's ECN takes into account the round-trip time, and avoids 1311 inadvertent triggering of a slow-start. RIO-ECN, together with a 1312 suitable profile meter at the destination, allows us to control TCP 1313 sending rates almost without flaw in our simulations. 1315 Appendix C: Economic issues 1317 This is a technical note. However, any discussion of providing 1318 different levels of service to different users of a commercial 1319 network cannot be complete without acknowledging the presence of 1320 economic issues. 1322 The scheme presented here has been conceived in the context of the 1323 public commercial Internet, where services are offered for money. It 1324 also works in the context of private, corporate or military networks, 1325 where other more administrative allocations of high-quality service 1326 may be used. But it must work in the context of commercial service. 1327 It is therefore crucial that it take into consideration the varying 1328 business models of Internet service customers and providers, and that 1329 it be consistent with some relevant economic principles. 1331 We discuss these matters briefly below. Note that we are not 1332 suggesting that any specific business model, pricing strategy, or 1333 service offering be universally adopted. In fact, we believe that a 1334 strength of this framework is that it cleanly separates technical 1335 mechanism from economic decisions at different points within the 1336 network. 1338 Congestion pricing 1340 The first economic principle is that there is only a marginal cost to 1341 carrying a packet when the network is congested. When the network is 1342 congested, the cost of carrying a packet from user A is the increased 1343 delay seen by user B. The traffic of user B, of course, caused delay 1344 for A. But if A somehow were given higher priority, so that B saw 1345 most of the delay, A would be receiving better service, and B paying 1346 a higher price, in terms of increased delay and (presumably) 1347 dissatisfaction. According to economic principles, A should receive 1348 better service only if he is willing to pay enough to exceed the 1349 "cost" to B of his increased delay. This can be achieved in the 1350 marketplace by suitable setting of prices. In principle, on can 1351 determine the pricing for access dynamically by allowing A and B to 1352 bid for service, although this has many practical problems. For an 1353 example of such a proposal, see [MMV95]. 1355 When the network is underloaded, however, the packets from A and from 1356 B do not interfere with each other. The marginal or incremental cost 1357 to the service provider of carrying the packets is zero. In a 1358 circumstance where prices follow intrinsic costs, the usage-based 1359 component of the charge to the user should be zero. This approach is 1360 called "congestion pricing". 1362 The scheme described here is consistent with the framework of 1363 congestion pricing. What the user subscribes to, in this scheme, is 1364 an expectation of what service he will receive during times of 1365 congestion, when the congestion price is non-zero. When the net is 1366 underloaded, this scheme permits the user to go faster, since both 1367 "in" and "out" packets are forwarded without discrimination in that 1368 case. 1370 Pricing need not (and often does not) follow abstract economic 1371 principles. An ISP might choose to prevent users from going faster in 1372 times of light load, to assess some price for doing so, or whatever. 1373 But the scheme is capable of implementing a price/service structure 1374 that matches an economically rational model, and we would argue that 1375 any scheme should have that characteristic. 1377 This line of reasoning has some practical implications for the design 1378 of service profiles. If a provider sells a profile that meters usage 1379 over some very long period (so many "in" packets per month, for 1380 example) then there will be a powerful incentive for the user not to 1381 expend these packets unless congestion is actually encountered. This 1382 consequence imposes an extra burden on the user (it is not trivial to 1383 detect congestion) and will yield no benefit to either the user or 1384 the provider. If there is no cost to sending traffic when the network 1385 is underloaded, then there is no cost to having some of those packets 1386 carry "in" tags. In fact, there is a secondary benefit, in that it 1387 allows providers to track demand for such traffic during all periods, 1388 not just during overload. But profiles could be defined that would 1389 motivate the user to conserve "in" tags for times of congestion, and 1390 these seem misguided. 1392 Getting incentives right 1394 The second economic principle is that pricing can be used as an 1395 incentive to shape user behavior toward goals that benefit the 1396 overall system, as well as the user. The "incentive compatibility" 1397 problem is to structure the service and pricing in such a way that 1398 beneficial aggregate behavior results. 1400 Service profiles represent an obvious example of these issues. If a 1401 profile can be shaped that closely matches the user's intrinsic need, 1402 then he will purchase that profile and use it for those needs. But if 1403 the only profile he can get provides him unused capacity, he will be 1404 tempted to consume that capacity in some constructive way, since he 1405 has been required to purchase it to get what he wants. He may be 1406 tempted to resell this capacity, or use it to carry lower value 1407 traffic, and so on. These uses represent distortions of the system. 1409 In general, resale of capacity, or arbitrage, results when pricing is 1410 distorted, and does not follow cost. It is difficult to design a 1411 technical mechanism that can prevent arbitrage, because the mechanism 1412 does not control pricing, but the mechanism should not of necessity 1413 create situations where arbitrage is a consequence. Generally 1414 speaking, this means that price should follow cost, and that profiles 1415 should be flexible enough to match the intrinsic needs of a range of 1416 users. This scheme attempts to capture this objective by allowing the 1417 traffic meters to implement a range of service profiles, rather than 1418 standardizing on a fixed set. 1420 Inter-provider payments 1422 One of the places where a traffic meter can be installed is at the 1423 boundary between two ISPs. In this circumstance, the purpose is to 1424 meter how much traffic of value, i.e. "in" packets, are flowing in 1425 each direction. This sort of information can provide the basis for 1426 differential compensation between the two providers. 1428 In a pure sender-based scheme, where the revenues are being collected 1429 from the sender, the sender of a packet marked as "in" should 1430 presumably pay the first ISP, who should in turn pay the second ISP, 1431 and so on until the packet reaches its final destination. In the 1432 middle of the network, the ISPs would presumably negotiate some long 1433 term contract to carry the "in" packets of each other, but if 1434 asymmetric flows result, or there is a higher cost to carry the 1435 packets onward in one or the other direction, this could constitute a 1436 valid basis for differential payment. 1438 As is discussed in [Clark97], the most general model requires both 1439 sender and receiver based payments, so that payments can be extracted 1440 from all participants in a transfer in proportion to the value that 1441 each brings to the transfer. In this case, the direction of packet 1442 flow does not determine the direction of value, and thus the 1443 direction of compensating payment. See the referenced paper for a 1444 full development of the details of a mixed sender-receiver scheme. It 1445 is interesting to note that the sender and receiver-based schemes are 1446 to some extent associated with different business models. 1448 The basic sender-based scheme considered in much of this note makes 1449 sense in many business contexts. For example, a user with multiple 1450 sites, who wants to connect those sites with known service, can 1451 equally well express all of these requirements in terms of behavior 1452 at the sender, since the senders are all known in advance. 1454 In contrast to this "closed" system, consider the "open" system of a 1455 node attached to the public Internet, who wants to purchase some 1456 known service profile for interaction with other sites on the 1457 Internet. If the primary traffic to that site is incoming (for 1458 example, browsing the Web), then it is the receiver of the traffic, 1459 not the sender, who associates the value with the transfer. In this 1460 case the receiver-based scheme, or a zone scheme, may best meet the 1461 needs of the concerned parties. 1463 References 1465 [Clark97] D. Clark, "Combining Sender anbd Receiver Payment Schemes 1466 in the Internet"; Proceedings of the Telecommunications Policy 1467 Research Conference, Solomon, MD, 1996 1469 [CF97] D. Clark and W. Fang, "Explicit Allocation of Best Effort 1470 Packet Delivery Service", (soon) to be available as 1471 http://ana.lcs.mit.edu/papers/exp-alloc.ps 1473 [Floyd93] S. Floyd and V. Jacobson, "Random Early Detection Gateways 1474 for Congestion Avoidance", IEEE/ACM Trans. on Networking, August 1993 1476 [Floyd95] S. Floyd, "TCP and Explicit Congestion Notification", 1477 Computer Communication Review, v 24:5, October, 1995 1479 [FF97] S. Floyd and K. Fall, "Router Mechanisms to Support End-to-End 1480 Congestion Control", available at http://www-nrg.ee.lbl.gov/nrg- 1481 papers.html 1483 [Kalevi97] K. Kilkki, "Simple Integrated Media Access" Internet 1484 Draft, June 1997, 1486 [Kelly97] F. Kelly, "Charging and Accounting for Bursty Connections" 1487 in "Internet Economics", L. McKnight and J. Bailey, eds., MIT Press, 1488 1997 1490 [MMV95] "Pricing the Internet" in "Public Access to the Internet", B. 1491 Kahin and J. Keller, eds., Prentice Hall, 1995. Available as 1492 http://www.spp.umich.edu/ipps/papers/info- 1493 nets/Pricing_Internet/Pricing_the_Internet.ps.Z 1495 Authors' Addresses: 1497 David D. Clark 1498 MIT Laboratory for Computer Science 1499 545 Technology Sq. 1500 Cambridge, MA 02139 1501 jtw@lcs.mit.edu 1502 617-253-6003 1503 617-253-2673 (FAX) 1505 John Wroclawski 1506 MIT Laboratory for Computer Science 1507 545 Technology Sq. 1508 Cambridge, MA 02139 1509 jtw@lcs.mit.edu 1510 617-253-7885 1511 617-253-2673 (FAX)