Internet Engineering Task Force D. Clark / J. Wroclawski INTERNET-DRAFT MIT LCS draft-clark-diff-svc-alloc-00.txt July, 1997 Expires: 12/97 An Approach to Service Allocation in the Internet Abstract This note describes the Service Allocation Profile scheme for differential service allocation within the Internet. The scheme is based on a simple packet drop preference mechanism at interior nodes, and highly flexible service profiles at edges and inter- provider boundary points within the net. The service profiles capture a wide variety of user requirements and expectations, and allow different users to receive different levels of service from the network in a scalable and efficient manner. The note describes the basic form of the mechanism, discusses the range of services that users and providers can obtain by employing it, and gives a more detailed presentation of particular technical, deployment, standardization, and economic issues related to its use. Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress". To learn the current status of any Internet-Draft, please check the "1id-abstracts.txt" listing contained in the Internet- Drafts Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). NOTE: This draft is a snapshot of a document in progress, and was somewhat arbitrarily cast into its current form at the Internet-Draft Clark/Wroclawski Expires 12/97 [Page 1] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 submission deadline for the Munich IETF. The authors apologize in advance for a certain raggedness of presentation.. 1. Introduction This document describes a framework for providing what has been called differentiated service, or allocated capacity service, in the Internet. The goal of the mechanism is to allocate the bandwidth of the Internet to different users in a controlled way during periods of congestion. The mechanism applies equally to traditional applications based on TCP, such as file transfer, data base access or Web servers, and new sorts of applications such as real time audio or video. The mechanism we describe can provide users with a predictable expectation of what service the Internet will provide to them in times of congestion, and can allow different users to obtain different levels of service from the network. This contrasts with today's Internet, in which each user gets some unpredictable share of the capacity. Our mechanism provides two additional things that are important to this task. First, it allows users and providers with a wide range of business and administrative models to make capacity allocation decisions. In the public Internet, where commercial providers offer service for payment, the feedback will most often be different prices charged to customers with different requirements. This allows the providers to charge differential prices to users that attach greater value to their Internet access, and thus fund the deployment of additional resources to better serve them. But whether pricing, or some other administrative control is used (as might apply in a corporate or military network), the same mechanism for allocating capacity can be used. The mechanism also provides useful information to providers about provisioning requirements. With our mechanism in place, service providers can more easily allocate specific levels of 3assured2 capacity to customers, and can easily monitor their networks to detect when the actual service needs of their customers are not being met. While this document does describe a mechanism, this is a small part of its goal. There are a number of mechanisms that might be proposed, and the issue is not just demonstrating which of them works (most do work in some fashion), but to discuss what the problem to be solved actually is, and therefore which of the possible mechanisms best meets the needs of the Internet. This document is thus as much about what the problem actually is, as it is about a preferred solution. Clark/Wroclawski Expires 12/97 [Page 2] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 1.1 Separating results from mechanism An essential aspect of this scheme is the range of services the user can obtain using the mechanism. The mechanism is obviously not useful if it does not meet a current need. Some initial requirements we see for services are that they must be useful, easy to understand, possible to measure (so the user can tell whether he is getting the service he contracted for), and easy to implement. At the same time, we should try very hard not to embed a specific set of services into the core of the Internet. As we gain experience in the marketplace, we may discover that our first speculations are wrong about what service the user actually wants. It should be possible to change the model, evolve it, and indeed to try different models at the same time to see which better meets the needs of the user and the market. So this scheme has the two goals: defining and implementing a first set of services, but permitting these services to be changed without modifying the "insides" of the Internet, specifically the routers. We will return later to the discussion of different sorts of services. 2. Outline of this document This document is organized as follows: Section 3 describes the basic mechanism, to give a general idea of how such a service can be implemented. Section 4 discusses the services which might be desired. It proposes a first set of services that might be implemented, and discusses the range of services that can be built out of this mechanism. Section 5 describes the location of service profiles in the network. Section 6 describes details of the mechanism. These include our specific algorithm for the dropper, issues concerning rate control of TCP, and dealing with non-responsive flows. Section 7 compares this mechanism with some alternatives. Section 8 discusses deployment issues, incremental deployment, and what portions of the mechanism require standardization. Section 9 discusses security issues. 3. The basic scheme Clark/Wroclawski Expires 12/97 [Page 3] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 The general approach of this mechanism is to define a service profile for each user, and to design a mechanism in the router that favors traffic that is within those service profiles. The core of the idea is very simple -- monitor the traffic of each user as it enters the network, and tag packets as being "in" or "out" of their service profile. Then at each router, if congestion occurs, preferentially drop traffic that is tagged as being "out". Inside the network, at the routers, there is no separation of traffic from different users into different flows or queues. The packets of all the users are aggregated into one queue, just as they are today. Different users can have very different profiles, which will result in different users having different quantities of "in" packets in the service queue. A router can treat these packets as a single commingled pool. This attribute of the scheme makes it very easy to implement, in contrast to a scheme like current RSVP reservations, in which the packets must be explicitly classified at each node. We have more to say about this issue below. To implement this scheme, the routers must be augmented to implement our dropping scheme, and a new function must be implemented to tag the traffic according to its service profile. This algorithm can be implemented as part of an existing network component (host, access device or router) or in a new component created for the purpose. Conceptually, we will refer to it as a distinct device called a "profile meter". We use the term "meter" rather than "tagger", because, as we will discuss below, the profile meter can actually take a more general set of actions. The idea of a service profile can be applied at any point in the network where a customer-provider relationship exists. A profile may describe the needs of a specific user within a campus, the service purchased by a corporate customer from an ISP, or the traffic handling agreement between two international providers. We discuss the location of profiles further in Section 5. The description above associates the profile with the traffic sender. That is, the sender has a service profile, the traffic is tagged at the source according to that profile, and then dropped if necessary inside the network. In some circumstances, however, it is the receiving user that wishes to control the level of service. The web provides a simple example; a customer using his browser for business research may be much more interested in a predictable level of performance than the casual surfer. The key observation is that "value" in the network does not always flow in the same direction as the data packets. Thus, for full generality, a "dual" mechanism is required, that can Clark/Wroclawski Expires 12/97 [Page 4] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 be either "sender driven" or "receiver driven". Most of this document is written, for simplicity, in terms of a sender scheme, but we briefly describe the receiver version as well, and discuss the circumstances in which it is important. In evaluating alternative mechanisms, it is important to see if both a sender and receiver version can be built. In later sections, we discuss the specifics of the profiling or tagging mechanism and the treatment of profiled packets within the network. First we turn to the question of the range of services the mechanism ought to support. 4. Range of services As discussed above, there are two general issues concerning service models. First, we want to start out by implementing a simple set of services, which are useful and easy to understand. At the same time, we should not embed these services into the mechanism, but should build a general mechanism that allows us to change the services as our experience matures. Our scheme provides this flexibility. To oversimplify, a service is defined by the profile meter, which implements the user's service profile. To change the service, it is necessary "only" to change the profile meter. The routers in the interior of the network implement a single common mechanism which is used by the different meters to provide different services. Three things must be considered when describing a service: - what exactly is provided to the customer (an example might be "one megabit per second of bandwidth, continuously available") - to where this service is provided (examples might be a specific destination, a group of destinations, all nodes on the local provider, or "everywhere") - with what level of assurance is the service provided (or alternately, what level of performance uncertainty can the user tolerate) These things are coupled; it is much easier to provide "a guaranteed one megabit per second" to a specific destination than to anywhere in the Internet. 4.1 A first service model As a place to start, a particularly simple service might provide the equivalent of a dedicated link of some specified bandwidth from source to destination. (The virtue of this simple model has been Clark/Wroclawski Expires 12/97 [Page 5] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 clearly articulated by Van Jacobson.) This model is easy for the user to understand -- he can take his existing application, connect it across a physical link and see how it performs. If he can make it work in that context, then this service will allow him to run that application over the Internet. This model has been implemented in a number of network architectures, with different "enhancements". The CBR service of ATM is an example, as is (to some extent) the CIR mechanism of Frame Relay. However, there are some issues and limitations to this very simple model. One very important limit to a pure virtual link model is that the user may not wish to purchase this virtual link full time. He may need it only some of the time, and in exchange would hope to obtain a lower cost. A provider could meet this desire by offering a more expressive profile; say a committed bandwidth with some duty cycle, e.g. "3 mb/s with a 5% duty cycle measured over 5 minutes". Or, the provider could offer the user a rebate based on observed (non)usage, or allow him to reserve the capacity dynamically on demand. A second issue is whether the user can exceed the capacity of the virtual link when the network is unloaded. Today, the Internet allows its users to go faster under that circumstance. Continuing to capture that benefit may be important in user acceptance. The CIR of Frame Relay works this way, and it is an important aspect of its market appeal. An equally important issue is that the user may not wish to set up different distinguished committed bandwidth flows to different destinations, but may prefer to have a more aggregated commitment. There are several drawbacks to making distinct bandwidth commitments between each source and destination. First, this may result in a large number of flow specifications. If the user is interested in 1000 network access points, he must specify one million source- destination pairs. Frame Relay has this specification problem. Second, the sum of the distinct commitments for any source (or destination) cannot exceed the physical capacity of the access link at that point, which may force each individual assurance to be rather small. Finally, the source-destination model implies that the user can determine his destinations in advance, and in some cases that he understands the network topology; two situations which are not universally true. In fact, several variations of service commitment might make sense to different users; from one source to a specific destination, from a source to a pool of specified destinations (one might configure a Virtual Private Network in this way) and finally from a source to "anywhere", which could mean either all points on the ISP, on a Clark/Wroclawski Expires 12/97 [Page 6] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 collection of ISPs, or any reachable node. The latter sorts of commitments are by their nature more difficult to offer with high assurance. There is no way to know for sure what the service will be to any specific destination, because that depends on what other traffic is leaving the source, and what other traffic is arriving at the destination. Offering commitments to "anywhere within the ISP" implies that the ISP has provisioned its resources adequately to support all in-profile users simultaneously to the same destination. Offering commitments to "anywhere at all" implies that all ISPs in any reachable path from the user have provisioned sufficiently, which is most unlikely. 4.2 Managing bursty traffic Not all Internet traffic is continuous in its requirement for bandwidth. Indeed, based on measurements on the Internet, much of the traffic is very bursty. It may thus be that a service model based on a fixed capacity "virtual link" does not actually meet user's needs very well. Some other more complex service profile that permits bursty traffic may be more suitable. It is possible to support bursty traffic by changing the profile meter to implement this new sort of service. The key issue is to insure, in the center of the network, that there is enough capacity to carry this bursty traffic, and thus actually meet the commitments implied by the outstanding profiles. This requires a more sophisticated provisioning strategy than the simple "add 'em up" needed for constant bit-rate virtual links. A body of mathematics that is now maturing provides a way to relate the bursty behavior of a single flow to the resulting implications for the required overall bandwidth when a number of such flows are combined. (see, for example [Kelly97]). This sort of analysis can be employed as a way to predict the capacity that must be provided to support profiles describing bursty traffic. As a practical matter, in the center of the existing Internet, at the backbone routers of the major ISPs, there is such a high degree of traffic aggregation that the bursty nature of individual traffic flows is essentially invisible. So providing bursty service profiles to individual users will not create a substantial provisioning issue in the center of the network, while possibly adding significant value to the service as perceived by the user. 4.3 Degrees of assurance The next aspect of sorting out services is to consider the degree of assurance that the user will receive that the contracted capacity will actually be there when he attempts to use it. Clark/Wroclawski Expires 12/97 [Page 7] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 Statistical bandwidth allocation allows the Internet to support an increased number of users, use bandwidth vastly more efficiently, and deal flexibly with new applications and services. However, it does lead to some uncertainty as to the bandwidth that will be available at any instant. Our approach to allocating traffic is to follow this philosophy to the degree that the user can tolerate the uncertainty. In other words, we believe that a capacity allocation scheme should provide a range of service assurance. At one extreme, the user may demand an absolute service assurance, even in the face of some number of network failures. (Wall Street traders often have two phones on their desk, connected by different building wiring to different phone exchanges, so that they can continue to make money even if a central office goes down or half the building burns.) Less demanding users may wish to purchase a service profile that is "usually" available, but may still fail with low probability. The presumption is that a higher assurance service will cost substantially more to implement. We have called these statistically provisioned service profiles "expected capacity" profiles. This term was picked to suggest that the profiles do not describe a strict guarantee, but rather an expectation that the user can have about the service he will receive during times of congestion. This sort of service will somewhat resemble the Internet of today in that users today have some expectation of what performance they will receive; the key change is that our mechanism by which different users can have very different expectations. Statistical assurance is a matter of provisioning. In our scenario, an ISP can track the amount of traffic tagged as "in" crossing various links over time, and provide enough capacity to carry this subset of the traffic, even at times of congestion. This is how the Internet is managed today, but the addition of tags gives the ISP a better handle on how much of the traffic at any instant is "valued" traffic, and how much is discretionary or opportunistic traffic for which a more relaxed attitude can be tolerated. For traffic that requires a higher level of commitment, more explicit actions must be taken. The specific sources and destinations must be determined, and then the paths between these points must be inspected to determine if there is sufficient capacity. There are two approaches. The static approach involves making a long term commitment to the user, and reserving the network resources to match this commitment. This involves some computation based on the topology map of the network to allocate the needed bandwidth along the primary (and perhaps secondary) routes. The dynamic approach involves using a setup or reservation protocol such as RSVP to request the service. This would explore the network path at the time of the request, and reserve the bandwidth from a pool available for Clark/Wroclawski Expires 12/97 [Page 8] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 dynamic services. Information concerning this pool would have to be stored in the routers themselves, to support the operation of RSVP. We have proposed a lightweight version of RSVP, called RSVP with Trusted TOS Tags, or T3, as a way to implement this dynamic service efficiently. Within one ISP, the reservation could be submitted to a central location for acceptance, depending on the design adopted for bandwidth management. It is important to note that traffic requiring this higher level of assurance can still be aggregated with other similar traffic. It is not necessary to separate out each individual flow to insure that it receives it promised service. It is necessary only to insure that sufficient capacity is available between the specific sources and destinations desiring the service, and that the high-assurance packets can draw on that capacity. This implies that there would be two queues in the router, one for traffic that has received a statistical assurance, and one for this higher or "guaranteed" assurance. Within each queue, "in" and "out" tags would be used to distinguish the subset of the traffic that is to receive the preferred treatment. However, some other discriminator must be used to separate the two classes, and thus sort packets into the two queues. Our specific proposal, which we detail later, is that two values of the TOS field be used, one to mean statistical assurance, and one to mean guaranteed assurance. Statistical assurance would correspond to the service delivered across the Internet today, augmented with "in" and "out" tags. An ISP could avoid the complexity of multiple queues and still provide the high-assurance service by over-provisioning the network to the point where all "in" traffic is essentially never dropped, no matter what usage patterns the users generate. It is an engineering decision of the ISP whether this approach is feasible. 4.4 A service profile for the access path In some cases, what the user is concerned with is not the end-to-end behavior he achieves, but the profile for utilizing his access path to the network. For example, users today buy a high-speed access path for two different reasons. One is to transfer a continuous flow of traffic, the other to transfer bursts at high speed. The user who has bursty traffic might want on the one hand an assurance that the bursts can go through at some known speed, but on the other hand a lower price than the user who delivers a continuous flow into the Internet. Giving these two sorts of users different service profiles that describe the aggregated traffic across the access link will help discriminate between them, and provide a basis for differential charging. Clark/Wroclawski Expires 12/97 [Page 9] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 A service profile of the sort discussed here is a reasonable way to capture this sort of requirement. By tagging the traffic that crosses the access path according to some service profile, the ISP commits to forward that subset of the traffic within its region, and only delivers the rest if the network is underloaded. It is instructive to compare this approach to pricing an access path to the more traditional "usage-based" scheme. In the traditional scheme, the actual usage is metered, and the user is charged a fee that depends on the usage. If the user sends more, he pays more. However, since TCP goes faster if the net is underloaded, it is hard for the user (or the ISP aggregating his traffic) to actually regulate his usage. In contrast, a service profile allows two users with different needs to be distinguished (and presumably charged differently) but each user could be charged a known price based on the profile. If the traffic exceeds the profile, the consequence is not increased fees, but congestion pushback if the network is congested. 4.5 An example of a more sophisticated profile Our initial service profile modeled a dedicated link of some set capacity. This service profile is easy to understand at one level, but once one runs TCP over this link, it becomes much harder to predict what behavior can actually be achieved. TCP hunts for the correct rate by increasing its window size until a packet is discarded at the bottleneck point, and then cutting its window size by two (in many current implementations). How this behavior interacts with a link of fixed size is a function of buffer size and implementation details in TCP. A more sophisticated service profile would be one that attempted to provide a specified and predictable throughput to a TCP stream, so long as the TCP was "well-behaved". This would actually make it easier for the user to test the profile, because he could just run a TCP-based application and observe the throughput. This is an example of a "higher-level" profile, because it provides a service which is less closely related to some existing network component and more closely related to the user's actual needs. These profiles are more difficult to define, because they depend on the behavior of both the network and the end-nodes. However, we have experimented with the design of such a profile, and believe that it is possible to implement this sort of service as well. A more detailed description of the profile needed to fix a TCP transfer rate is given in Appendix B. 5. Location of Service Profiles in the Network In the simple sender-based scheme described so far, the function that checks whether traffic fits within a profile is implemented by Clark/Wroclawski Expires 12/97 [Page 10] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 tagging packets as in or out of profile at the edge of the network. The complete story is more complex. A profile describes an expectation of service obtained by a customer from a provider. These relationships exist at many points in the network, ranging from individual users and their campus LANs to the peering relationships between global ISP's. Any such boundary may be an appropriate place for a profile meter. Further, the packet tagging associated with this service profile will, in the general case, be performed by devices at either side of the boundary. One sort, located on the traffic sourcing side of a network boundary, is a "policy meter". This sort implements some policy by choosing the packets that leave the network (or user's machine) with their in-profile bit set, and thus receive the assured service. Another sort, a "checking meter", sits on the arriving- traffic side of a network boundary, checks the incoming traffic, and marks packets as out of profile (or turns off excess in-profile bits) if the arriving traffic exceeds the assigned profile. A general model is that the first meter that the traffic encounters should provide the highest degree of discrimination among the flows. A profile meter could be integrated into a host implementation of TCP and IP, where it could serve to regulate the relative use of the network by individual flows. The subsequent meters, looking only at larger aggregates, serve to verify that there is a large enough overall service contract in place at that point to carry all of the traffic tagged as "in" (the valuable traffic) at the interior points. When a traffic meter is placed at the point where a campus or corporate network connects to an ISP, or one ISP connects to another, the traffic being passed across the link is highly aggregated. The ISP, on the arriving- traffic side of the link, will check only to verify the total behavior. On the traffic sourcing side of the link, an additional profile meter can be installed to verify that tags have been applied according to policy of the user. Additional profile meters installed at intermediate points can provide good feedback on network demand. Consider a specific situation, where traffic is tagged at individual hosts according to policies specific to these hosts, and then passes through a second meter at the point of attachment from the private network to the public Internet. If the number of "in" packets arriving at that point exceeds the aggregate service profile purchased at that point, this means that the user has not purchased enough aggregate capacity to match the needs of his individual policy setting. In the short run, there is no choice but to turn some of these "in" packets to "out", (or to charge an extra fee for carrying unexpected overloads), but in the long run, this provides a basis to negotiate a higher service level with the ISP. So traffic meters actually provide a basis for Clark/Wroclawski Expires 12/97 [Page 11] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 monitoring user needs, and moving users to a higher service profile as needed. 5.1 Controlling the scope of profiles Even in the case where the user wants to obtain a service profile that is not specific to one destination, but rather applies to "all" possible destinations, it is clear that the "all" cannot be literally true. Any service profile that involves an unspecified set of destinations will have to bound the scope of the agreement. For example, a single ISP or a set of co-operating ISPs may agree to provide an assured service profile among all of their end points, but if the traffic passes beyond that point, the profile will cease to apply. The user might be given further options in the design of his profile. For example, if there are regions of restricted bandwidth within the Internet, some users may wish to pay more in order to have their "in" tags define their service across this part of the net, while others may be willing to have their "in" tags reset if the traffic reaches this point. This could be implemented by installing a profile meter at that point in the network, with explicit lists of source-destination pairs that are and are not allowed to send "in" traffic beyond this point. The alternative would be some sort of "zone system" for service profiles that is specified in the packets themselves. See [Clark97] for a discussion of a zone system. 6. Details of the Mechanism This section describes several aspects of our proposed mechanism in more detail. 6.1 Design of the dropper One of the key parts of this scheme is the algorithm in the router that drops "out" packets preferentially during times of congestion. The behavior of this algorithm must be well understood and agreed on, because the taggers at the edge of the network must take this behavior into account in their design. There can be many taggers, with different goals as to the service profile, the degree of aggregation etc. There is only one dropper, and all the routers have to perform an agreed behavior. The essence of our dropper is an algorithm which processes all packets in order as received, in a single queue, but preferentially drops "out" packets. There are other designs that could be proposed Clark/Wroclawski Expires 12/97 [Page 12] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 for queue management, for example to put the "in" packets in a higher priority queue. There are specific reasons why we prefer drop preference to priority queuing for the allocation of best effort traffic, but we delay that discussion until Section 7. The primary design goals of the dropper are the following: - It must allow the taggers to implement a range of service profiles in a useful and understandable way. - If the router is flooded with "out" packets, it must be able to discard them all without harming the "in" packets. In other words, it must deal well with non-conforming flows that do not adjust their sending rate when they observe packet loss. - If the router is receiving a number of "well-behaved" TCP flows, which will (as TCP always does) speed up until they encounter a lost packet, it must have enough real buffering available that once it starts to get overloaded with packets, it can discard "out" packets and still receive traffic bursts for a round trip until the affected TCP slows down. 6.2 RIO -- RED with In and Out Our specific dropping scheme is an extension of the Random Early Detection scheme, or RED, that is now being deployed in the Internet. The general behavior of RED is that, as the queue begins to build up, it drops packets with a low but increasing probability, instead of waiting until the queue is full and then dropping all arriving packets. This results in better overall behavior, shorter queues, and lower drop rates. Our approach is to run two RED algorithms at the same time, one for "in" packets, and one for "out" packets. The "out" RED algorithm starts dropping at a much shorter average queue length, and drops much more aggressively than the "in" algorithm. With proper setting of the parameters, the "out" traffic can be controlled before the queue grows to the point that any "in" traffic is dropped. We call this scheme RIO. There are some subtle aspects to this scheme. The "in" dropper must look at the number of "in" packets in the queue. The "out" dropper must look at the total queue length, taking into account both "in" and "out". This is because the link can be subjected to a range of overloads, from a mix of "in" and "out" traffic to just "out". In both cases, the router must start dropping "outs" before the "in" traffic is affected, and must continue to achieve the basic function of RED; preserving enough free buffer space to absorb transient loads with a duration too short to be affected by feedback congestion Clark/Wroclawski Expires 12/97 [Page 13] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 control. 6.3. Rate control of TCP A useful, but challenging, problem is to build a traffic meter that causes a TCP to send at a specified maximum rate in periods of congestion. Such a meter works by causing the TCP's bandwidth usage (actually congestion avoidance) algorithm to "hunt" between somewhat over and somewhat under the target rate, by tagging packets such that the RIO algorithm will drop them appropriately when the network is loaded. An important aspect of this is that the meter and RIO work together to avoid *too many* closely spaced packet discards, forcing the TCP into slow-start and causing it to obtain less than the desired bandwidth. A detailed description of a traffic meter which meets these objectives is given in Appendix B of this note. 6.4. Dealing with non-responsive flows A well-behaved TCP, or other traffic source that responds similarly to congestion signaled by packet loss, will respond well to the RIO dropper. As more of its packets are marked as "out", one will eventually be dropped. At this point, the source will back off. As a result, most of the time a network of well-behaved TCPs will contain just enough "out" packets to use up any excess capacity not claimed by the "in" packets being sent. But what if there is a source of packets that does not adjust? This could happen because of a poorly implemented TCP, or from a source of packets, such as a video data application, that does not or cannot adjust. In this circumstance, if the unresponsive flow1s packets are marked as out of profile, the flood of "out" packets will cause a RIO router to operate in a different way, but well behaved TCPs and similar flows must continue to receive good service. (If the unresponsive flow1s packets are in profile, the network should be able carry them, and there is no issue.) 6.4.1. Robustness against non-responsive flows In the RIO scheme, once the level of "out" packets exceeds a certain average level, all the incoming "out" packets will be discarded (this is similar to the non-RIO RED behavior). This behavior has the consequence of increasing the router1s queue length. The average queue length will increase by the number of "out" packets that are allowed to sit in the queue before RIO switches over to the phase Clark/Wroclawski Expires 12/97 [Page 14] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 where it drops every "out". There must be enough physical room in the buffer so that even when there are this many "out" packets present, there is enough room for the normal instantaneous bursts of "in" packets which would be seen in any event. Thus, a RIO router may require slightly larger queues than a non-RIO router. In the simulations summarized in Appendix B, the maximum number of "out" packets is approximately 15. (This particular number is not magic -- the point is that it is not 1, nor 100.) So to operate RIO, it will be necessary to increase the minimum physical buffer size by perhaps this amount, or a little more, to allow for swings in the instantaneous numbers of "out" packets as well. But in most circumstances, this is a modest increase in the buffer size. 6.4.2. Filtering out non-responsive flows Although RIO is reasonably robust against overload from non- responsive flows, it may be useful to consider the alternative strategy of singling out non-conforming flows and selectively dropping them in the congested router. There has been work [FF97] towards enhancing the traditional RED scheme with a mechanism to detect and discriminate against non-conforming flows. Discriminating against these flows requires the installation of a packet classifier or filter that can select these packet flows, so that they can be discarded. This adds complexity and introduces scaling concerns to the scheme. These concerns are somewhat mitigated because only the misbehaving flows, not the majority of flows that behave, need be recognized. Whatever classification scheme that basic RED might use can be used by RIO as well. The difference between our framework and RED is that the designers of RED are working to design an algorithm that runs locally in each router to detect non-conforming flows, without any concept of a service profile. In that case, the only sort of traffic allocation that can be done is some form of local fairness. However, with the addition of profile tags, the router can look only at the "out" packets, which by definition represent that portion of a flow that is in excess. This might make it easier to detect locally flows that were non- conforming. The alternative approach would be an indication from the traffic meter that the flow is persistently exceeding the service profile in a time of congestion. This indication, a control packet, could either install a classifier in each potential point of congestion, or flow all the way back to the traffic meter nearest the sender, where the traffic can be distinguished and discarded (or otherwise discriminated against). The latter approach has the benefit that the control packet need not follow the detailed hop-by-hop route of the data packet in reverse, which is hard to do in today's Internet with asymmetric routes. Clark/Wroclawski Expires 12/97 [Page 15] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 We consider the question of whether such a mechanism provides sufficient benefit over the approach of employing local detection of non-responsive flows at each node to be unresolved at present. 7. Alternatives to the mechanism Schemes for differential service or capacity allocation differ in a number of respects. Some standardize on the service profiles, and embed them directly in the routers. As discussed above, this scheme has the advantage that the actual service profile is not a part of what is standardized, but is instead realized locally in the traffic meter, which gives this scheme much greater flexibility in changing the profile. 7.1. Drop preference vs. priority One possible difference is what the router does when it is presented with an overload. Our scheme is based on a specific algorithm for drop preference for packets marked as "out". An alternative would be to put packets marked as "out" in a lower priority queue. Under overload that lower priority queue would be subjected to service starvation, queue overflow and eventually packet drops. Thus a priority scheme might be seen as similar to a drop preference scheme. They are similar, but not the same. The priority scheme has the consequence that packets in the two queues are reordered by the scheduling discipline that implements the priority behavior. If packets from a single TCP flow are metered such that some are marked as "in" and some as "out", they will in general arrive at the receiver out of order, which will cause performance problems with the TCP. In contrast, the RIO scheme always keeps the packets in order, and just explicitly drops some of the "out" packets if necessary. That makes TCP work much better under slight overload. The priority scheme is often proposed for the case of a restricted class of service profiles in which all the packets of a single flow are either "in" or "out". These schemes include the concept of a "premium" customer (all its packets are "in"), or a rate-limited flow (packets that exceed the service profile are dropped at the meter, rather than being passed on.) These proposals are valid experiments in what a service profile should be, but they are not the only possibilities. The drop preference scheme has the advantage that it seems to support a wider range of potential service profiles (including the above two), and thus offers an important level of flexibility. 7.2. More bits? Clark/Wroclawski Expires 12/97 [Page 16] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 A variation on this scheme is to have more than two levels of control -- more than simple "in" and "out". One reason to have more than two levels is to allow the user to express his range of values more completely. With three or four levels of tagging, the user could express what service profile he would like at different levels of congestion -- none, low, medium and severe. The question is whether this actually brings much real incremental value. In commercial networks, which are usually provisioned in a conservative fashion, it is not clear that there will be enough congestion to discriminate between more than two states. In other circumstances, for example military networks where severe service degradation might occur under adverse circumstances, having several levels of usage preference might be helpful. Asking the user to define these several tiers of service profiles raises one issue, however; it presumes that the user is actually able to determine what his needs are to this degree of precision. It is not actually clear that the user has this level of understanding of how he would trade off usage against cost. There is an alternative way to deal with variation in the degree of congestion. Instead of coding the user's desires into each packet, one could imagine a management protocol running in the background that reports to the edges of the network what the current level of congestion is, or whether a special or crisis circumstance exists. Based on information from that protocol, the service profile of the user could be changed. Both approaches may have advantages. An advantage of the first is the lack of need for a management protocol. An advantage of the second is that the management protocol can express a much wider range of policies and reallocation actions. Another reason to have multiple levels of control is to achieve a smoother transition between the two states of a flow. As discussed above, when controlling TCP, because of the specific congestion schemes used in TCP, it is helpful not to drop a number of packets from one flow at once, because it is likely to trigger a full TCP slow- start, rather then the preferable fast recovery action. Having more bits might enhance this discrimination. However, based on our simulations, if we are going to use more bits from the packet header for control, it might be a better option to move to an Explicit Congestion Notification design for the Internet, which seems to provide a better degree of control overall. 8. Deployment Issues 8.1. Incremental deployment plan. No scheme like this can be deployed at once in all parts of the Internet. It must be possible to install it incrementally, if it is to succeed at all. Clark/Wroclawski Expires 12/97 [Page 17] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 The obvious path is to provide these capabilities first within a single ISP. This implies installing RIO routers within the ISP, and tagging the traffic at the access points to that ISP. This requires a profile meter at each access link into that ISP. The meter could maintain a large amount of user-specific information about desired usage patterns between specific sources and destinations (and this might represent a business opportunity), but more likely would provide only rough categories of traffic classification. A user of this ISP could then install a profile meter on his end of the access link, which he controls and configures, to provide a finer- grained set of controls over which traffic is to be marked as "in" and "out". Eventually, meters might appear as part of host implementations, which would permit the construction of profiles that took into account the behavior of specific applications, and which would also control the use of network resources within the campus or corporate area. At the boundary to the region of routers implementing RIO, all traffic must be checked, to make sure that no un-metered traffic sneaks into the network tagged as "in". So the implementation of this scheme requires a consistent engineering of the network configuration within an administrative region (such as an ISP) to make sure that all sources of traffic have been identified, and either metered or "turned out". If some routers implement RIO, and some do not, but just implement simple RED, the user may fail to receive the committed service profile. But no other major failures will occur. That is, the worst that the user will see is what he sees today. One can achieve substantial incremental improvements by identifying points of actual congestion, and putting RIO routers there first. 8.2. What has to be standardized In fact, very little of this scheme needs to be standardized in the normal pattern of IETF undertakings. What is required is to agree on the general approach, and set a few specific standards. 8.2.1. Semantics of router behavior It is necessary to agree on the common semantics that all routers will display for "in" and "out" bits. Our proposal is that routers implement the RIO scheme, as described above. The parameters should be left for operational adjustment. For the receiver-based scheme, the router has to tag packets rather than drop them. We omit the description of the tagging algorithm, Clark/Wroclawski Expires 12/97 [Page 18] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 only noting that it, too, must be agreed to if a receiver-based scheme is to be deployed. 8.2.2. Use of IP precedence field Currently, the three bit precedence field in the IP header is not widely used. Bit x of this field will be used as the "in/out" bit. This bit will be known as the In Profile Indicator, or IPI. The meaning of the IPI is that a 1 value implies "in". This has the effect that the normal default value of the field, 0, will map to the baseline behavior, which is out of profile service. 8.2.3. Use of IP TOS field This document proposes to view Type of Service in a slightly different way than has been usual in the past. While previous RFCs have not been explicit (e.g. RFC 1349), the role of the ToS field has been thought of more to control routing than scheduling and dropping within the router. This document explicitly proposes to specify these features. The TOS field can be used for this purpose, but doing so will preclude its use in the same packet to select the service defined in RFC 1349 and RFC 1700: low delay, high throughput, low cost, high reliability and high security. According to RFC 1349, the TOS field should be viewed as a 4 bit integer value, with certain values reserved for backwards compatibility. We propose that the six defined values of TOS be associated with the statistical service profiles ("expected capacity profiles") defined in this document. That is, the use of the IPI is legal with any of these value of TOS, and the difference among them is routing options. A new value of TOS (yyyy) shall be used to specify the assured service profile, which has a level of assurance for the service profile that is not statistical in nature. As part of the design of this type of service, the routing will have to be controlled to achieve this goal, so the value yyyy for the TOS will also imply some routing constraints for the ISPs. It is an engineering decision of the service provider how this sort of traffic is routed, so that it follows the routes along which the resources have been reserved. 8.2.4. Additional issues for the sender/receiver based scheme The combined sender-receiver scheme is capable of expressing a much more complex set of value relationships than the sender-based scheme. However, it implies more complexity and more bits in the header. It does not appear possible to encode all the necessary information for the combined scheme in an IPv4 header. This option is thus proposed Clark/Wroclawski Expires 12/97 [Page 19] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 as a consideration for IPv6, if there seems to be sufficient demand. 9. Security considerations This scheme is concerned with resource allocation, and thus the major security concern is theft of resources. Resources can be stolen by injecting traffic that is marked as "in" but which has not passed through a legitimate profile meter into a RIO-controlled region of the network. To protect against this, it is necessary to define "regions of shared trust", and engineer and audit all the links that bring traffic into each of these regions to insure that a profile meter has been installed in each such link. Such a region might correspond to a single ISP, the backbone component of a single ISP, a collection of co-operating ISPs and so on. In general, the presence of a profile meter is an indication of a possible boundary where trust is not shared, and the traffic has to be verified. It is a matter for further research whether algorithms can be designed to detect (locally, at each router) a flow of packets that is not legitimate. 10. Acknowledgments The simulations reported in this paper were performed by Wenjia Fang. Earlier simulations that proved the concepts of the profile meter and the receiver-based scheme were performed by Pedro Zayas. We acknowledge the valuable discussions with the members of the End-to- End research group. Appendix A: Description of a receiver-based scheme The tagging scheme described above implements a model in which the sender, by selecting one or another service profile, determines what service will govern each transfer. However, the sender- controlled model is not the only appropriate model for determining how Internet transmissions should be regulated. For much of the traditional Internet, where information has been made available, often for free, to those users who care enough to retrieve it, it is the value that the receiver places on the transfer, not the sender, that would properly dictate the service allocated to the transfer. In this document, we do not debate the philosophical tradeoff between sender and receiver controlled schemes. Instead, we describe a mechanism that implements receiver control of service, which is similar in approach and meshes with the sender controlled tagging scheme. One technique that does not work is to have the receiver send some Clark/Wroclawski Expires 12/97 [Page 20] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 credentials to the sender, on the basis of which a flag is set in the packet. This runs the risks of great complexity, but more fundamentally does not deal with multicast, where one packet may go to several receivers, each of which attaches a different value to the transfer. A critical design decision is whether the scheme must react to congestion instantly, or with one round trip delay. If it must react instantly, then each point of congestion must have stored state, installed by the receiver, that will allow that point to determine if the packet is "in" or "out" of profile. This runs the risk of formidable complexity. If, however, we are willing to have the reaction to congestion occur one round trip later, several quite tractable schemes can be proposed, which are similar to the sender controlled scheme in spirit. A receiver controlled scheme can be built using a traffic meter at the receiver, similar to the traffic meter at the sender in the sender tagging scheme. The meter knows what the current usage profile of the receiver is, and thus can check to see whether a stream of received packets is inside of the profile. A (different) new flag in the packet, called Forward Congestion Notification, or FCN, is used to carry information about congestion to the receiver's traffic meter. A packet under this receiver controlled scheme starts out from the sender with the FCN bit off, and when the packet encounters congestion the bit is set on. As the packet reaches the destination, the receiver's traffic meter notes that the bit is on, and checks to see if the packet fits within the profile of the receiver. If it does, the service profile of the receiver is debited, and the bit is turned off in the packet. If the packet cannot fit within the profile of the user, the bit remains on. When the receiver receives a packet with the FCN on, which means that the receiver's profile does not have sufficient capacity to cover all the packets that encountered congestion, the sender must be instructed to slow down. This can occur in a number of ways. One, for TCP, the receiver could reduce the window size. That is, the receiver as well as the sender could compute a dynamic congestion window. This is complex to design. Second, again for TCP, the ACK packet or a separate control message (such as an ICMP Source Quench) could carry back to the sender some explicit indication to slow down. Third, for TCP, if the traffic meter noted that the receiver seemed to have taken no action in response to the FCN bit, the meter can delete some returning ACKs or an incoming data packet, which will trigger a congestion slowdown in the sender. The paper by Floyd [Floyd95] contains a detailed discussion of enhancing TCP to include explicit congestion notification, using Clark/Wroclawski Expires 12/97 [Page 21] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 either bits in the header or the ICMP Source Quench message with redefined semantics. The range of algorithms explored there for implementing explicit notification are directly applicable to this scheme. In fact, the end node behavior (the source and destination TCP) for her scheme is exactly the same as the scheme here. What is different is the method of notifying the end node of the congestion. In her scheme, random packets are selected to trigger congestion notification. In this scheme, during periods of congestion all packets are marked, but these marks are then removed by the receiver's traffic meter, unless the rate exceeds the installed service profile. We have simulated the receiver-based scheme, using the ECN mechanism proposed by Floyd to notify the sending TCP to slow down. Because of the very well-behaved characteristics of the ECN scheme, we can regulate TCPs to different sending rates essentially flawlessly. A key question in the successful implementation of the receiver scheme is defining what constitutes congestion in the router -- under what conditions the router should start setting the FCN bit. Hypothetically, the router should start setting the bit as soon as it detects the onset of queuing in the router. It is important to detect congestion and limit traffic as soon as possible, because it is very undesirable for the queue to build up to the point where packets must be discarded. Key differences between sender and receiver control There are a number of interesting asymmetries between the sender and the receiver versions of this tag and profile scheme, asymmetries that arise from the fact that the data packets flow from the sender to the receiver. In the sender scheme, the packet first passes through the meter, where it is tagged, and then through any points of congestion, while in the receiver payment scheme the packet first passes through any points of congestion, where it is tagged, and then through the receiver's meter. The receiver scheme, since it only sets the FCN bit if congestion is actually detected, can convey to the end point dynamic information about current congestion levels. The sender scheme, in contrast, must set the IPI and tag the packet as "in" or "out" without knowing if congestion is actually present. Thus, it would be harder, in the sender scheme, to construct a service that billed the user for actual usage during periods of congestion. While the receiver scheme seems preferable in that it can naturally implement both static and dynamic payment schemes, the sender scheme has the advantage that since the packet carries in it the explicit assertion of whether it is in or out of profile, when it reaches a Clark/Wroclawski Expires 12/97 [Page 22] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 point of congestion, the treatment of the packet is evident. In the receiver scheme, the data packet itself carries no indication of whether it is in or out of profile, so all the point of congestion can do is set the FCN bit, attempt to forward the packet, and trust that the sender will correctly adjust its transmission rate. The receiver scheme is thus much more indirect in its ability to respond to congestion. Of course, the controller at the point of congestion may employ a scheme to discard a packet from the queue, as it does now. However, the receiver scheme gives no guidance as to which packet to delete. Another difference between the two schemes is that in the sender scheme, the sending application can set the In Profile Indicator in different packets to control which packets are favored during congestion. In the receiver scheme, all packets sent to the receiver pass through and debit the traffic meter before the receiving host gets to see them. Thus, in order for the receiving host to distinguish those packets that should receive preferred service, it would be necessary for it to install some sort of packet filter in the traffic meter. This seems feasible but potentially complex. However, it is again a local matter between the traffic meter and the attached host. While this scheme works well to control TCP, what about a source that does not adjust when faced with lost packets, or otherwise just floods the congested router? In the receiver-based scheme, there is an increase need for some sort of notification message that can flow backwards through the network from the receiver's traffic meter towards the source of the traffic (or towards the congested routers along the path) so that offending traffic can be distinguished and discriminated against. This sort of mechanism was discussed above in the section on Filtering out Non-Responsive Flows. Appendix B: Designing traffic meters to control TCP throughput We have suggested that a useful goal for a traffic meter is to let a well-behaved TCP operate at a specific speed. This is more complex than a service that mimics a link of a specific speed, since a TCP may not be able to fully utilize a physical link because of its behavior dealing with congestion. In order to design a traffic meter that allows a TCP to go at a set speed, the designer must take into account the behavior of TCP. This appendix presents a quick review of the relevant TCP behavior, describes the preliminary design of a traffic meter that directly controls TCP bandwidth, and summarizes some simulation results. Further details of this work can be found in [CF97]. TCP attempts to adjust its performance by varying its window size. Clark/Wroclawski Expires 12/97 [Page 23] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 Within the limit imposed by the receive window, the sender increases its window size until a packet is discarded; then reduces its window size and begins again. This process controls the TCP1s effective throughput rate. There are several different operating regions for TCP. When a number of packets are lost, a TCP reduces its window size to 1, and then (roughly) doubles it each round trip until a threshold is reached. (This threshold is often referred to by the variable name used to implement it in the Berkeley Unix code: ss-thresh.) Once the send window has exceeded ss-thresh, it increases more slowly -- one packet per round trip. When only one (or a small number) of packets are lost, the window size is reduced less drastically; it is cut in half, and ss-thresh is set to the new current window size. It is this latter behavior that is the desired one in order to achieve a reasonable control over the sending rate of the TCP. When TCP is in this part of its operating range, its window size resembles a saw-tooth, swinging between two values differing by a factor of two. The effect of this saw-tooth window size is to slowly fill up the buffer at the point of congestion until a packet is discarded, then cut the window size by two, which allows the buffer to drain, and may actually cause a period of underutilizing the link. Some thought will suggest that the actual average throughput achieved by the TCP is a function of the buffer size in the router, as well as other parameters. It is difficult to predict. To design a traffic meter that allows a TCP to achieve a given average rate, it is necessary for the meter to recognize the swings, and fit them into the profile. One approach would be to build a meter that looks at the very long-term average rate, and allows the TCP to send so long as that average is less than the target rate. However, this has the severe drawback that if the TCP undersends for some time because it has no data to send, it builds up a credit in the meter that allows it to exceed the average rate for an excessive time. This sort of overrun can interfere with other TCPs. The alternative is to build a meter that measures the rate of the sending TCP, and looks for a peak rate (the high point of the saw- tooth). A simple approach is to build a meter that looks for short term sending rates above 1.33 times the target rate R. Once that rate is detected, the meter starts tagging a few packets as "out". When one of these is discarded, the TCP cuts its window size by a factor of two, which will cause some sort of rate reduction, perhaps also to a factor of two. The TCP will thus swing between 1.33 R and .66 R, which averages out to R. One can build a meter that does this, but it is necessary to consider several factors. Clark/Wroclawski Expires 12/97 [Page 24] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 The relationship between the window size of a TCP and its sending rate is complex. Once the buffer at the point of congestion starts to fill up, increasing the window size does not increase the sending rate. Each packet added to the window adds one packet in circulation, but adds to the round trip delay by the transmission time of one packet because of the increased waiting time in the buffer. The combination of these two effects is to leave the achieved throughput unchanged. If, on the other hand, the buffer is largely empty, then if the window is cut by 2, the rate will be cut by two. It is important that the RIO dropper operate in this region, both so that it has enough empty buffer to handle transient congestion, and to improve its ability to control the TCP throughput. With RIO, the average buffer utilization by "out" packets is small, although the instantaneous buffer size can fluctuate due to traffic bursts. As soon as the TCP exceeds its short-term target rate of 1.33 R, some number of "out" packets begin to appear, and if they generate a queue in the router, a packet is dropped probabilistically, which causes the TCP in question to cut its rate by 2. (Note that in a properly provisioned network, there is enough capacity to carry all the offered "in" packets, and thus "in" packets do not contribute to the RIO buffer load. In a sufficiently underprovisioned network, "in" packet dropping will be triggered, and the TCP congestion control mechanism will limit the packet load as always. Loss of "in" packets indicates to the customer that his provider's provisioning is inadequate to support the customer's profile.) An important issue in the design of this meter is finding the time period over which to average in order to detect the 1.33 R peak. Average over too long a time, and the average takes into account too much of the saw-tooth, and underestimates the peak rate. Average over too short a period, and the meter begins to detect the short- term bursty nature of the traffic, and detects the 1.33 R peak too soon. Since the round trip of different TCPs can differ by at least one order of magnitude and perhaps two, designing a meter (unless it is integrated into the host implementation and knows the round trip) is difficult. However, reasonable parameters can be set which work over a range of round trip delays, say 10 to 100 ms. One objection to this approach, in which the meters looks for a short term peak at 1.33 R, is that a creative user could abuse the design by carefully adjusting the window manually so that it achieved a steady-state rate somewhat above R (the long term target average) but below 1.33R. To detect this, the meter has two rate measurements, one of which looks with a short averaging time for a peak of 1.33 R, and Clark/Wroclawski Expires 12/97 [Page 25] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 a second one, with a substantially longer period (longer than a saw- tooth) for a flow that exceeds R. If the flow falls short of R, no action is taken, because this might simply be a lack of data to send. But if the TCP exceeds the rate R over a long time, the parameters of the short-term averaging meter are adjusted. This meter is a sophisticated objective, because it represents a difficult control problem. First, it attempts to set a rate for a sending TCP, rather then just emulating a physical link. Second, it is operating at a low level of traffic aggregation (we have simulated situations with as few as two flows). Third, the meter operates without knowledge of the round-trips of the individual flows. Integrating the meter into the host, so that it can know the measured RTT (which TCP computes anyway) greatly simplifies the design. However, this non-integrated design is more appropriate for an incremental deployment strategy using unmodified hosts. Avoiding slow-start As noted above, it is desirable to keep TCP operating in the region where, in response to a lost packet, it cuts its window size in half and sets ss-thresh equal to this new window size. However, if several packets are lost at once, the TCP will execute a different algorithm, called "slow-start", in which it goes idle for some period of time and then sets the window size to 1. It is preferable to avoid this behavior. One way to avoid this is to avoid dropping several packets in close proximity. There are two halves to achieving this goal. The first is that the dropper should avoid dropping a block of packets if it has not recently dropped any. That it, it should undergo a gradual transition between the states where it is not dropping any packets, and where it starts to drop. RED, and by extension RIO, has this behavior. Up to some average queue length, RED drops no packets. As the average packet length starts to exceed this length, the probability of loss starts to build, but it is a linear function of how much longer the average is than this minimum. So at first, the rate of drops is very low. However, if the dropper is overloaded with "out" packets, it will be forced to drop every one that arrives. To deal with this situation, the meter, when it starts tagging packets as "out", also should at first tag the packets infrequently. It should not suddenly enter a mode where it tags a block of packets as "out". However, if the TCP continues to speed up, as it will if the path is uncongested and it can sustain the speed, more and more of the packets will be marked as out, so a gradual transition to tagging in the meter is not Clark/Wroclawski Expires 12/97 [Page 26] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 sufficient to avoid all cases of clumped dropping. Both halves of the scheme, the meter and the dropper, must enter the control phase gradually. In essence, this strategy introduces low-pass filters into both the traffic metering and congestion detection data. These filters are needed to address the two separate cases of the system dropping out packets because the TCP exceeding its profile in an otherwise loaded network, and the system dropping out packets because of new congestion in a network with TCP1s previously operating above profile Brief simulation results We have performed some simulations of this traffic meter and the RIO dropper. In this note we show one test case from our simulations. The first column is the target rate, the second column is the actual rate, the third column is the round trip delay. Target rate Actual rate TCP RTT .1 mb/s .158 mb/s 20 ms. 1 1.032 20 .1 .193 40 1 1.02 40 .1 .165 60 1 1.01 60 .1 .15 80 1 .95 80 .1 .15 100 1 .93 100 In this simulation, the actual link capacity was exactly the sum of the target rates, so there was no "headroom" for overshoot. As the numbers indicate, we can control the rates of the large flows to within 10% over a range of round trips from 20 to 100 ms, with the longer delay flows having the greater difficulty achieving full speed. The smaller flows, for a number of reasons, are more opportunistic in using any unclaimed capacity, and exceed their target ranges. By adjusting the RIO parameters and the parameters in the meter, different detailed behavior can be produced. We are using this research to fine tune our best understanding of the RIO parameters, as well as the design of advanced meters. New TCP designs help greatly Improvements to the dynamic performance of TCP have been proposed for reasons unrelated to this scheme, but rather to more general goals for improved operation. These include SACK TCP, which supports selective acknowledgment when specific packets are lost, and other Clark/Wroclawski Expires 12/97 [Page 27] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 TCP tuning changes that deal better with multiple losses. We have simulated our taggers and droppers with these newer TCPs, and the effect is to make the approach work much better. The reason for this is that much of the care in the detailed design is required to avoid triggering slow-start rather than fast recovery, and thus reduce our ability to control the TCP's throughput. The newer TCP designs, which achieve that same goal generally, make our design much more robust. Another way to improve the operation of this scheme is to use an Explicit Congestion Notification scheme, as has been proposed by Sally Floyd. In this variation of RIO, RIO-ECN, the algorithm does not drop "out" packets at first, but just sends an ECN indication to the destination, where it is returned to the source. The design of Floyd's ECN takes into account the round-trip time, and avoids inadvertent triggering of a slow-start. RIO-ECN, together with a suitable profile meter at the destination, allows us to control TCP sending rates almost without flaw in our simulations. Appendix C: Economic issues This is a technical note. However, any discussion of providing different levels of service to different users of a commercial network cannot be complete without acknowledging the presence of economic issues. The scheme presented here has been conceived in the context of the public commercial Internet, where services are offered for money. It also works in the context of private, corporate or military networks, where other more administrative allocations of high-quality service may be used. But it must work in the context of commercial service. It is therefore crucial that it take into consideration the varying business models of Internet service customers and providers, and that it be consistent with some relevant economic principles. We discuss these matters briefly below. Note that we are not suggesting that any specific business model, pricing strategy, or service offering be universally adopted. In fact, we believe that a strength of this framework is that it cleanly separates technical mechanism from economic decisions at different points within the network. Congestion pricing The first economic principle is that there is only a marginal cost to carrying a packet when the network is congested. When the network is congested, the cost of carrying a packet from user A is the increased delay seen by user B. The traffic of user B, of course, caused delay Clark/Wroclawski Expires 12/97 [Page 28] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 for A. But if A somehow were given higher priority, so that B saw most of the delay, A would be receiving better service, and B paying a higher price, in terms of increased delay and (presumably) dissatisfaction. According to economic principles, A should receive better service only if he is willing to pay enough to exceed the "cost" to B of his increased delay. This can be achieved in the marketplace by suitable setting of prices. In principle, on can determine the pricing for access dynamically by allowing A and B to bid for service, although this has many practical problems. For an example of such a proposal, see [MMV95]. When the network is underloaded, however, the packets from A and from B do not interfere with each other. The marginal or incremental cost to the service provider of carrying the packets is zero. In a circumstance where prices follow intrinsic costs, the usage-based component of the charge to the user should be zero. This approach is called "congestion pricing". The scheme described here is consistent with the framework of congestion pricing. What the user subscribes to, in this scheme, is an expectation of what service he will receive during times of congestion, when the congestion price is non-zero. When the net is underloaded, this scheme permits the user to go faster, since both "in" and "out" packets are forwarded without discrimination in that case. Pricing need not (and often does not) follow abstract economic principles. An ISP might choose to prevent users from going faster in times of light load, to assess some price for doing so, or whatever. But the scheme is capable of implementing a price/service structure that matches an economically rational model, and we would argue that any scheme should have that characteristic. This line of reasoning has some practical implications for the design of service profiles. If a provider sells a profile that meters usage over some very long period (so many "in" packets per month, for example) then there will be a powerful incentive for the user not to expend these packets unless congestion is actually encountered. This consequence imposes an extra burden on the user (it is not trivial to detect congestion) and will yield no benefit to either the user or the provider. If there is no cost to sending traffic when the network is underloaded, then there is no cost to having some of those packets carry "in" tags. In fact, there is a secondary benefit, in that it allows providers to track demand for such traffic during all periods, not just during overload. But profiles could be defined that would motivate the user to conserve "in" tags for times of congestion, and these seem misguided. Clark/Wroclawski Expires 12/97 [Page 29] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 Getting incentives right The second economic principle is that pricing can be used as an incentive to shape user behavior toward goals that benefit the overall system, as well as the user. The "incentive compatibility" problem is to structure the service and pricing in such a way that beneficial aggregate behavior results. Service profiles represent an obvious example of these issues. If a profile can be shaped that closely matches the user's intrinsic need, then he will purchase that profile and use it for those needs. But if the only profile he can get provides him unused capacity, he will be tempted to consume that capacity in some constructive way, since he has been required to purchase it to get what he wants. He may be tempted to resell this capacity, or use it to carry lower value traffic, and so on. These uses represent distortions of the system. In general, resale of capacity, or arbitrage, results when pricing is distorted, and does not follow cost. It is difficult to design a technical mechanism that can prevent arbitrage, because the mechanism does not control pricing, but the mechanism should not of necessity create situations where arbitrage is a consequence. Generally speaking, this means that price should follow cost, and that profiles should be flexible enough to match the intrinsic needs of a range of users. This scheme attempts to capture this objective by allowing the traffic meters to implement a range of service profiles, rather than standardizing on a fixed set. Inter-provider payments One of the places where a traffic meter can be installed is at the boundary between two ISPs. In this circumstance, the purpose is to meter how much traffic of value, i.e. "in" packets, are flowing in each direction. This sort of information can provide the basis for differential compensation between the two providers. In a pure sender-based scheme, where the revenues are being collected from the sender, the sender of a packet marked as "in" should presumably pay the first ISP, who should in turn pay the second ISP, and so on until the packet reaches its final destination. In the middle of the network, the ISPs would presumably negotiate some long term contract to carry the "in" packets of each other, but if asymmetric flows result, or there is a higher cost to carry the packets onward in one or the other direction, this could constitute a valid basis for differential payment. As is discussed in [Clark97], the most general model requires both sender and receiver based payments, so that payments can be extracted Clark/Wroclawski Expires 12/97 [Page 30] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 from all participants in a transfer in proportion to the value that each brings to the transfer. In this case, the direction of packet flow does not determine the direction of value, and thus the direction of compensating payment. See the referenced paper for a full development of the details of a mixed sender-receiver scheme. It is interesting to note that the sender and receiver-based schemes are to some extent associated with different business models. The basic sender-based scheme considered in much of this note makes sense in many business contexts. For example, a user with multiple sites, who wants to connect those sites with known service, can equally well express all of these requirements in terms of behavior at the sender, since the senders are all known in advance. In contrast to this "closed" system, consider the "open" system of a node attached to the public Internet, who wants to purchase some known service profile for interaction with other sites on the Internet. If the primary traffic to that site is incoming (for example, browsing the Web), then it is the receiver of the traffic, not the sender, who associates the value with the transfer. In this case the receiver-based scheme, or a zone scheme, may best meet the needs of the concerned parties. References [Clark97] D. Clark, "Combining Sender anbd Receiver Payment Schemes in the Internet"; Proceedings of the Telecommunications Policy Research Conference, Solomon, MD, 1996 [CF97] D. Clark and W. Fang, "Explicit Allocation of Best Effort Packet Delivery Service", (soon) to be available as http://ana.lcs.mit.edu/papers/exp-alloc.ps [Floyd93] S. Floyd and V. Jacobson, "Random Early Detection Gateways for Congestion Avoidance", IEEE/ACM Trans. on Networking, August 1993 [Floyd95] S. Floyd, "TCP and Explicit Congestion Notification", Computer Communication Review, v 24:5, October, 1995 [FF97] S. Floyd and K. Fall, "Router Mechanisms to Support End-to-End Congestion Control", available at http://www-nrg.ee.lbl.gov/nrg- papers.html [Kalevi97] K. Kilkki, "Simple Integrated Media Access" Internet Draft, June 1997, [Kelly97] F. Kelly, "Charging and Accounting for Bursty Connections" in "Internet Economics", L. McKnight and J. Bailey, eds., MIT Press, Clark/Wroclawski Expires 12/97 [Page 31] INTERNET-DRAFT draft-clark-diff-svc-alloc-00.txt July, 1997 1997 [MMV95] "Pricing the Internet" in "Public Access to the Internet", B. Kahin and J. Keller, eds., Prentice Hall, 1995. Available as http://www.spp.umich.edu/ipps/papers/info- nets/Pricing_Internet/Pricing_the_Internet.ps.Z Authors' Addresses: David D. Clark MIT Laboratory for Computer Science 545 Technology Sq. Cambridge, MA 02139 jtw@lcs.mit.edu 617-253-6003 617-253-2673 (FAX) John Wroclawski MIT Laboratory for Computer Science 545 Technology Sq. Cambridge, MA 02139 jtw@lcs.mit.edu 617-253-7885 617-253-2673 (FAX) Clark/Wroclawski Expires 12/97 [Page 32]