idnits 2.17.1 draft-ietf-tsvwg-byte-pkt-congest-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 15. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 1691. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 1702. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 1709. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 1715. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (August 07, 2008) is 5735 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '0' on line 515 -- Looks like a reference, but probably isn't: '1' on line 515 ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) == Outdated reference: A later version (-06) exists of draft-floyd-tcpm-ackcc-02 == Outdated reference: A later version (-10) exists of draft-ietf-tcpm-ecnsyn-05 == Outdated reference: A later version (-07) exists of draft-ietf-tcpm-rfc2581bis-03 == Outdated reference: A later version (-08) exists of draft-irtf-iccrg-welzl-congestion-control-open-research-00 -- Obsolete informational reference (is this intentional?): RFC 2581 (Obsoleted by RFC 5681) -- Obsolete informational reference (is this intentional?): RFC 3448 (Obsoleted by RFC 5348) == Outdated reference: A later version (-09) exists of draft-briscoe-tsvwg-re-ecn-tcp-05 Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe 3 Internet-Draft BT & UCL 4 Intended status: Informational August 07, 2008 5 Expires: February 8, 2009 7 Byte and Packet Congestion Notification 8 draft-ietf-tsvwg-byte-pkt-congest-00 10 Status of this Memo 12 By submitting this Internet-Draft, each author represents that any 13 applicable patent or other IPR claims of which he or she is aware 14 have been or will be disclosed, and any of which he or she becomes 15 aware will be disclosed, in accordance with Section 6 of BCP 79. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on February 8, 2009. 35 Abstract 37 This memo concerns dropping or marking packets using active queue 38 management (AQM) such as random early detection (RED) or pre- 39 congestion notification (PCN). The primary conclusion is that packet 40 size should be taken into account when transports read congestion 41 indications, not when network equipment writes them. Reducing drop 42 of small packets has some tempting advantages: i) it drops less 43 control packets, which tend to be small and ii) it makes TCP's bit- 44 rate less dependent on packet size. However, there are ways of 45 addressing these issues at the transport layer, rather than reverse 46 engineering network forwarding to fix specific transport problems. 47 Network layer algorithms like the byte-mode packet drop variant of 48 RED should not be used to drop fewer small packets, because that 49 creates a perverse incentive for transports to use tiny segments, 50 consequently also opening up a DoS vulnerability. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 55 2. Motivating Arguments . . . . . . . . . . . . . . . . . . . . . 8 56 2.1. Scaling Congestion Control with Packet Size . . . . . . . 8 57 2.2. Avoiding Perverse Incentives to (ab)use Smaller Packets . 10 58 2.3. Small != Control . . . . . . . . . . . . . . . . . . . . . 11 59 3. Working Definition of Congestion Notification . . . . . . . . 11 60 4. Congestion Measurement . . . . . . . . . . . . . . . . . . . . 12 61 4.1. Congestion Measurement by Queue Length . . . . . . . . . . 12 62 4.1.1. Fixed Size Packet Buffers . . . . . . . . . . . . . . 12 63 4.2. Congestion Measurement without a Queue . . . . . . . . . . 13 64 5. Idealised Wire Protocol Coding . . . . . . . . . . . . . . . . 14 65 6. The State of the Art . . . . . . . . . . . . . . . . . . . . . 15 66 6.1. Congestion Measurement: Status . . . . . . . . . . . . . . 16 67 6.2. Congestion Coding: Status . . . . . . . . . . . . . . . . 17 68 6.2.1. Network Bias when Encoding . . . . . . . . . . . . . . 17 69 6.2.2. Transport Bias when Decoding . . . . . . . . . . . . . 19 70 6.2.3. Making Transports Robust against Control Packet 71 Losses . . . . . . . . . . . . . . . . . . . . . . . . 20 72 6.2.4. Congestion Coding: Summary of Status . . . . . . . . . 21 73 7. Outstanding Issues and Next Steps . . . . . . . . . . . . . . 23 74 7.1. Bit-congestible World . . . . . . . . . . . . . . . . . . 23 75 7.2. Bit- & Packet-congestible World . . . . . . . . . . . . . 23 76 8. Security Considerations . . . . . . . . . . . . . . . . . . . 24 77 9. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 25 78 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 27 79 11. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 27 80 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 27 81 12.1. Normative References . . . . . . . . . . . . . . . . . . . 27 82 12.2. Informative References . . . . . . . . . . . . . . . . . . 27 83 Editorial Comments . . . . . . . . . . . . . . . . . . . . . . . . 84 Appendix A. Example Scenarios . . . . . . . . . . . . . . . . . . 31 85 A.1. Notation . . . . . . . . . . . . . . . . . . . . . . . . . 31 86 A.2. Bit-congestible resource, equal bit rates (Ai) . . . . . . 31 87 A.3. Bit-congestible resource, equal packet rates (Bi) . . . . 32 88 A.4. Pkt-congestible resource, equal bit rates (Aii) . . . . . 33 89 A.5. Pkt-congestible resource, equal packet rates (Bii) . . . . 34 90 Appendix B. Congestion Notification Definition: Further 91 Justification . . . . . . . . . . . . . . . . . . . . 34 92 Appendix C. Byte-mode Drop Complicates Policing Congestion 93 Response . . . . . . . . . . . . . . . . . . . . . . 35 94 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 36 95 Intellectual Property and Copyright Statements . . . . . . . . . . 37 97 Relationship to existing RFCs 99 To be removed by the RFC Editor on publication (with appropriate 100 changes to the 'Updates:' header and the RFC Index as appropriate). 102 This memo intends to update RFC2309, which stated an interim view but 103 requested that further research was needed on this topic. 105 Changes from Previous Versions 107 To be removed by the RFC Editor on publication. 109 Full incremental diffs between each version are available at 110 111 or 112 113 (courtesy of the rfcdiff tool): 115 From briscoe-byte-pkt-mark-02 to ietf-byte-pkt-congest-00 (this 116 version): 118 Added note on relationship to existing RFCs 120 Posed the question of whether packet-congestion could become 121 common and deferred it to the IRTF ICCRG. Added ref to the 122 dual-resource queue (DRQ) proposal. 124 Changed PCN references from the PCN charter & architecture to 125 the PCN marking behaviour draft most likely to imminently 126 become the standards track WG item. 128 From -01 to -02: 130 Abstract reorganised to align with clearer separation of issue 131 in the memo. 133 Introduction reorganised with motivating arguments removed to 134 new Section 2. 136 Clarified avoiding lock-out of large packets is not the main or 137 only motivation for RED. 139 Mentioned choice of drop or marking explicitly throughout, 140 rather than trying to coin a word to mean either. 142 Generalised the discussion throughout to any packet forwarding 143 function on any network equipment, not just routers. 145 Clarified the last point about why this is a good time to sort 146 out this issue: because it will be hard / impossible to design 147 new transports unless we decide whether the network or the 148 transport is allowing for packet size. 150 Added statement explaining the horizon of the memo is long 151 term, but with short term expediency in mind. 153 Added material on scaling congestion control with packet size 154 (Section 2.1). 156 Separated out issue of normalising TCP's bit rate from issue of 157 preference to control packets (Section 2.3). 159 Divided up Congestion Measurement section for clarity, 160 including new material on fixed size packet buffers and buffer 161 carving (Section 4.1.1 & Section 6.2.1) and on congestion 162 measurement in wireless link technologies without queues 163 (Section 4.2). 165 Added section on 'Making Transports Robust against Control 166 Packet Losses' (Section 6.2.3) with existing & new material 167 included. 169 Added tabulated results of vendor survey on byte-mode drop 170 variant of RED (Table 2). 172 From -00 to -01: 174 Clarified applicability to drop as well as ECN. 176 Highlighted DoS vulnerability. 178 Emphasised that drop-tail suffers from similar problems to 179 byte-mode drop, so only byte-mode drop should be turned off, 180 not RED itself. 182 Clarified the original apparent motivations for recommending 183 byte-mode drop included protecting SYNs and pure ACKs more than 184 equalising the bit rates of TCPs with different segment sizes. 185 Removed some conjectured motivations. 187 Added support for updates to TCP in progress (ackcc & ecn-syn- 188 ack). 190 Updated survey results with newly arrived data. 192 Pulled all recommendations together into the conclusions. 194 Moved some detailed points into two additional appendices and a 195 note. 197 Considerable clarifications throughout. 199 Updated references 201 Requirements notation 203 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 204 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 205 document are to be interpreted as described in [RFC2119]. 207 1. Introduction 209 When notifying congestion, the problem of how (and whether) to take 210 packet sizes into account has exercised the minds of researchers and 211 practitioners for as long as active queue management (AQM) has been 212 discussed. Indeed, one reason AQM was originally introduced was to 213 reduce the lock-out effects that small packets can have on large 214 packets in drop-tail queues. This memo aims to state the principles 215 we should be using and to come to conclusions on what these 216 principles will mean for future protocol design, taking into account 217 the deployments we have already. 219 Note that the byte vs. packet dilemma concerns congestion 220 notification irrespective of whether it is signalled implicitly by 221 drop or using explicit congestion notification (ECN [RFC3168] or PCN 222 [I-D.eardley-pcn-marking-behaviour]). Throughout this document, 223 unless clear from the context, the term marking will be used to mean 224 notifying congestion explicitly, while congestion notification will 225 be used to mean notifying congestion either implicitly by drop or 226 explicitly by marking. 228 If the load on a resource depends on the rate at which packets 229 arrive, it is called packet-congestible. If the load depends on the 230 rate at which bits arrive it is called bit-congestible. 232 Examples of packet-congestible resources are route look-up engines 233 and firewalls, because load depends on how many packet headers they 234 have to process. Examples of bit-congestible resources are 235 transmission links, and most buffer memory, because the load depends 236 on how many bits they have to transmit or store. Some machine 237 architectures use fixed size packet buffers, so buffer memory in 238 these cases is packet-congestible (see Section 4.1.1). 240 Note that information is generally processed or transmitted with a 241 minimum granularity greater than a bit (e.g. octets). The 242 appropriate granularity for the resource in question SHOULD be used, 243 but for the sake of brevity we will talk in terms of bytes in this 244 memo. 246 Resources may be congestible at higher levels of granularity than 247 packets, for instance stateful firewalls are flow-congestible and 248 call-servers are session-congestible. This memo focuses on 249 congestion of connectionless resources, but the same principles may 250 be applied for congestion notification protocols controlling per-flow 251 and per-session processing or state. 253 The byte vs. packet dilemma arises at three stages in the congestion 254 notification process: 256 Measuring congestion When the congested resource decides locally how 257 to measure how congested it is. (Should the queue be measured in 258 bytes or packets?); 260 Coding congestion notification into the wire protocol: When the 261 congested resource decides how to notify the level of congestion. 262 (Should the level of notification depend on the byte-size of each 263 particular packet carrying the notification?); 265 Decoding congestion notification from the wire protocol: When the 266 transport interprets the notification. (Should the byte-size of a 267 missing or marked packet be taken into account?). 269 In RED, whether to use packets or bytes when measuring queues is 270 called packet-mode or byte-mode queue measurement. This choice is 271 now fairly well understood but is included in Section 4 to document 272 it in the RFC series. 274 The controversy is mainly around the other two stages: whether to 275 allow for packet size when the network codes or when the transport 276 decodes congestion notification. In RED, the variant that reduces 277 drop probability for packets based on their size in bytes is called 278 byte-mode drop, while the variant that doesn't is called packet mode 279 drop. Whether queues are measured in bytes or packets is an 280 orthogonal choice, termed byte-mode queue measurement or packet-mode 281 queue measurement. 283 Currently, the RFC series is silent on this matter other than a paper 284 trail of advice referenced from [RFC2309], which conditionally 285 recommends byte-mode (packet-size dependent) drop [pktByteEmail]. 286 However, all the implementers who responded to our survey have not 287 followed this advice. The primary purpose of this memo is to build a 288 definitive consensus against deliberate preferential treatment for 289 small packets in AQM algorithms and to record this advice within the 290 RFC series. 292 Now is a good time to discuss whether fairness between different 293 sized packets would best be implemented in the network layer, or at 294 the transport, for a number of reasons: 296 1. The packet vs. byte issue requires speedy resolution because the 297 IETF pre-congestion notification (PCN) working group is about to 298 standardise the external behaviour of a PCN congestion 299 notification (AQM) algorithm [I-D.eardley-pcn-marking-behaviour]; 301 2. [RFC2309] says RED may either take account of packet size or not 302 when dropping, but gives no recommendation between the two, 303 referring instead to advice on the performance implications in an 304 email [pktByteEmail], which recommends byte-mode drop. Further, 305 just before RFC2309 was issued, an addendum was added to the 306 archived email that revisited the issue of packet vs. byte-mode 307 drop in its last para, making the recommendation less clear-cut; 309 3. Without the present memo, the only advice in the RFC series on 310 packet size bias in AQM algorithms would be a reference to an 311 archived email in [RFC2309] (including an addendum at the end of 312 the email to correct the original). 314 4. The IRTF Internet Congestion Control Research Group (ICCRG) 315 recently took on the challenge of building consensus on what 316 common congestion control support should be required from network 317 forwarding functions in future 318 [I-D.irtf-iccrg-welzl-congestion-control-open-research]. The 319 wider Internet community needs to discuss whether the complexity 320 of adjusting for packet size should be in the network or in 321 transports; 323 5. Given there are many good reasons why larger path max 324 transmission units (PMTUs) would help solve a number of scaling 325 issues, we don't want to create any bias against large packets 326 that is greater than their true cost; 328 6. The IETF has started to consider the question of fairness between 329 flows that use different packet sizes (e.g. in the small-packet 330 variant of TCP-friendly rate control, TFRC-SP [RFC4828]). Given 331 transports with different packet sizes, if we don't decide 332 whether the network or the transport should allow for packet 333 size, it will be hard if not impossible to design any transport 334 protocol so that its bit-rate relative to other transports meets 335 design guidelines [RFC5033] (Note however that, if the concern 336 were fairness between users, rather than between flows 337 [Rate_fair_Dis], relative rates between flows would have to come 338 under run-time control rather than being embedded in protocol 339 designs). 341 This memo is initially concerned with how we should correctly scale 342 congestion control functions with packet size for the long term. But 343 it also recognises that expediency may be necessary to deal with 344 existing widely deployed protocols that don't live up to the long 345 term goal. It turns out that the 'correct' variant of RED to deploy 346 seems to be the one everyone has deployed, and no-one who responded 347 to our survey has implemented the other variant. However, at the 348 transport layer, TCP congestion control is a widely deployed protocol 349 that we argue doesn't scale correctly with packet size. To date this 350 hasn't been a significant problem because most TCPs have been used 351 with similar packet sizes. But, as we design new congestion 352 controls, we should build in scaling with packet size rather than 353 assuming we should follow TCP's example. 355 Motivating arguments for our advice are given next in Section 2. 356 Then the body of the memo starts from first principles, defining 357 congestion notification in Section 3 then determining the correct way 358 to measure congestion (Section 4) and to design an idealised 359 congestion notification protocol (Section 5). It then surveys the 360 advice given previously in the RFC series, the research literature 361 and the deployed legacy (Section 6) before listing outstanding issues 362 (Section 7) that will need resolution both to achieve the ideal 363 protocol and to handle legacy. After discussing security 364 considerations (Section 8) strong recommendations for the way forward 365 are given in the conclusions (Section 9). 367 2. Motivating Arguments 369 2.1. Scaling Congestion Control with Packet Size 371 There are two ways of interpreting a dropped or marked packet. It 372 can either be considered as a single loss event or as loss/marking of 373 the bytes in the packet. Here we try to design a test to see which 374 approach scales with packet size. 376 Imagine a bit-congestible link shared by many flows, so that each 377 busy period tends to cause packets to be lost from different flows. 379 The test compares two identical scenarios with the same applications, 380 the same numbers of sources and the same load. But the sources break 381 the load into large packets in one scenario and small packets in the 382 other. Of course, because the load is the same, there will be 383 proportionately more packets in the small packet case. 385 The test of whether a congestion control scales with packet size is 386 that it should respond in the same way to the same congestion 387 excursion, irrespective of the size of the packets that the bytes 388 causing congestion happen to be broken down into. 390 A bit-congestible queue suffering a congestion excursion has to drop 391 or mark the same excess bytes whether they are in a few large packets 392 or many small packets. So for the same congestion excursion, the 393 same amount of bytes have to be shed to get the load back to its 394 operating point. But, of course, for smaller packets more packets 395 will have to be discarded to shed the same bytes. 397 If all the transports interpret each drop/mark as a single loss event 398 irrespective of the size of the packet dropped, they will respond 399 more to the same congestion excursion, failing our test. On the 400 other hand, if they respond proportionately less when smaller packets 401 are dropped/marked, overall they will be able to respond the same to 402 the same congestion excursion. 404 Therefore, for a congestion control to scale with packet size it 405 should respond to dropped or marked bytes (as TFRC-SP [RFC4828] 406 effectively does), not just to dropped or marked packets irrespective 407 of packet size (as TCP does). 409 The email [pktByteEmail] referred to by RFC2309 says the question of 410 whether a packet's own size should affect its drop probability 411 "depends on the dominant end-to-end congestion control mechanisms". 412 But we argue the network layer should not be optimised for whatever 413 transport is predominant. 415 TCP congestion control ensures that flows competing for the same 416 resource each maintain the same number of segments in flight, 417 irrespective of segment size. So under similar conditions, flows 418 with different segment sizes will get different bit rates. But even 419 though reducing the drop probability of small packets helps ensure 420 TCPs with different packet sizes will achieve similar bit rates, we 421 argue this should be achieved in TCP itself, not in the network. 423 Effectively, favouring small packets is reverse engineering of the 424 network layer around TCP, contrary to the excellent advice in 425 [RFC3426], which asks designers to question "Why are you proposing a 426 solution at this layer of the protocol stack, rather than at another 427 layer?" 429 2.2. Avoiding Perverse Incentives to (ab)use Smaller Packets 431 Increasingly, it is being recognised that a protocol design must take 432 care not to cause unintended consequences by giving the parties in 433 the protocol exchange perverse incentives [Evol_cc][RFC3426]. Again, 434 imagine a scenario where the same bit rate of packets will contribute 435 the same to congestion of a link irrespective of whether it is sent 436 as fewer larger packets or more smaller packets. A protocol design 437 that caused larger packets to be more likely to be dropped than 438 smaller ones would be dangerous in this case: 440 Malicious transports: A queue that gives an advantage to small 441 packets can be used to amplify the force of a flooding attack. By 442 sending a flood of small packets, the attacker can get the queue 443 to discard more traffic in large packets, allowing more attack 444 traffic to get through to cause further damage. Such a queue 445 allows attack traffic to have a disproportionately large effect on 446 regular traffic without the attacker having to do much work. The 447 byte-mode drop variant of RED amplifies small packet attacks. 448 Drop-tail queues amplify small packet attacks even more than RED 449 byte-mode drop (see the Security Considerations section 450 Section 8). Wherever possible neither should be used. 452 Normal transports: Even if a transport is not malicious, if it finds 453 small packets go faster, it will tend to act in its own interest 454 and use them. Queues that give advantage to small packets create 455 an evolutionary pressure for transports to send at the same bit- 456 rate but break their data stream down into tiny segments to reduce 457 their drop rate. Encouraging a high volume of tiny packets might 458 in turn unnecessarily overload a completely unrelated part of the 459 system, perhaps more limited by header-processing than bandwidth. 461 Imagine two flows arrive at a bit-congestible transmission link each 462 with the same bit rate, say 1Mbps, but one consists of 1500B and the 463 other 60B packets, which are 25x smaller. Consider a scenario where 464 gentle RED [gentle_RED] is used, along with the variant of RED we 465 advise against, i.e. where the RED algorithm is configured to adjust 466 the drop probability of packets in proportion to each packet's size 467 (byte mode packet drop). In this case, if RED drops 25% of the 468 larger packets, it will aim to drop 1% of the smaller packets (but in 469 practice it may drop more as congestion increases 470 [RFC4828](S.B.4)[Note_Variation]). Even though both flows arrive 471 with the same bit rate, the bit rate the RED queue aims to pass to 472 the line will be 750k for the flow of larger packet but 990k for the 473 smaller packets (but because of rate variation it will be less than 474 this target). It can be seen that this behaviour reopens the same 475 denial of service vulnerability that drop tail queues offer to floods 476 of small packet, though not necessarily as strongly (see Section 8). 478 2.3. Small != Control 480 It is tempting to drop small packets with lower probability to 481 improve performance, because many control packets are small (TCP SYNs 482 & ACKs, DNS queries & responses, SIP messages, HTTP GETs, etc) and 483 dropping fewer control packets considerably improves performance. 484 However, we must not give control packets preference purely by virtue 485 of their smallness, otherwise it is too easy for any data source to 486 get the same preferential treatment simply by sending data in smaller 487 packets. Again we should not create perverse incentives to favour 488 small packets rather than to favour control packets, which is what we 489 intend. 491 Just because many control packets are small does not mean all small 492 packets are control packets. 494 So again, rather than fix these problems in the network layer, we 495 argue that the transport should be made more robust against losses of 496 control packets (see 'Making Transports Robust against Control Packet 497 Losses' in Section 6.2.3). 499 3. Working Definition of Congestion Notification 501 Rather than aim to achieve what many have tried and failed, this memo 502 will not try to define congestion. It will give a working definition 503 of what congestion notification should be taken to mean for this 504 document. Congestion notification is a changing signal that aims to 505 communicate the ratio E/L, where E is the instantaneous excess load 506 offered to a resource that it cannot (or would not) serve and L is 507 the instantaneous offered load. 509 The phrase `would not serve' is added, because AQM systems (e.g. 510 RED, PCN [I-D.eardley-pcn-marking-behaviour]) use a virtual capacity 511 smaller than actual capacity, then notify congestion of this virtual 512 capacity in order to avoid congestion of the actual capacity. 514 Note that the denominator is offered load, not capacity. Therefore 515 congestion notification is a real number bounded by the range [0,1]. 516 This ties in with the most well-understood form of congestion 517 notification: drop rate. It also means that congestion has a natural 518 interpretation as a probability; the probability of offered traffic 519 not being served (or being marked as at risk of not being served). 520 Appendix B describes a further incidental benefit that arises from 521 using load as the denominator of congestion notification. 523 4. Congestion Measurement 525 4.1. Congestion Measurement by Queue Length 527 Queue length is usually the most correct and simplest way to measure 528 congestion of a resource. To avoid the pathological effects of drop 529 tail, an AQM function can then be used to transform queue length into 530 the probability of dropping or marking a packet (e.g. RED's 531 piecewise linear function between thresholds). If the resource is 532 bit-congestible, the length of the queue SHOULD be measured in bytes. 533 If the resource is packet-congestible, the length of the queue SHOULD 534 be measured in packets. No other choice makes sense, because the 535 number of packets waiting in the queue isn't relevant if the resource 536 gets congested by bytes and vice versa. We discuss the implications 537 on RED's byte mode and packet mode for measuring queue length in 538 Section 6. 540 4.1.1. Fixed Size Packet Buffers 542 Some, mostly older, queuing hardware sets aside fixed sized buffers 543 in which to store each packet in the queue. Also, with some 544 hardware, any fixed sized buffers not completely filled by a packet 545 are padded when transmitted to the wire. If we imagine a theoretical 546 forwarding system with both queuing and transmission in fixed, MTU- 547 sized units, it should clearly be treated as packet-congestible, 548 because the queue length in packets would be a good model of 549 congestion of the lower layer link. 551 If we now imagine a hybrid forwarding system with transmission delay 552 largely dependent on the byte-size of packets but buffers of one MTU 553 per packet, it should strictly require a more complex algorithm to 554 determine the probability of congestion. It should be treated as two 555 resources in sequence, where the sum of the byte-sizes of the packets 556 within each packet buffer models congestion of the line while the 557 length of the queue in packets models congestion of the queue. Then 558 the probability of congesting the forwarding buffer would be a 559 conditional probability--conditional on the previously calculated 560 probability of congesting the line. 562 However, in systems that use fixed size buffers, it is unusual for 563 all the buffers used by an interface to be the same size. Typically 564 pools of different sized buffers are provided (Cisco uses the term 565 'buffer carving' for the process of dividing up memory into these 566 pools [IOSArch]). Usually, if the pool of small buffers is 567 exhausted, arriving small packets can borrow space in the pool of 568 large buffers, but not vice versa. However, it is easier to work out 569 what should be done if we temporarily set aside the possibility of 570 such borrowing. Then, with fixed pools of buffers for different 571 sized packets and no borrowing, the size of each pool and the current 572 queue length in each pool would both be measured in packets. So an 573 AQM algorithm would have to maintain the queue length for each pool, 574 and judge whether to drop/mark a packet of a particular size by 575 looking at the pool for packets of that size and using the length (in 576 packets) of its queue. 578 We now return to the issue we temporarily set aside: small packets 579 borrowing space in larger buffers. In this case, the only difference 580 is that the pools for smaller packets have a maximum queue size that 581 includes all the pools for larger packets. And every time a packet 582 takes a larger buffer, the current queue size has to be incremented 583 for all queues in the pools of buffers less than or equal to the 584 buffer size used. 586 We will return to borrowing of fixed sized buffers when we discuss 587 biasing the drop/marking probability of a specific packet because of 588 its size in Section 6.2.1. But here we can give a simple summary of 589 the present discussion on how to measure the length of queues of 590 fixed buffers: no matter how complicated the scheme is, ultimately 591 any fixed buffer system will need to measure its queue length in 592 packets not bytes. 594 4.2. Congestion Measurement without a Queue 596 AQM algorithms are nearly always described assuming there is a queue 597 for a congested resource and the algorithm can use the queue length 598 to determine the probability that it will drop or mark each packet. 599 But not all congested resources lead to queues. For instance, 600 wireless spectrum is bit-congestible (for a given coding scheme), 601 because interference increases with the rate at which bits are 602 transmitted. But wireless link protocols do not always maintain a 603 queue that depends on spectrum interference. Similarly, power 604 limited resources are also usually bit-congestible if energy is 605 primarily required for transmission rather than header processing, 606 but it is rare for a link protocol to build a queue as it approaches 607 maximum power. 609 However, AQM algorithms don't require a queue in order to work. For 610 instance spectrum congestion can be modelled by signal quality using 611 target bit-energy-to-noise-density ratio. And, to model radio power 612 exhaustion, transmission power levels can be measured and compared to 613 the maximum power available. [ECNFixedWireless] proposes a practical 614 and theoretically sound way to combine congestion notification for 615 different bit-congestible resources at different layers along an end 616 to end path, whether wireless or wired, and whether with or without 617 queues. 619 5. Idealised Wire Protocol Coding 621 We will start by inventing an idealised congestion notification 622 protocol before discussing how to make it practical. The idealised 623 protocol is shown to be correct using examples in Appendix A. 625 Congestion notification involves the congested resource coding a 626 congestion notification signal into the packet stream and the 627 transports decoding it. The idealised protocol uses two different 628 (imaginary) fields in each datagram to signal congestion: one for 629 byte congestion and one for packet congestion. 631 We are not saying two ECN fields will be needed (and we are not 632 saying that somehow a resource should be able to drop a packet in one 633 of two different ways so that the transport can distinguish which 634 sort of drop it was!). These two congestion notification channels 635 are just a conceptual device. They allow us to defer having to 636 decide whether to distinguish between byte and packet congestion when 637 the network resource codes the signal or when the transport decodes 638 it. 640 However, although this idealised mechanism isn't intended for 641 implementation, we do want to emphasise that we may need to find a 642 way to implement it, because it could become necessary to somehow 643 distinguish between bit and packet congestion [RFC3714]. Currently a 644 design goal of network processing equipment such as routers and 645 firewalls is to keep packet processing uncongested even under worst 646 case bit rates with minimum packet sizes. Therefore, packet- 647 congestion is currently rare, but there is no guarantee that it will 648 not become common with future technology trends. 650 The idealised wire protocol is given below. It accounts for packet 651 sizes at the transport layer, not in the network, and then only in 652 the case of bit-congestible resources. This avoids the perverse 653 incentive to send smaller packets and the DoS vulnerability that 654 would otherwise result if the network were to bias towards them (see 655 the motivating argument about avoiding perverse incentives in 656 Section 2.2). Incidentally, it also ensures neither the network nor 657 the transport needs to do a multiply operation--multiplication by 658 packet size is effectively achieved as a repeated add when the 659 transport adds to its count of marked bytes as each congestion event 660 is fed to it: 662 o A packet-congestible resource trying to code congestion level p_p 663 into a packet stream should mark the idealised `packet congestion' 664 field in each packet with probability p_p irrespective of the 665 packet's size. The transport should then take a packet with the 666 packet congestion field marked to mean just one mark, irrespective 667 of the packet size. 669 o A bit-congestible resource trying to code time-varying byte- 670 congestion level p_b into a packet stream should mark the `byte 671 congestion' field in each packet with probability p_b, again 672 irrespective of the packet's size. Unlike before, the transport 673 should take a packet with the byte congestion field marked to 674 count as a mark on each byte in the packet. 676 The worked examples in Appendix A show that transports can extract 677 sufficient and correct congestion notification from these protocols 678 for cases when two flows with different packet sizes have matching 679 bit rates or matching packet rates. Examples are also given that mix 680 these two flows into one to show that a flow with mixed packet sizes 681 would still be able to extract sufficient and correct information. 683 Sufficient and correct congestion information means that there is 684 sufficient information for the two different types of transport 685 requirements: 687 Ratio-based: Established transport congestion controls like TCP's 688 [RFC2581] aim to achieve equal segment rates per RTT through the 689 same bottleneck--TCP friendliness [RFC3448]. They work with the 690 ratio of dropped to delivered segments (or marked to unmarked 691 segments in the case of ECN). The example scenarios show that 692 these ratio-based transports are effectively the same whether 693 counting in bytes or packets, because the units cancel out. 694 (Incidentally, this is why TCP's bit rate is still proportional to 695 packet size even when byte-counting is used, as recommended for 696 TCP in [I-D.ietf-tcpm-rfc2581bis], mainly for orthogonal security 697 reasons.) 699 Absolute-target-based: Other congestion controls proposed in the 700 research community aim to limit the volume of congestion caused to 701 a constant weight parameter. [MulTCP][WindowPropFair] are 702 examples of weighted proportionally fair transports designed for 703 cost-fair environments [Rate_fair_Dis]. In this case, the 704 transport requires a count (not a ratio) of dropped/marked bytes 705 in the bit-congestible case and of dropped/marked packets in the 706 packet congestible case. 708 6. The State of the Art 710 The original 1993 paper on RED [RED93] proposed two options for the 711 RED active queue management algorithm: packet mode and byte mode. 712 Packet mode measured the queue length in packets and dropped (or 713 marked) individual packets with a probability independent of their 714 size. Byte mode measured the queue length in bytes and marked an 715 individual packet with probability in proportion to its size 716 (relative to the maximum packet size). In the paper's outline of 717 further work, it was stated that no recommendation had been made on 718 whether the queue size should be measured in bytes or packets, but 719 noted that the difference could be significant. 721 When RED was recommended for general deployment in 1998 [RFC2309], 722 the two modes were mentioned implying the choice between them was a 723 question of performance, referring to a 1997 email [pktByteEmail] for 724 advice on tuning. This email clarified that there were in fact two 725 orthogonal choices: whether to measure queue length in bytes or 726 packets (Section 6.1 below) and whether the drop probability of an 727 individual packet should depend on its own size (Section 6.2 below). 729 6.1. Congestion Measurement: Status 731 The choice of which metric to use to measure queue length was left 732 open in RFC2309. It is now well understood that queues for bit- 733 congestible resources should be measured in bytes, and queues for 734 packet-congestible resources should be measured in packets (see 735 Section 4). 737 Where buffers are not configured or legacy buffers cannot be 738 configured to the above guideline, we don't have to make allowances 739 for such legacy in future protocol design. If a bit-congestible 740 buffer is measured in packets, the operator will have set the 741 thresholds mindful of a typical mix of packets sizes. Any AQM 742 algorithm on such a buffer will be oversensitive to high proportions 743 of small packets, e.g. a DoS attack, and undersensitive to high 744 proportions of large packets. But an operator can safely keep such a 745 legacy buffer because any undersensitivity during unusual traffic 746 mixes cannot lead to congestion collapse given the buffer will 747 eventually revert to tail drop, discarding proportionately more large 748 packets. 750 Some modern queue implementations give a choice for setting RED's 751 thresholds in byte-mode or packet-mode. This may merely be an 752 administrator-interface preference, not altering how the queue itself 753 is measured but on some hardware it does actually change the way it 754 measures its queue. Whether a resource is bit-congestible or packet- 755 congestible is a property of the resource, so an admin SHOULD NOT 756 ever need to, or be able to, configure the way a queue measures 757 itself. 759 We believe the question of whether to measure queues in bytes or 760 packets is fairly well understood these days. The only outstanding 761 issues concern how to measure congestion when the queue is bit 762 congestible but the resource is packet congestible or vice versa (see 763 Section 4). But there is no controversy over what should be done. 764 It's just you have to be an expert in probability to work out what 765 should be done and, even if you have, it's not always easy to find a 766 practical algorithm to implement it. 768 6.2. Congestion Coding: Status 770 6.2.1. Network Bias when Encoding 772 The previously mentioned email [pktByteEmail] referred to by 773 [RFC2309] said that the choice over whether a packet's own size 774 should affect its drop probability "depends on the dominant end-to- 775 end congestion control mechanisms". [Section 2 argues against this 776 approach, citing the excellent advice in RFC3246.] The referenced 777 email went on to argue that drop probability should depend on the 778 size of the packet being considered for drop if the resource is bit- 779 congestible, but not if it is packet-congestible, but advised that 780 most scarce resources in the Internet were currently bit-congestible. 781 The argument continued that if packet drops were inflated by packet 782 size (byte-mode dropping), "a flow's fraction of the packet drops is 783 then a good indication of that flow's fraction of the link bandwidth 784 in bits per second". This was consistent with a referenced policing 785 mechanism being worked on at the time for detecting unusually high 786 bandwidth flows, eventually published in 1999 [pBox]. [The problem 787 could have been solved by making the policing mechanism count the 788 volume of bytes randomly dropped, not the number of packets.] 790 A few months before RFC2309 was published, an addendum was added to 791 the above archived email referenced from the RFC, in which the final 792 paragraph seemed to partially retract what had previously been said. 793 It clarified that the question of whether the probability of 794 dropping/marking a packet should depend on its size was not related 795 to whether the resource itself was bit congestible, but a completely 796 orthogonal question. However the only example given had the queue 797 measured in packets but packet drop depended on the byte-size of the 798 packet in question. No example was given the other way round. 800 In 2000, Cnodder et al [REDbyte] pointed out that there was an error 801 in the part of the original 1993 RED algorithm that aimed to 802 distribute drops uniformly, because it didn't correctly take into 803 account the adjustment for packet size. They recommended an 804 algorithm called RED_4 to fix this. But they also recommended a 805 further change, RED_5, to adjust drop rate dependent on the square of 806 relative packet size. This was indeed consistent with one stated 807 motivation behind RED's byte mode drop--that we should reverse 808 engineer the network to improve the performance of dominant end-to- 809 end congestion control mechanisms. 811 By 2003, a further change had been made to the adjustment for packet 812 size, this time in the RED algorithm of the ns2 simulator. Instead 813 of taking each packet's size relative to a `maximum packet size' it 814 was taken relative to a `mean packet size', intended to be a static 815 value representative of the `typical' packet size on the link. We 816 have not been able to find a justification for this change in the 817 literature, however Eddy and Allman conducted experiments [REDbias] 818 that assessed how sensitive RED was to this parameter, amongst other 819 things. No-one seems to have pointed out that this changed algorithm 820 can often lead to drop probabilities of greater than 1 [which should 821 ring alarm bells hinting that there's a mistake in the theory 822 somewhere]. On 10-Nov-2004, this variant of byte-mode packet drop 823 was made the default in the ns2 simulator. 825 The byte-mode drop variant of RED is, of course, not the only 826 possible bias towards small packets in queueing algorithms. We have 827 already mentioned that tail-drop queues naturally tend to lock-out 828 large packets once they are full. But also queues with fixed sized 829 buffers reduce the probability that small packets will be dropped if 830 (and only if) they allow small packets to borrow buffers from the 831 pools for larger packets. As was explained in Section 4.1.1 on fixed 832 size buffer carving, borrowing effectively makes the maximum queue 833 size for small packets greater than that for large packets, because 834 more buffers can be used by small packets while less will fit large 835 packets. 837 However, in itself, the bias towards small packets caused by buffer 838 borrowing is perfectly correct. Lower drop probability for small 839 packets is legitimate in buffer borrowing schemes, because small 840 packets genuinely congest the machine's buffer memory less than large 841 packets, given they can fit in more spaces. The bias towards small 842 packets is not artificially added (as it is in RED's byte-mode drop 843 algorithm), it merely reflects the reality of the way fixed buffer 844 memory gets congested. Incidentally, the bias towards small packets 845 from buffer borrowing is nothing like as large as that of RED's byte- 846 mode drop. 848 Nonetheless, fixed-buffer memory with tail drop is still prone to 849 lock-out large packets, purely because of the tail-drop aspect. So a 850 good AQM algorithm like RED with packet-mode drop should be used with 851 fixed buffer memories where possible. If RED is too complicated to 852 implement with multiple fixed buffer pools, the minimum necessary to 853 prevent large packet lock-out is to ensure smaller packets never use 854 the last available buffer in any of the pools for larger packets. 856 6.2.2. Transport Bias when Decoding 858 The above proposals to alter the network layer to give a bias towards 859 smaller packets have largely carried on outside the IETF process 860 (unless one counts a reference in an informational RFC to an archived 861 email!). Whereas, within the IETF, there are many different 862 proposals to alter transport protocols to achieve the same goals, 863 i.e. either to make the flow bit-rate take account of packet size, or 864 to protect control packets from loss. This memo argues that altering 865 transport protocols is the more principled approach. 867 A recently approved experimental RFC adapts its transport layer 868 protocol to take account of packet sizes relative to typical TCP 869 packet sizes. This proposes a new small-packet variant of TCP- 870 friendly rate control [RFC3448] called TFRC-SP [RFC4828]. 871 Essentially, it proposes a rate equation that inflates the flow rate 872 by the ratio of a typical TCP segment size (1500B including TCP 873 header) over the actual segment size [PktSizeEquCC]. (There are also 874 other important differences of detail relative to TFRC, such as using 875 virtual packets [CCvarPktSize] to avoid responding to multiple losses 876 per round trip and using a minimum inter-packet interval.) 878 Section 4.5.1 of this TFRC-SP spec discusses the implications of 879 operating in an environment where queues have been configured to drop 880 smaller packets with proportionately lower probability than larger 881 ones. But it only discusses TCP operating in such an environment, 882 only mentioning TFRC-SP briefly when discussing how to define 883 fairness with TCP. And it only discusses the byte-mode dropping 884 version of RED as it was before Cnodder et al pointed out it didn't 885 sufficiently bias towards small packets to make TCP independent of 886 packet size. 888 So the TFRC-SP spec doesn't address the issue of which of the network 889 or the transport _should_ handle fairness between different packet 890 sizes. In its Appendix B.4 it discusses the possibility of both 891 TFRC-SP and some network buffers duplicating each other's attempts to 892 deliberately bias towards small packets. But the discussion is not 893 conclusive, instead reporting simulations of many of the 894 possibilities in order to assess performance but not recommending any 895 particular course of action. 897 The paper originally proposing TFRC with virtual packets (VP-TFRC) 898 [CCvarPktSize] proposed that there should perhaps be two variants to 899 cater for the different variants of RED. However, as the TFRC-SP 900 authors point out, there is no way for a transport to know whether 901 some queues on its path have deployed RED with byte-mode packet drop 902 (except if an exhaustive survey found that no-one has deployed it!-- 903 see Section 6.2.4). Incidentally, VP-TFRC also proposed that byte- 904 mode RED dropping should really square the packet size compensation 905 factor (like that of RED_5, but apparently unaware of it). 907 Pre-congestion notification [I-D.eardley-pcn-marking-behaviour] is a 908 proposal to use a virtual queue for AQM marking for packets within 909 one Diffserv class in order to give early warning prior to any real 910 queuing. The proposed PCN marking algorithms have been designed not 911 to take account of packet size when forwarding through queues. 912 Instead the general principle has been to take account of the sizes 913 of marked packets when monitoring the fraction of marking at the edge 914 of the network. 916 6.2.3. Making Transports Robust against Control Packet Losses 918 Recently, two drafts have proposed changes to TCP that make it more 919 robust against losing small control packets [I-D.ietf-tcpm-ecnsyn] 920 [I-D.floyd-tcpm-ackcc]. In both cases they note that the case for 921 these TCP changes would be weaker if RED were biased against dropping 922 small packets. We argue here that these two proposals are a safer 923 and more principled way to achieve TCP performance improvements than 924 reverse engineering RED to benefit TCP. 926 Although no proposals exist as far as we know, it would also be 927 possible and perfectly valid to make control packets robust against 928 drop by explicitly requesting a lower drop probability using their 929 Diffserv code point [RFC2474] to request a scheduling class with 930 lower drop. 932 The re-ECN protocol proposal [Re-TCP] is designed so that transports 933 can be made more robust against losing control packets. It gives 934 queues an incentive to optionally give preference against drop to 935 packets with the 'feedback not established' codepoint in the proposed 936 'extended ECN' field. Senders have incentives to use this codepoint 937 sparingly, but they can use it on control packets to reduce their 938 chance of being dropped. For instance, the proposed modification to 939 TCP for re-ECN uses this codepoint on the SYN and SYN-ACK. 941 Although not brought to the IETF, a simple proposal from Wischik 942 [DupTCP] suggests that the first three packets of every TCP flow 943 should be routinely duplicated after a short delay. It shows that 944 this would greatly improve the chances of short flows completing 945 quickly, but it would hardly increase traffic levels on the Internet, 946 because Internet bytes have always been concentrated in the large 947 flows. It further shows that the performance of many typical 948 applications depends on completion of long serial chains of short 949 messages. It argues that, given most of the value people get from 950 the Internet is concentrated within short flows, this simple 951 expedient would greatly increase the value of the best efforts 952 Internet at minimal cost. 954 6.2.4. Congestion Coding: Summary of Status 956 +-----------+----------------+-----------------+--------------------+ 957 | transport | RED_1 (packet | RED_4 (linear | RED_5 (square byte | 958 | cc | mode drop) | byte mode drop) | mode drop) | 959 +-----------+----------------+-----------------+--------------------+ 960 | TCP or | s/sqrt(p) | sqrt(s/p) | 1/sqrt(p) | 961 | TFRC | | | | 962 | TFRC-SP | 1/sqrt(p) | 1/sqrt(sp) | 1/(s.sqrt(p)) | 963 +-----------+----------------+-----------------+--------------------+ 965 Table 1: Dependence of flow bit-rate per RTT on packet size s and 966 drop rate p when network and/or transport bias towards small packets 967 to varying degrees 969 Table 1 aims to summarise the positions we may now be in. Each 970 column shows a different possible AQM behaviour in different queues 971 in the network, using the terminology of Cnodder et al outlined 972 earlier (RED_1 is basic RED with packet-mode drop). Each row shows a 973 different transport behaviour: TCP [RFC2581] and TFRC [RFC3448] on 974 the top row with TFRC-SP [RFC4828] below. Suppressing all 975 inessential details the table shows that independence from packet 976 size should either be achievable by not altering the TCP transport in 977 a RED_5 network, or using the small packet TFRC-SP transport in a 978 network without any byte-mode dropping RED (top right and bottom 979 left). Top left is the `do nothing' scenario, while bottom right is 980 the `do-both' scenario in which bit-rate would become far too biased 981 towards small packets. Of course, if any form of byte-mode dropping 982 RED has been deployed on a selection of congested queues, each path 983 will present a different hybrid scenario to its transport. 985 Whatever, we can see that the linear byte-mode drop column in the 986 middle considerably complicates the Internet. It's a half-way house 987 that doesn't bias enough towards small packets even if one believes 988 the network should be doing the biasing. We argue below that _all_ 989 network layer bias towards small packets should be turned off--if 990 indeed any equipment vendors have implemented it--leaving packet size 991 bias solely as the preserve of the transport layer (solely the 992 leftmost, packet-mode drop column). 994 A survey has been conducted of 84 vendors to assess how widely drop 995 probability based on packet size has been implemented in RED. Prior 996 to the survey, an individual approach to Cisco received confirmation 997 that, having checked the code-base for each of the product ranges, 998 Cisco has not implemented any discrimination based on packet size in 999 any AQM algorithm in any of its products. Also an individual 1000 approach to Alcatel-Lucent drew a confirmation that it was very 1001 likely that none of their products contained RED code that 1002 implemented any packet-size bias. 1004 Turning to our more formal survey (Table 2), about 19% of those 1005 surveyed have replied so far, giving a sample size of 16. Although 1006 we do not have permission to identify the respondents, we can say 1007 that those that have responded include most of the larger vendors, 1008 covering a large fraction of the market. They range across the large 1009 network equipment vendors at L3 & L2, firewall vendors, wireless 1010 equipment vendors, as well as large software businesses with a small 1011 selection of networking products. So far, all those who have 1012 responded have confirmed that they have not implemented the variant 1013 of RED with drop dependent on packet size (2 are fairly sure they 1014 haven't but need to check more thoroughly). 1016 +-------------------------------+----------------+-----------------+ 1017 | Response | No. of vendors | %age of vendors | 1018 +-------------------------------+----------------+-----------------+ 1019 | Not implemented | 14 | 17% | 1020 | Not implemented (probably) | 2 | 2% | 1021 | Implemented | 0 | 0% | 1022 | No response | 68 | 81% | 1023 | Total companies/orgs surveyed | 84 | 100% | 1024 +-------------------------------+----------------+-----------------+ 1026 Table 2: Vendor Survey on byte-mode drop variant of RED (lower drop 1027 probability for small packets) 1029 Where reasons have been given, the extra complexity of packet bias 1030 code has been most prevalent, though one vendor had a more principled 1031 reason for avoiding it--similar to the argument of this document. We 1032 have established that Linux does not implement RED with packet size 1033 drop bias, although we have not investigated a wider range of open 1034 source code. 1036 Finally, we repeat that RED's byte mode drop is not the only way to 1037 bias towards small packets--tail-drop tends to lock-out large packets 1038 very effectively. Our survey was of vendor implementations, so we 1039 cannot be certain about operator deployment. But we believe many 1040 queues in the Internet are still tail-drop. My own company (BT) has 1041 widely deployed RED, but there are bound to be many tail-drop queues, 1042 particularly in access network equipment and on middleboxes like 1043 firewalls, where RED is not always available. Routers using a memory 1044 architecture based on fixed size buffers with borrowing may also 1045 still be prevalent in the Internet. As explained in Section 6.2.1, 1046 these also provide a marginal (but legitimate) bias towards small 1047 packets. So even though RED byte-mode drop is not prevalent, it is 1048 likely there is still some bias towards small packets in the Internet 1049 due to tail drop and fixed buffer borrowing. 1051 7. Outstanding Issues and Next Steps 1053 7.1. Bit-congestible World 1055 For a connectionless network with nearly all resources being bit- 1056 congestible we believe the recommended position is now unarguably 1057 clear--that the network should not make allowance for packet sizes 1058 and the transport should. This leaves two outstanding issues: 1060 o How to handle any legacy of AQM with byte-mode drop already 1061 deployed; 1063 o The need to start a programme to update transport congestion 1064 control protocol standards to take account of packet size. 1066 The sample of returns from our vendor survey Section 6.2.4 suggest 1067 that byte-mode packet drop seems not to be implemented at all let 1068 alone deployed, or if it is, it is likely to be very sparse. 1069 Therefore, we do not really need a migration strategy from all but 1070 nothing to nothing. 1072 A programme of standards updates to take account of packet size in 1073 transport congestion control protocols has started with TFRC-SP 1074 [RFC4828], while weighted TCPs implemented in the research community 1075 [WindowPropFair] could form the basis of a future change to TCP 1076 congestion control [RFC2581] itself. 1078 7.2. Bit- & Packet-congestible World 1080 Nonetheless, a connectionless network with both bit-congestible and 1081 packet-congestible resources is a different matter. If we believe we 1082 should allow for this possibility in the future, this space contains 1083 a truly open research issue. 1085 The idealised wire protocol coding described in Section 5 requires at 1086 least two flags for congestion of bit-congestible and packet- 1087 congestible resources. This hides a fundamental problem--much more 1088 fundamental than whether we can magically create header space for yet 1089 another ECN flag in IPv4, or whether it would work while being 1090 deployed incrementally. A congestion notification protocol must 1091 survive a transition from low levels of congestion to high. Marking 1092 two states is feasible with explicit marking, but much harder if 1093 packets are dropped. Also, it will not always be cost-effective to 1094 implement AQM at every low level resource, so drop will often have to 1095 suffice. Distinguishing drop from delivery naturally provides just 1096 one congestion flag--it is hard to drop a packet in two ways that are 1097 distinguishable remotely. This is a similar problem to that of 1098 distinguishing wireless transmission losses from congestive losses. 1100 We should also note that, strictly, packet-congestible resources are 1101 actually cycle-congestible because load also depends on the 1102 complexity of each look-up and whether the pattern of arrivals is 1103 amenable to caching or not. Further, this reminds us that any 1104 solution must not require a forwarding engine to use excessive 1105 processor cycles in order to decide how to say it has no spare 1106 processor cycles. 1108 Recently, the dual resource queue (DRQ) proposal [DRQ] has been made 1109 on the premise that, as network processors become more cost 1110 effective, per packet operations will become more complex 1111 (irrespective of whether more function in the network layer is 1112 desirable). Consequently the premise is that CPU congestion will 1113 become more common. DRQ is a proposed modification to the RED 1114 algorithm that folds both bit congestion and packet congestion into 1115 one signal (either loss or ECN). 1117 The problem of signalling packet processing congestion is not 1118 pressing, as most Internet resources are designed to be bit- 1119 congestible before packet processing starts to congest. However, the 1120 IRTF Internet congestion control research group (ICCRG) has set 1121 itself the task of reaching consensus on generic forwarding 1122 mechanisms that are necessary and sufficient to support the 1123 Internet's future congestion control requirements (the first 1124 challenge in 1125 [I-D.irtf-iccrg-welzl-congestion-control-open-research]). Therefore, 1126 rather than not giving this problem any thought at all, just because 1127 it is hard and currently hypothetical, we defer the question of 1128 whether packet congestion might become common and what to do if it 1129 does to the IRTF (the 'Small Packets' challenge in 1130 [I-D.irtf-iccrg-welzl-congestion-control-open-research]). 1132 8. Security Considerations 1134 This draft recommends that queues do not bias drop probability 1135 towards small packets as this creates a perverse incentive for 1136 transports to break down their flows into tiny segments. One of the 1137 benefits of implementing AQM was meant to be to remove this perverse 1138 incentive that drop-tail queues gave to small packets. Of course, if 1139 transports really want to make the greatest gains, they don't have to 1140 respond to congestion anyway. But we don't want applications that 1141 are trying to behave to discover that they can go faster by using 1142 smaller packets. 1144 In practice, transports cannot all be trusted to respond to 1145 congestion. So another reason for recommending that queues do not 1146 bias drop probability towards small packets is to avoid the 1147 vulnerability to small packet DDoS attacks that would otherwise 1148 result. One of the benefits of implementing AQM was meant to be to 1149 remove drop-tail's DoS vulnerability to small packets, so we 1150 shouldn't add it back again. 1152 If most queues implemented AQM with byte-mode drop, the resulting 1153 network would amplify the potency of a small packet DDoS attack. At 1154 the first queue the stream of packets would push aside a greater 1155 proportion of large packets, so more of the small packets would 1156 survive to attack the next queue. Thus a flood of small packets 1157 would continue on towards the destination, pushing regular traffic 1158 with large packets out of the way in one queue after the next, but 1159 suffering much less drop itself. 1161 Appendix C explains why the ability of networks to police the 1162 response of _any_ transport to congestion depends on bit-congestible 1163 network resources only doing packet-mode not byte-mode drop. In 1164 summary, it says that making drop probability depend on the size of 1165 the packets that bits happen to be divided into simply encourages the 1166 bits to be divided into smaller packets. Byte-mode drop would 1167 therefore irreversibly complicate any attempt to fix the Internet's 1168 incentive structures. 1170 9. Conclusions 1172 The strong conclusion is that AQM algorithms such as RED SHOULD NOT 1173 use byte-mode drop. More generally, the Internet's congestion 1174 notification protocols (drop, ECN & PCN) SHOULD take account of 1175 packet size when the notification is read by the transport layer, NOT 1176 when it is written by the network layer. This approach offers 1177 sufficient and correct congestion information for all known and 1178 future transport protocols and also ensures no perverse incentives 1179 are created that would encourage transports to use inappropriately 1180 small packet sizes. 1182 The alternative of deflating RED's drop probability for smaller 1183 packet sizes (byte-mode drop) has no enduring advantages. It is more 1184 complex, it creates the perverse incentive to fragment segments into 1185 tiny pieces and it reopens the vulnerability to floods of small- 1186 packets that drop-tail queues suffered from and AQM was designed to 1187 remove. Byte-mode drop is a change to the network layer that makes 1188 allowance for an omission from the design of TCP, effectively reverse 1189 engineering the network layer to contrive to make two TCPs with 1190 different packet sizes run at equal bit rates (rather than packet 1191 rates) under the same path conditions. It also improves TCP 1192 performance by reducing the chance that a SYN or a pure ACK will be 1193 dropped, because they are small. But we SHOULD NOT hack the network 1194 layer to improve or fix certain transport protocols. No matter how 1195 predominant a transport protocol is (even if it's TCP), trying to 1196 correct for its failings by biasing towards small packets in the 1197 network layer creates a perverse incentive to break down all flows 1198 from all transports into tiny segments. 1200 So far, our survey of 84 vendors across the industry has drawn 1201 responses from about 19%, none of whom have implemented the byte mode 1202 packet drop variant of RED. Given there appears to be little, if 1203 any, installed base recommending removal of byte-mode drop from RED 1204 is possibly only a paper exercise with few, if any, incremental 1205 deployment issues. 1207 If a vendor has implemented byte-mode drop, and an operator has 1208 turned it on, it is strongly RECOMMENDED that it SHOULD be turned 1209 off. Note that RED as a whole SHOULD NOT be turned off, as without 1210 it, a drop tail queue also biases against large packets. But note 1211 also that turning off byte-mode may alter the relative performance of 1212 applications using different packet sizes, so it would be advisable 1213 to establish the implications before turning it off. 1215 Instead, the IETF transport area should continue its programme of 1216 updating congestion control protocols to take account of packet size 1217 and to make transports less sensitive to losing control packets like 1218 SYNs and pure ACKS. 1220 NOTE WELL that RED's byte-mode queue measurement is fine, being 1221 completely orthogonal to byte-mode drop. If a RED implementation has 1222 a byte-mode but does not specify what sort of byte-mode, it is most 1223 probably byte-mode queue measurement, which is fine. However, if in 1224 doubt, the vendor should be consulted. 1226 The above conclusions cater for the Internet as it is today with 1227 most, if not all, resources being primarily bit-congestible. A 1228 secondary conclusion of this memo is that we may see more packet- 1229 congestible resources in the future, so research may be needed to 1230 extend the Internet's congestion notification (drop or ECN) so that 1231 it can handle a mix of bit-congestible and packet-congestible 1232 resources. 1234 10. Acknowledgements 1236 Thank you to Sally Floyd, who gave extensive and useful review 1237 comments. Also thanks for the reviews from Toby Moncaster and Arnaud 1238 Jacquet. I am grateful to Bruce Davie and his colleagues for 1239 providing a timely and efficient survey of RED implementation in 1240 Cisco's product range. Also grateful thanks to Toby Moncaster, Will 1241 Dormann, John Regnault, Simon Carter and Stefaan De Cnodder who 1242 further helped survey the current status of RED implementation and 1243 deployment and, finally, thanks to the anonymous individuals who 1244 responded. 1246 11. Comments Solicited 1248 Comments and questions are encouraged and very welcome. They can be 1249 addressed to the IETF Transport Area working group mailing list 1250 , and/or to the authors. 1252 12. References 1254 12.1. Normative References 1256 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1257 Requirement Levels", BCP 14, RFC 2119, March 1997. 1259 [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, 1260 S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., 1261 Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, 1262 S., Wroclawski, J., and L. Zhang, "Recommendations on 1263 Queue Management and Congestion Avoidance in the 1264 Internet", RFC 2309, April 1998. 1266 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1267 of Explicit Congestion Notification (ECN) to IP", 1268 RFC 3168, September 2001. 1270 [RFC3426] Floyd, S., "General Architectural and Policy 1271 Considerations", RFC 3426, November 2002. 1273 [RFC5033] Floyd, S. and M. Allman, "Specifying New Congestion 1274 Control Algorithms", BCP 133, RFC 5033, August 2007. 1276 12.2. Informative References 1278 [CCvarPktSize] 1279 Widmer, J., Boutremans, C., and J-Y. Le Boudec, 1280 "Congestion Control for Flows with Variable Packet Size", 1281 ACM CCR 34(2) 137--151, 2004, 1282 . 1284 [DRQ] Shin, M., Chong, S., and I. Rhee, "Dual-Resource TCP/AQM 1285 for Processing-Constrained Networks", IEEE/ACM 1286 Transactions on Networking Vol 16, issue 2, April 2008, 1287 . 1289 [DupTCP] Wischik, D., "Short messages", Royal Society workshop on 1290 networks: modelling and control , September 2007, . 1293 [ECNFixedWireless] 1294 Siris, V., "Resource Control for Elastic Traffic in CDMA 1295 Networks", Proc. ACM MOBICOM'02 , September 2002, . 1299 [Evol_cc] Gibbens, R. and F. Kelly, "Resource pricing and the 1300 evolution of congestion control", Automatica 35(12)1969-- 1301 1985, December 1999, 1302 . 1304 [I-D.eardley-pcn-marking-behaviour] 1305 Eardley, P., "Marking behaviour of PCN-nodes", 1306 draft-eardley-pcn-marking-behaviour-01 (work in progress), 1307 June 2008. 1309 [I-D.falk-xcp-spec] 1310 Falk, A., "Specification for the Explicit Control Protocol 1311 (XCP)", draft-falk-xcp-spec-03 (work in progress), 1312 July 2007. 1314 [I-D.floyd-tcpm-ackcc] 1315 Floyd, S. and I. Property, "Adding Acknowledgement 1316 Congestion Control to TCP", draft-floyd-tcpm-ackcc-02 1317 (work in progress), November 2007. 1319 [I-D.ietf-tcpm-ecnsyn] 1320 Floyd, S., "Adding Explicit Congestion Notification (ECN) 1321 Capability to TCP's SYN/ACK Packets", 1322 draft-ietf-tcpm-ecnsyn-05 (work in progress), 1323 February 2008. 1325 [I-D.ietf-tcpm-rfc2581bis] 1326 Allman, M., "TCP Congestion Control", 1327 draft-ietf-tcpm-rfc2581bis-03 (work in progress), 1328 September 2007. 1330 [I-D.irtf-iccrg-welzl-congestion-control-open-research] 1331 Papadimitriou, D., "Open Research Issues in Internet 1332 Congestion Control", 1333 draft-irtf-iccrg-welzl-congestion-control-open-research-00 1334 (work in progress), July 2007. 1336 [IOSArch] Bollapragada, V., White, R., and C. Murphy, "Inside Cisco 1337 IOS Software Architecture", Cisco Press: CCIE Professional 1338 Development ISBN13: 978-1-57870-181-0, July 2000. 1340 [MulTCP] Crowcroft, J. and Ph. Oechslin, "Differentiated End to End 1341 Internet Services using a Weighted Proportional Fair 1342 Sharing TCP", CCR 28(3) 53--69, July 1998, . 1345 [PktSizeEquCC] 1346 Vasallo, P., "Variable Packet Size Equation-Based 1347 Congestion Control", ICSI Technical Report tr-00-008, 1348 2000, . 1351 [RED93] Floyd, S. and V. Jacobson, "Random Early Detection (RED) 1352 gateways for Congestion Avoidance", IEEE/ACM Transactions 1353 on Networking 1(4) 397--413, August 1993, 1354 . 1356 [REDbias] Eddy, W. and M. Allman, "A Comparison of RED's Byte and 1357 Packet Modes", Computer Networks 42(3) 261--280, 1358 June 2003, 1359 . 1361 [REDbyte] De Cnodder, S., Elloumi, O., and K. Pauwels, "RED behavior 1362 with different packet sizes", Proc. 5th IEEE Symposium on 1363 Computers and Communications (ISCC) 793--799, July 2000, 1364 . 1366 [RFC2474] Nichols, K., Blake, S., Baker, F., and D. Black, 1367 "Definition of the Differentiated Services Field (DS 1368 Field) in the IPv4 and IPv6 Headers", RFC 2474, 1369 December 1998. 1371 [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion 1372 Control", RFC 2581, April 1999. 1374 [RFC3448] Handley, M., Floyd, S., Padhye, J., and J. Widmer, "TCP 1375 Friendly Rate Control (TFRC): Protocol Specification", 1376 RFC 3448, January 2003. 1378 [RFC3714] Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion 1379 Control for Voice Traffic in the Internet", RFC 3714, 1380 March 2004. 1382 [RFC4782] Floyd, S., Allman, M., Jain, A., and P. Sarolahti, "Quick- 1383 Start for TCP and IP", RFC 4782, January 2007. 1385 [RFC4828] Floyd, S. and E. Kohler, "TCP Friendly Rate Control 1386 (TFRC): The Small-Packet (SP) Variant", RFC 4828, 1387 April 2007. 1389 [Rate_fair_Dis] 1390 Briscoe, B., "Flow Rate Fairness: Dismantling a Religion", 1391 ACM CCR 37(2)63--74, April 2007, 1392 . 1394 [Re-TCP] Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, 1395 "Re-ECN: Adding Accountability for Causing Congestion to 1396 TCP/IP", draft-briscoe-tsvwg-re-ecn-tcp-05 (work in 1397 progress), January 2008. 1399 [WindowPropFair] 1400 Siris, V., "Service Differentiation and Performance of 1401 Weighted Window-Based Congestion Control and Packet 1402 Marking Algorithms in ECN Networks", Computer 1403 Communications 26(4) 314--326, 2002, . 1407 [gentle_RED] 1408 Floyd, S., "Recommendation on using the "gentle_" variant 1409 of RED", Web page , March 2000, 1410 . 1412 [pBox] Floyd, S. and K. Fall, "Promoting the Use of End-to-End 1413 Congestion Control in the Internet", IEEE/ACM Transactions 1414 on Networking 7(4) 458--472, August 1999, 1415 . 1417 [pktByteEmail] 1418 Floyd, S., "RED: Discussions of Byte and Packet Modes", 1419 email , March 1997, 1420 . 1422 Editorial Comments 1424 [Note_Variation] The algorithm of the byte-mode drop variant of RED 1425 switches off any bias towards small packets 1426 whenever the smoothed queue length dictates that 1427 the drop probability of large packets should be 1428 100%. In the example in the Introduction, as the 1429 large packet drop probability varies around 25% the 1430 small packet drop probability will vary around 1%, 1431 but with occasional jumps to 100% whenever the 1432 instantaneous queue (after drop) manages to sustain 1433 a length above the 100% drop point for longer than 1434 the queue averaging period. 1436 Appendix A. Example Scenarios 1438 A.1. Notation 1440 To prove the two sets of assertions in the idealised wire protocol 1441 (Section 5) are true, we will compare two flows with different packet 1442 sizes, s_1 and s_2 [bit/pkt], to make sure their transports each see 1443 the correct congestion notification. Initially, within each flow we 1444 will take all packets as having equal sizes, but later we will 1445 generalise to flows within which packet sizes vary. A flow's bit 1446 rate, x [bit/s], is related to its packet rate, u [pkt/s], by 1448 x(t) = s.u(t). 1450 We will consider a 2x2 matrix of four scenarios: 1452 +-----------------------------+------------------+------------------+ 1453 | resource type and | A) Equal bit | B) Equal pkt | 1454 | congestion level | rates | rates | 1455 +-----------------------------+------------------+------------------+ 1456 | i) bit-congestible, p_b | (Ai) | (Bi) | 1457 | ii) pkt-congestible, p_p | (Aii) | (Bii) | 1458 +-----------------------------+------------------+------------------+ 1460 Table 3 1462 A.2. Bit-congestible resource, equal bit rates (Ai) 1464 Starting with the bit-congestible scenario, for two flows to maintain 1465 equal bit rates (Ai) the ratio of the packet rates must be the 1466 inverse of the ratio of packet sizes: u_2/u_1 = s_1/s_2. So, for 1467 instance, a flow of 60B packets would have to send 25x more packets 1468 to achieve the same bit rate as a flow of 1500B packets. If a 1469 congested resource marks proportion p_b of packets irrespective of 1470 size, the ratio of marked packets received by each transport will 1471 still be the same as the ratio of their packet rates, p_b.u_2/p_b.u_1 1472 = s_1/s_2. So of the 25x more 60B packets sent, 25x more will be 1473 marked than in the 1500B packet flow, but 25x more won't be marked 1474 too. 1476 In this scenario, the resource is bit-congestible, so it always uses 1477 our idealised bit-congestion field when it marks packets. Therefore 1478 the transport should count marked bytes not packets. But it doesn't 1479 actually matter for ratio-based transports like TCP (Section 5). The 1480 ratio of marked to unmarked bytes seen by each flow will be p_b, as 1481 will the ratio of marked to unmarked packets. Because they are 1482 ratios, the units cancel out. 1484 If a flow sent an inconsistent mixture of packet sizes, we have said 1485 it should count the ratio of marked and unmarked bytes not packets in 1486 order to correctly decode the level of congestion. But actually, if 1487 all it is trying to do is decode p_b, it still doesn't matter. For 1488 instance, imagine the two equal bit rate flows were actually one flow 1489 at twice the bit rate sending a mixture of one 1500B packet for every 1490 thirty 60B packets. 25x more small packets will be marked and 25x 1491 more will be unmarked. The transport can still calculate p_b whether 1492 it uses bytes or packets for the ratio. In general, for any 1493 algorithm which works on a ratio of marks to non-marks, either bytes 1494 or packets can be counted interchangeably, because the choice cancels 1495 out in the ratio calculation. 1497 However, where an absolute target rather than relative volume of 1498 congestion caused is important (Section 5), as it is for congestion 1499 accountability [Rate_fair_Dis], the transport must count marked bytes 1500 not packets, in this bit-congestible case. Aside from the goal of 1501 congestion accountability, this is how the bit rate of a transport 1502 can be made independent of packet size; by ensuring the rate of 1503 congestion caused is kept to a constant weight [WindowPropFair], 1504 rather than merely responding to the ratio of marked and unmarked 1505 bytes. 1507 Note the unit of byte-congestion volume is the byte. 1509 A.3. Bit-congestible resource, equal packet rates (Bi) 1511 If two flows send different packet sizes but at the same packet rate, 1512 their bit rates will be in the same ratio as their packet sizes, x_2/ 1513 x_1 = s_2/s_1. For instance, a flow sending 1500B packets at the 1514 same packet rate as another sending 60B packets will be sending at 1515 25x greater bit rate. In this case, if a congested resource marks 1516 proportion p_b of packets irrespective of size, the ratio of packets 1517 received with the byte-congestion field marked by each transport will 1518 be the same, p_b.u_2/p_b.u_1 = 1. 1520 Because the byte-congestion field is marked, the transport should 1521 count marked bytes not packets. But because each flow sends 1522 consistently sized packets it still doesn't matter for ratio-based 1523 transports. The ratio of marked to unmarked bytes seen by each flow 1524 will be p_b, as will the ratio of marked to unmarked packets. 1525 Therefore, if the congestion control algorithm is only concerned with 1526 the ratio of marked to unmarked packets (as is TCP), both flows will 1527 be able to decode p_b correctly whether they count packets or bytes. 1529 But if the absolute volume of congestion is important, e.g. for 1530 congestion accountability, the transport must count marked bytes not 1531 packets. Then the lower bit rate flow using smaller packets will 1532 rightly be perceived as causing less byte-congestion even though its 1533 packet rate is the same. 1535 If the two flows are mixed into one, of bit rate x1+x2, with equal 1536 packet rates of each size packet, the ratio p_b will still be 1537 measurable by counting the ratio of marked to unmarked bytes (or 1538 packets because the ratio cancels out the units). However, if the 1539 absolute volume of congestion is required, the transport must count 1540 the sum of congestion marked bytes, which indeed gives a correct 1541 measure of the rate of byte-congestion p_b(x_1 + x_2) caused by the 1542 combined bit rate. 1544 A.4. Pkt-congestible resource, equal bit rates (Aii) 1546 Moving to the case of packet-congestible resources, we now take two 1547 flows that send different packet sizes at the same bit rate, but this 1548 time the pkt-congestion field is marked by the resource with 1549 probability p_p. As in scenario Ai with the same bit rates but a 1550 bit-congestible resource, the flow with smaller packets will have a 1551 higher packet rate, so more packets will be both marked and unmarked, 1552 but in the same proportion. 1554 This time, the transport should only count marks without taking into 1555 account packet sizes. Transports will get the same result, p_p, by 1556 decoding the ratio of marked to unmarked packets in either flow. 1558 If one flow imitates the two flows but merged together, the bit rate 1559 will double with more small packets than large. The ratio of marked 1560 to unmarked packets will still be p_p. But if the absolute number of 1561 pkt-congestion marked packets is counted it will accumulate at the 1562 combined packet rate times the marking probability, p_p(u_1+u_2), 26x 1563 faster than packet congestion accumulates in the single 1500B packet 1564 flow of our example, as required. 1566 But if the transport is interested in the absolute number of packet 1567 congestion, it should just count how many marked packets arrive. For 1568 instance, a flow sending 60B packets will see 25x more marked packets 1569 than one sending 1500B packets at the same bit rate, because it is 1570 sending more packets through a packet-congestible resource. 1572 Note the unit of packet congestion is packets. 1574 A.5. Pkt-congestible resource, equal packet rates (Bii) 1576 Finally, if two flows with the same packet rate, pass through a 1577 packet-congestible resource, they will both suffer the same 1578 proportion of marking, p_p, irrespective of their packet sizes. On 1579 detecting that the pkt-congestion field is marked, the transport 1580 should count packets, and it will be able to extract the ratio p_p of 1581 marked to unmarked packets from both flows, irrespective of packet 1582 sizes. 1584 Even if the transport is monitoring the absolute amount of packets 1585 congestion over a period, still it will see the same amount of packet 1586 congestion from either flow. 1588 And if the two equal packet rates of different size packets are mixed 1589 together in one flow, the packet rate will double, so the absolute 1590 volume of packet-congestion will accumulate at twice the rate of 1591 either flow, 2p_p.u_1 = p_p(u_1+u_2). 1593 Appendix B. Congestion Notification Definition: Further Justification 1595 In Section 3 on the definition of congestion notification, load not 1596 capacity was used as the denominator. This also has a subtle 1597 significance in the related debate over the design of new transport 1598 protocols--typical new protocol designs (e.g. in XCP 1599 [I-D.falk-xcp-spec] & Quickstart [RFC4782]) expect the sending 1600 transport to communicate its desired flow rate to the network and 1601 network elements to progressively subtract from this so that the 1602 achievable flow rate emerges at the receiving transport. 1604 Congestion notification with total load in the denominator can serve 1605 a similar purpose (though in retrospect not in advance like XCP & 1606 QuickStart). Congestion notification is a dimensionless fraction but 1607 each source can extract necessary rate information from it because it 1608 already knows what its own rate is. Even though congestion 1609 notification doesn't communicate a rate explicitly, from each 1610 source's point of view congestion notification represents the 1611 fraction of the rate it was sending a round trip ago that couldn't 1612 (or wouldn't) be served by available resources. After they were 1613 sent, all these fractions of each source's offered load added up to 1614 the aggregate fraction of offered load seen by the congested 1615 resource. So, the source can also know the total excess rate by 1616 multiplying total load by congestion level. Therefore congestion 1617 notification, as one scale-free dimensionless fraction, implicitly 1618 communicates the instantaneous excess flow rate, albeit a RTT ago. 1620 Appendix C. Byte-mode Drop Complicates Policing Congestion Response 1622 This appendix explains why the ability of networks to police the 1623 response of _any_ transport to congestion depends on bit-congestible 1624 network resources only doing packet-mode not byte-mode drop. 1626 To be able to police a transport's response to congestion when 1627 fairness can only be judged over time and over all an individual's 1628 flows, the policer has to have an integrated view of all the 1629 congestion an individual (not just one flow) has caused due to all 1630 traffic entering the Internet from that individual. This is termed 1631 congestion accountability. 1633 But with byte-mode drop, one dropped or marked packet is not 1634 necessarily equivalent to another unless you know the MTU that caused 1635 it to be dropped/marked. To have an integrated view of a user, we 1636 believe congestion policing has to be located at an individual's 1637 attachment point to the Internet [Re-TCP]. But from there it cannot 1638 know the MTU of each remote queue that caused each drop/mark. 1639 Therefore it cannot take an integrated approach to policing all the 1640 responses to congestion of all the transports of one individual. 1641 Therefore it cannot police anything. 1643 The security/incentive argument _for_ packet-mode drop is similar. 1644 Firstly, confining RED to packet-mode drop would not preclude 1645 bottleneck policing approaches such as [pBox] as it seems likely they 1646 could work just as well by monitoring the volume of dropped bytes 1647 rather than packets. Secondly packet-mode dropping/marking naturally 1648 allows the congestion notification of packets to be globally 1649 meaningful without relying on MTU information held elsewhere. 1651 Because we recommend that a dropped/marked packet should be taken to 1652 mean that all the bytes in the packet are dropped/marked, a policer 1653 can remain robust against bits being re-divided into different size 1654 packets or across different size flows [Rate_fair_Dis]. Therefore 1655 policing would work naturally with just simple packet-mode drop in 1656 RED. 1658 In summary, making drop probability depend on the size of the packets 1659 that bits happen to be divided into simply encourages the bits to be 1660 divided into smaller packets. Byte-mode drop would therefore 1661 irreversibly complicate any attempt to fix the Internet's incentive 1662 structures. 1664 Author's Address 1666 Bob Briscoe 1667 BT & UCL 1668 B54/77, Adastral Park 1669 Martlesham Heath 1670 Ipswich IP5 3RE 1671 UK 1673 Phone: +44 1473 645196 1674 Email: bob.briscoe@bt.com 1675 URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ 1677 Full Copyright Statement 1679 Copyright (C) The IETF Trust (2008). 1681 This document is subject to the rights, licenses and restrictions 1682 contained in BCP 78, and except as set forth therein, the authors 1683 retain all their rights. 1685 This document and the information contained herein are provided on an 1686 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 1687 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 1688 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 1689 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 1690 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 1691 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 1693 Intellectual Property 1695 The IETF takes no position regarding the validity or scope of any 1696 Intellectual Property Rights or other rights that might be claimed to 1697 pertain to the implementation or use of the technology described in 1698 this document or the extent to which any license under such rights 1699 might or might not be available; nor does it represent that it has 1700 made any independent effort to identify any such rights. Information 1701 on the procedures with respect to rights in RFC documents can be 1702 found in BCP 78 and BCP 79. 1704 Copies of IPR disclosures made to the IETF Secretariat and any 1705 assurances of licenses to be made available, or the result of an 1706 attempt made to obtain a general license or permission for the use of 1707 such proprietary rights by implementers or users of this 1708 specification can be obtained from the IETF on-line IPR repository at 1709 http://www.ietf.org/ipr. 1711 The IETF invites any interested party to bring to its attention any 1712 copyrights, patents or patent applications, or other proprietary 1713 rights that may cover technology that may be required to implement 1714 this standard. Please address the information to the IETF at 1715 ietf-ipr@ietf.org. 1717 Acknowledgment 1719 This document was produced using xml2rfc v1.33 (of 1720 http://xml.resource.org/) from a source in RFC-2629 XML format.