idnits 2.17.1 draft-ietf-tsvwg-byte-pkt-congest-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) -- The draft header indicates that this document updates RFC2309, but the abstract doesn't seem to mention this, which it should. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1537 has weird spacing: '...ability p ...' == Line 1542 has weird spacing: '...ss-rate p*u ...' == Line 1543 has weird spacing: '...ss-rate p*u*s...' == Line 1550 has weird spacing: '...ss-rate p*u ...' == Line 1551 has weird spacing: '...ss-rate p*u*s...' (Using the creation date from RFC2309, updated by this document, for RFC5378 checks: 1997-03-25) -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 20, 2012) is 4442 days in the past. Is this intentional? Checking references for intended status: Best Current Practice ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 216 -- Looks like a reference, but probably isn't: '1' on line 216 ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567) ** Downref: Normative reference to an Informational RFC: RFC 3426 == Outdated reference: A later version (-08) exists of draft-ietf-avtcore-ecn-for-rtp-06 == Outdated reference: A later version (-05) exists of draft-ietf-conex-concepts-uses-03 Summary: 3 errors (**), 0 flaws (~~), 8 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe 3 Internet-Draft BT 4 Updates: 2309 (if approved) J. Manner 5 Intended status: BCP Aalto University 6 Expires: August 23, 2012 February 20, 2012 8 Byte and Packet Congestion Notification 9 draft-ietf-tsvwg-byte-pkt-congest-07 11 Abstract 13 This memo concerns dropping or marking packets using active queue 14 management (AQM) such as random early detection (RED) or pre- 15 congestion notification (PCN). We give three strong recommendations: 16 (1) packet size should be taken into account when transports read and 17 respond to congestion indications, (2) packet size should not be 18 taken into account when network equipment creates congestion signals 19 (marking, dropping), and therefore (3) the byte-mode packet drop 20 variant of the RED AQM algorithm that drops fewer small packets 21 should not be used. 23 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at http://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on August 23, 2012. 40 Copyright Notice 42 Copyright (c) 2012 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (http://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 58 1.1. Terminology and Scoping . . . . . . . . . . . . . . . . . 6 59 1.2. Example Comparing Packet-Mode Drop and Byte-Mode Drop . . 7 60 2. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 8 61 2.1. Recommendation on Queue Measurement . . . . . . . . . . . 9 62 2.2. Recommendation on Encoding Congestion Notification . . . . 9 63 2.3. Recommendation on Responding to Congestion . . . . . . . . 10 64 2.4. Recommendation on Handling Congestion Indications when 65 Splitting or Merging Packets . . . . . . . . . . . . . . . 11 66 3. Motivating Arguments . . . . . . . . . . . . . . . . . . . . . 11 67 3.1. Avoiding Perverse Incentives to (Ab)use Smaller Packets . 12 68 3.2. Small != Control . . . . . . . . . . . . . . . . . . . . . 13 69 3.3. Transport-Independent Network . . . . . . . . . . . . . . 13 70 3.4. Scaling Congestion Control with Packet Size . . . . . . . 14 71 3.5. Implementation Efficiency . . . . . . . . . . . . . . . . 16 72 4. A Survey and Critique of Past Advice . . . . . . . . . . . . . 16 73 4.1. Congestion Measurement Advice . . . . . . . . . . . . . . 16 74 4.1.1. Fixed Size Packet Buffers . . . . . . . . . . . . . . 17 75 4.1.2. Congestion Measurement without a Queue . . . . . . . . 18 76 4.2. Congestion Notification Advice . . . . . . . . . . . . . . 19 77 4.2.1. Network Bias when Encoding . . . . . . . . . . . . . . 19 78 4.2.2. Transport Bias when Decoding . . . . . . . . . . . . . 21 79 4.2.3. Making Transports Robust against Control Packet 80 Losses . . . . . . . . . . . . . . . . . . . . . . . . 22 81 4.2.4. Congestion Notification: Summary of Conflicting 82 Advice . . . . . . . . . . . . . . . . . . . . . . . . 22 83 5. Outstanding Issues and Next Steps . . . . . . . . . . . . . . 24 84 5.1. Bit-congestible Network . . . . . . . . . . . . . . . . . 24 85 5.2. Bit- & Packet-congestible Network . . . . . . . . . . . . 24 86 6. Security Considerations . . . . . . . . . . . . . . . . . . . 24 87 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 25 88 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 26 89 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 27 90 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 27 91 10.1. Normative References . . . . . . . . . . . . . . . . . . . 27 92 10.2. Informative References . . . . . . . . . . . . . . . . . . 27 93 Appendix A. Survey of RED Implementation Status . . . . . . . . . 31 94 Appendix B. Sufficiency of Packet-Mode Drop . . . . . . . . . . . 32 95 B.1. Packet-Size (In)Dependence in Transports . . . . . . . . . 33 96 B.2. Bit-Congestible and Packet-Congestible Indications . . . . 36 98 Appendix C. Byte-mode Drop Complicates Policing Congestion 99 Response . . . . . . . . . . . . . . . . . . . . . . 37 100 Appendix D. Changes from Previous Versions . . . . . . . . . . . 38 102 1. Introduction 104 This memo concerns how we should correctly scale congestion control 105 functions with packet size for the long term. It also recognises 106 that expediency may be necessary to deal with existing widely 107 deployed protocols that don't live up to the long term goal. 109 When notifying congestion, the problem of how (and whether) to take 110 packet sizes into account has exercised the minds of researchers and 111 practitioners for as long as active queue management (AQM) has been 112 discussed. Indeed, one reason AQM was originally introduced was to 113 reduce the lock-out effects that small packets can have on large 114 packets in drop-tail queues. This memo aims to state the principles 115 we should be using and to outline how these principles will affect 116 future protocol design, taking into account the existing deployments 117 we have already. 119 The question of whether to take into account packet size arises at 120 three stages in the congestion notification process: 122 Measuring congestion: When a congested resource measures locally how 123 congested it is, should it measure its queue length in bytes or 124 packets? 126 Encoding congestion notification into the wire protocol: When a 127 congested network resource notifies its level of congestion, 128 should it drop / mark each packet dependent on the byte-size of 129 the particular packet in question? 131 Decoding congestion notification from the wire protocol: When a 132 transport interprets the notification in order to decide how much 133 to respond to congestion, should it take into account the byte- 134 size of each missing or marked packet? 136 Consensus has emerged over the years concerning the first stage: 137 whether queues are measured in bytes or packets, termed byte-mode 138 queue measurement or packet-mode queue measurement. Section 2.1 of 139 this memo records this consensus in the RFC Series. In summary the 140 choice solely depends on whether the resource is congested by bytes 141 or packets. 143 The controversy is mainly around the last two stages: whether to 144 allow for the size of the specific packet notifying congestion i) 145 when the network encodes or ii) when the transport decodes the 146 congestion notification. 148 Currently, the RFC series is silent on this matter other than a paper 149 trail of advice referenced from [RFC2309], which conditionally 150 recommends byte-mode (packet-size dependent) drop [pktByteEmail]. 151 Reducing drop of small packets certainly has some tempting 152 advantages: i) it drops less control packets, which tend to be small 153 and ii) it makes TCP's bit-rate less dependent on packet size. 154 However, there are ways of addressing these issues at the transport 155 layer, rather than reverse engineering network forwarding to fix the 156 problems. 158 This memo updates [RFC2309] to deprecate deliberate preferential 159 treatment of small packets in AQM algorithms. It recommends that (1) 160 packet size should be taken into account when transports read 161 congestion indications, (2) not when network equipment writes them. 163 In particular this means that the byte-mode packet drop variant of 164 Random early Detection (RED) should not be used to drop fewer small 165 packets, because that creates a perverse incentive for transports to 166 use tiny segments, consequently also opening up a DoS vulnerability. 167 Fortunately all the RED implementers who responded to our admittedly 168 limited survey (Section 4.2.4) have not followed the earlier advice 169 to use byte-mode drop, so the position this memo argues for seems to 170 already exist in implementations. 172 However, at the transport layer, TCP congestion control is a widely 173 deployed protocol that doesn't scale with packet size. To date this 174 hasn't been a significant problem because most TCP implementations 175 have been used with similar packet sizes. But, as we design new 176 congestion control mechanisms, the current recommendation is that we 177 should build in scaling with packet size rather than assuming we 178 should follow TCP's example. 180 This memo continues as follows. First it discusses terminology and 181 scoping. Section 2 gives the concrete formal recommendations, 182 followed by motivating arguments in Section 3. We then critically 183 survey the advice given previously in the RFC series and the research 184 literature (Section 4), referring to an assessment of whether or not 185 this advice has been followed in production networks (Appendix A). 186 To wrap up, outstanding issues are discussed that will need 187 resolution both to inform future protocol designs and to handle 188 legacy (Section 5). Then security issues are collected together in 189 Section 6 before conclusions are drawn in Section 7. The interested 190 reader can find discussion of more detailed issues on the theme of 191 byte vs. packet in the appendices. 193 This memo intentionally includes a non-negligible amount of material 194 on the subject. For the busy reader Section 2 summarises the 195 recommendations for the Internet community. 197 1.1. Terminology and Scoping 199 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 200 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 201 document are to be interpreted as described in [RFC2119]. 203 Congestion Notification: Congestion notification is a changing 204 signal that aims to communicate the probability that the network 205 resource(s) will not be able to forward the level of traffic load 206 offered (or that there is an impending risk that they will not be 207 able to). 209 The `impending risk' qualifier is added, because AQM systems (e.g. 210 RED, PCN [RFC5670]) set a virtual limit smaller than the actual 211 limit to the resource, then notify when this virtual limit is 212 exceeded in order to avoid uncontrolled congestion of the actual 213 capacity. 215 Congestion notification communicates a real number bounded by the 216 range [0,1]. This ties in with the most well-understood measure 217 of congestion notification: drop probability. 219 Explicit and Implicit Notification: The byte vs. packet dilemma 220 concerns congestion notification irrespective of whether it is 221 signalled implicitly by drop or using explicit congestion 222 notification (ECN [RFC3168] or PCN [RFC5670]). Throughout this 223 document, unless clear from the context, the term marking will be 224 used to mean notifying congestion explicitly, while congestion 225 notification will be used to mean notifying congestion either 226 implicitly by drop or explicitly by marking. 228 Bit-congestible vs. Packet-congestible: If the load on a resource 229 depends on the rate at which packets arrive, it is called packet- 230 congestible. If the load depends on the rate at which bits arrive 231 it is called bit-congestible. 233 Examples of packet-congestible resources are route look-up engines 234 and firewalls, because load depends on how many packet headers 235 they have to process. Examples of bit-congestible resources are 236 transmission links, radio power and most buffer memory, because 237 the load depends on how many bits they have to transmit or store. 238 Some machine architectures use fixed size packet buffers, so 239 buffer memory in these cases is packet-congestible (see 240 Section 4.1.1). 242 Currently a design goal of network processing equipment such as 243 routers and firewalls is to keep packet processing uncongested 244 even under worst case packet rates with runs of minimum size 245 packets. Therefore, packet-congestion is currently rare [RFC6077; 246 S.3.3], but there is no guarantee that it will not become more 247 common in future. 249 Note that information is generally processed or transmitted with a 250 minimum granularity greater than a bit (e.g. octets). The 251 appropriate granularity for the resource in question should be 252 used, but for the sake of brevity we will talk in terms of bytes 253 in this memo. 255 Coarser Granularity: Resources may be congestible at higher levels 256 of granularity than bits or packets, for instance stateful 257 firewalls are flow-congestible and call-servers are session- 258 congestible. This memo focuses on congestion of connectionless 259 resources, but the same principles may be applicable for 260 congestion notification protocols controlling per-flow and per- 261 session processing or state. 263 RED Terminology: In RED whether to use packets or bytes when 264 measuring queues is called respectively "packet-mode queue 265 measurement" or "byte-mode queue measurement". And whether the 266 probability of dropping a particular packet is independent or 267 dependent on its byte-size is called respectively "packet-mode 268 drop" or "byte-mode drop". The terms byte-mode and packet-mode 269 should not be used without specifying whether they apply to queue 270 measurement or to drop. 272 1.2. Example Comparing Packet-Mode Drop and Byte-Mode Drop 274 A central question addressed by this document is whether to recommend 275 that AQM uses RED's packet-mode drop and to deprecate byte-mode drop. 276 Table 1 compares how packet-mode and byte-mode drop affect two flows 277 of different size packets. For each it gives the expected number of 278 packets and of bits dropped in one second. Each example flow runs at 279 the same bit-rate of 48Mb/s, but one is broken up into small 60 byte 280 packets and the other into large 1500 byte packets. 282 To keep up the same bit-rate, in one second there are about 25 times 283 more small packets because they are 25 times smaller. As can be seen 284 from the table, the packet rate is 100,000 small packets versus 4,000 285 large packets per second (pps). 287 Parameter Formula Small packets Large packets 288 -------------------- -------------- ------------- ------------- 289 Packet size s/8 60B 1,500B 290 Packet size s 480b 12,000b 291 Bit-rate x 48Mbps 48Mbps 292 Packet-rate u = x/s 100kpps 4kpps 294 Packet-mode Drop 295 Pkt loss probability p 0.1% 0.1% 296 Pkt loss-rate p*u 100pps 4pps 297 Bit loss-rate p*u*s 48kbps 48kbps 299 Byte-mode Drop MTU, M=12,000b 300 Pkt loss probability b = p*s/M 0.004% 0.1% 301 Pkt loss-rate b*u 4pps 4pps 302 Bit loss-rate b*u*s 1.92kbps 48kbps 304 Table 1: Example Comparing Packet-mode and Byte-mode Drop 306 For packet-mode drop, we illustrate the effect of a drop probability 307 of 0.1%, which the algorithm applies to all packets irrespective of 308 size. Because there are 25 times more small packets in one second, 309 it naturally drops 25 times more small packets, that is 100 small 310 packets but only 4 large packets. But if we count how many bits it 311 drops, there are 48,000 bits in 100 small packets and 48,000 bits in 312 4 large packets--the same number of bits of small packets as large. 314 The packet-mode drop algorithm drops any bit with the same 315 probability whether the bit is in a small or a large packet. 317 For byte-mode drop, again we use an example drop probability of 0.1%, 318 but only for maximum size packets (assuming the link MTU is 1,500B or 319 12,000b). The byte-mode algorithm reduces the drop probability of 320 smaller packets proportional to their size, making the probability 321 that it drops a small packet 25 times smaller at 0.004%. But there 322 are 25 times more small packets, so dropping them with 25 times lower 323 probability results in dropping the same number of packets: 4 drops 324 in both cases. The 4 small dropped packets contain 25 times less 325 bits than the 4 large dropped packets: 1,920 compared to 48,000. 327 The byte-mode drop algorithm drops any bit with a probability 328 proportionate to the size of the packet it is in. 330 2. Recommendations 332 This section gives recommendations related to network equipment in 333 Sections 2.1 and 2.2, and in Sections 2.3 and 2.4 we discuss the 334 implications on the transport protocols. 336 2.1. Recommendation on Queue Measurement 338 Queue length is usually the most correct and simplest way to measure 339 congestion of a resource. To avoid the pathological effects of drop 340 tail, an AQM function can then be used to transform queue length into 341 the probability of dropping or marking a packet (e.g. RED's 342 piecewise linear function between thresholds). 344 If the resource is bit-congestible, the implementation SHOULD measure 345 the length of the queue in bytes. If the resource is packet- 346 congestible, the implementation SHOULD measure the length of the 347 queue in packets. No other choice makes sense, because the number of 348 packets waiting in the queue isn't relevant if the resource gets 349 congested by bytes and vice versa. 351 What this advice means for the case of RED: 353 1. A RED implementation SHOULD use byte mode queue measurement for 354 measuring the congestion of bit-congestible resources and packet 355 mode queue measurement for packet-congestible resources. 357 2. An implementation SHOULD NOT make it possible to configure the 358 way a queue measures itself, because whether a queue is bit- 359 congestible or packet-congestible is an inherent property of the 360 queue. 362 The recommended approach in less straightforward scenarios, such as 363 fixed size buffers, and resources without a queue, is discussed in 364 Section 4.1. 366 2.2. Recommendation on Encoding Congestion Notification 368 When encoding congestion notification (e.g. by drop, ECN & PCN), a 369 network device SHOULD treat all packets equally, regardless of their 370 size. In other words, the probability that network equipment drops 371 or marks a particular packet to notify congestion SHOULD NOT depend 372 on the size of the packet in question. As the example in Section 1.2 373 illustrates, to drop any bit with probability 0.1% it is only 374 necessary to drop every packet with probability 0.1% without regard 375 to the size of each packet. 377 This approach ensures the network layer offers sufficient congestion 378 information for all known and future transport protocols and also 379 ensures no perverse incentives are created that would encourage 380 transports to use inappropriately small packet sizes. 382 What this advice means for the case of RED: 384 1. AQM algorithms such as RED SHOULD NOT use byte-mode drop, which 385 deflates RED's drop probability for smaller packet sizes. RED's 386 byte-mode drop has no enduring advantages. It is more complex, 387 it creates the perverse incentive to fragment segments into tiny 388 pieces and it reopens the vulnerability to floods of small- 389 packets that drop-tail queues suffered from and AQM was designed 390 to remove. 392 2. If a vendor has implemented byte-mode drop, and an operator has 393 turned it on, it is RECOMMENDED to turn it off. Note that RED as 394 a whole SHOULD NOT be turned off, as without it, a drop tail 395 queue also biases against large packets. But note also that 396 turning off byte-mode drop may alter the relative performance of 397 applications using different packet sizes, so it would be 398 advisable to establish the implications before turning it off. 400 Note well that RED's byte-mode queue drop is completely 401 orthogonal to byte-mode queue measurement and should not be 402 confused with it. If a RED implementation has a byte-mode but 403 does not specify what sort of byte-mode, it is most probably 404 byte-mode queue measurement, which is fine. However, if in 405 doubt, the vendor should be consulted. 407 A survey (Appendix A) showed that there appears to be little, if any, 408 installed base of the byte-mode drop variant of RED. This suggests 409 that deprecating byte-mode drop will have little, if any, incremental 410 deployment impact. 412 2.3. Recommendation on Responding to Congestion 414 When a transport detects that a packet has been lost or congestion 415 marked, it SHOULD consider the strength of the congestion indication 416 as proportionate to the size in octets (bytes) of the missing or 417 marked packet. 419 In other words, when a packet indicates congestion (by being lost or 420 marked) it can be considered conceptually as if there is a congestion 421 indication on every octet of the packet, not just one indication per 422 packet. 424 Therefore, the IETF transport area should continue its programme of; 426 o updating host-based congestion control protocols to take account 427 of packet size 429 o making transports less sensitive to losing control packets like 430 SYNs and pure ACKs. 432 What this advice means for the case of TCP: 434 1. If two TCP flows with different packet sizes are required to run 435 at equal bit rates under the same path conditions, this should be 436 done by altering TCP (Section 4.2.2), not network equipment (the 437 latter affects other transports besides TCP). 439 2. If it is desired to improve TCP performance by reducing the 440 chance that a SYN or a pure ACK will be dropped, this should be 441 done by modifying TCP (Section 4.2.3), not network equipment. 443 2.4. Recommendation on Handling Congestion Indications when Splitting 444 or Merging Packets 446 Packets carrying congestion indications may be split or merged in 447 some circumstances (e.g. at a RTCP transcoder or during IP fragment 448 reassembly). Splitting and merging only make sense in the context of 449 ECN, not loss. 451 The general rule to follow is that the number of octets in packets 452 with congestion indications SHOULD be equivalent before and after 453 merging or splitting. This is based on the principle used above; 454 that an indication of congestion on a packet can be considered as an 455 indication of congestion on each octet of the packet. 457 The above rule is not phrased with the word "MUST" to allow the 458 following exception. There are cases where pre-existing protocols 459 were not designed to conserve congestion marked octets (e.g. IP 460 fragment reassembly [RFC3168] or loss statistics in RTCP receiver 461 reports [RFC3550] before ECN was added 462 [I-D.ietf-avtcore-ecn-for-rtp]). When any such protocol is updated, 463 it SHOULD comply with the above rule to conserve marked octets. 464 However, the rule may be relaxed if it would otherwise become too 465 complex to interoperate with pre-existing implementations of the 466 protocol. 468 One can think of a splitting or merging process as if all the 469 incoming congestion-marked octets increment a counter and all the 470 outgoing marked octets decrement the same counter. In order to 471 ensure that congestion indications remain timely, even the smallest 472 positive remainder in the conceptual counter should trigger the next 473 outgoing packet to be marked (causing the counter to go negative). 475 3. Motivating Arguments 477 This section is informative. It justifies the recommendations given 478 in the previous section. 480 3.1. Avoiding Perverse Incentives to (Ab)use Smaller Packets 482 Increasingly, it is being recognised that a protocol design must take 483 care not to cause unintended consequences by giving the parties in 484 the protocol exchange perverse incentives [Evol_cc][RFC3426]. Given 485 there are many good reasons why larger path maximum transmission 486 units (PMTUs) would help solve a number of scaling issues, we do not 487 want to create any bias against large packets that is greater than 488 their true cost. 490 Imagine a scenario where the same bit rate of packets will contribute 491 the same to bit-congestion of a link irrespective of whether it is 492 sent as fewer larger packets or more smaller packets. A protocol 493 design that caused larger packets to be more likely to be dropped 494 than smaller ones would be dangerous in both the following cases: 496 Malicious transports: A queue that gives an advantage to small 497 packets can be used to amplify the force of a flooding attack. By 498 sending a flood of small packets, the attacker can get the queue 499 to discard more traffic in large packets, allowing more attack 500 traffic to get through to cause further damage. Such a queue 501 allows attack traffic to have a disproportionately large effect on 502 regular traffic without the attacker having to do much work. 504 Non-malicious transports: Even if a transport designer is not 505 actually malicious, if over time it is noticed that small packets 506 tend to go faster, designers will act in their own interest and 507 use smaller packets. Queues that give advantage to small packets 508 create an evolutionary pressure for transports to send at the same 509 bit-rate but break their data stream down into tiny segments to 510 reduce their drop rate. Encouraging a high volume of tiny packets 511 might in turn unnecessarily overload a completely unrelated part 512 of the system, perhaps more limited by header-processing than 513 bandwidth. 515 Imagine two unresponsive flows arrive at a bit-congestible 516 transmission link each with the same bit rate, say 1Mbps, but one 517 consists of 1500B and the other 60B packets, which are 25x smaller. 518 Consider a scenario where gentle RED [gentle_RED] is used, along with 519 the variant of RED we advise against, i.e. where the RED algorithm is 520 configured to adjust the drop probability of packets in proportion to 521 each packet's size (byte mode packet drop). In this case, RED aims 522 to drop 25x more of the larger packets than the smaller ones. Thus, 523 for example if RED drops 25% of the larger packets, it will aim to 524 drop 1% of the smaller packets (but in practice it may drop more as 525 congestion increases [RFC4828; Appx B.4]). Even though both flows 526 arrive with the same bit rate, the bit rate the RED queue aims to 527 pass to the line will be 750kbps for the flow of larger packets but 528 990kbps for the smaller packets (because of rate variations it will 529 actually be a little less than this target). 531 Note that, although the byte-mode drop variant of RED amplifies small 532 packet attacks, drop-tail queues amplify small packet attacks even 533 more (see Security Considerations in Section 6). Wherever possible 534 neither should be used. 536 3.2. Small != Control 538 Dropping fewer control packets considerably improves performance. It 539 is tempting to drop small packets with lower probability in order to 540 improve performance, because many control packets are small (TCP SYNs 541 & ACKs, DNS queries & responses, SIP messages, HTTP GETs, etc). 542 However, we must not give control packets preference purely by virtue 543 of their smallness, otherwise it is too easy for any data source to 544 get the same preferential treatment simply by sending data in smaller 545 packets. Again we should not create perverse incentives to favour 546 small packets rather than to favour control packets, which is what we 547 intend. 549 Just because many control packets are small does not mean all small 550 packets are control packets. 552 So, rather than fix these problems in the network, we argue that the 553 transport should be made more robust against losses of control 554 packets (see 'Making Transports Robust against Control Packet Losses' 555 in Section 4.2.3). 557 3.3. Transport-Independent Network 559 TCP congestion control ensures that flows competing for the same 560 resource each maintain the same number of segments in flight, 561 irrespective of segment size. So under similar conditions, flows 562 with different segment sizes will get different bit-rates. 564 One motivation for the network biasing congestion notification by 565 packet size is to counter this effect and try to equalise the bit- 566 rates of flows with different packet sizes. However, in order to do 567 this, the queuing algorithm has to make assumptions about the 568 transport, which become embedded in the network. Specifically: 570 o The queuing algorithm has to assume how aggressively the transport 571 will respond to congestion (see Section 4.2.4). If the network 572 assumes the transport responds as aggressively as TCP NewReno, it 573 will be wrong for Compound TCP and differently wrong for Cubic 574 TCP, etc. To achieve equal bit-rates, each transport then has to 575 guess what assumption the network made, and work out how to 576 replace this assumed aggressiveness with its own aggressiveness. 578 o Also, if the network biases congestion notification by packet size 579 it has to assume a baseline packet size--all proposed algorithms 580 use the local MTU. Then transports have to guess which link was 581 congested and what its local MTU was, in order to know how to 582 tailor their congestion response to that link. 584 Even though reducing the drop probability of small packets (e.g. 585 RED's byte-mode drop) helps ensure TCP flows with different packet 586 sizes will achieve similar bit rates, we argue this correction should 587 be made to any future transport protocols based on TCP, not to the 588 network in order to fix one transport, no matter how predominant it 589 is. Effectively, favouring small packets is reverse engineering of 590 network equipment around one particular transport protocol (TCP), 591 contrary to the excellent advice in [RFC3426], which asks designers 592 to question "Why are you proposing a solution at this layer of the 593 protocol stack, rather than at another layer?" 595 In contrast, if the network never takes account of packet size, the 596 transport can be certain it will never need to guess any assumptions 597 the network has made. And the network passes two pieces of 598 information to the transport that are sufficient in all cases: i) 599 congestion notification on the packet and ii) the size of the packet. 600 Both are available for the transport to combine (by taking account of 601 packet size when responding to congestion) or not. Appendix B checks 602 that these two pieces of information are sufficient for all relevant 603 scenarios. 605 When the network does not take account of packet size, it allows 606 transport protocols to choose whether to take account of packet size 607 or not. However, if the network were to bias congestion notification 608 by packet size, transport protocols would have no choice; those that 609 did not take account of packet size themselves would unwittingly 610 become dependent on packet size, and those that already took account 611 of packet size would end up taking account of it twice. 613 3.4. Scaling Congestion Control with Packet Size 615 Having so far justified only our recommendations for the network, 616 this section focuses on the host. We construct a scaling argument to 617 justify the recommendation that a host should respond to a dropped or 618 marked packet in proportion to its size, not just as a single 619 congestion event. 621 The argument assumes that we have already sufficiently justified our 622 recommendation that the network should not take account of packet 623 size. 625 Also, we assume bit-congestible links are the predominant source of 626 congestion. As the Internet stands, it is hard if not impossible to 627 know whether congestion notification is from a bit-congestible or a 628 packet-congestible resource (see Appendix B.2) so we have to assume 629 the most prevalent case (see Section 1.1). If this assumption is 630 wrong, and particular congestion indications are actually due to 631 overload of packet-processing, there is no issue of safety at stake. 632 Any congestion control that triggers a multiplicative decrease in 633 response to a congestion indication will bring packet processing back 634 to its operating point just as quickly. The only issue at stake is 635 that the resource could be utilised more efficiently if packet- 636 congestion could be separately identified. 638 Imagine a bit-congestible link shared by many flows, so that each 639 busy period tends to cause packets to be lost from different flows. 640 Consider further two sources that have the same data rate but break 641 the load into large packets in one application (A) and small packets 642 in the other (B). Of course, because the load is the same, there 643 will be proportionately more packets in the small packet flow (B). 645 If a congestion control scales with packet size it should respond in 646 the same way to the same congestion notification, irrespective of the 647 size of the packets containing the bytes that contribute to 648 congestion. 650 A bit-congestible queue suffering congestion has to drop or mark the 651 same excess bytes whether they are in a few large packets (A) or many 652 small packets (B). So for the same amount of congestion overload, 653 the same amount of bytes has to be shed to get the load back to its 654 operating point. For smaller packets (B) more packets will have to 655 be discarded to shed the same bytes. 657 If both the transports interpret each drop/mark as a single loss 658 event irrespective of the size of the packet dropped, the flow of 659 smaller packets (B) will respond more times to the same congestion. 660 On the other hand, if a transport responds proportionately less when 661 smaller packets are dropped/marked, overall it will be able to 662 respond the same to the same amount of congestion. 664 Therefore, for a congestion control to scale with packet size it 665 should respond to dropped or marked bytes (as TFRC-SP [RFC4828] 666 effectively does), instead of dropped or marked packets (as TCP 667 does). 669 For the avoidance of doubt, this is not a recommendation that TCP 670 should be changed so that it scales with packet size. It is a 671 recommendation that any future transport protocol proposal should 672 respond to dropped or marked bytes if it wishes to claim that it is 673 scalable. 675 3.5. Implementation Efficiency 677 Allowing for packet size at the transport rather than in the network 678 ensures that neither the network nor the transport needs to do a 679 multiply operation--multiplication by packet size is effectively 680 achieved as a repeated add when the transport adds to its count of 681 marked bytes as each congestion event is fed to it. This isn't a 682 principled reason in itself, but it is a happy consequence of the 683 other principled reasons. 685 4. A Survey and Critique of Past Advice 687 This section is informative, not normative. 689 The original 1993 paper on RED [RED93] proposed two options for the 690 RED active queue management algorithm: packet mode and byte mode. 691 Packet mode measured the queue length in packets and dropped (or 692 marked) individual packets with a probability independent of their 693 size. Byte mode measured the queue length in bytes and marked an 694 individual packet with probability in proportion to its size 695 (relative to the maximum packet size). In the paper's outline of 696 further work, it was stated that no recommendation had been made on 697 whether the queue size should be measured in bytes or packets, but 698 noted that the difference could be significant. 700 When RED was recommended for general deployment in 1998 [RFC2309], 701 the two modes were mentioned implying the choice between them was a 702 question of performance, referring to a 1997 email [pktByteEmail] for 703 advice on tuning. A later addendum to this email introduced the 704 insight that there are in fact two orthogonal choices: 706 o whether to measure queue length in bytes or packets (Section 4.1) 708 o whether the drop probability of an individual packet should depend 709 on its own size (Section 4.2). 711 The rest of this section is structured accordingly. 713 4.1. Congestion Measurement Advice 715 The choice of which metric to use to measure queue length was left 716 open in RFC2309. It is now well understood that queues for bit- 717 congestible resources should be measured in bytes, and queues for 718 packet-congestible resources should be measured in packets 719 [pktByteEmail]. 721 Congestion in some legacy bit-congestible buffers is only measured in 722 packets not bytes. In such cases, the operator has to set the 723 thresholds mindful of a typical mix of packets sizes. Any AQM 724 algorithm on such a buffer will be oversensitive to high proportions 725 of small packets, e.g. a DoS attack, and undersensitive to high 726 proportions of large packets. However, there is no need to make 727 allowances for the possibility of such legacy in future protocol 728 design. This is safe because any undersensitivity during unusual 729 traffic mixes cannot lead to congestion collapse given the buffer 730 will eventually revert to tail drop, discarding proportionately more 731 large packets. 733 4.1.1. Fixed Size Packet Buffers 735 The question of whether to measure queues in bytes or packets seems 736 to be well understood. However, measuring congestion is not 737 straightforward when the resource is bit congestible but the queue is 738 packet congestible or vice versa. This section outlines the approach 739 to take. There is no controversy over what should be done, you just 740 need to be expert in probability to work it out. And, even if you 741 know what should be done, it's not always easy to find a practical 742 algorithm to implement it. 744 Some, mostly older, queuing hardware sets aside fixed sized buffers 745 in which to store each packet in the queue. Also, with some 746 hardware, any fixed sized buffers not completely filled by a packet 747 are padded when transmitted to the wire. If we imagine a theoretical 748 forwarding system with both queuing and transmission in fixed, MTU- 749 sized units, it should clearly be treated as packet-congestible, 750 because the queue length in packets would be a good model of 751 congestion of the lower layer link. 753 If we now imagine a hybrid forwarding system with transmission delay 754 largely dependent on the byte-size of packets but buffers of one MTU 755 per packet, it should strictly require a more complex algorithm to 756 determine the probability of congestion. It should be treated as two 757 resources in sequence, where the sum of the byte-sizes of the packets 758 within each packet buffer models congestion of the line while the 759 length of the queue in packets models congestion of the queue. Then 760 the probability of congesting the forwarding buffer would be a 761 conditional probability--conditional on the previously calculated 762 probability of congesting the line. 764 In systems that use fixed size buffers, it is unusual for all the 765 buffers used by an interface to be the same size. Typically pools of 766 different sized buffers are provided (Cisco uses the term 'buffer 767 carving' for the process of dividing up memory into these pools 768 [IOSArch]). Usually, if the pool of small buffers is exhausted, 769 arriving small packets can borrow space in the pool of large buffers, 770 but not vice versa. However, it is easier to work out what should be 771 done if we temporarily set aside the possibility of such borrowing. 772 Then, with fixed pools of buffers for different sized packets and no 773 borrowing, the size of each pool and the current queue length in each 774 pool would both be measured in packets. So an AQM algorithm would 775 have to maintain the queue length for each pool, and judge whether to 776 drop/mark a packet of a particular size by looking at the pool for 777 packets of that size and using the length (in packets) of its queue. 779 We now return to the issue we temporarily set aside: small packets 780 borrowing space in larger buffers. In this case, the only difference 781 is that the pools for smaller packets have a maximum queue size that 782 includes all the pools for larger packets. And every time a packet 783 takes a larger buffer, the current queue size has to be incremented 784 for all queues in the pools of buffers less than or equal to the 785 buffer size used. 787 We will return to borrowing of fixed sized buffers when we discuss 788 biasing the drop/marking probability of a specific packet because of 789 its size in Section 4.2.1. But here we can give a at least one 790 simple rule for how to measure the length of queues of fixed buffers: 791 no matter how complicated the scheme is, ultimately any fixed buffer 792 system will need to measure its queue length in packets not bytes. 794 4.1.2. Congestion Measurement without a Queue 796 AQM algorithms are nearly always described assuming there is a queue 797 for a congested resource and the algorithm can use the queue length 798 to determine the probability that it will drop or mark each packet. 799 But not all congested resources lead to queues. For instance, 800 wireless spectrum is usually regarded as bit-congestible (for a given 801 coding scheme). But wireless link protocols do not always maintain a 802 queue that depends on spectrum interference. Similarly, power 803 limited resources are also usually bit-congestible if energy is 804 primarily required for transmission rather than header processing, 805 but it is rare for a link protocol to build a queue as it approaches 806 maximum power. 808 Nonetheless, AQM algorithms do not require a queue in order to work. 809 For instance spectrum congestion can be modelled by signal quality 810 using target bit-energy-to-noise-density ratio. And, to model radio 811 power exhaustion, transmission power levels can be measured and 812 compared to the maximum power available. [ECNFixedWireless] proposes 813 a practical and theoretically sound way to combine congestion 814 notification for different bit-congestible resources at different 815 layers along an end to end path, whether wireless or wired, and 816 whether with or without queues. 818 4.2. Congestion Notification Advice 820 4.2.1. Network Bias when Encoding 822 4.2.1.1. Advice on Packet Size Bias in RED 824 The previously mentioned email [pktByteEmail] referred to by 825 [RFC2309] advised that most scarce resources in the Internet were 826 bit-congestible, which is still believed to be true (Section 1.1). 827 But it went on to offer advice that is updated by this memo. It said 828 that drop probability should depend on the size of the packet being 829 considered for drop if the resource is bit-congestible, but not if it 830 is packet-congestible. The argument continued that if packet drops 831 were inflated by packet size (byte-mode dropping), "a flow's fraction 832 of the packet drops is then a good indication of that flow's fraction 833 of the link bandwidth in bits per second". This was consistent with 834 a referenced policing mechanism being worked on at the time for 835 detecting unusually high bandwidth flows, eventually published in 836 1999 [pBox]. However, the problem could and should have been solved 837 by making the policing mechanism count the volume of bytes randomly 838 dropped, not the number of packets. 840 A few months before RFC2309 was published, an addendum was added to 841 the above archived email referenced from the RFC, in which the final 842 paragraph seemed to partially retract what had previously been said. 843 It clarified that the question of whether the probability of 844 dropping/marking a packet should depend on its size was not related 845 to whether the resource itself was bit congestible, but a completely 846 orthogonal question. However the only example given had the queue 847 measured in packets but packet drop depended on the byte-size of the 848 packet in question. No example was given the other way round. 850 In 2000, Cnodder et al [REDbyte] pointed out that there was an error 851 in the part of the original 1993 RED algorithm that aimed to 852 distribute drops uniformly, because it didn't correctly take into 853 account the adjustment for packet size. They recommended an 854 algorithm called RED_4 to fix this. But they also recommended a 855 further change, RED_5, to adjust drop rate dependent on the square of 856 relative packet size. This was indeed consistent with one implied 857 motivation behind RED's byte mode drop--that we should reverse 858 engineer the network to improve the performance of dominant end-to- 859 end congestion control mechanisms. This memo makes a different 860 recommendations in Section 2. 862 By 2003, a further change had been made to the adjustment for packet 863 size, this time in the RED algorithm of the ns2 simulator. Instead 864 of taking each packet's size relative to a `maximum packet size' it 865 was taken relative to a `mean packet size', intended to be a static 866 value representative of the `typical' packet size on the link. We 867 have not been able to find a justification in the literature for this 868 change, however Eddy and Allman conducted experiments [REDbias] that 869 assessed how sensitive RED was to this parameter, amongst other 870 things. However, this changed algorithm can often lead to drop 871 probabilities of greater than 1 (which gives a hint that there is 872 probably a mistake in the theory somewhere). 874 On 10-Nov-2004, this variant of byte-mode packet drop was made the 875 default in the ns2 simulator. It seems unlikely that byte-mode drop 876 has ever been implemented in production networks (Appendix A), 877 therefore any conclusions based on ns2 simulations that use RED 878 without disabling byte-mode drop are likely to behave very 879 differently from RED in production networks. 881 4.2.1.2. Packet Size Bias Regardless of RED 883 The byte-mode drop variant of RED is, of course, not the only 884 possible bias towards small packets in queueing systems. We have 885 already mentioned that tail-drop queues naturally tend to lock-out 886 large packets once they are full. But also queues with fixed sized 887 buffers reduce the probability that small packets will be dropped if 888 (and only if) they allow small packets to borrow buffers from the 889 pools for larger packets. As was explained in Section 4.1.1 on fixed 890 size buffer carving, borrowing effectively makes the maximum queue 891 size for small packets greater than that for large packets, because 892 more buffers can be used by small packets while less will fit large 893 packets. 895 In itself, the bias towards small packets caused by buffer borrowing 896 is perfectly correct. Lower drop probability for small packets is 897 legitimate in buffer borrowing schemes, because small packets 898 genuinely congest the machine's buffer memory less than large 899 packets, given they can fit in more spaces. The bias towards small 900 packets is not artificially added (as it is in RED's byte-mode drop 901 algorithm), it merely reflects the reality of the way fixed buffer 902 memory gets congested. Incidentally, the bias towards small packets 903 from buffer borrowing is nothing like as large as that of RED's byte- 904 mode drop. 906 Nonetheless, fixed-buffer memory with tail drop is still prone to 907 lock-out large packets, purely because of the tail-drop aspect. So a 908 good AQM algorithm like RED with packet-mode drop should be used with 909 fixed buffer memories where possible. If RED is too complicated to 910 implement with multiple fixed buffer pools, the minimum necessary to 911 prevent large packet lock-out is to ensure smaller packets never use 912 the last available buffer in any of the pools for larger packets. 914 4.2.2. Transport Bias when Decoding 916 The above proposals to alter the network equipment to bias towards 917 smaller packets have largely carried on outside the IETF process. 918 Whereas, within the IETF, there are many different proposals to alter 919 transport protocols to achieve the same goals, i.e. either to make 920 the flow bit-rate take account of packet size, or to protect control 921 packets from loss. This memo argues that altering transport 922 protocols is the more principled approach. 924 A recently approved experimental RFC adapts its transport layer 925 protocol to take account of packet sizes relative to typical TCP 926 packet sizes. This proposes a new small-packet variant of TCP- 927 friendly rate control [RFC5348] called TFRC-SP [RFC4828]. 928 Essentially, it proposes a rate equation that inflates the flow rate 929 by the ratio of a typical TCP segment size (1500B including TCP 930 header) over the actual segment size [PktSizeEquCC]. (There are also 931 other important differences of detail relative to TFRC, such as using 932 virtual packets [CCvarPktSize] to avoid responding to multiple losses 933 per round trip and using a minimum inter-packet interval.) 935 Section 4.5.1 of this TFRC-SP spec discusses the implications of 936 operating in an environment where queues have been configured to drop 937 smaller packets with proportionately lower probability than larger 938 ones. But it only discusses TCP operating in such an environment, 939 only mentioning TFRC-SP briefly when discussing how to define 940 fairness with TCP. And it only discusses the byte-mode dropping 941 version of RED as it was before Cnodder et al pointed out it didn't 942 sufficiently bias towards small packets to make TCP independent of 943 packet size. 945 So the TFRC-SP spec doesn't address the issue of which of the network 946 or the transport _should_ handle fairness between different packet 947 sizes. In its Appendix B.4 it discusses the possibility of both 948 TFRC-SP and some network buffers duplicating each other's attempts to 949 deliberately bias towards small packets. But the discussion is not 950 conclusive, instead reporting simulations of many of the 951 possibilities in order to assess performance but not recommending any 952 particular course of action. 954 The paper originally proposing TFRC with virtual packets (VP-TFRC) 955 [CCvarPktSize] proposed that there should perhaps be two variants to 956 cater for the different variants of RED. However, as the TFRC-SP 957 authors point out, there is no way for a transport to know whether 958 some queues on its path have deployed RED with byte-mode packet drop 959 (except if an exhaustive survey found that no-one has deployed it!-- 960 see Appendix A). Incidentally, VP-TFRC also proposed that byte-mode 961 RED dropping should really square the packet-size compensation-factor 962 (like that of Cnodder's RED_5, but apparently unaware of it). 964 Pre-congestion notification [RFC5670] is an IETF technology to use a 965 virtual queue for AQM marking for packets within one Diffserv class 966 in order to give early warning prior to any real queuing. The PCN 967 marking algorithms have been designed not to take account of packet 968 size when forwarding through queues. Instead the general principle 969 has been to take account of the sizes of marked packets when 970 monitoring the fraction of marking at the edge of the network, as 971 recommended here. 973 4.2.3. Making Transports Robust against Control Packet Losses 975 Recently, two RFCs have defined changes to TCP that make it more 976 robust against losing small control packets [RFC5562] [RFC5690]. In 977 both cases they note that the case for these two TCP changes would be 978 weaker if RED were biased against dropping small packets. We argue 979 here that these two proposals are a safer and more principled way to 980 achieve TCP performance improvements than reverse engineering RED to 981 benefit TCP. 983 Although there are no known proposals, it would also be possible and 984 perfectly valid to make control packets robust against drop by 985 explicitly requesting a lower drop probability using their Diffserv 986 code point [RFC2474] to request a scheduling class with lower drop. 988 Although not brought to the IETF, a simple proposal from Wischik 989 [DupTCP] suggests that the first three packets of every TCP flow 990 should be routinely duplicated after a short delay. It shows that 991 this would greatly improve the chances of short flows completing 992 quickly, but it would hardly increase traffic levels on the Internet, 993 because Internet bytes have always been concentrated in the large 994 flows. It further shows that the performance of many typical 995 applications depends on completion of long serial chains of short 996 messages. It argues that, given most of the value people get from 997 the Internet is concentrated within short flows, this simple 998 expedient would greatly increase the value of the best efforts 999 Internet at minimal cost. 1001 4.2.4. Congestion Notification: Summary of Conflicting Advice 1002 +-----------+----------------+-----------------+--------------------+ 1003 | transport | RED_1 (packet | RED_4 (linear | RED_5 (square byte | 1004 | cc | mode drop) | byte mode drop) | mode drop) | 1005 +-----------+----------------+-----------------+--------------------+ 1006 | TCP or | s/sqrt(p) | sqrt(s/p) | 1/sqrt(p) | 1007 | TFRC | | | | 1008 | TFRC-SP | 1/sqrt(p) | 1/sqrt(sp) | 1/(s.sqrt(p)) | 1009 +-----------+----------------+-----------------+--------------------+ 1011 Table 2: Dependence of flow bit-rate per RTT on packet size, s, and 1012 drop probability, p, when network and/or transport bias towards small 1013 packets to varying degrees 1015 Table 2 aims to summarise the potential effects of all the advice 1016 from different sources. Each column shows a different possible AQM 1017 behaviour in different queues in the network, using the terminology 1018 of Cnodder et al outlined earlier (RED_1 is basic RED with packet- 1019 mode drop). Each row shows a different transport behaviour: TCP 1020 [RFC5681] and TFRC [RFC5348] on the top row with TFRC-SP [RFC4828] 1021 below. Each cell shows how the bits per round trip of a flow depends 1022 on packet size, s, and drop probability, p. In order to declutter 1023 the formulae to focus on packet-size dependence they are all given 1024 per round trip, which removes any RTT term. 1026 Let us assume that the goal is for the bit-rate of a flow to be 1027 independent of packet size. Suppressing all inessential details, the 1028 table shows that this should either be achievable by not altering the 1029 TCP transport in a RED_5 network, or using the small packet TFRC-SP 1030 transport (or similar) in a network without any byte-mode dropping 1031 RED (top right and bottom left). Top left is the `do nothing' 1032 scenario, while bottom right is the `do-both' scenario in which bit- 1033 rate would become far too biased towards small packets. Of course, 1034 if any form of byte-mode dropping RED has been deployed on a subset 1035 of queues that congest, each path through the network will present a 1036 different hybrid scenario to its transport. 1038 Whatever, we can see that the linear byte-mode drop column in the 1039 middle would considerably complicate the Internet. It's a half-way 1040 house that doesn't bias enough towards small packets even if one 1041 believes the network should be doing the biasing. Section 2 1042 recommends that _all_ bias in network equipment towards small packets 1043 should be turned off--if indeed any equipment vendors have 1044 implemented it--leaving packet-size bias solely as the preserve of 1045 the transport layer (solely the leftmost, packet-mode drop column). 1047 In practice it seems that no deliberate bias towards small packets 1048 has been implemented for production networks. Of the 19% of vendors 1049 who responded to a survey of 84 equipment vendors, none had 1050 implemented byte-mode drop in RED (see Appendix A for details). 1052 5. Outstanding Issues and Next Steps 1054 5.1. Bit-congestible Network 1056 For a connectionless network with nearly all resources being bit- 1057 congestible the recommended position is clear--that the network 1058 should not make allowance for packet sizes and the transport should. 1059 This leaves two outstanding issues: 1061 o How to handle any legacy of AQM with byte-mode drop already 1062 deployed; 1064 o The need to start a programme to update transport congestion 1065 control protocol standards to take account of packet size. 1067 A survey of equipment vendors (Section 4.2.4) found no evidence that 1068 byte-mode packet drop had been implemented, so deployment will be 1069 sparse at best. A migration strategy is not really needed to remove 1070 an algorithm that may not even be deployed. 1072 A programme of experimental updates to take account of packet size in 1073 transport congestion control protocols has already started with 1074 TFRC-SP [RFC4828]. 1076 5.2. Bit- & Packet-congestible Network 1078 The position is much less clear-cut if the Internet becomes populated 1079 by a more even mix of both packet-congestible and bit-congestible 1080 resources (see Appendix B.2). This problem is not pressing, because 1081 most Internet resources are designed to be bit-congestible before 1082 packet processing starts to congest (see Section 1.1). 1084 The IRTF Internet congestion control research group (ICCRG) has set 1085 itself the task of reaching consensus on generic forwarding 1086 mechanisms that are necessary and sufficient to support the 1087 Internet's future congestion control requirements (the first 1088 challenge in [RFC6077]). The research question of whether packet 1089 congestion might become common and what to do if it does may in the 1090 future be explored in the IRTF (the "Challenge 3: Packet Size" in 1091 [RFC6077]). 1093 6. Security Considerations 1095 This memo recommends that queues do not bias drop probability towards 1096 small packets as this creates a perverse incentive for transports to 1097 break down their flows into tiny segments. One of the benefits of 1098 implementing AQM was meant to be to remove this perverse incentive 1099 that drop-tail queues gave to small packets. 1101 In practice, transports cannot all be trusted to respond to 1102 congestion. So another reason for recommending that queues do not 1103 bias drop probability towards small packets is to avoid the 1104 vulnerability to small packet DDoS attacks that would otherwise 1105 result. One of the benefits of implementing AQM was meant to be to 1106 remove drop-tail's DoS vulnerability to small packets, so we 1107 shouldn't add it back again. 1109 If most queues implemented AQM with byte-mode drop, the resulting 1110 network would amplify the potency of a small packet DDoS attack. At 1111 the first queue the stream of packets would push aside a greater 1112 proportion of large packets, so more of the small packets would 1113 survive to attack the next queue. Thus a flood of small packets 1114 would continue on towards the destination, pushing regular traffic 1115 with large packets out of the way in one queue after the next, but 1116 suffering much less drop itself. 1118 Appendix C explains why the ability of networks to police the 1119 response of _any_ transport to congestion depends on bit-congestible 1120 network resources only doing packet-mode not byte-mode drop. In 1121 summary, it says that making drop probability depend on the size of 1122 the packets that bits happen to be divided into simply encourages the 1123 bits to be divided into smaller packets. Byte-mode drop would 1124 therefore irreversibly complicate any attempt to fix the Internet's 1125 incentive structures. 1127 7. Conclusions 1129 This memo identifies the three distinct stages of the congestion 1130 notification process where implementations need to decide whether to 1131 take packet size into account. The recommendations provided in 1132 Section 2 of this memo are different in each case: 1134 o When network equipment measures the length of a queue, whether it 1135 counts in bytes or packets depends on whether the network resource 1136 is congested respectively by bytes or by packets. 1138 o When network equipment decides whether to drop (or mark) a packet, 1139 it is recommended that the size of the particular packet should 1140 not be taken into account 1142 o However, when a transport algorithm responds to a dropped or 1143 marked packet, the size of the rate reduction should be 1144 proportionate to the size of the packet. 1146 In summary, the answers are 'it depends', 'no' and 'yes' respectively 1148 For the specific case of RED, this means that byte-mode queue 1149 measurement will often be appropriate although byte-mode drop is 1150 strongly deprecated. 1152 At the transport layer the IETF should continue updating congestion 1153 control protocols to take account of the size of each packet that 1154 indicates congestion. Also the IETF should continue to make 1155 protocols less sensitive to losing control packets like SYNs, pure 1156 ACKs and DNS exchanges. Although many control packets happen to be 1157 small, the alternative of network equipment favouring all small 1158 packets would be dangerous. That would create perverse incentives to 1159 split data transfers into smaller packets. 1161 The memo develops these recommendations from principled arguments 1162 concerning scaling, layering, incentives, inherent efficiency, 1163 security and policeability. But it also addresses practical issues 1164 such as specific buffer architectures and incremental deployment. 1165 Indeed a limited survey of RED implementations is discussed, which 1166 shows there appears to be little, if any, installed base of RED's 1167 byte-mode drop. Therefore it can be deprecated with little, if any, 1168 incremental deployment complications. 1170 The recommendations have been developed on the well-founded basis 1171 that most Internet resources are bit-congestible not packet- 1172 congestible. We need to know the likelihood that this assumption 1173 will prevail longer term and, if it might not, what protocol changes 1174 will be needed to cater for a mix of the two. The IRTF Internet 1175 Congestion Control Research Group (ICCRG) is currently working on 1176 these problems [RFC6077]. 1178 8. Acknowledgements 1180 Thank you to Sally Floyd, who gave extensive and useful review 1181 comments. Also thanks for the reviews from Philip Eardley, David 1182 Black, Fred Baker, Toby Moncaster, Arnaud Jacquet and Mirja 1183 Kuehlewind as well as helpful explanations of different hardware 1184 approaches from Larry Dunn and Fred Baker. We are grateful to Bruce 1185 Davie and his colleagues for providing a timely and efficient survey 1186 of RED implementation in Cisco's product range. Also grateful thanks 1187 to Toby Moncaster, Will Dormann, John Regnault, Simon Carter and 1188 Stefaan De Cnodder who further helped survey the current status of 1189 RED implementation and deployment and, finally, thanks to the 1190 anonymous individuals who responded. 1192 Bob Briscoe and Jukka Manner were partly funded by Trilogy, a 1193 research project (ICT- 216372) supported by the European Community 1194 under its Seventh Framework Programme. The views expressed here are 1195 those of the authors only. 1197 9. Comments Solicited 1199 Comments and questions are encouraged and very welcome. They can be 1200 addressed to the IETF Transport Area working group mailing list 1201 , and/or to the authors. 1203 10. References 1205 10.1. Normative References 1207 [RFC2119] Bradner, S., "Key words for use in 1208 RFCs to Indicate Requirement Levels", 1209 BCP 14, RFC 2119, March 1997. 1211 [RFC2309] Braden, B., Clark, D., Crowcroft, J., 1212 Davie, B., Deering, S., Estrin, D., 1213 Floyd, S., Jacobson, V., Minshall, 1214 G., Partridge, C., Peterson, L., 1215 Ramakrishnan, K., Shenker, S., 1216 Wroclawski, J., and L. Zhang, 1217 "Recommendations on Queue Management 1218 and Congestion Avoidance in the 1219 Internet", RFC 2309, April 1998. 1221 [RFC3168] Ramakrishnan, K., Floyd, S., and D. 1222 Black, "The Addition of Explicit 1223 Congestion Notification (ECN) to IP", 1224 RFC 3168, September 2001. 1226 [RFC3426] Floyd, S., "General Architectural and 1227 Policy Considerations", RFC 3426, 1228 November 2002. 1230 10.2. Informative References 1232 [CCvarPktSize] Widmer, J., Boutremans, C., and J-Y. 1233 Le Boudec, "Congestion Control for 1234 Flows with Variable Packet Size", ACM 1235 CCR 34(2) 137--151, 2004, . 1238 [CHOKe_Var_Pkt] Psounis, K., Pan, R., and B. 1239 Prabhaker, "Approximate Fair Dropping 1240 for Variable Length Packets", IEEE 1241 Micro 21(1):48--56, January- 1242 February 2001, . 1246 [DRQ] Shin, M., Chong, S., and I. Rhee, 1247 "Dual-Resource TCP/AQM for 1248 Processing-Constrained Networks", 1249 IEEE/ACM Transactions on 1250 Networking Vol 16, issue 2, 1251 April 2008, . 1254 [DupTCP] Wischik, D., "Short messages", Royal 1255 Society workshop on networks: 1256 modelling and control , 1257 September 2007, . 1261 [ECNFixedWireless] Siris, V., "Resource Control for 1262 Elastic Traffic in CDMA Networks", 1263 Proc. ACM MOBICOM'02 , 1264 September 2002, . 1268 [Evol_cc] Gibbens, R. and F. Kelly, "Resource 1269 pricing and the evolution of 1270 congestion control", 1271 Automatica 35(12)1969--1985, 1272 December 1999, . 1276 [I-D.ietf-avtcore-ecn-for-rtp] Westerlund, M., Johansson, I., 1277 Perkins, C., O'Hanlon, P., and K. 1278 Carlberg, "Explicit Congestion 1279 Notification (ECN) for RTP over UDP", 1280 draft-ietf-avtcore-ecn-for-rtp-06 1281 (work in progress), February 2012. 1283 [I-D.ietf-conex-concepts-uses] Briscoe, B., Woundy, R., and A. 1284 Cooper, "ConEx Concepts and Use 1285 Cases", 1286 draft-ietf-conex-concepts-uses-03 1287 (work in progress), October 2011. 1289 [IOSArch] Bollapragada, V., White, R., and C. 1291 Murphy, "Inside Cisco IOS Software 1292 Architecture", Cisco Press: CCIE 1293 Professional Development ISBN13: 978- 1294 1-57870-181-0, July 2000. 1296 [PktSizeEquCC] Vasallo, P., "Variable Packet Size 1297 Equation-Based Congestion Control", 1298 ICSI Technical Report tr-00-008, 1299 2000, . 1303 [RED93] Floyd, S. and V. Jacobson, "Random 1304 Early Detection (RED) gateways for 1305 Congestion Avoidance", IEEE/ACM 1306 Transactions on Networking 1(4) 397-- 1307 413, August 1993, . 1311 [REDbias] Eddy, W. and M. Allman, "A Comparison 1312 of RED's Byte and Packet Modes", 1313 Computer Networks 42(3) 261--280, 1314 June 2003, . 1317 [REDbyte] De Cnodder, S., Elloumi, O., and K. 1318 Pauwels, "RED behavior with different 1319 packet sizes", Proc. 5th IEEE 1320 Symposium on Computers and 1321 Communications (ISCC) 793--799, 1322 July 2000, . 1325 [RFC2474] Nichols, K., Blake, S., Baker, F., 1326 and D. Black, "Definition of the 1327 Differentiated Services Field (DS 1328 Field) in the IPv4 and IPv6 Headers", 1329 RFC 2474, December 1998. 1331 [RFC3550] Schulzrinne, H., Casner, S., 1332 Frederick, R., and V. Jacobson, "RTP: 1333 A Transport Protocol for Real-Time 1334 Applications", STD 64, RFC 3550, 1335 July 2003. 1337 [RFC3714] Floyd, S. and J. Kempf, "IAB Concerns 1338 Regarding Congestion Control for 1339 Voice Traffic in the Internet", 1340 RFC 3714, March 2004. 1342 [RFC4828] Floyd, S. and E. Kohler, "TCP 1343 Friendly Rate Control (TFRC): The 1344 Small-Packet (SP) Variant", RFC 4828, 1345 April 2007. 1347 [RFC5348] Floyd, S., Handley, M., Padhye, J., 1348 and J. Widmer, "TCP Friendly Rate 1349 Control (TFRC): Protocol 1350 Specification", RFC 5348, 1351 September 2008. 1353 [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, 1354 S., and K. Ramakrishnan, "Adding 1355 Explicit Congestion Notification 1356 (ECN) Capability to TCP's SYN/ACK 1357 Packets", RFC 5562, June 2009. 1359 [RFC5670] Eardley, P., "Metering and Marking 1360 Behaviour of PCN-Nodes", RFC 5670, 1361 November 2009. 1363 [RFC5681] Allman, M., Paxson, V., and E. 1364 Blanton, "TCP Congestion Control", 1365 RFC 5681, September 2009. 1367 [RFC5690] Floyd, S., Arcia, A., Ros, D., and J. 1368 Iyengar, "Adding Acknowledgement 1369 Congestion Control to TCP", RFC 5690, 1370 February 2010. 1372 [RFC6077] Papadimitriou, D., Welzl, M., Scharf, 1373 M., and B. Briscoe, "Open Research 1374 Issues in Internet Congestion 1375 Control", RFC 6077, February 2011. 1377 [Rate_fair_Dis] Briscoe, B., "Flow Rate Fairness: 1378 Dismantling a Religion", ACM 1379 CCR 37(2)63--74, April 2007, . 1383 [gentle_RED] Floyd, S., "Recommendation on using 1384 the "gentle_" variant of RED", Web 1385 page , March 2000, . 1388 [pBox] Floyd, S. and K. Fall, "Promoting the 1389 Use of End-to-End Congestion Control 1390 in the Internet", IEEE/ACM 1391 Transactions on Networking 7(4) 458-- 1392 472, August 1999, . 1396 [pktByteEmail] Floyd, S., "RED: Discussions of Byte 1397 and Packet Modes", Web page Red Queue 1398 Management, March 1997, . 1402 Appendix A. Survey of RED Implementation Status 1404 This Appendix is informative, not normative. 1406 In May 2007 a survey was conducted of 84 vendors to assess how widely 1407 drop probability based on packet size has been implemented in RED 1408 Table 3. About 19% of those surveyed replied, giving a sample size 1409 of 16. Although in most cases we do not have permission to identify 1410 the respondents, we can say that those that have responded include 1411 most of the larger equipment vendors, covering a large fraction of 1412 the market. The two who gave permission to be identified were Cisco 1413 and Alcatel-Lucent. The others range across the large network 1414 equipment vendors at L3 & L2, firewall vendors, wireless equipment 1415 vendors, as well as large software businesses with a small selection 1416 of networking products. All those who responded confirmed that they 1417 have not implemented the variant of RED with drop dependent on packet 1418 size (2 were fairly sure they had not but needed to check more 1419 thoroughly). At the time the survey was conducted, Linux did not 1420 implement RED with packet-size bias of drop, although we have not 1421 investigated a wider range of open source code. 1423 +-------------------------------+----------------+-----------------+ 1424 | Response | No. of vendors | %age of vendors | 1425 +-------------------------------+----------------+-----------------+ 1426 | Not implemented | 14 | 17% | 1427 | Not implemented (probably) | 2 | 2% | 1428 | Implemented | 0 | 0% | 1429 | No response | 68 | 81% | 1430 | Total companies/orgs surveyed | 84 | 100% | 1431 +-------------------------------+----------------+-----------------+ 1433 Table 3: Vendor Survey on byte-mode drop variant of RED (lower drop 1434 probability for small packets) 1436 Where reasons have been given, the extra complexity of packet bias 1437 code has been most prevalent, though one vendor had a more principled 1438 reason for avoiding it--similar to the argument of this document. 1440 Our survey was of vendor implementations, so we cannot be certain 1441 about operator deployment. But we believe many queues in the 1442 Internet are still tail-drop. The company of one of the co-authors 1443 (BT) has widely deployed RED, but many tail-drop queues are bound to 1444 still exist, particularly in access network equipment and on 1445 middleboxes like firewalls, where RED is not always available. 1447 Routers using a memory architecture based on fixed size buffers with 1448 borrowing may also still be prevalent in the Internet. As explained 1449 in Section 4.2.1, these also provide a marginal (but legitimate) bias 1450 towards small packets. So even though RED byte-mode drop is not 1451 prevalent, it is likely there is still some bias towards small 1452 packets in the Internet due to tail drop and fixed buffer borrowing. 1454 Appendix B. Sufficiency of Packet-Mode Drop 1456 This Appendix is informative, not normative. 1458 Here we check that packet-mode drop (or marking) in the network gives 1459 sufficiently generic information for the transport layer to use. We 1460 check against a 2x2 matrix of four scenarios that may occur now or in 1461 the future (Table 4). The horizontal and vertical dimensions have 1462 been chosen because each tests extremes of sensitivity to packet size 1463 in the transport and in the network respectively. 1465 Note that this section does not consider byte-mode drop at all. 1466 Having deprecated byte-mode drop, the goal here is to check that 1467 packet-mode drop will be sufficient in all cases. 1469 +-------------------------------+-----------------+-----------------+ 1470 | Transport | a) Independent | b) Dependent on | 1471 | | of packet size | packet size of | 1472 | Network | of congestion | congestion | 1473 | | notifications | notifications | 1474 +-------------------------------+-----------------+-----------------+ 1475 | 1) Predominantly | Scenario a1) | Scenario b1) | 1476 | bit-congestible network | | | 1477 | 2) Mix of bit-congestible and | Scenario a2) | Scenario b2) | 1478 | pkt-congestible network | | | 1479 +-------------------------------+-----------------+-----------------+ 1481 Table 4: Four Possible Congestion Scenarios 1483 Appendix B.1 focuses on the horizontal dimension of Table 4 checking 1484 that packet-mode drop (or marking) gives sufficient information, 1485 whether or not the transport uses it--scenarios b) and a) 1486 respectively. 1488 Appendix B.2 focuses on the vertical dimension of Table 4, checking 1489 that packet-mode drop gives sufficient information to the transport 1490 whether resources in the network are bit-congestible or packet- 1491 congestible (these terms are defined in Section 1.1). 1493 Notation: To be concrete, we will compare two flows with different 1494 packet sizes, s_1 and s_2. As an example, we will take s_1 = 60B 1495 = 480b and s_2 = 1500B = 12,000b. 1497 A flow's bit rate, x [bps], is related to its packet rate, u 1498 [pps], by 1500 x(t) = s.u(t). 1502 In the bit-congestible case, path congestion will be denoted by 1503 p_b, and in the packet-congestible case by p_p. When either case 1504 is implied, the letter p alone will denote path congestion. 1506 B.1. Packet-Size (In)Dependence in Transports 1508 In all cases we consider a packet-mode drop queue that indicates 1509 congestion by dropping (or marking) packets with probability p 1510 irrespective of packet size. We use an example value of loss 1511 (marking) probability, p=0.1%. 1513 A transport like RFC5681 TCP treats a congestion notification on any 1514 packet whatever its size as one event. However, a network with just 1515 the packet-mode drop algorithm does give more information if the 1516 transport chooses to use it. We will use Table 5 to illustrate this. 1518 We will set aside the last column until later. The columns labelled 1519 "Flow 1" and "Flow 2" compare two flows consisting of 60B and 1500B 1520 packets respectively. The body of the table considers two separate 1521 cases, one where the flows have equal bit-rate and the other with 1522 equal packet-rates. In both cases, the two flows fill a 96Mbps link. 1523 Therefore, in the equal bit-rate case they each have half the bit- 1524 rate (48Mbps). Whereas, with equal packet-rates, flow 1 uses 25 1525 times smaller packets so it gets 25 times less bit-rate--it only gets 1526 1/(1+25) of the link capacity (96Mbps/26 = 4Mbps after rounding). In 1527 contrast flow 2 gets 25 times more bit-rate (92Mbps) in the equal 1528 packet rate case because its packets are 25 times larger. The packet 1529 rate shown for each flow could easily be derived once the bit-rate 1530 was known by dividing bit-rate by packet size, as shown in the column 1531 labelled "Formula". 1533 Parameter Formula Flow 1 Flow 2 Combined 1534 ----------------------- ----------- ------- ------- -------- 1535 Packet size s/8 60B 1,500B (Mix) 1536 Packet size s 480b 12,000b (Mix) 1537 Pkt loss probability p 0.1% 0.1% 0.1% 1539 EQUAL BIT-RATE CASE 1540 Bit-rate x 48Mbps 48Mbps 96Mbps 1541 Packet-rate u = x/s 100kpps 4kpps 104kpps 1542 Absolute pkt-loss-rate p*u 100pps 4pps 104pps 1543 Absolute bit-loss-rate p*u*s 48kbps 48kbps 96kbps 1544 Ratio of lost/sent pkts p*u/u 0.1% 0.1% 0.1% 1545 Ratio of lost/sent bits p*u*s/(u*s) 0.1% 0.1% 0.1% 1547 EQUAL PACKET-RATE CASE 1548 Bit-rate x 4Mbps 92Mbps 96Mbps 1549 Packet-rate u = x/s 8kpps 8kpps 15kpps 1550 Absolute pkt-loss-rate p*u 8pps 8pps 15pps 1551 Absolute bit-loss-rate p*u*s 4kbps 92kbps 96kbps 1552 Ratio of lost/sent pkts p*u/u 0.1% 0.1% 0.1% 1553 Ratio of lost/sent bits p*u*s/(u*s) 0.1% 0.1% 0.1% 1555 Table 5: Absolute Loss Rates and Loss Ratios for Flows of Small and 1556 Large Packets and Both Combined 1558 So far we have merely set up the scenarios. We now consider 1559 congestion notification in the scenario. Two TCP flows with the same 1560 round trip time aim to equalise their packet-loss-rates over time. 1561 That is the number of packets lost in a second, which is the packets 1562 per second (u) multiplied by the probability that each one is dropped 1563 (p). Thus TCP converges on the "Equal packet-rate" case, where both 1564 flows aim for the same "Absolute packet-loss-rate" (both 8pps in the 1565 table). 1567 Packet-mode drop actually gives flows sufficient information to 1568 measure their loss-rate in bits per second, if they choose, not just 1569 packets per second. Each flow can count the size of a lost or marked 1570 packet and scale its rate-response in proportion (as TFRC-SP does). 1571 The result is shown in the row entitled "Absolute bit-loss-rate", 1572 where the bits lost in a second is the packets per second (u) 1573 multiplied by the probability of losing a packet (p) multiplied by 1574 the packet size (s). Such an algorithm would try to remove any 1575 imbalance in bit-loss-rate such as the wide disparity in the "Equal 1576 packet-rate" case (4kbps vs. 92kbps). Instead, a packet-size- 1577 dependent algorithm would aim for equal bit-loss-rates, which would 1578 drive both flows towards the "Equal bit-rate" case, by driving them 1579 to equal bit-loss-rates (both 48kbps in this example). 1581 The explanation so far has assumed that each flow consists of packets 1582 of only one constant size. Nonetheless, it extends naturally to 1583 flows with mixed packet sizes. In the right-most column of Table 5 a 1584 flow of mixed size packets is created simply by considering flow 1 1585 and flow 2 as a single aggregated flow. There is no need for a flow 1586 to maintain an average packet size. It is only necessary for the 1587 transport to scale its response to each congestion indication by the 1588 size of each individual lost (or marked) packet. Taking for example 1589 the "Equal packet-rate" case, in one second about 8 small packets and 1590 8 large packets are lost (making closer to 15 than 16 losses per 1591 second due to rounding). If the transport multiplies each loss by 1592 its size, in one second it responds to 8*480b and 8*12,000b lost 1593 bits, adding up to 96,000 lost bits in a second. This double checks 1594 correctly, being the same as 0.1% of the total bit-rate of 96Mbps. 1595 For completeness, the formula for absolute bit-loss-rate is p(u1*s1+ 1596 u2*s2). 1598 Incidentally, a transport will always measure the loss probability 1599 the same irrespective of whether it measures in packets or in bytes. 1600 In other words, the ratio of lost to sent packets will be the same as 1601 the ratio of lost to sent bytes. (This is why TCP's bit rate is 1602 still proportional to packet size even when byte-counting is used, as 1603 recommended for TCP in [RFC5681], mainly for orthogonal security 1604 reasons.) This is intuitively obvious by comparing two example 1605 flows; one with 60B packets, the other with 1500B packets. If both 1606 flows pass through a queue with drop probability 0.1%, each flow will 1607 lose 1 in 1,000 packets. In the stream of 60B packets the ratio of 1608 bytes lost to sent will be 60B in every 60,000B; and in the stream of 1609 1500B packets, the loss ratio will be 1,500B out of 1,500,000B. When 1610 the transport responds to the ratio of lost to sent packets, it will 1611 measure the same ratio whether it measures in packets or bytes: 0.1% 1612 in both cases. The fact that this ratio is the same whether measured 1613 in packets or bytes can be seen in Table 5, where the ratio of lost 1614 to sent packets and the ratio of lost to sent bytes is always 0.1% in 1615 all cases (recall that the scenario was set up with p=0.1%). 1617 This discussion of how the ratio can be measured in packets or bytes 1618 is only raised here to highlight that it is irrelevant to this memo! 1619 Whether a transport depends on packet size or not depends on how this 1620 ratio is used within the congestion control algorithm. 1622 So far we have shown that packet-mode drop passes sufficient 1623 information to the transport layer so that the transport can take 1624 account of bit-congestion, by using the sizes of the packets that 1625 indicate congestion. We have also shown that the transport can 1626 choose not to take packet size into account if it wishes. We will 1627 now consider whether the transport can know which to do. 1629 B.2. Bit-Congestible and Packet-Congestible Indications 1631 As a thought-experiment, imagine an idealised congestion notification 1632 protocol that supports both bit-congestible and packet-congestible 1633 resources. It would require at least two ECN flags, one for each of 1634 bit-congestible and packet-congestible resources. 1636 1. A packet-congestible resource trying to code congestion level p_p 1637 into a packet stream should mark the idealised `packet 1638 congestion' field in each packet with probability p_p 1639 irrespective of the packet's size. The transport should then 1640 take a packet with the packet congestion field marked to mean 1641 just one mark, irrespective of the packet size. 1643 2. A bit-congestible resource trying to code time-varying byte- 1644 congestion level p_b into a packet stream should mark the `byte 1645 congestion' field in each packet with probability p_b, again 1646 irrespective of the packet's size. Unlike before, the transport 1647 should take a packet with the byte congestion field marked to 1648 count as a mark on each byte in the packet. 1650 This hides a fundamental problem--much more fundamental than whether 1651 we can magically create header space for yet another ECN flag, or 1652 whether it would work while being deployed incrementally. 1653 Distinguishing drop from delivery naturally provides just one 1654 implicit bit of congestion indication information--the packet is 1655 either dropped or not. It is hard to drop a packet in two ways that 1656 are distinguishable remotely. This is a similar problem to that of 1657 distinguishing wireless transmission losses from congestive losses. 1659 This problem would not be solved even if ECN were universally 1660 deployed. A congestion notification protocol must survive a 1661 transition from low levels of congestion to high. Marking two states 1662 is feasible with explicit marking, but much harder if packets are 1663 dropped. Also, it will not always be cost-effective to implement AQM 1664 at every low level resource, so drop will often have to suffice. 1666 We are not saying two ECN fields will be needed (and we are not 1667 saying that somehow a resource should be able to drop a packet in one 1668 of two different ways so that the transport can distinguish which 1669 sort of drop it was!). These two congestion notification channels 1670 are a conceptual device to illustrate a dilemma we could face in the 1671 future. Section 3 gives four good reasons why it would be a bad idea 1672 to allow for packet size by biasing drop probability in favour of 1673 small packets within the network. The impracticality of our thought 1674 experiment shows that it will be hard to give transports a practical 1675 way to know whether to take account of the size of congestion 1676 indication packets or not. 1678 Fortunately, this dilemma is not pressing because by design most 1679 equipment becomes bit-congested before its packet-processing becomes 1680 congested (as already outlined in Section 1.1). Therefore transports 1681 can be designed on the relatively sound assumption that a congestion 1682 indication will usually imply bit-congestion. 1684 Nonetheless, although the above idealised protocol isn't intended for 1685 implementation, we do want to emphasise that research is needed to 1686 predict whether there are good reasons to believe that packet 1687 congestion might become more common, and if so, to find a way to 1688 somehow distinguish between bit and packet congestion [RFC3714]. 1690 Recently, the dual resource queue (DRQ) proposal [DRQ] has been made 1691 on the premise that, as network processors become more cost 1692 effective, per packet operations will become more complex 1693 (irrespective of whether more function in the network is desirable). 1694 Consequently the premise is that CPU congestion will become more 1695 common. DRQ is a proposed modification to the RED algorithm that 1696 folds both bit congestion and packet congestion into one signal 1697 (either loss or ECN). 1699 Finally, we note one further complication. Strictly, packet- 1700 congestible resources are often cycle-congestible. For instance, for 1701 routing look-ups load depends on the complexity of each look-up and 1702 whether the pattern of arrivals is amenable to caching or not. This 1703 also reminds us that any solution must not require a forwarding 1704 engine to use excessive processor cycles in order to decide how to 1705 say it has no spare processor cycles. 1707 Appendix C. Byte-mode Drop Complicates Policing Congestion Response 1709 This section is informative, not normative. 1711 There are two main classes of approach to policing congestion 1712 response: i) policing at each bottleneck link or ii) policing at the 1713 edges of networks. Packet-mode drop in RED is compatible with 1714 either, while byte-mode drop precludes edge policing. 1716 The simplicity of an edge policer relies on one dropped or marked 1717 packet being equivalent to another of the same size without having to 1718 know which link the drop or mark occurred at. However, the byte-mode 1719 drop algorithm has to depend on the local MTU of the line--it needs 1720 to use some concept of a 'normal' packet size. Therefore, one 1721 dropped or marked packet from a byte-mode drop algorithm is not 1722 necessarily equivalent to another from a different link. A policing 1723 function local to the link can know the local MTU where the 1724 congestion occurred. However, a policer at the edge of the network 1725 cannot, at least not without a lot of complexity. 1727 The early research proposals for type (i) policing at a bottleneck 1728 link [pBox] used byte-mode drop, then detected flows that contributed 1729 disproportionately to the number of packets dropped. However, with 1730 no extra complexity, later proposals used packet mode drop and looked 1731 for flows that contributed a disproportionate amount of dropped bytes 1732 [CHOKe_Var_Pkt]. 1734 Work is progressing on the congestion exposure protocol (ConEx 1735 [I-D.ietf-conex-concepts-uses]), which enables a type (ii) edge 1736 policer located at a user's attachment point. The idea is to be able 1737 to take an integrated view of the effect of all a user's traffic on 1738 any link in the internetwork. However, byte-mode drop would 1739 effectively preclude such edge policing because of the MTU issue 1740 above. 1742 Indeed, making drop probability depend on the size of the packets 1743 that bits happen to be divided into would simply encourage the bits 1744 to be divided into smaller packets in order to confuse policing. In 1745 contrast, as long as a dropped/marked packet is taken to mean that 1746 all the bytes in the packet are dropped/marked, a policer can remain 1747 robust against bits being re-divided into different size packets or 1748 across different size flows [Rate_fair_Dis]. 1750 Appendix D. Changes from Previous Versions 1752 To be removed by the RFC Editor on publication. 1754 Full incremental diffs between each version are available at 1755 1756 (courtesy of the rfcdiff tool): 1758 From -06 to -07: 1760 * A mix-up with the corollaries and their naming in 2.1 to 2.3 1761 fixed. 1763 From -05 to -06: 1765 * Primarily editorial fixes. 1767 From -04 to -05: 1769 * Changed from Informational to BCP and highlighted non-normative 1770 sections and appendices 1772 * Removed language about consensus 1774 * Added "Example Comparing Packet-Mode Drop and Byte-Mode Drop" 1776 * Arranged "Motivating Arguments" into a more logical order and 1777 completely rewrote "Transport-Independent Network" & "Scaling 1778 Congestion Control with Packet Size" arguments. Removed "Why 1779 Now?" 1781 * Clarified applicability of certain recommendations 1783 * Shifted vendor survey to an Appendix 1785 * Cut down "Outstanding Issues and Next Steps" 1787 * Re-drafted the start of the conclusions to highlight the three 1788 distinct areas of concern 1790 * Completely re-wrote appendices 1792 * Editorial corrections throughout. 1794 From -03 to -04: 1796 * Reordered Sections 2 and 3, and some clarifications here and 1797 there based on feedback from Colin Perkins and Mirja 1798 Kuehlewind. 1800 From -02 to -03 (this version) 1802 * Structural changes: 1804 + Split off text at end of "Scaling Congestion Control with 1805 Packet Size" into new section "Transport-Independent 1806 Network" 1808 + Shifted "Recommendations" straight after "Motivating 1809 Arguments" and added "Conclusions" at end to reinforce 1810 Recommendations 1812 + Added more internal structure to Recommendations, so that 1813 recommendations specific to RED or to TCP are just 1814 corollaries of a more general recommendation, rather than 1815 being listed as a separate recommendation. 1817 + Renamed "State of the Art" as "Critical Survey of Existing 1818 Advice" and retitled a number of subsections with more 1819 descriptive titles. 1821 + Split end of "Congestion Coding: Summary of Status" into a 1822 new subsection called "RED Implementation Status". 1824 + Removed text that had been in the Appendix "Congestion 1825 Notification Definition: Further Justification". 1827 * Reordered the intro text a little. 1829 * Made it clearer when advice being reported is deprecated and 1830 when it is not. 1832 * Described AQM as in network equipment, rather than saying "at 1833 the network layer" (to side-step controversy over whether 1834 functions like AQM are in the transport layer but in network 1835 equipment). 1837 * Minor improvements to clarity throughout 1839 From -01 to -02: 1841 * Restructured the whole document for (hopefully) easier reading 1842 and clarity. The concrete recommendation, in RFC2119 language, 1843 is now in Section 7. 1845 From -00 to -01: 1847 * Minor clarifications throughout and updated references 1849 From briscoe-byte-pkt-mark-02 to ietf-byte-pkt-congest-00: 1851 * Added note on relationship to existing RFCs 1852 * Posed the question of whether packet-congestion could become 1853 common and deferred it to the IRTF ICCRG. Added ref to the 1854 dual-resource queue (DRQ) proposal. 1856 * Changed PCN references from the PCN charter & architecture to 1857 the PCN marking behaviour draft most likely to imminently 1858 become the standards track WG item. 1860 From -01 to -02: 1862 * Abstract reorganised to align with clearer separation of issue 1863 in the memo. 1865 * Introduction reorganised with motivating arguments removed to 1866 new Section 3. 1868 * Clarified avoiding lock-out of large packets is not the main or 1869 only motivation for RED. 1871 * Mentioned choice of drop or marking explicitly throughout, 1872 rather than trying to coin a word to mean either. 1874 * Generalised the discussion throughout to any packet forwarding 1875 function on any network equipment, not just routers. 1877 * Clarified the last point about why this is a good time to sort 1878 out this issue: because it will be hard / impossible to design 1879 new transports unless we decide whether the network or the 1880 transport is allowing for packet size. 1882 * Added statement explaining the horizon of the memo is long 1883 term, but with short term expediency in mind. 1885 * Added material on scaling congestion control with packet size 1886 (Section 3.4). 1888 * Separated out issue of normalising TCP's bit rate from issue of 1889 preference to control packets (Section 3.2). 1891 * Divided up Congestion Measurement section for clarity, 1892 including new material on fixed size packet buffers and buffer 1893 carving (Section 4.1.1 & Section 4.2.1) and on congestion 1894 measurement in wireless link technologies without queues 1895 (Section 4.1.2). 1897 * Added section on 'Making Transports Robust against Control 1898 Packet Losses' (Section 4.2.3) with existing & new material 1899 included. 1901 * Added tabulated results of vendor survey on byte-mode drop 1902 variant of RED (Table 3). 1904 From -00 to -01: 1906 * Clarified applicability to drop as well as ECN. 1908 * Highlighted DoS vulnerability. 1910 * Emphasised that drop-tail suffers from similar problems to 1911 byte-mode drop, so only byte-mode drop should be turned off, 1912 not RED itself. 1914 * Clarified the original apparent motivations for recommending 1915 byte-mode drop included protecting SYNs and pure ACKs more than 1916 equalising the bit rates of TCPs with different segment sizes. 1917 Removed some conjectured motivations. 1919 * Added support for updates to TCP in progress (ackcc & ecn-syn- 1920 ack). 1922 * Updated survey results with newly arrived data. 1924 * Pulled all recommendations together into the conclusions. 1926 * Moved some detailed points into two additional appendices and a 1927 note. 1929 * Considerable clarifications throughout. 1931 * Updated references 1933 Authors' Addresses 1935 Bob Briscoe 1936 BT 1937 B54/77, Adastral Park 1938 Martlesham Heath 1939 Ipswich IP5 3RE 1940 UK 1942 Phone: +44 1473 645196 1943 EMail: bob.briscoe@bt.com 1944 URI: http://bobbriscoe.net/ 1945 Jukka Manner 1946 Aalto University 1947 Department of Communications and Networking (Comnet) 1948 P.O. Box 13000 1949 FIN-00076 Aalto 1950 Finland 1952 Phone: +358 9 470 22481 1953 EMail: jukka.manner@aalto.fi 1954 URI: http://www.netlab.tkk.fi/~jmanner/