idnits 2.17.1 

draft-ietf-tsvwg-byte-pkt-congest-08.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1542 has weird spacing: '...ability    p  ...'

  == Line 1547 has weird spacing: '...ss-rate  p*u  ...'

  == Line 1548 has weird spacing: '...ss-rate  p*u*s...'

  == Line 1555 has weird spacing: '...ss-rate  p*u  ...'

  == Line 1556 has weird spacing: '...ss-rate  p*u*s...'

     (Using the creation date from RFC2309, updated by this document, for
     RFC5378 checks: 1997-03-25)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (August 13, 2012) is 4273 days in the past.  Is this
     intentional?


  Checking references for intended status: Best Current Practice
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-05) exists of
     draft-ietf-conex-concepts-uses-04

  -- Obsolete informational reference (is this intentional?): RFC 2309
     (Obsoleted by RFC 7567)


     Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Transport Area Working Group                                  B. Briscoe
3	Internet-Draft                                                        BT
4	Updates: 2309 (if approved)                                    J. Manner
5	Intended status: BCP                                    Aalto University
6	Expires: February 14, 2013                               August 13, 2012

8	                Byte and Packet Congestion Notification
9	                  draft-ietf-tsvwg-byte-pkt-congest-08

11	Abstract

13	   This document provides recommendations of best current practice for
14	   dropping or marking packets using active queue management (AQM) such
15	   as random early detection (RED) or pre-congestion notification (PCN).
16	   We give three strong recommendations: (1) packet size should be taken
17	   into account when transports read and respond to congestion
18	   indications, (2) packet size should not be taken into account when
19	   network equipment creates congestion signals (marking, dropping), and
20	   therefore (3) the byte-mode packet drop variant of the RED AQM
21	   algorithm that drops fewer small packets should not be used.  This
22	   memo updates RFC 2309 to deprecate deliberate preferential treatment
23	   of small packets in AQM algorithms.

25	Status of This Memo

27	   This Internet-Draft is submitted in full conformance with the
28	   provisions of BCP 78 and BCP 79.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF).  Note that other groups may also distribute
32	   working documents as Internet-Drafts.  The list of current Internet-
33	   Drafts is at http://datatracker.ietf.org/drafts/current/.

35	   Internet-Drafts are draft documents valid for a maximum of six months
36	   and may be updated, replaced, or obsoleted by other documents at any
37	   time.  It is inappropriate to use Internet-Drafts as reference
38	   material or to cite them other than as "work in progress."

40	   This Internet-Draft will expire on February 14, 2013.

42	Copyright Notice

44	   Copyright (c) 2012 IETF Trust and the persons identified as the
45	   document authors.  All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (http://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document.  Please review these documents
51	   carefully, as they describe your rights and restrictions with respect
52	   to this document.  Code Components extracted from this document must
53	   include Simplified BSD License text as described in Section 4.e of
54	   the Trust Legal Provisions and are provided without warranty as
55	   described in the Simplified BSD License.

57	Table of Contents

59	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
60	     1.1.  Terminology and Scoping  . . . . . . . . . . . . . . . . .  6
61	     1.2.  Example Comparing Packet-Mode Drop and Byte-Mode Drop  . .  7
62	   2.  Recommendations  . . . . . . . . . . . . . . . . . . . . . . .  8
63	     2.1.  Recommendation on Queue Measurement  . . . . . . . . . . .  9
64	     2.2.  Recommendation on Encoding Congestion Notification . . . .  9
65	     2.3.  Recommendation on Responding to Congestion . . . . . . . . 10
66	     2.4.  Recommendation on Handling Congestion Indications when
67	           Splitting or Merging Packets . . . . . . . . . . . . . . . 11
68	   3.  Motivating Arguments . . . . . . . . . . . . . . . . . . . . . 11
69	     3.1.  Avoiding Perverse Incentives to (Ab)use Smaller Packets  . 12
70	     3.2.  Small != Control . . . . . . . . . . . . . . . . . . . . . 13
71	     3.3.  Transport-Independent Network  . . . . . . . . . . . . . . 13
72	     3.4.  Scaling Congestion Control with Packet Size  . . . . . . . 14
73	     3.5.  Implementation Efficiency  . . . . . . . . . . . . . . . . 16
74	   4.  A Survey and Critique of Past Advice . . . . . . . . . . . . . 16
75	     4.1.  Congestion Measurement Advice  . . . . . . . . . . . . . . 16
76	       4.1.1.  Fixed Size Packet Buffers  . . . . . . . . . . . . . . 17
77	       4.1.2.  Congestion Measurement without a Queue . . . . . . . . 18
78	     4.2.  Congestion Notification Advice . . . . . . . . . . . . . . 19
79	       4.2.1.  Network Bias when Encoding . . . . . . . . . . . . . . 19
80	       4.2.2.  Transport Bias when Decoding . . . . . . . . . . . . . 21
81	       4.2.3.  Making Transports Robust against Control Packet
82	               Losses . . . . . . . . . . . . . . . . . . . . . . . . 22
83	       4.2.4.  Congestion Notification: Summary of Conflicting
84	               Advice . . . . . . . . . . . . . . . . . . . . . . . . 22
85	   5.  Outstanding Issues and Next Steps  . . . . . . . . . . . . . . 24
86	     5.1.  Bit-congestible Network  . . . . . . . . . . . . . . . . . 24
87	     5.2.  Bit- & Packet-congestible Network  . . . . . . . . . . . . 24
88	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 24
89	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 25
90	   8.  Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 25
91	   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 26
92	   10. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 27
93	   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 27
94	     11.1. Normative References . . . . . . . . . . . . . . . . . . . 27
95	     11.2. Informative References . . . . . . . . . . . . . . . . . . 27
96	   Appendix A.  Survey of RED Implementation Status . . . . . . . . . 31
97	   Appendix B.  Sufficiency of Packet-Mode Drop . . . . . . . . . . . 32
98	     B.1.  Packet-Size (In)Dependence in Transports . . . . . . . . . 33
99	     B.2.  Bit-Congestible and Packet-Congestible Indications . . . . 36
100	   Appendix C.  Byte-mode Drop Complicates Policing Congestion
101	                Response  . . . . . . . . . . . . . . . . . . . . . . 37
102	   Appendix D.  Changes from Previous Versions  . . . . . . . . . . . 38

104	1.  Introduction

106	   This memo concerns how we should correctly scale congestion control
107	   functions with packet size for the long term.  It also recognises
108	   that expediency may be necessary to deal with existing widely
109	   deployed protocols that don't live up to the long term goal.

111	   When notifying congestion, the problem of how (and whether) to take
112	   packet sizes into account has exercised the minds of researchers and
113	   practitioners for as long as active queue management (AQM) has been
114	   discussed.  Indeed, one reason AQM was originally introduced was to
115	   reduce the lock-out effects that small packets can have on large
116	   packets in drop-tail queues.  This memo aims to state the principles
117	   we should be using and to outline how these principles will affect
118	   future protocol design, taking into account the existing deployments
119	   we have already.

121	   The question of whether to take into account packet size arises at
122	   three stages in the congestion notification process:

124	   Measuring congestion:  When a congested resource measures locally how
125	      congested it is, should it measure its queue length in bytes or
126	      packets?

128	   Encoding congestion notification into the wire protocol:  When a
129	      congested network resource notifies its level of congestion,
130	      should it drop / mark each packet dependent on the byte-size of
131	      the particular packet in question?

133	   Decoding congestion notification from the wire protocol:  When a
134	      transport interprets the notification in order to decide how much
135	      to respond to congestion, should it take into account the byte-
136	      size of each missing or marked packet?

138	   Consensus has emerged over the years concerning the first stage:
139	   whether queues are measured in bytes or packets, termed byte-mode
140	   queue measurement or packet-mode queue measurement.  Section 2.1 of
141	   this memo records this consensus in the RFC Series.  In summary the
142	   choice solely depends on whether the resource is congested by bytes
143	   or packets.

145	   The controversy is mainly around the last two stages: whether to
146	   allow for the size of the specific packet notifying congestion i)
147	   when the network encodes or ii) when the transport decodes the
148	   congestion notification.

150	   Currently, the RFC series is silent on this matter other than a paper
151	   trail of advice referenced from [RFC2309], which conditionally
152	   recommends byte-mode (packet-size dependent) drop [pktByteEmail].
153	   Reducing drop of small packets certainly has some tempting
154	   advantages: i) it drops less control packets, which tend to be small
155	   and ii) it makes TCP's bit-rate less dependent on packet size.
156	   However, there are ways of addressing these issues at the transport
157	   layer, rather than reverse engineering network forwarding to fix the
158	   problems.

160	   This memo updates [RFC2309] to deprecate deliberate preferential
161	   treatment of small packets in AQM algorithms.  It recommends that (1)
162	   packet size should be taken into account when transports read
163	   congestion indications, (2) not when network equipment writes them.

165	   In particular this means that the byte-mode packet drop variant of
166	   Random early Detection (RED) should not be used to drop fewer small
167	   packets, because that creates a perverse incentive for transports to
168	   use tiny segments, consequently also opening up a DoS vulnerability.
169	   Fortunately all the RED implementers who responded to our admittedly
170	   limited survey (Section 4.2.4) have not followed the earlier advice
171	   to use byte-mode drop, so the position this memo argues for seems to
172	   already exist in implementations.

174	   However, at the transport layer, TCP congestion control is a widely
175	   deployed protocol that doesn't scale with packet size.  To date this
176	   hasn't been a significant problem because most TCP implementations
177	   have been used with similar packet sizes.  But, as we design new
178	   congestion control mechanisms, the current recommendation is that we
179	   should build in scaling with packet size rather than assuming we
180	   should follow TCP's example.

182	   This memo continues as follows.  First it discusses terminology and
183	   scoping.  Section 2 gives the concrete formal recommendations,
184	   followed by motivating arguments in Section 3.  We then critically
185	   survey the advice given previously in the RFC series and the research
186	   literature (Section 4), referring to an assessment of whether or not
187	   this advice has been followed in production networks (Appendix A).
188	   To wrap up, outstanding issues are discussed that will need
189	   resolution both to inform future protocol designs and to handle
190	   legacy (Section 5).  Then security issues are collected together in
191	   Section 6 before conclusions are drawn in Section 8.  The interested
192	   reader can find discussion of more detailed issues on the theme of
193	   byte vs. packet in the appendices.

195	   This memo intentionally includes a non-negligible amount of material
196	   on the subject.  For the busy reader Section 2 summarises the
197	   recommendations for the Internet community.

199	1.1.  Terminology and Scoping

201	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
202	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
203	   document are to be interpreted as described in [RFC2119].

205	   Congestion Notification:  Congestion notification is a changing
206	      signal that aims to communicate the probability that the network
207	      resource(s) will not be able to forward the level of traffic load
208	      offered (or that there is an impending risk that they will not be
209	      able to).

211	      The `impending risk' qualifier is added, because AQM systems (e.g.
212	      RED, PCN [RFC5670]) set a virtual limit smaller than the actual
213	      limit to the resource, then notify when this virtual limit is
214	      exceeded in order to avoid uncontrolled congestion of the actual
215	      capacity.

217	      Congestion notification communicates a real number bounded by the
218	      range [ 0 , 1 ].  This ties in with the most well-understood
219	      measure of congestion notification: drop probability.

221	   Explicit and Implicit Notification:  The byte vs. packet dilemma
222	      concerns congestion notification irrespective of whether it is
223	      signalled implicitly by drop or using explicit congestion
224	      notification (ECN [RFC3168] or PCN [RFC5670]).  Throughout this
225	      document, unless clear from the context, the term marking will be
226	      used to mean notifying congestion explicitly, while congestion
227	      notification will be used to mean notifying congestion either
228	      implicitly by drop or explicitly by marking.

230	   Bit-congestible vs. Packet-congestible:  If the load on a resource
231	      depends on the rate at which packets arrive, it is called packet-
232	      congestible.  If the load depends on the rate at which bits arrive
233	      it is called bit-congestible.

235	      Examples of packet-congestible resources are route look-up engines
236	      and firewalls, because load depends on how many packet headers
237	      they have to process.  Examples of bit-congestible resources are
238	      transmission links, radio power and most buffer memory, because
239	      the load depends on how many bits they have to transmit or store.
240	      Some machine architectures use fixed size packet buffers, so
241	      buffer memory in these cases is packet-congestible (see
242	      Section 4.1.1).

244	      Currently a design goal of network processing equipment such as
245	      routers and firewalls is to keep packet processing uncongested
246	      even under worst case packet rates with runs of minimum size
247	      packets.  Therefore, packet-congestion is currently rare [RFC6077;
248	      S.3.3], but there is no guarantee that it will not become more
249	      common in future.

251	      Note that information is generally processed or transmitted with a
252	      minimum granularity greater than a bit (e.g. octets).  The
253	      appropriate granularity for the resource in question should be
254	      used, but for the sake of brevity we will talk in terms of bytes
255	      in this memo.

257	   Coarser Granularity:  Resources may be congestible at higher levels
258	      of granularity than bits or packets, for instance stateful
259	      firewalls are flow-congestible and call-servers are session-
260	      congestible.  This memo focuses on congestion of connectionless
261	      resources, but the same principles may be applicable for
262	      congestion notification protocols controlling per-flow and per-
263	      session processing or state.

265	   RED Terminology:  In RED whether to use packets or bytes when
266	      measuring queues is called respectively "packet-mode queue
267	      measurement" or "byte-mode queue measurement".  And whether the
268	      probability of dropping a particular packet is independent or
269	      dependent on its byte-size is called respectively "packet-mode
270	      drop" or "byte-mode drop".  The terms byte-mode and packet-mode
271	      should not be used without specifying whether they apply to queue
272	      measurement or to drop.

274	1.2.  Example Comparing Packet-Mode Drop and Byte-Mode Drop

276	   A central question addressed by this document is whether to recommend
277	   that AQM uses RED's packet-mode drop and to deprecate byte-mode drop.
278	   Table 1 compares how packet-mode and byte-mode drop affect two flows
279	   of different size packets.  For each it gives the expected number of
280	   packets and of bits dropped in one second.  Each example flow runs at
281	   the same bit-rate of 48Mb/s, but one is broken up into small 60 byte
282	   packets and the other into large 1500 byte packets.

284	   To keep up the same bit-rate, in one second there are about 25 times
285	   more small packets because they are 25 times smaller.  As can be seen
286	   from the table, the packet rate is 100,000 small packets versus 4,000
287	   large packets per second (pps).

289	      Parameter            Formula        Small packets Large packets
290	      -------------------- -------------- ------------- -------------
291	      Packet size          s/8                      60B        1,500B
292	      Packet size          s                       480b       12,000b
293	      Bit-rate             x                     48Mbps        48Mbps
294	      Packet-rate          u = x/s              100kpps         4kpps

296	      Packet-mode Drop
297	      Pkt loss probability p                       0.1%          0.1%
298	      Pkt loss-rate        p*u                   100pps          4pps
299	      Bit loss-rate        p*u*s                 48kbps        48kbps

301	      Byte-mode Drop       MTU, M=12,000b
302	      Pkt loss probability b = p*s/M             0.004%          0.1%
303	      Pkt loss-rate        b*u                     4pps          4pps
304	      Bit loss-rate        b*u*s               1.92kbps        48kbps

306	         Table 1: Example Comparing Packet-mode and Byte-mode Drop

308	   For packet-mode drop, we illustrate the effect of a drop probability
309	   of 0.1%, which the algorithm applies to all packets irrespective of
310	   size.  Because there are 25 times more small packets in one second,
311	   it naturally drops 25 times more small packets, that is 100 small
312	   packets but only 4 large packets.  But if we count how many bits it
313	   drops, there are 48,000 bits in 100 small packets and 48,000 bits in
314	   4 large packets--the same number of bits of small packets as large.

316	      The packet-mode drop algorithm drops any bit with the same
317	      probability whether the bit is in a small or a large packet.

319	   For byte-mode drop, again we use an example drop probability of 0.1%,
320	   but only for maximum size packets (assuming the link MTU is 1,500B or
321	   12,000b).  The byte-mode algorithm reduces the drop probability of
322	   smaller packets proportional to their size, making the probability
323	   that it drops a small packet 25 times smaller at 0.004%.  But there
324	   are 25 times more small packets, so dropping them with 25 times lower
325	   probability results in dropping the same number of packets: 4 drops
326	   in both cases.  The 4 small dropped packets contain 25 times less
327	   bits than the 4 large dropped packets: 1,920 compared to 48,000.

329	      The byte-mode drop algorithm drops any bit with a probability
330	      proportionate to the size of the packet it is in.

332	2.  Recommendations

334	   This section gives recommendations related to network equipment in
335	   Sections 2.1 and 2.2, and in Sections 2.3 and 2.4 we discuss the
336	   implications on the transport protocols.

338	2.1.  Recommendation on Queue Measurement

340	   Queue length is usually the most correct and simplest way to measure
341	   congestion of a resource.  To avoid the pathological effects of drop
342	   tail, an AQM function can then be used to transform queue length into
343	   the probability of dropping or marking a packet (e.g.  RED's
344	   piecewise linear function between thresholds).

346	   If the resource is bit-congestible, the implementation SHOULD measure
347	   the length of the queue in bytes.  If the resource is packet-
348	   congestible, the implementation SHOULD measure the length of the
349	   queue in packets.  No other choice makes sense, because the number of
350	   packets waiting in the queue isn't relevant if the resource gets
351	   congested by bytes and vice versa.

353	   What this advice means for the case of RED:

355	   1.  A RED implementation SHOULD use byte mode queue measurement for
356	       measuring the congestion of bit-congestible resources and packet
357	       mode queue measurement for packet-congestible resources.

359	   2.  An implementation SHOULD NOT make it possible to configure the
360	       way a queue measures itself, because whether a queue is bit-
361	       congestible or packet-congestible is an inherent property of the
362	       queue.

364	   The recommended approach in less straightforward scenarios, such as
365	   fixed size buffers, and resources without a queue, is discussed in
366	   Section 4.1.

368	2.2.  Recommendation on Encoding Congestion Notification

370	   When encoding congestion notification (e.g. by drop, ECN & PCN), a
371	   network device SHOULD treat all packets equally, regardless of their
372	   size.  In other words, the probability that network equipment drops
373	   or marks a particular packet to notify congestion SHOULD NOT depend
374	   on the size of the packet in question.  As the example in Section 1.2
375	   illustrates, to drop any bit with probability 0.1% it is only
376	   necessary to drop every packet with probability 0.1% without regard
377	   to the size of each packet.

379	   This approach ensures the network layer offers sufficient congestion
380	   information for all known and future transport protocols and also
381	   ensures no perverse incentives are created that would encourage
382	   transports to use inappropriately small packet sizes.

384	   What this advice means for the case of RED:

386	   1.  AQM algorithms such as RED SHOULD NOT use byte-mode drop, which
387	       deflates RED's drop probability for smaller packet sizes.  RED's
388	       byte-mode drop has no enduring advantages.  It is more complex,
389	       it creates the perverse incentive to fragment segments into tiny
390	       pieces and it reopens the vulnerability to floods of small-
391	       packets that drop-tail queues suffered from and AQM was designed
392	       to remove.

394	   2.  If a vendor has implemented byte-mode drop, and an operator has
395	       turned it on, it is RECOMMENDED to turn it off.  Note that RED as
396	       a whole SHOULD NOT be turned off, as without it, a drop tail
397	       queue also biases against large packets.  But note also that
398	       turning off byte-mode drop may alter the relative performance of
399	       applications using different packet sizes, so it would be
400	       advisable to establish the implications before turning it off.

402	       Note well that RED's byte-mode queue drop is completely
403	       orthogonal to byte-mode queue measurement and should not be
404	       confused with it.  If a RED implementation has a byte-mode but
405	       does not specify what sort of byte-mode, it is most probably
406	       byte-mode queue measurement, which is fine.  However, if in
407	       doubt, the vendor should be consulted.

409	   A survey (Appendix A) showed that there appears to be little, if any,
410	   installed base of the byte-mode drop variant of RED.  This suggests
411	   that deprecating byte-mode drop will have little, if any, incremental
412	   deployment impact.

414	2.3.  Recommendation on Responding to Congestion

416	   When a transport detects that a packet has been lost or congestion
417	   marked, it SHOULD consider the strength of the congestion indication
418	   as proportionate to the size in octets (bytes) of the missing or
419	   marked packet.

421	   In other words, when a packet indicates congestion (by being lost or
422	   marked) it can be considered conceptually as if there is a congestion
423	   indication on every octet of the packet, not just one indication per
424	   packet.

426	   Therefore, the IETF transport area should continue its programme of;

428	   o  updating host-based congestion control protocols to take account
429	      of packet size

431	   o  making transports less sensitive to losing control packets like
432	      SYNs and pure ACKs.

434	   What this advice means for the case of TCP:

436	   1.  If two TCP flows with different packet sizes are required to run
437	       at equal bit rates under the same path conditions, this should be
438	       done by altering TCP (Section 4.2.2), not network equipment (the
439	       latter affects other transports besides TCP).

441	   2.  If it is desired to improve TCP performance by reducing the
442	       chance that a SYN or a pure ACK will be dropped, this should be
443	       done by modifying TCP (Section 4.2.3), not network equipment.

445	2.4.  Recommendation on Handling Congestion Indications when Splitting
446	      or Merging Packets

448	   Packets carrying congestion indications may be split or merged in
449	   some circumstances (e.g. at a RTCP transcoder or during IP fragment
450	   reassembly).  Splitting and merging only make sense in the context of
451	   ECN, not loss.

453	   The general rule to follow is that the number of octets in packets
454	   with congestion indications SHOULD be equivalent before and after
455	   merging or splitting.  This is based on the principle used above;
456	   that an indication of congestion on a packet can be considered as an
457	   indication of congestion on each octet of the packet.

459	   The above rule is not phrased with the word "MUST" to allow the
460	   following exception.  There are cases where pre-existing protocols
461	   were not designed to conserve congestion marked octets (e.g.  IP
462	   fragment reassembly [RFC3168] or loss statistics in RTCP receiver
463	   reports [RFC3550] before ECN was added
464	   [I-D.ietf-avtcore-ecn-for-rtp]).  When any such protocol is updated,
465	   it SHOULD comply with the above rule to conserve marked octets.
466	   However, the rule may be relaxed if it would otherwise become too
467	   complex to interoperate with pre-existing implementations of the
468	   protocol.

470	   One can think of a splitting or merging process as if all the
471	   incoming congestion-marked octets increment a counter and all the
472	   outgoing marked octets decrement the same counter.  In order to
473	   ensure that congestion indications remain timely, even the smallest
474	   positive remainder in the conceptual counter should trigger the next
475	   outgoing packet to be marked (causing the counter to go negative).

477	3.  Motivating Arguments

479	   This section is informative.  It justifies the recommendations given
480	   in the previous section.

482	3.1.  Avoiding Perverse Incentives to (Ab)use Smaller Packets

484	   Increasingly, it is being recognised that a protocol design must take
485	   care not to cause unintended consequences by giving the parties in
486	   the protocol exchange perverse incentives [Evol_cc][RFC3426].  Given
487	   there are many good reasons why larger path maximum transmission
488	   units (PMTUs) would help solve a number of scaling issues, we do not
489	   want to create any bias against large packets that is greater than
490	   their true cost.

492	   Imagine a scenario where the same bit rate of packets will contribute
493	   the same to bit-congestion of a link irrespective of whether it is
494	   sent as fewer larger packets or more smaller packets.  A protocol
495	   design that caused larger packets to be more likely to be dropped
496	   than smaller ones would be dangerous in both the following cases:

498	   Malicious transports:  A queue that gives an advantage to small
499	      packets can be used to amplify the force of a flooding attack.  By
500	      sending a flood of small packets, the attacker can get the queue
501	      to discard more traffic in large packets, allowing more attack
502	      traffic to get through to cause further damage.  Such a queue
503	      allows attack traffic to have a disproportionately large effect on
504	      regular traffic without the attacker having to do much work.

506	   Non-malicious transports:  Even if a transport designer is not
507	      actually malicious, if over time it is noticed that small packets
508	      tend to go faster, designers will act in their own interest and
509	      use smaller packets.  Queues that give advantage to small packets
510	      create an evolutionary pressure for transports to send at the same
511	      bit-rate but break their data stream down into tiny segments to
512	      reduce their drop rate.  Encouraging a high volume of tiny packets
513	      might in turn unnecessarily overload a completely unrelated part
514	      of the system, perhaps more limited by header-processing than
515	      bandwidth.

517	   Imagine two unresponsive flows arrive at a bit-congestible
518	   transmission link each with the same bit rate, say 1Mbps, but one
519	   consists of 1500B and the other 60B packets, which are 25x smaller.
520	   Consider a scenario where gentle RED [gentle_RED] is used, along with
521	   the variant of RED we advise against, i.e. where the RED algorithm is
522	   configured to adjust the drop probability of packets in proportion to
523	   each packet's size (byte mode packet drop).  In this case, RED aims
524	   to drop 25x more of the larger packets than the smaller ones.  Thus,
525	   for example if RED drops 25% of the larger packets, it will aim to
526	   drop 1% of the smaller packets (but in practice it may drop more as
527	   congestion increases [RFC4828; Appx B.4]).  Even though both flows
528	   arrive with the same bit rate, the bit rate the RED queue aims to
529	   pass to the line will be 750kbps for the flow of larger packets but
530	   990kbps for the smaller packets (because of rate variations it will
531	   actually be a little less than this target).

533	   Note that, although the byte-mode drop variant of RED amplifies small
534	   packet attacks, drop-tail queues amplify small packet attacks even
535	   more (see Security Considerations in Section 6).  Wherever possible
536	   neither should be used.

538	3.2.  Small != Control

540	   Dropping fewer control packets considerably improves performance.  It
541	   is tempting to drop small packets with lower probability in order to
542	   improve performance, because many control packets are small (TCP SYNs
543	   & ACKs, DNS queries & responses, SIP messages, HTTP GETs, etc).
544	   However, we must not give control packets preference purely by virtue
545	   of their smallness, otherwise it is too easy for any data source to
546	   get the same preferential treatment simply by sending data in smaller
547	   packets.  Again we should not create perverse incentives to favour
548	   small packets rather than to favour control packets, which is what we
549	   intend.

551	   Just because many control packets are small does not mean all small
552	   packets are control packets.

554	   So, rather than fix these problems in the network, we argue that the
555	   transport should be made more robust against losses of control
556	   packets (see 'Making Transports Robust against Control Packet Losses'
557	   in Section 4.2.3).

559	3.3.  Transport-Independent Network

561	   TCP congestion control ensures that flows competing for the same
562	   resource each maintain the same number of segments in flight,
563	   irrespective of segment size.  So under similar conditions, flows
564	   with different segment sizes will get different bit-rates.

566	   One motivation for the network biasing congestion notification by
567	   packet size is to counter this effect and try to equalise the bit-
568	   rates of flows with different packet sizes.  However, in order to do
569	   this, the queuing algorithm has to make assumptions about the
570	   transport, which become embedded in the network.  Specifically:

572	   o  The queuing algorithm has to assume how aggressively the transport
573	      will respond to congestion (see Section 4.2.4).  If the network
574	      assumes the transport responds as aggressively as TCP NewReno, it
575	      will be wrong for Compound TCP and differently wrong for Cubic
576	      TCP, etc.  To achieve equal bit-rates, each transport then has to
577	      guess what assumption the network made, and work out how to
578	      replace this assumed aggressiveness with its own aggressiveness.

580	   o  Also, if the network biases congestion notification by packet size
581	      it has to assume a baseline packet size--all proposed algorithms
582	      use the local MTU.  Then transports have to guess which link was
583	      congested and what its local MTU was, in order to know how to
584	      tailor their congestion response to that link.

586	   Even though reducing the drop probability of small packets (e.g.
587	   RED's byte-mode drop) helps ensure TCP flows with different packet
588	   sizes will achieve similar bit rates, we argue this correction should
589	   be made to any future transport protocols based on TCP, not to the
590	   network in order to fix one transport, no matter how predominant it
591	   is.  Effectively, favouring small packets is reverse engineering of
592	   network equipment around one particular transport protocol (TCP),
593	   contrary to the excellent advice in [RFC3426], which asks designers
594	   to question "Why are you proposing a solution at this layer of the
595	   protocol stack, rather than at another layer?"

597	   In contrast, if the network never takes account of packet size, the
598	   transport can be certain it will never need to guess any assumptions
599	   the network has made.  And the network passes two pieces of
600	   information to the transport that are sufficient in all cases: i)
601	   congestion notification on the packet and ii) the size of the packet.
602	   Both are available for the transport to combine (by taking account of
603	   packet size when responding to congestion) or not.  Appendix B checks
604	   that these two pieces of information are sufficient for all relevant
605	   scenarios.

607	   When the network does not take account of packet size, it allows
608	   transport protocols to choose whether to take account of packet size
609	   or not.  However, if the network were to bias congestion notification
610	   by packet size, transport protocols would have no choice; those that
611	   did not take account of packet size themselves would unwittingly
612	   become dependent on packet size, and those that already took account
613	   of packet size would end up taking account of it twice.

615	3.4.  Scaling Congestion Control with Packet Size

617	   Having so far justified only our recommendations for the network,
618	   this section focuses on the host.  We construct a scaling argument to
619	   justify the recommendation that a host should respond to a dropped or
620	   marked packet in proportion to its size, not just as a single
621	   congestion event.

623	   The argument assumes that we have already sufficiently justified our
624	   recommendation that the network should not take account of packet
625	   size.

627	   Also, we assume bit-congestible links are the predominant source of
628	   congestion.  As the Internet stands, it is hard if not impossible to
629	   know whether congestion notification is from a bit-congestible or a
630	   packet-congestible resource (see Appendix B.2) so we have to assume
631	   the most prevalent case (see Section 1.1).  If this assumption is
632	   wrong, and particular congestion indications are actually due to
633	   overload of packet-processing, there is no issue of safety at stake.
634	   Any congestion control that triggers a multiplicative decrease in
635	   response to a congestion indication will bring packet processing back
636	   to its operating point just as quickly.  The only issue at stake is
637	   that the resource could be utilised more efficiently if packet-
638	   congestion could be separately identified.

640	   Imagine a bit-congestible link shared by many flows, so that each
641	   busy period tends to cause packets to be lost from different flows.
642	   Consider further two sources that have the same data rate but break
643	   the load into large packets in one application (A) and small packets
644	   in the other (B).  Of course, because the load is the same, there
645	   will be proportionately more packets in the small packet flow (B).

647	   If a congestion control scales with packet size it should respond in
648	   the same way to the same congestion notification, irrespective of the
649	   size of the packets containing the bytes that contribute to
650	   congestion.

652	   A bit-congestible queue suffering congestion has to drop or mark the
653	   same excess bytes whether they are in a few large packets (A) or many
654	   small packets (B).  So for the same amount of congestion overload,
655	   the same amount of bytes has to be shed to get the load back to its
656	   operating point.  For smaller packets (B) more packets will have to
657	   be discarded to shed the same bytes.

659	   If both the transports interpret each drop/mark as a single loss
660	   event irrespective of the size of the packet dropped, the flow of
661	   smaller packets (B) will respond more times to the same congestion.
662	   On the other hand, if a transport responds proportionately less when
663	   smaller packets are dropped/marked, overall it will be able to
664	   respond the same to the same amount of congestion.

666	   Therefore, for a congestion control to scale with packet size it
667	   should respond to dropped or marked bytes (as TFRC-SP [RFC4828]
668	   effectively does), instead of dropped or marked packets (as TCP
669	   does).

671	   For the avoidance of doubt, this is not a recommendation that TCP
672	   should be changed so that it scales with packet size.  It is a
673	   recommendation that any future transport protocol proposal should
674	   respond to dropped or marked bytes if it wishes to claim that it is
675	   scalable.

677	3.5.  Implementation Efficiency

679	   Allowing for packet size at the transport rather than in the network
680	   ensures that neither the network nor the transport needs to do a
681	   multiply operation--multiplication by packet size is effectively
682	   achieved as a repeated add when the transport adds to its count of
683	   marked bytes as each congestion event is fed to it.  This isn't a
684	   principled reason in itself, but it is a happy consequence of the
685	   other principled reasons.

687	4.  A Survey and Critique of Past Advice

689	   This section is informative, not normative.

691	   The original 1993 paper on RED [RED93] proposed two options for the
692	   RED active queue management algorithm: packet mode and byte mode.
693	   Packet mode measured the queue length in packets and dropped (or
694	   marked) individual packets with a probability independent of their
695	   size.  Byte mode measured the queue length in bytes and marked an
696	   individual packet with probability in proportion to its size
697	   (relative to the maximum packet size).  In the paper's outline of
698	   further work, it was stated that no recommendation had been made on
699	   whether the queue size should be measured in bytes or packets, but
700	   noted that the difference could be significant.

702	   When RED was recommended for general deployment in 1998 [RFC2309],
703	   the two modes were mentioned implying the choice between them was a
704	   question of performance, referring to a 1997 email [pktByteEmail] for
705	   advice on tuning.  A later addendum to this email introduced the
706	   insight that there are in fact two orthogonal choices:

708	   o  whether to measure queue length in bytes or packets (Section 4.1)

710	   o  whether the drop probability of an individual packet should depend
711	      on its own size (Section 4.2).

713	   The rest of this section is structured accordingly.

715	4.1.  Congestion Measurement Advice

717	   The choice of which metric to use to measure queue length was left
718	   open in RFC2309.  It is now well understood that queues for bit-
719	   congestible resources should be measured in bytes, and queues for
720	   packet-congestible resources should be measured in packets
721	   [pktByteEmail].

723	   Congestion in some legacy bit-congestible buffers is only measured in
724	   packets not bytes.  In such cases, the operator has to set the
725	   thresholds mindful of a typical mix of packets sizes.  Any AQM
726	   algorithm on such a buffer will be oversensitive to high proportions
727	   of small packets, e.g. a DoS attack, and undersensitive to high
728	   proportions of large packets.  However, there is no need to make
729	   allowances for the possibility of such legacy in future protocol
730	   design.  This is safe because any undersensitivity during unusual
731	   traffic mixes cannot lead to congestion collapse given the buffer
732	   will eventually revert to tail drop, discarding proportionately more
733	   large packets.

735	4.1.1.  Fixed Size Packet Buffers

737	   The question of whether to measure queues in bytes or packets seems
738	   to be well understood.  However, measuring congestion is not
739	   straightforward when the resource is bit congestible but the queue is
740	   packet congestible or vice versa.  This section outlines the approach
741	   to take.  There is no controversy over what should be done, you just
742	   need to be expert in probability to work it out.  And, even if you
743	   know what should be done, it's not always easy to find a practical
744	   algorithm to implement it.

746	   Some, mostly older, queuing hardware sets aside fixed sized buffers
747	   in which to store each packet in the queue.  Also, with some
748	   hardware, any fixed sized buffers not completely filled by a packet
749	   are padded when transmitted to the wire.  If we imagine a theoretical
750	   forwarding system with both queuing and transmission in fixed, MTU-
751	   sized units, it should clearly be treated as packet-congestible,
752	   because the queue length in packets would be a good model of
753	   congestion of the lower layer link.

755	   If we now imagine a hybrid forwarding system with transmission delay
756	   largely dependent on the byte-size of packets but buffers of one MTU
757	   per packet, it should strictly require a more complex algorithm to
758	   determine the probability of congestion.  It should be treated as two
759	   resources in sequence, where the sum of the byte-sizes of the packets
760	   within each packet buffer models congestion of the line while the
761	   length of the queue in packets models congestion of the queue.  Then
762	   the probability of congesting the forwarding buffer would be a
763	   conditional probability--conditional on the previously calculated
764	   probability of congesting the line.

766	   In systems that use fixed size buffers, it is unusual for all the
767	   buffers used by an interface to be the same size.  Typically pools of
768	   different sized buffers are provided (Cisco uses the term 'buffer
769	   carving' for the process of dividing up memory into these pools
770	   [IOSArch]).  Usually, if the pool of small buffers is exhausted,
771	   arriving small packets can borrow space in the pool of large buffers,
772	   but not vice versa.  However, it is easier to work out what should be
773	   done if we temporarily set aside the possibility of such borrowing.
774	   Then, with fixed pools of buffers for different sized packets and no
775	   borrowing, the size of each pool and the current queue length in each
776	   pool would both be measured in packets.  So an AQM algorithm would
777	   have to maintain the queue length for each pool, and judge whether to
778	   drop/mark a packet of a particular size by looking at the pool for
779	   packets of that size and using the length (in packets) of its queue.

781	   We now return to the issue we temporarily set aside: small packets
782	   borrowing space in larger buffers.  In this case, the only difference
783	   is that the pools for smaller packets have a maximum queue size that
784	   includes all the pools for larger packets.  And every time a packet
785	   takes a larger buffer, the current queue size has to be incremented
786	   for all queues in the pools of buffers less than or equal to the
787	   buffer size used.

789	   We will return to borrowing of fixed sized buffers when we discuss
790	   biasing the drop/marking probability of a specific packet because of
791	   its size in Section 4.2.1.  But here we can give a at least one
792	   simple rule for how to measure the length of queues of fixed buffers:
793	   no matter how complicated the scheme is, ultimately any fixed buffer
794	   system will need to measure its queue length in packets not bytes.

796	4.1.2.  Congestion Measurement without a Queue

798	   AQM algorithms are nearly always described assuming there is a queue
799	   for a congested resource and the algorithm can use the queue length
800	   to determine the probability that it will drop or mark each packet.
801	   But not all congested resources lead to queues.  For instance,
802	   wireless spectrum is usually regarded as bit-congestible (for a given
803	   coding scheme).  But wireless link protocols do not always maintain a
804	   queue that depends on spectrum interference.  Similarly, power
805	   limited resources are also usually bit-congestible if energy is
806	   primarily required for transmission rather than header processing,
807	   but it is rare for a link protocol to build a queue as it approaches
808	   maximum power.

810	   Nonetheless, AQM algorithms do not require a queue in order to work.
811	   For instance spectrum congestion can be modelled by signal quality
812	   using target bit-energy-to-noise-density ratio.  And, to model radio
813	   power exhaustion, transmission power levels can be measured and
814	   compared to the maximum power available.  [ECNFixedWireless] proposes
815	   a practical and theoretically sound way to combine congestion
816	   notification for different bit-congestible resources at different
817	   layers along an end to end path, whether wireless or wired, and
818	   whether with or without queues.

820	4.2.  Congestion Notification Advice

822	4.2.1.  Network Bias when Encoding

824	4.2.1.1.  Advice on Packet Size Bias in RED

826	   The previously mentioned email [pktByteEmail] referred to by
827	   [RFC2309] advised that most scarce resources in the Internet were
828	   bit-congestible, which is still believed to be true (Section 1.1).
829	   But it went on to offer advice that is updated by this memo.  It said
830	   that drop probability should depend on the size of the packet being
831	   considered for drop if the resource is bit-congestible, but not if it
832	   is packet-congestible.  The argument continued that if packet drops
833	   were inflated by packet size (byte-mode dropping), "a flow's fraction
834	   of the packet drops is then a good indication of that flow's fraction
835	   of the link bandwidth in bits per second".  This was consistent with
836	   a referenced policing mechanism being worked on at the time for
837	   detecting unusually high bandwidth flows, eventually published in
838	   1999 [pBox].  However, the problem could and should have been solved
839	   by making the policing mechanism count the volume of bytes randomly
840	   dropped, not the number of packets.

842	   A few months before RFC2309 was published, an addendum was added to
843	   the above archived email referenced from the RFC, in which the final
844	   paragraph seemed to partially retract what had previously been said.
845	   It clarified that the question of whether the probability of
846	   dropping/marking a packet should depend on its size was not related
847	   to whether the resource itself was bit congestible, but a completely
848	   orthogonal question.  However the only example given had the queue
849	   measured in packets but packet drop depended on the byte-size of the
850	   packet in question.  No example was given the other way round.

852	   In 2000, Cnodder et al [REDbyte] pointed out that there was an error
853	   in the part of the original 1993 RED algorithm that aimed to
854	   distribute drops uniformly, because it didn't correctly take into
855	   account the adjustment for packet size.  They recommended an
856	   algorithm called RED_4 to fix this.  But they also recommended a
857	   further change, RED_5, to adjust drop rate dependent on the square of
858	   relative packet size.  This was indeed consistent with one implied
859	   motivation behind RED's byte mode drop--that we should reverse
860	   engineer the network to improve the performance of dominant end-to-
861	   end congestion control mechanisms.  This memo makes a different
862	   recommendations in Section 2.

864	   By 2003, a further change had been made to the adjustment for packet
865	   size, this time in the RED algorithm of the ns2 simulator.  Instead
866	   of taking each packet's size relative to a `maximum packet size' it
867	   was taken relative to a `mean packet size', intended to be a static
868	   value representative of the `typical' packet size on the link.  We
869	   have not been able to find a justification in the literature for this
870	   change, however Eddy and Allman conducted experiments [REDbias] that
871	   assessed how sensitive RED was to this parameter, amongst other
872	   things.  However, this changed algorithm can often lead to drop
873	   probabilities of greater than 1 (which gives a hint that there is
874	   probably a mistake in the theory somewhere).

876	   On 10-Nov-2004, this variant of byte-mode packet drop was made the
877	   default in the ns2 simulator.  It seems unlikely that byte-mode drop
878	   has ever been implemented in production networks (Appendix A),
879	   therefore any conclusions based on ns2 simulations that use RED
880	   without disabling byte-mode drop are likely to behave very
881	   differently from RED in production networks.

883	4.2.1.2.  Packet Size Bias Regardless of RED

885	   The byte-mode drop variant of RED is, of course, not the only
886	   possible bias towards small packets in queueing systems.  We have
887	   already mentioned that tail-drop queues naturally tend to lock-out
888	   large packets once they are full.  But also queues with fixed sized
889	   buffers reduce the probability that small packets will be dropped if
890	   (and only if) they allow small packets to borrow buffers from the
891	   pools for larger packets.  As was explained in Section 4.1.1 on fixed
892	   size buffer carving, borrowing effectively makes the maximum queue
893	   size for small packets greater than that for large packets, because
894	   more buffers can be used by small packets while less will fit large
895	   packets.

897	   In itself, the bias towards small packets caused by buffer borrowing
898	   is perfectly correct.  Lower drop probability for small packets is
899	   legitimate in buffer borrowing schemes, because small packets
900	   genuinely congest the machine's buffer memory less than large
901	   packets, given they can fit in more spaces.  The bias towards small
902	   packets is not artificially added (as it is in RED's byte-mode drop
903	   algorithm), it merely reflects the reality of the way fixed buffer
904	   memory gets congested.  Incidentally, the bias towards small packets
905	   from buffer borrowing is nothing like as large as that of RED's byte-
906	   mode drop.

908	   Nonetheless, fixed-buffer memory with tail drop is still prone to
909	   lock-out large packets, purely because of the tail-drop aspect.  So a
910	   good AQM algorithm like RED with packet-mode drop should be used with
911	   fixed buffer memories where possible.  If RED is too complicated to
912	   implement with multiple fixed buffer pools, the minimum necessary to
913	   prevent large packet lock-out is to ensure smaller packets never use
914	   the last available buffer in any of the pools for larger packets.

916	4.2.2.  Transport Bias when Decoding

918	   The above proposals to alter the network equipment to bias towards
919	   smaller packets have largely carried on outside the IETF process.
920	   Whereas, within the IETF, there are many different proposals to alter
921	   transport protocols to achieve the same goals, i.e. either to make
922	   the flow bit-rate take account of packet size, or to protect control
923	   packets from loss.  This memo argues that altering transport
924	   protocols is the more principled approach.

926	   A recently approved experimental RFC adapts its transport layer
927	   protocol to take account of packet sizes relative to typical TCP
928	   packet sizes.  This proposes a new small-packet variant of TCP-
929	   friendly rate control [RFC5348] called TFRC-SP [RFC4828].
930	   Essentially, it proposes a rate equation that inflates the flow rate
931	   by the ratio of a typical TCP segment size (1500B including TCP
932	   header) over the actual segment size [PktSizeEquCC].  (There are also
933	   other important differences of detail relative to TFRC, such as using
934	   virtual packets [CCvarPktSize] to avoid responding to multiple losses
935	   per round trip and using a minimum inter-packet interval.)

937	   Section 4.5.1 of this TFRC-SP spec discusses the implications of
938	   operating in an environment where queues have been configured to drop
939	   smaller packets with proportionately lower probability than larger
940	   ones.  But it only discusses TCP operating in such an environment,
941	   only mentioning TFRC-SP briefly when discussing how to define
942	   fairness with TCP.  And it only discusses the byte-mode dropping
943	   version of RED as it was before Cnodder et al pointed out it didn't
944	   sufficiently bias towards small packets to make TCP independent of
945	   packet size.

947	   So the TFRC-SP spec doesn't address the issue of which of the network
948	   or the transport _should_ handle fairness between different packet
949	   sizes.  In its Appendix B.4 it discusses the possibility of both
950	   TFRC-SP and some network buffers duplicating each other's attempts to
951	   deliberately bias towards small packets.  But the discussion is not
952	   conclusive, instead reporting simulations of many of the
953	   possibilities in order to assess performance but not recommending any
954	   particular course of action.

956	   The paper originally proposing TFRC with virtual packets (VP-TFRC)
957	   [CCvarPktSize] proposed that there should perhaps be two variants to
958	   cater for the different variants of RED.  However, as the TFRC-SP
959	   authors point out, there is no way for a transport to know whether
960	   some queues on its path have deployed RED with byte-mode packet drop
961	   (except if an exhaustive survey found that no-one has deployed it!--
962	   see Appendix A).  Incidentally, VP-TFRC also proposed that byte-mode
963	   RED dropping should really square the packet-size compensation-factor
964	   (like that of Cnodder's RED_5, but apparently unaware of it).

966	   Pre-congestion notification [RFC5670] is an IETF technology to use a
967	   virtual queue for AQM marking for packets within one Diffserv class
968	   in order to give early warning prior to any real queuing.  The PCN
969	   marking algorithms have been designed not to take account of packet
970	   size when forwarding through queues.  Instead the general principle
971	   has been to take account of the sizes of marked packets when
972	   monitoring the fraction of marking at the edge of the network, as
973	   recommended here.

975	4.2.3.  Making Transports Robust against Control Packet Losses

977	   Recently, two RFCs have defined changes to TCP that make it more
978	   robust against losing small control packets [RFC5562] [RFC5690].  In
979	   both cases they note that the case for these two TCP changes would be
980	   weaker if RED were biased against dropping small packets.  We argue
981	   here that these two proposals are a safer and more principled way to
982	   achieve TCP performance improvements than reverse engineering RED to
983	   benefit TCP.

985	   Although there are no known proposals, it would also be possible and
986	   perfectly valid to make control packets robust against drop by
987	   explicitly requesting a lower drop probability using their Diffserv
988	   code point [RFC2474] to request a scheduling class with lower drop.

990	   Although not brought to the IETF, a simple proposal from Wischik
991	   [DupTCP] suggests that the first three packets of every TCP flow
992	   should be routinely duplicated after a short delay.  It shows that
993	   this would greatly improve the chances of short flows completing
994	   quickly, but it would hardly increase traffic levels on the Internet,
995	   because Internet bytes have always been concentrated in the large
996	   flows.  It further shows that the performance of many typical
997	   applications depends on completion of long serial chains of short
998	   messages.  It argues that, given most of the value people get from
999	   the Internet is concentrated within short flows, this simple
1000	   expedient would greatly increase the value of the best efforts
1001	   Internet at minimal cost.

1003	4.2.4.  Congestion Notification: Summary of Conflicting Advice
1004	   +-----------+----------------+-----------------+--------------------+
1005	   | transport |  RED_1 (packet |  RED_4 (linear  | RED_5 (square byte |
1006	   |        cc |   mode drop)   | byte mode drop) |     mode drop)     |
1007	   +-----------+----------------+-----------------+--------------------+
1008	   |    TCP or |    s/sqrt(p)   |    sqrt(s/p)    |      1/sqrt(p)     |
1009	   |      TFRC |                |                 |                    |
1010	   |   TFRC-SP |    1/sqrt(p)   |    1/sqrt(sp)   |    1/(s.sqrt(p))   |
1011	   +-----------+----------------+-----------------+--------------------+

1013	    Table 2: Dependence of flow bit-rate per RTT on packet size, s, and
1014	   drop probability, p, when network and/or transport bias towards small
1015	                        packets to varying degrees

1017	   Table 2 aims to summarise the potential effects of all the advice
1018	   from different sources.  Each column shows a different possible AQM
1019	   behaviour in different queues in the network, using the terminology
1020	   of Cnodder et al outlined earlier (RED_1 is basic RED with packet-
1021	   mode drop).  Each row shows a different transport behaviour: TCP
1022	   [RFC5681] and TFRC [RFC5348] on the top row with TFRC-SP [RFC4828]
1023	   below.  Each cell shows how the bits per round trip of a flow depends
1024	   on packet size, s, and drop probability, p.  In order to declutter
1025	   the formulae to focus on packet-size dependence they are all given
1026	   per round trip, which removes any RTT term.

1028	   Let us assume that the goal is for the bit-rate of a flow to be
1029	   independent of packet size.  Suppressing all inessential details, the
1030	   table shows that this should either be achievable by not altering the
1031	   TCP transport in a RED_5 network, or using the small packet TFRC-SP
1032	   transport (or similar) in a network without any byte-mode dropping
1033	   RED (top right and bottom left).  Top left is the `do nothing'
1034	   scenario, while bottom right is the `do-both' scenario in which bit-
1035	   rate would become far too biased towards small packets.  Of course,
1036	   if any form of byte-mode dropping RED has been deployed on a subset
1037	   of queues that congest, each path through the network will present a
1038	   different hybrid scenario to its transport.

1040	   Whatever, we can see that the linear byte-mode drop column in the
1041	   middle would considerably complicate the Internet.  It's a half-way
1042	   house that doesn't bias enough towards small packets even if one
1043	   believes the network should be doing the biasing.  Section 2
1044	   recommends that _all_ bias in network equipment towards small packets
1045	   should be turned off--if indeed any equipment vendors have
1046	   implemented it--leaving packet-size bias solely as the preserve of
1047	   the transport layer (solely the leftmost, packet-mode drop column).

1049	   In practice it seems that no deliberate bias towards small packets
1050	   has been implemented for production networks.  Of the 19% of vendors
1051	   who responded to a survey of 84 equipment vendors, none had
1052	   implemented byte-mode drop in RED (see Appendix A for details).

1054	5.  Outstanding Issues and Next Steps

1056	5.1.  Bit-congestible Network

1058	   For a connectionless network with nearly all resources being bit-
1059	   congestible the recommended position is clear--that the network
1060	   should not make allowance for packet sizes and the transport should.
1061	   This leaves two outstanding issues:

1063	   o  How to handle any legacy of AQM with byte-mode drop already
1064	      deployed;

1066	   o  The need to start a programme to update transport congestion
1067	      control protocol standards to take account of packet size.

1069	   A survey of equipment vendors (Section 4.2.4) found no evidence that
1070	   byte-mode packet drop had been implemented, so deployment will be
1071	   sparse at best.  A migration strategy is not really needed to remove
1072	   an algorithm that may not even be deployed.

1074	   A programme of experimental updates to take account of packet size in
1075	   transport congestion control protocols has already started with
1076	   TFRC-SP [RFC4828].

1078	5.2.  Bit- & Packet-congestible Network

1080	   The position is much less clear-cut if the Internet becomes populated
1081	   by a more even mix of both packet-congestible and bit-congestible
1082	   resources (see Appendix B.2).  This problem is not pressing, because
1083	   most Internet resources are designed to be bit-congestible before
1084	   packet processing starts to congest (see Section 1.1).

1086	   The IRTF Internet congestion control research group (ICCRG) has set
1087	   itself the task of reaching consensus on generic forwarding
1088	   mechanisms that are necessary and sufficient to support the
1089	   Internet's future congestion control requirements (the first
1090	   challenge in [RFC6077]).  The research question of whether packet
1091	   congestion might become common and what to do if it does may in the
1092	   future be explored in the IRTF (the "Challenge 3: Packet Size" in
1093	   [RFC6077]).

1095	6.  Security Considerations

1097	   This memo recommends that queues do not bias drop probability towards
1098	   small packets as this creates a perverse incentive for transports to
1099	   break down their flows into tiny segments.  One of the benefits of
1100	   implementing AQM was meant to be to remove this perverse incentive
1101	   that drop-tail queues gave to small packets.

1103	   In practice, transports cannot all be trusted to respond to
1104	   congestion.  So another reason for recommending that queues do not
1105	   bias drop probability towards small packets is to avoid the
1106	   vulnerability to small packet DDoS attacks that would otherwise
1107	   result.  One of the benefits of implementing AQM was meant to be to
1108	   remove drop-tail's DoS vulnerability to small packets, so we
1109	   shouldn't add it back again.

1111	   If most queues implemented AQM with byte-mode drop, the resulting
1112	   network would amplify the potency of a small packet DDoS attack.  At
1113	   the first queue the stream of packets would push aside a greater
1114	   proportion of large packets, so more of the small packets would
1115	   survive to attack the next queue.  Thus a flood of small packets
1116	   would continue on towards the destination, pushing regular traffic
1117	   with large packets out of the way in one queue after the next, but
1118	   suffering much less drop itself.

1120	   Appendix C explains why the ability of networks to police the
1121	   response of _any_ transport to congestion depends on bit-congestible
1122	   network resources only doing packet-mode not byte-mode drop.  In
1123	   summary, it says that making drop probability depend on the size of
1124	   the packets that bits happen to be divided into simply encourages the
1125	   bits to be divided into smaller packets.  Byte-mode drop would
1126	   therefore irreversibly complicate any attempt to fix the Internet's
1127	   incentive structures.

1129	7.  IANA Considerations

1131	   This document has no actions for IANA.

1133	8.  Conclusions

1135	   This memo identifies the three distinct stages of the congestion
1136	   notification process where implementations need to decide whether to
1137	   take packet size into account.  The recommendations provided in
1138	   Section 2 of this memo are different in each case:

1140	   o  When network equipment measures the length of a queue, whether it
1141	      counts in bytes or packets depends on whether the network resource
1142	      is congested respectively by bytes or by packets.

1144	   o  When network equipment decides whether to drop (or mark) a packet,
1145	      it is recommended that the size of the particular packet should
1146	      not be taken into account

1148	   o  However, when a transport algorithm responds to a dropped or
1149	      marked packet, the size of the rate reduction should be
1150	      proportionate to the size of the packet.

1152	   In summary, the answers are 'it depends', 'no' and 'yes' respectively

1154	   For the specific case of RED, this means that byte-mode queue
1155	   measurement will often be appropriate although byte-mode drop is
1156	   strongly deprecated.

1158	   At the transport layer the IETF should continue updating congestion
1159	   control protocols to take account of the size of each packet that
1160	   indicates congestion.  Also the IETF should continue to make
1161	   protocols less sensitive to losing control packets like SYNs, pure
1162	   ACKs and DNS exchanges.  Although many control packets happen to be
1163	   small, the alternative of network equipment favouring all small
1164	   packets would be dangerous.  That would create perverse incentives to
1165	   split data transfers into smaller packets.

1167	   The memo develops these recommendations from principled arguments
1168	   concerning scaling, layering, incentives, inherent efficiency,
1169	   security and policeability.  But it also addresses practical issues
1170	   such as specific buffer architectures and incremental deployment.
1171	   Indeed a limited survey of RED implementations is discussed, which
1172	   shows there appears to be little, if any, installed base of RED's
1173	   byte-mode drop.  Therefore it can be deprecated with little, if any,
1174	   incremental deployment complications.

1176	   The recommendations have been developed on the well-founded basis
1177	   that most Internet resources are bit-congestible not packet-
1178	   congestible.  We need to know the likelihood that this assumption
1179	   will prevail longer term and, if it might not, what protocol changes
1180	   will be needed to cater for a mix of the two.  The IRTF Internet
1181	   Congestion Control Research Group (ICCRG) is currently working on
1182	   these problems [RFC6077].

1184	9.  Acknowledgements

1186	   Thank you to Sally Floyd, who gave extensive and useful review
1187	   comments.  Also thanks for the reviews from Philip Eardley, David
1188	   Black, Fred Baker, Toby Moncaster, Arnaud Jacquet and Mirja
1189	   Kuehlewind as well as helpful explanations of different hardware
1190	   approaches from Larry Dunn and Fred Baker.  We are grateful to Bruce
1191	   Davie and his colleagues for providing a timely and efficient survey
1192	   of RED implementation in Cisco's product range.  Also grateful thanks
1193	   to Toby Moncaster, Will Dormann, John Regnault, Simon Carter and
1194	   Stefaan De Cnodder who further helped survey the current status of
1195	   RED implementation and deployment and, finally, thanks to the
1196	   anonymous individuals who responded.

1198	   Bob Briscoe and Jukka Manner were partly funded by Trilogy, a
1199	   research project (ICT- 216372) supported by the European Community
1200	   under its Seventh Framework Programme.  The views expressed here are
1201	   those of the authors only.

1203	10.  Comments Solicited

1205	   Comments and questions are encouraged and very welcome.  They can be
1206	   addressed to the IETF Transport Area working group mailing list
1207	   <tsvwg@ietf.org>, and/or to the authors.

1209	11.  References

1211	11.1.  Normative References

1213	   [RFC2119]                       Bradner, S., "Key words for use in
1214	                                   RFCs to Indicate Requirement Levels",
1215	                                   BCP 14, RFC 2119, March 1997.

1217	   [RFC3168]                       Ramakrishnan, K., Floyd, S., and D.
1218	                                   Black, "The Addition of Explicit
1219	                                   Congestion Notification (ECN) to IP",
1220	                                   RFC 3168, September 2001.

1222	11.2.  Informative References

1224	   [CCvarPktSize]                  Widmer, J., Boutremans, C., and J-Y.
1225	                                   Le Boudec, "Congestion Control for
1226	                                   Flows with Variable Packet Size", ACM
1227	                                   CCR 34(2) 137--151, 2004, <http://
1228	                                   doi.acm.org/10.1145/997150.997162>.

1230	   [CHOKe_Var_Pkt]                 Psounis, K., Pan, R., and B.
1231	                                   Prabhaker, "Approximate Fair Dropping
1232	                                   for Variable Length Packets", IEEE
1233	                                   Micro 21(1):48--56, January-
1234	                                   February 2001, <http://
1235	                                   www.stanford.edu/~balaji/papers/
1236	                                   01approximatefair.pdf}>.

1238	   [DRQ]                           Shin, M., Chong, S., and I. Rhee,
1239	                                   "Dual-Resource TCP/AQM for
1240	                                   Processing-Constrained Networks",
1241	                                   IEEE/ACM Transactions on
1242	                                   Networking Vol 16, issue 2,
1243	                                   April 2008, <http://dx.doi.org/
1244	                                   10.1109/TNET.2007.900415>.

1246	   [DupTCP]                        Wischik, D., "Short messages", Royal
1247	                                   Society workshop on networks:
1248	                                   modelling and control ,
1249	                                   September 2007, <http://
1250	                                   www.cs.ucl.ac.uk/staff/ucacdjw/
1251	                                   Research/shortmsg.html>.

1253	   [ECNFixedWireless]              Siris, V., "Resource Control for
1254	                                   Elastic Traffic in CDMA Networks",
1255	                                   Proc. ACM MOBICOM'02 ,
1256	                                   September 2002, <http://
1257	                                   www.ics.forth.gr/netlab/publications/
1258	                                   resource_control_elastic_cdma.html>.

1260	   [Evol_cc]                       Gibbens, R. and F. Kelly, "Resource
1261	                                   pricing and the evolution of
1262	                                   congestion control",
1263	                                   Automatica 35(12)1969--1985,
1264	                                   December 1999, <http://
1265	                                   www.statslab.cam.ac.uk/~frank/
1266	                                   evol.html>.

1268	   [I-D.ietf-avtcore-ecn-for-rtp]  Westerlund, M., Johansson, I.,
1269	                                   Perkins, C., O'Hanlon, P., and K.
1270	                                   Carlberg, "Explicit Congestion
1271	                                   Notification (ECN) for RTP over UDP",
1272	                                   draft-ietf-avtcore-ecn-for-rtp-08
1273	                                   (work in progress), May 2012.

1275	   [I-D.ietf-conex-concepts-uses]  Briscoe, B., Woundy, R., and A.
1276	                                   Cooper, "ConEx Concepts and Use
1277	                                   Cases",
1278	                                   draft-ietf-conex-concepts-uses-04
1279	                                   (work in progress), March 2012.

1281	   [IOSArch]                       Bollapragada, V., White, R., and C.
1282	                                   Murphy, "Inside Cisco IOS Software
1283	                                   Architecture", Cisco Press: CCIE
1284	                                   Professional Development ISBN13: 978-
1285	                                   1-57870-181-0, July 2000.

1287	   [PktSizeEquCC]                  Vasallo, P., "Variable Packet Size
1288	                                   Equation-Based Congestion Control",
1289	                                   ICSI Technical Report tr-00-008,
1290	                                   2000, <http://http.icsi.berkeley.edu/
1291	                                   ftp/global/pub/techreports/2000/
1292	                                   tr-00-008.pdf>.

1294	   [RED93]                         Floyd, S. and V. Jacobson, "Random
1295	                                   Early Detection (RED) gateways for
1296	                                   Congestion Avoidance", IEEE/ACM
1297	                                   Transactions on Networking 1(4) 397--
1298	                                   413, August 1993, <http://
1299	                                   www.icir.org/floyd/papers/red/
1300	                                   red.html>.

1302	   [REDbias]                       Eddy, W. and M. Allman, "A Comparison
1303	                                   of RED's Byte and Packet Modes",
1304	                                   Computer Networks 42(3) 261--280,
1305	                                   June 2003, <http://www.ir.bbn.com/
1306	                                   documents/articles/redbias.ps>.

1308	   [REDbyte]                       De Cnodder, S., Elloumi, O., and K.
1309	                                   Pauwels, "RED behavior with different
1310	                                   packet sizes", Proc. 5th IEEE
1311	                                   Symposium on Computers and
1312	                                   Communications (ISCC) 793--799,
1313	                                   July 2000, <http://www.icir.org/
1314	                                   floyd/red/Elloumi99.pdf>.

1316	   [RFC2309]                       Braden, B., Clark, D., Crowcroft, J.,
1317	                                   Davie, B., Deering, S., Estrin, D.,
1318	                                   Floyd, S., Jacobson, V., Minshall,
1319	                                   G., Partridge, C., Peterson, L.,
1320	                                   Ramakrishnan, K., Shenker, S.,
1321	                                   Wroclawski, J., and L. Zhang,
1322	                                   "Recommendations on Queue Management
1323	                                   and Congestion Avoidance in the
1324	                                   Internet", RFC 2309, April 1998.

1326	   [RFC2474]                       Nichols, K., Blake, S., Baker, F.,
1327	                                   and D. Black, "Definition of the
1328	                                   Differentiated Services Field (DS
1329	                                   Field) in the IPv4 and IPv6 Headers",
1330	                                   RFC 2474, December 1998.

1332	   [RFC3426]                       Floyd, S., "General Architectural and
1333	                                   Policy Considerations", RFC 3426,
1334	                                   November 2002.

1336	   [RFC3550]                       Schulzrinne, H., Casner, S.,
1337	                                   Frederick, R., and V. Jacobson, "RTP:
1338	                                   A Transport Protocol for Real-Time
1339	                                   Applications", STD 64, RFC 3550,
1340	                                   July 2003.

1342	   [RFC3714]                       Floyd, S. and J. Kempf, "IAB Concerns
1343	                                   Regarding Congestion Control for
1344	                                   Voice Traffic in the Internet",
1345	                                   RFC 3714, March 2004.

1347	   [RFC4828]                       Floyd, S. and E. Kohler, "TCP
1348	                                   Friendly Rate Control (TFRC): The
1349	                                   Small-Packet (SP) Variant", RFC 4828,
1350	                                   April 2007.

1352	   [RFC5348]                       Floyd, S., Handley, M., Padhye, J.,
1353	                                   and J. Widmer, "TCP Friendly Rate
1354	                                   Control (TFRC): Protocol
1355	                                   Specification", RFC 5348,
1356	                                   September 2008.

1358	   [RFC5562]                       Kuzmanovic, A., Mondal, A., Floyd,
1359	                                   S., and K. Ramakrishnan, "Adding
1360	                                   Explicit Congestion Notification
1361	                                   (ECN) Capability to TCP's SYN/ACK
1362	                                   Packets", RFC 5562, June 2009.

1364	   [RFC5670]                       Eardley, P., "Metering and Marking
1365	                                   Behaviour of PCN-Nodes", RFC 5670,
1366	                                   November 2009.

1368	   [RFC5681]                       Allman, M., Paxson, V., and E.
1369	                                   Blanton, "TCP Congestion Control",
1370	                                   RFC 5681, September 2009.

1372	   [RFC5690]                       Floyd, S., Arcia, A., Ros, D., and J.
1373	                                   Iyengar, "Adding Acknowledgement
1374	                                   Congestion Control to TCP", RFC 5690,
1375	                                   February 2010.

1377	   [RFC6077]                       Papadimitriou, D., Welzl, M., Scharf,
1378	                                   M., and B. Briscoe, "Open Research
1379	                                   Issues in Internet Congestion
1380	                                   Control", RFC 6077, February 2011.

1382	   [Rate_fair_Dis]                 Briscoe, B., "Flow Rate Fairness:
1383	                                   Dismantling a Religion", ACM
1384	                                   CCR 37(2)63--74, April 2007, <http://
1385	                                   portal.acm.org/
1386	                                   citation.cfm?id=1232926>.

1388	   [gentle_RED]                    Floyd, S., "Recommendation on using
1389	                                   the "gentle_" variant of RED", Web
1390	                                   page , March 2000, <http://
1391	                                   www.icir.org/floyd/red/gentle.html>.

1393	   [pBox]                          Floyd, S. and K. Fall, "Promoting the
1394	                                   Use of End-to-End Congestion Control
1395	                                   in the Internet", IEEE/ACM
1396	                                   Transactions on Networking 7(4) 458--
1397	                                   472, August 1999, <http://
1398	                                   www.aciri.org/floyd/
1399	                                   end2end-paper.html>.

1401	   [pktByteEmail]                  Floyd, S., "RED: Discussions of Byte
1402	                                   and Packet Modes", Web page Red Queue
1403	                                   Management, March 1997, <Available
1404	                                   at: http://ee.lbl.gov/floyd/
1405	                                   REDaveraging.txt>.

1407	Appendix A.  Survey of RED Implementation Status

1409	   This Appendix is informative, not normative.

1411	   In May 2007 a survey was conducted of 84 vendors to assess how widely
1412	   drop probability based on packet size has been implemented in RED
1413	   Table 3.  About 19% of those surveyed replied, giving a sample size
1414	   of 16.  Although in most cases we do not have permission to identify
1415	   the respondents, we can say that those that have responded include
1416	   most of the larger equipment vendors, covering a large fraction of
1417	   the market.  The two who gave permission to be identified were Cisco
1418	   and Alcatel-Lucent.  The others range across the large network
1419	   equipment vendors at L3 & L2, firewall vendors, wireless equipment
1420	   vendors, as well as large software businesses with a small selection
1421	   of networking products.  All those who responded confirmed that they
1422	   have not implemented the variant of RED with drop dependent on packet
1423	   size (2 were fairly sure they had not but needed to check more
1424	   thoroughly).  At the time the survey was conducted, Linux did not
1425	   implement RED with packet-size bias of drop, although we have not
1426	   investigated a wider range of open source code.

1428	   +-------------------------------+----------------+-----------------+
1429	   |                      Response | No. of vendors | %age of vendors |
1430	   +-------------------------------+----------------+-----------------+
1431	   |               Not implemented |             14 |             17% |
1432	   |    Not implemented (probably) |              2 |              2% |
1433	   |                   Implemented |              0 |              0% |
1434	   |                   No response |             68 |             81% |
1435	   | Total companies/orgs surveyed |             84 |            100% |
1436	   +-------------------------------+----------------+-----------------+

1438	    Table 3: Vendor Survey on byte-mode drop variant of RED (lower drop
1439	                      probability for small packets)

1441	   Where reasons have been given, the extra complexity of packet bias
1442	   code has been most prevalent, though one vendor had a more principled
1443	   reason for avoiding it--similar to the argument of this document.

1445	   Our survey was of vendor implementations, so we cannot be certain
1446	   about operator deployment.  But we believe many queues in the
1447	   Internet are still tail-drop.  The company of one of the co-authors
1448	   (BT) has widely deployed RED, but many tail-drop queues are bound to
1449	   still exist, particularly in access network equipment and on
1450	   middleboxes like firewalls, where RED is not always available.

1452	   Routers using a memory architecture based on fixed size buffers with
1453	   borrowing may also still be prevalent in the Internet.  As explained
1454	   in Section 4.2.1, these also provide a marginal (but legitimate) bias
1455	   towards small packets.  So even though RED byte-mode drop is not
1456	   prevalent, it is likely there is still some bias towards small
1457	   packets in the Internet due to tail drop and fixed buffer borrowing.

1459	Appendix B.  Sufficiency of Packet-Mode Drop

1461	   This Appendix is informative, not normative.

1463	   Here we check that packet-mode drop (or marking) in the network gives
1464	   sufficiently generic information for the transport layer to use.  We
1465	   check against a 2x2 matrix of four scenarios that may occur now or in
1466	   the future (Table 4).  The horizontal and vertical dimensions have
1467	   been chosen because each tests extremes of sensitivity to packet size
1468	   in the transport and in the network respectively.

1470	   Note that this section does not consider byte-mode drop at all.
1471	   Having deprecated byte-mode drop, the goal here is to check that
1472	   packet-mode drop will be sufficient in all cases.

1474	   +-------------------------------+-----------------+-----------------+
1475	   |                     Transport |  a) Independent | b) Dependent on |
1476	   |                               |  of packet size |  packet size of |
1477	   | Network                       |  of congestion  |    congestion   |
1478	   |                               |  notifications  |  notifications  |
1479	   +-------------------------------+-----------------+-----------------+
1480	   | 1) Predominantly              |   Scenario a1)  |   Scenario b1)  |
1481	   | bit-congestible network       |                 |                 |
1482	   | 2) Mix of bit-congestible and |   Scenario a2)  |   Scenario b2)  |
1483	   | pkt-congestible network       |                 |                 |
1484	   +-------------------------------+-----------------+-----------------+

1486	                Table 4: Four Possible Congestion Scenarios

1488	   Appendix B.1 focuses on the horizontal dimension of Table 4 checking
1489	   that packet-mode drop (or marking) gives sufficient information,
1490	   whether or not the transport uses it--scenarios b) and a)
1491	   respectively.

1493	   Appendix B.2 focuses on the vertical dimension of Table 4, checking
1494	   that packet-mode drop gives sufficient information to the transport
1495	   whether resources in the network are bit-congestible or packet-
1496	   congestible (these terms are defined in Section 1.1).

1498	   Notation:  To be concrete, we will compare two flows with different
1499	      packet sizes, s_1 and s_2.  As an example, we will take s_1 = 60B
1500	      = 480b and s_2 = 1500B = 12,000b.

1502	      A flow's bit rate, x [bps], is related to its packet rate, u
1503	      [pps], by

1505	         x(t) = s.u(t).

1507	      In the bit-congestible case, path congestion will be denoted by
1508	      p_b, and in the packet-congestible case by p_p.  When either case
1509	      is implied, the letter p alone will denote path congestion.

1511	B.1.  Packet-Size (In)Dependence in Transports

1513	   In all cases we consider a packet-mode drop queue that indicates
1514	   congestion by dropping (or marking) packets with probability p
1515	   irrespective of packet size. We use an example value of loss
1516	   (marking) probability, p=0.1%.

1518	   A transport like RFC5681 TCP treats a congestion notification on any
1519	   packet whatever its size as one event.  However, a network with just
1520	   the packet-mode drop algorithm does give more information if the
1521	   transport chooses to use it.  We will use Table 5 to illustrate this.

1523	   We will set aside the last column until later.  The columns labelled
1524	   "Flow 1" and "Flow 2" compare two flows consisting of 60B and 1500B
1525	   packets respectively.  The body of the table considers two separate
1526	   cases, one where the flows have equal bit-rate and the other with
1527	   equal packet-rates.  In both cases, the two flows fill a 96Mbps link.
1528	   Therefore, in the equal bit-rate case they each have half the bit-
1529	   rate (48Mbps).  Whereas, with equal packet-rates, flow 1 uses 25
1530	   times smaller packets so it gets 25 times less bit-rate--it only gets
1531	   1/(1+25) of the link capacity (96Mbps/26 = 4Mbps after rounding).  In
1532	   contrast flow 2 gets 25 times more bit-rate (92Mbps) in the equal
1533	   packet rate case because its packets are 25 times larger.  The packet
1534	   rate shown for each flow could easily be derived once the bit-rate
1535	   was known by dividing bit-rate by packet size, as shown in the column
1536	   labelled "Formula".

1538	       Parameter               Formula      Flow 1  Flow 2 Combined
1539	       ----------------------- ----------- ------- ------- --------
1540	       Packet size             s/8             60B  1,500B    (Mix)
1541	       Packet size             s              480b 12,000b    (Mix)
1542	       Pkt loss probability    p              0.1%    0.1%     0.1%

1544	       EQUAL BIT-RATE CASE
1545	       Bit-rate                x            48Mbps  48Mbps   96Mbps
1546	       Packet-rate             u = x/s     100kpps   4kpps  104kpps
1547	       Absolute pkt-loss-rate  p*u          100pps    4pps   104pps
1548	       Absolute bit-loss-rate  p*u*s        48kbps  48kbps   96kbps
1549	       Ratio of lost/sent pkts p*u/u          0.1%    0.1%     0.1%
1550	       Ratio of lost/sent bits p*u*s/(u*s)    0.1%    0.1%     0.1%

1552	       EQUAL PACKET-RATE CASE
1553	       Bit-rate                x             4Mbps  92Mbps   96Mbps
1554	       Packet-rate             u = x/s       8kpps   8kpps   15kpps
1555	       Absolute pkt-loss-rate  p*u            8pps    8pps    15pps
1556	       Absolute bit-loss-rate  p*u*s         4kbps  92kbps   96kbps
1557	       Ratio of lost/sent pkts p*u/u          0.1%    0.1%     0.1%
1558	       Ratio of lost/sent bits p*u*s/(u*s)    0.1%    0.1%     0.1%

1560	    Table 5: Absolute Loss Rates and Loss Ratios for Flows of Small and
1561	                      Large Packets and Both Combined

1563	   So far we have merely set up the scenarios.  We now consider
1564	   congestion notification in the scenario.  Two TCP flows with the same
1565	   round trip time aim to equalise their packet-loss-rates over time.
1566	   That is the number of packets lost in a second, which is the packets
1567	   per second (u) multiplied by the probability that each one is dropped
1568	   (p).  Thus TCP converges on the "Equal packet-rate" case, where both
1569	   flows aim for the same "Absolute packet-loss-rate" (both 8pps in the
1570	   table).

1572	   Packet-mode drop actually gives flows sufficient information to
1573	   measure their loss-rate in bits per second, if they choose, not just
1574	   packets per second.  Each flow can count the size of a lost or marked
1575	   packet and scale its rate-response in proportion (as TFRC-SP does).
1576	   The result is shown in the row entitled "Absolute bit-loss-rate",
1577	   where the bits lost in a second is the packets per second (u)
1578	   multiplied by the probability of losing a packet (p) multiplied by
1579	   the packet size (s).  Such an algorithm would try to remove any
1580	   imbalance in bit-loss-rate such as the wide disparity in the "Equal
1581	   packet-rate" case (4kbps vs. 92kbps).  Instead, a packet-size-
1582	   dependent algorithm would aim for equal bit-loss-rates, which would
1583	   drive both flows towards the "Equal bit-rate" case, by driving them
1584	   to equal bit-loss-rates (both 48kbps in this example).

1586	   The explanation so far has assumed that each flow consists of packets
1587	   of only one constant size.  Nonetheless, it extends naturally to
1588	   flows with mixed packet sizes.  In the right-most column of Table 5 a
1589	   flow of mixed size packets is created simply by considering flow 1
1590	   and flow 2 as a single aggregated flow.  There is no need for a flow
1591	   to maintain an average packet size.  It is only necessary for the
1592	   transport to scale its response to each congestion indication by the
1593	   size of each individual lost (or marked) packet.  Taking for example
1594	   the "Equal packet-rate" case, in one second about 8 small packets and
1595	   8 large packets are lost (making closer to 15 than 16 losses per
1596	   second due to rounding).  If the transport multiplies each loss by
1597	   its size, in one second it responds to 8*480b and 8*12,000b lost
1598	   bits, adding up to 96,000 lost bits in a second.  This double checks
1599	   correctly, being the same as 0.1% of the total bit-rate of 96Mbps.
1600	   For completeness, the formula for absolute bit-loss-rate is p(u1*s1+
1601	   u2*s2).

1603	   Incidentally, a transport will always measure the loss probability
1604	   the same irrespective of whether it measures in packets or in bytes.
1605	   In other words, the ratio of lost to sent packets will be the same as
1606	   the ratio of lost to sent bytes.  (This is why TCP's bit rate is
1607	   still proportional to packet size even when byte-counting is used, as
1608	   recommended for TCP in [RFC5681], mainly for orthogonal security
1609	   reasons.)  This is intuitively obvious by comparing two example
1610	   flows; one with 60B packets, the other with 1500B packets.  If both
1611	   flows pass through a queue with drop probability 0.1%, each flow will
1612	   lose 1 in 1,000 packets.  In the stream of 60B packets the ratio of
1613	   bytes lost to sent will be 60B in every 60,000B; and in the stream of
1614	   1500B packets, the loss ratio will be 1,500B out of 1,500,000B. When
1615	   the transport responds to the ratio of lost to sent packets, it will
1616	   measure the same ratio whether it measures in packets or bytes: 0.1%
1617	   in both cases.  The fact that this ratio is the same whether measured
1618	   in packets or bytes can be seen in Table 5, where the ratio of lost
1619	   to sent packets and the ratio of lost to sent bytes is always 0.1% in
1620	   all cases (recall that the scenario was set up with p=0.1%).

1622	   This discussion of how the ratio can be measured in packets or bytes
1623	   is only raised here to highlight that it is irrelevant to this memo!
1624	   Whether a transport depends on packet size or not depends on how this
1625	   ratio is used within the congestion control algorithm.

1627	   So far we have shown that packet-mode drop passes sufficient
1628	   information to the transport layer so that the transport can take
1629	   account of bit-congestion, by using the sizes of the packets that
1630	   indicate congestion.  We have also shown that the transport can
1631	   choose not to take packet size into account if it wishes.  We will
1632	   now consider whether the transport can know which to do.

1634	B.2.  Bit-Congestible and Packet-Congestible Indications

1636	   As a thought-experiment, imagine an idealised congestion notification
1637	   protocol that supports both bit-congestible and packet-congestible
1638	   resources.  It would require at least two ECN flags, one for each of
1639	   bit-congestible and packet-congestible resources.

1641	   1.  A packet-congestible resource trying to code congestion level p_p
1642	       into a packet stream should mark the idealised `packet
1643	       congestion' field in each packet with probability p_p
1644	       irrespective of the packet's size.  The transport should then
1645	       take a packet with the packet congestion field marked to mean
1646	       just one mark, irrespective of the packet size.

1648	   2.  A bit-congestible resource trying to code time-varying byte-
1649	       congestion level p_b into a packet stream should mark the `byte
1650	       congestion' field in each packet with probability p_b, again
1651	       irrespective of the packet's size.  Unlike before, the transport
1652	       should take a packet with the byte congestion field marked to
1653	       count as a mark on each byte in the packet.

1655	   This hides a fundamental problem--much more fundamental than whether
1656	   we can magically create header space for yet another ECN flag, or
1657	   whether it would work while being deployed incrementally.
1658	   Distinguishing drop from delivery naturally provides just one
1659	   implicit bit of congestion indication information--the packet is
1660	   either dropped or not.  It is hard to drop a packet in two ways that
1661	   are distinguishable remotely.  This is a similar problem to that of
1662	   distinguishing wireless transmission losses from congestive losses.

1664	   This problem would not be solved even if ECN were universally
1665	   deployed.  A congestion notification protocol must survive a
1666	   transition from low levels of congestion to high.  Marking two states
1667	   is feasible with explicit marking, but much harder if packets are
1668	   dropped.  Also, it will not always be cost-effective to implement AQM
1669	   at every low level resource, so drop will often have to suffice.

1671	   We are not saying two ECN fields will be needed (and we are not
1672	   saying that somehow a resource should be able to drop a packet in one
1673	   of two different ways so that the transport can distinguish which
1674	   sort of drop it was!).  These two congestion notification channels
1675	   are a conceptual device to illustrate a dilemma we could face in the
1676	   future.  Section 3 gives four good reasons why it would be a bad idea
1677	   to allow for packet size by biasing drop probability in favour of
1678	   small packets within the network.  The impracticality of our thought
1679	   experiment shows that it will be hard to give transports a practical
1680	   way to know whether to take account of the size of congestion
1681	   indication packets or not.

1683	   Fortunately, this dilemma is not pressing because by design most
1684	   equipment becomes bit-congested before its packet-processing becomes
1685	   congested (as already outlined in Section 1.1).  Therefore transports
1686	   can be designed on the relatively sound assumption that a congestion
1687	   indication will usually imply bit-congestion.

1689	   Nonetheless, although the above idealised protocol isn't intended for
1690	   implementation, we do want to emphasise that research is needed to
1691	   predict whether there are good reasons to believe that packet
1692	   congestion might become more common, and if so, to find a way to
1693	   somehow distinguish between bit and packet congestion [RFC3714].

1695	   Recently, the dual resource queue (DRQ) proposal [DRQ] has been made
1696	   on the premise that, as network processors become more cost
1697	   effective, per packet operations will become more complex
1698	   (irrespective of whether more function in the network is desirable).
1699	   Consequently the premise is that CPU congestion will become more
1700	   common.  DRQ is a proposed modification to the RED algorithm that
1701	   folds both bit congestion and packet congestion into one signal
1702	   (either loss or ECN).

1704	   Finally, we note one further complication.  Strictly, packet-
1705	   congestible resources are often cycle-congestible.  For instance, for
1706	   routing look-ups load depends on the complexity of each look-up and
1707	   whether the pattern of arrivals is amenable to caching or not.  This
1708	   also reminds us that any solution must not require a forwarding
1709	   engine to use excessive processor cycles in order to decide how to
1710	   say it has no spare processor cycles.

1712	Appendix C.  Byte-mode Drop Complicates Policing Congestion Response

1714	   This section is informative, not normative.

1716	   There are two main classes of approach to policing congestion
1717	   response: i) policing at each bottleneck link or ii) policing at the
1718	   edges of networks.  Packet-mode drop in RED is compatible with
1719	   either, while byte-mode drop precludes edge policing.

1721	   The simplicity of an edge policer relies on one dropped or marked
1722	   packet being equivalent to another of the same size without having to
1723	   know which link the drop or mark occurred at.  However, the byte-mode
1724	   drop algorithm has to depend on the local MTU of the line--it needs
1725	   to use some concept of a 'normal' packet size.  Therefore, one
1726	   dropped or marked packet from a byte-mode drop algorithm is not
1727	   necessarily equivalent to another from a different link.  A policing
1728	   function local to the link can know the local MTU where the
1729	   congestion occurred.  However, a policer at the edge of the network
1730	   cannot, at least not without a lot of complexity.

1732	   The early research proposals for type (i) policing at a bottleneck
1733	   link [pBox] used byte-mode drop, then detected flows that contributed
1734	   disproportionately to the number of packets dropped.  However, with
1735	   no extra complexity, later proposals used packet mode drop and looked
1736	   for flows that contributed a disproportionate amount of dropped bytes
1737	   [CHOKe_Var_Pkt].

1739	   Work is progressing on the congestion exposure protocol (ConEx
1740	   [I-D.ietf-conex-concepts-uses]), which enables a type (ii) edge
1741	   policer located at a user's attachment point.  The idea is to be able
1742	   to take an integrated view of the effect of all a user's traffic on
1743	   any link in the internetwork.  However, byte-mode drop would
1744	   effectively preclude such edge policing because of the MTU issue
1745	   above.

1747	   Indeed, making drop probability depend on the size of the packets
1748	   that bits happen to be divided into would simply encourage the bits
1749	   to be divided into smaller packets in order to confuse policing.  In
1750	   contrast, as long as a dropped/marked packet is taken to mean that
1751	   all the bytes in the packet are dropped/marked, a policer can remain
1752	   robust against bits being re-divided into different size packets or
1753	   across different size flows [Rate_fair_Dis].

1755	Appendix D.  Changes from Previous Versions

1757	   To be removed by the RFC Editor on publication.

1759	   Full incremental diffs between each version are available at
1760	   <http://tools.ietf.org/wg/tsvwg/draft-ietf-tsvwg-byte-pkt-congest/>
1761	   (courtesy of the rfcdiff tool):

1763	   From -06 to -07:

1765	      *  A mix-up with the corollaries and their naming in 2.1 to 2.3
1766	         fixed.

1768	   From -05 to -06:

1770	      *  Primarily editorial fixes.

1772	   From -04 to -05:

1774	      *  Changed from Informational to BCP and highlighted non-normative
1775	         sections and appendices

1777	      *  Removed language about consensus

1779	      *  Added "Example Comparing Packet-Mode Drop and Byte-Mode Drop"

1781	      *  Arranged "Motivating Arguments" into a more logical order and
1782	         completely rewrote "Transport-Independent Network" & "Scaling
1783	         Congestion Control with Packet Size" arguments.  Removed "Why
1784	         Now?"

1786	      *  Clarified applicability of certain recommendations

1788	      *  Shifted vendor survey to an Appendix

1790	      *  Cut down "Outstanding Issues and Next Steps"

1792	      *  Re-drafted the start of the conclusions to highlight the three
1793	         distinct areas of concern

1795	      *  Completely re-wrote appendices

1797	      *  Editorial corrections throughout.

1799	   From -03 to -04:

1801	      *  Reordered Sections 2 and 3, and some clarifications here and
1802	         there based on feedback from Colin Perkins and Mirja
1803	         Kuehlewind.

1805	   From -02 to -03  (this version)

1807	      *  Structural changes:

1809	         +  Split off text at end of "Scaling Congestion Control with
1810	            Packet Size" into new section "Transport-Independent
1811	            Network"

1813	         +  Shifted "Recommendations" straight after "Motivating
1814	            Arguments" and added "Conclusions" at end to reinforce
1815	            Recommendations

1817	         +  Added more internal structure to Recommendations, so that
1818	            recommendations specific to RED or to TCP are just
1819	            corollaries of a more general recommendation, rather than
1820	            being listed as a separate recommendation.

1822	         +  Renamed "State of the Art" as "Critical Survey of Existing
1823	            Advice" and retitled a number of subsections with more
1824	            descriptive titles.

1826	         +  Split end of "Congestion Coding: Summary of Status" into a
1827	            new subsection called "RED Implementation Status".

1829	         +  Removed text that had been in the Appendix "Congestion
1830	            Notification Definition: Further Justification".

1832	      *  Reordered the intro text a little.

1834	      *  Made it clearer when advice being reported is deprecated and
1835	         when it is not.

1837	      *  Described AQM as in network equipment, rather than saying "at
1838	         the network layer" (to side-step controversy over whether
1839	         functions like AQM are in the transport layer but in network
1840	         equipment).

1842	      *  Minor improvements to clarity throughout

1844	   From -01 to -02:

1846	      *  Restructured the whole document for (hopefully) easier reading
1847	         and clarity.  The concrete recommendation, in RFC2119 language,
1848	         is now in Section 8.

1850	   From -00 to -01:

1852	      *  Minor clarifications throughout and updated references

1854	   From briscoe-byte-pkt-mark-02 to ietf-byte-pkt-congest-00:

1856	      *  Added note on relationship to existing RFCs
1857	      *  Posed the question of whether packet-congestion could become
1858	         common and deferred it to the IRTF ICCRG.  Added ref to the
1859	         dual-resource queue (DRQ) proposal.

1861	      *  Changed PCN references from the PCN charter & architecture to
1862	         the PCN marking behaviour draft most likely to imminently
1863	         become the standards track WG item.

1865	   From -01 to -02:

1867	      *  Abstract reorganised to align with clearer separation of issue
1868	         in the memo.

1870	      *  Introduction reorganised with motivating arguments removed to
1871	         new Section 3.

1873	      *  Clarified avoiding lock-out of large packets is not the main or
1874	         only motivation for RED.

1876	      *  Mentioned choice of drop or marking explicitly throughout,
1877	         rather than trying to coin a word to mean either.

1879	      *  Generalised the discussion throughout to any packet forwarding
1880	         function on any network equipment, not just routers.

1882	      *  Clarified the last point about why this is a good time to sort
1883	         out this issue: because it will be hard / impossible to design
1884	         new transports unless we decide whether the network or the
1885	         transport is allowing for packet size.

1887	      *  Added statement explaining the horizon of the memo is long
1888	         term, but with short term expediency in mind.

1890	      *  Added material on scaling congestion control with packet size
1891	         (Section 3.4).

1893	      *  Separated out issue of normalising TCP's bit rate from issue of
1894	         preference to control packets (Section 3.2).

1896	      *  Divided up Congestion Measurement section for clarity,
1897	         including new material on fixed size packet buffers and buffer
1898	         carving (Section 4.1.1 & Section 4.2.1) and on congestion
1899	         measurement in wireless link technologies without queues
1900	         (Section 4.1.2).

1902	      *  Added section on 'Making Transports Robust against Control
1903	         Packet Losses' (Section 4.2.3) with existing & new material
1904	         included.

1906	      *  Added tabulated results of vendor survey on byte-mode drop
1907	         variant of RED (Table 3).

1909	   From -00 to -01:

1911	      *  Clarified applicability to drop as well as ECN.

1913	      *  Highlighted DoS vulnerability.

1915	      *  Emphasised that drop-tail suffers from similar problems to
1916	         byte-mode drop, so only byte-mode drop should be turned off,
1917	         not RED itself.

1919	      *  Clarified the original apparent motivations for recommending
1920	         byte-mode drop included protecting SYNs and pure ACKs more than
1921	         equalising the bit rates of TCPs with different segment sizes.
1922	         Removed some conjectured motivations.

1924	      *  Added support for updates to TCP in progress (ackcc & ecn-syn-
1925	         ack).

1927	      *  Updated survey results with newly arrived data.

1929	      *  Pulled all recommendations together into the conclusions.

1931	      *  Moved some detailed points into two additional appendices and a
1932	         note.

1934	      *  Considerable clarifications throughout.

1936	      *  Updated references

1938	Authors' Addresses

1940	   Bob Briscoe
1941	   BT
1942	   B54/77, Adastral Park
1943	   Martlesham Heath
1944	   Ipswich  IP5 3RE
1945	   UK

1947	   Phone: +44 1473 645196
1948	   EMail: bob.briscoe@bt.com
1949	   URI:   http://bobbriscoe.net/
1950	   Jukka Manner
1951	   Aalto University
1952	   Department of Communications and Networking (Comnet)
1953	   P.O. Box 13000
1954	   FIN-00076 Aalto
1955	   Finland

1957	   Phone: +358 9 470 22481
1958	   EMail: jukka.manner@aalto.fi
1959	   URI:   http://www.netlab.tkk.fi/~jmanner/