idnits 2.17.1 

draft-ietf-tsvwg-byte-pkt-congest-09.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1560 has weird spacing: '...ability    p  ...'

  == Line 1565 has weird spacing: '...ss-rate  p*u  ...'

  == Line 1566 has weird spacing: '...ss-rate  p*u*s...'

  == Line 1573 has weird spacing: '...ss-rate  p*u  ...'

  == Line 1574 has weird spacing: '...ss-rate  p*u*s...'

     (Using the creation date from RFC2309, updated by this document, for
     RFC5378 checks: 1997-03-25)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (November 7, 2012) is 4181 days in the past.  Is this
     intentional?


  Checking references for intended status: Best Current Practice
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Obsolete informational reference (is this intentional?): RFC 2309
     (Obsoleted by RFC 7567)


     Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Transport Area Working Group                                  B. Briscoe
3	Internet-Draft                                                        BT
4	Updates: 2309 (if approved)                                    J. Manner
5	Intended status: BCP                                    Aalto University
6	Expires: May 11, 2013                                   November 7, 2012

8	                Byte and Packet Congestion Notification
9	                  draft-ietf-tsvwg-byte-pkt-congest-09

11	Abstract

13	   This document provides recommendations of best current practice for
14	   dropping or marking packets using active queue management (AQM) such
15	   as random early detection (RED) or pre-congestion notification (PCN).
16	   We give three strong recommendations: (1) packet size should be taken
17	   into account when transports read and respond to congestion
18	   indications, (2) packet size should not be taken into account when
19	   network equipment creates congestion signals (marking, dropping), and
20	   therefore (3) the byte-mode packet drop variant of the RED AQM
21	   algorithm that drops fewer small packets should not be used.  This
22	   memo updates RFC 2309 to deprecate deliberate preferential treatment
23	   of small packets in AQM algorithms.

25	Status of This Memo

27	   This Internet-Draft is submitted in full conformance with the
28	   provisions of BCP 78 and BCP 79.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF).  Note that other groups may also distribute
32	   working documents as Internet-Drafts.  The list of current Internet-
33	   Drafts is at http://datatracker.ietf.org/drafts/current/.

35	   Internet-Drafts are draft documents valid for a maximum of six months
36	   and may be updated, replaced, or obsoleted by other documents at any
37	   time.  It is inappropriate to use Internet-Drafts as reference
38	   material or to cite them other than as "work in progress."

40	   This Internet-Draft will expire on May 11, 2013.

42	Copyright Notice

44	   Copyright (c) 2012 IETF Trust and the persons identified as the
45	   document authors.  All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (http://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document.  Please review these documents
51	   carefully, as they describe your rights and restrictions with respect
52	   to this document.  Code Components extracted from this document must
53	   include Simplified BSD License text as described in Section 4.e of
54	   the Trust Legal Provisions and are provided without warranty as
55	   described in the Simplified BSD License.

57	Table of Contents

59	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
60	     1.1.  Terminology and Scoping  . . . . . . . . . . . . . . . . .  6
61	     1.2.  Example Comparing Packet-Mode Drop and Byte-Mode Drop  . .  7
62	   2.  Recommendations  . . . . . . . . . . . . . . . . . . . . . . .  8
63	     2.1.  Recommendation on Queue Measurement  . . . . . . . . . . .  9
64	     2.2.  Recommendation on Encoding Congestion Notification . . . .  9
65	     2.3.  Recommendation on Responding to Congestion . . . . . . . . 10
66	     2.4.  Recommendation on Handling Congestion Indications when
67	           Splitting or Merging Packets . . . . . . . . . . . . . . . 11
68	   3.  Motivating Arguments . . . . . . . . . . . . . . . . . . . . . 12
69	     3.1.  Avoiding Perverse Incentives to (Ab)use Smaller Packets  . 12
70	     3.2.  Small != Control . . . . . . . . . . . . . . . . . . . . . 13
71	     3.3.  Transport-Independent Network  . . . . . . . . . . . . . . 13
72	     3.4.  Partial Deployment of AQM  . . . . . . . . . . . . . . . . 15
73	     3.5.  Implementation Efficiency  . . . . . . . . . . . . . . . . 16
74	   4.  A Survey and Critique of Past Advice . . . . . . . . . . . . . 16
75	     4.1.  Congestion Measurement Advice  . . . . . . . . . . . . . . 17
76	       4.1.1.  Fixed Size Packet Buffers  . . . . . . . . . . . . . . 17
77	       4.1.2.  Congestion Measurement without a Queue . . . . . . . . 18
78	     4.2.  Congestion Notification Advice . . . . . . . . . . . . . . 19
79	       4.2.1.  Network Bias when Encoding . . . . . . . . . . . . . . 19
80	       4.2.2.  Transport Bias when Decoding . . . . . . . . . . . . . 21
81	       4.2.3.  Making Transports Robust against Control Packet
82	               Losses . . . . . . . . . . . . . . . . . . . . . . . . 22
83	       4.2.4.  Congestion Notification: Summary of Conflicting
84	               Advice . . . . . . . . . . . . . . . . . . . . . . . . 23
85	   5.  Outstanding Issues and Next Steps  . . . . . . . . . . . . . . 24
86	     5.1.  Bit-congestible Network  . . . . . . . . . . . . . . . . . 24
87	     5.2.  Bit- & Packet-congestible Network  . . . . . . . . . . . . 24
88	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 25
89	   7.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 25
90	   8.  Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 26
91	   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 27
92	   10. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 27
93	   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 27
94	     11.1. Normative References . . . . . . . . . . . . . . . . . . . 27
95	     11.2. Informative References . . . . . . . . . . . . . . . . . . 27
96	   Appendix A.  Survey of RED Implementation Status . . . . . . . . . 31
97	   Appendix B.  Sufficiency of Packet-Mode Drop . . . . . . . . . . . 32
98	     B.1.  Packet-Size (In)Dependence in Transports . . . . . . . . . 33
99	     B.2.  Bit-Congestible and Packet-Congestible Indications . . . . 36
100	   Appendix C.  Byte-mode Drop Complicates Policing Congestion
101	                Response  . . . . . . . . . . . . . . . . . . . . . . 38
102	   Appendix D.  Changes from Previous Versions  . . . . . . . . . . . 39

104	1.  Introduction

106	   This memo concerns how we should correctly scale congestion control
107	   functions with packet size for the long term.  It also recognises
108	   that expediency may be necessary to deal with existing widely
109	   deployed protocols that don't live up to the long term goal.

111	   When notifying congestion, the problem of how (and whether) to take
112	   packet sizes into account has exercised the minds of researchers and
113	   practitioners for as long as active queue management (AQM) has been
114	   discussed.  Indeed, one reason AQM was originally introduced was to
115	   reduce the lock-out effects that small packets can have on large
116	   packets in drop-tail queues.  This memo aims to state the principles
117	   we should be using and to outline how these principles will affect
118	   future protocol design, taking into account the existing deployments
119	   we have already.

121	   The question of whether to take into account packet size arises at
122	   three stages in the congestion notification process:

124	   Measuring congestion:  When a congested resource measures locally how
125	      congested it is, should it measure its queue length in bytes or
126	      packets?

128	   Encoding congestion notification into the wire protocol:  When a
129	      congested network resource notifies its level of congestion,
130	      should it drop / mark each packet dependent on the byte-size of
131	      the particular packet in question?

133	   Decoding congestion notification from the wire protocol:  When a
134	      transport interprets the notification in order to decide how much
135	      to respond to congestion, should it take into account the byte-
136	      size of each missing or marked packet?

138	   Consensus has emerged over the years concerning the first stage:
139	   whether queues are measured in bytes or packets, termed byte-mode
140	   queue measurement or packet-mode queue measurement.  Section 2.1 of
141	   this memo records this consensus in the RFC Series.  In summary the
142	   choice solely depends on whether the resource is congested by bytes
143	   or packets.

145	   The controversy is mainly around the last two stages: whether to
146	   allow for the size of the specific packet notifying congestion i)
147	   when the network encodes or ii) when the transport decodes the
148	   congestion notification.

150	   Currently, the RFC series is silent on this matter other than a paper
151	   trail of advice referenced from [RFC2309], which conditionally
152	   recommends byte-mode (packet-size dependent) drop [pktByteEmail].
153	   Reducing drop of small packets certainly has some tempting
154	   advantages: i) it drops less control packets, which tend to be small
155	   and ii) it makes TCP's bit-rate less dependent on packet size.
156	   However, there are ways of addressing these issues at the transport
157	   layer, rather than reverse engineering network forwarding to fix the
158	   problems.

160	   This memo updates [RFC2309] to deprecate deliberate preferential
161	   treatment of small packets in AQM algorithms.  It recommends that (1)
162	   packet size should be taken into account when transports read
163	   congestion indications, (2) not when network equipment writes them.

165	   In particular this means that the byte-mode packet drop variant of
166	   Random early Detection (RED) should not be used to drop fewer small
167	   packets, because that creates a perverse incentive for transports to
168	   use tiny segments, consequently also opening up a DoS vulnerability.
169	   Fortunately all the RED implementers who responded to our admittedly
170	   limited survey (Section 4.2.4) have not followed the earlier advice
171	   to use byte-mode drop, so the position this memo argues for seems to
172	   already exist in implementations.

174	   However, at the transport layer, TCP congestion control is a widely
175	   deployed protocol that doesn't scale with packet size.  To date this
176	   hasn't been a significant problem because most TCP implementations
177	   have been used with similar packet sizes.  But, as we design new
178	   congestion control mechanisms, the current recommendation is that we
179	   should build in scaling with packet size rather than assuming we
180	   should follow TCP's example.

182	   This memo continues as follows.  First it discusses terminology and
183	   scoping.  Section 2 gives the concrete formal recommendations,
184	   followed by motivating arguments in Section 3.  We then critically
185	   survey the advice given previously in the RFC series and the research
186	   literature (Section 4), referring to an assessment of whether or not
187	   this advice has been followed in production networks (Appendix A).
188	   To wrap up, outstanding issues are discussed that will need
189	   resolution both to inform future protocol designs and to handle
190	   legacy (Section 5).  Then security issues are collected together in
191	   Section 6 before conclusions are drawn in Section 8.  The interested
192	   reader can find discussion of more detailed issues on the theme of
193	   byte vs. packet in the appendices.

195	   This memo intentionally includes a non-negligible amount of material
196	   on the subject.  For the busy reader Section 2 summarises the
197	   recommendations for the Internet community.

199	1.1.  Terminology and Scoping

201	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
202	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
203	   document are to be interpreted as described in [RFC2119].

205	   Congestion Notification:  Congestion notification is a changing
206	      signal that aims to communicate the probability that the network
207	      resource(s) will not be able to forward the level of traffic load
208	      offered (or that there is an impending risk that they will not be
209	      able to).

211	      The `impending risk' qualifier is added, because AQM systems (e.g.
212	      RED, PCN [RFC5670]) set a virtual limit smaller than the actual
213	      limit to the resource, then notify when this virtual limit is
214	      exceeded in order to avoid uncontrolled congestion of the actual
215	      capacity.

217	      Congestion notification communicates a real number bounded by the
218	      range [ 0 , 1 ].  This ties in with the most well-understood
219	      measure of congestion notification: drop probability.

221	   Explicit and Implicit Notification:  The byte vs. packet dilemma
222	      concerns congestion notification irrespective of whether it is
223	      signalled implicitly by drop or using explicit congestion
224	      notification (ECN [RFC3168] or PCN [RFC5670]).  Throughout this
225	      document, unless clear from the context, the term marking will be
226	      used to mean notifying congestion explicitly, while congestion
227	      notification will be used to mean notifying congestion either
228	      implicitly by drop or explicitly by marking.

230	   Bit-congestible vs. Packet-congestible:  If the load on a resource
231	      depends on the rate at which packets arrive, it is called packet-
232	      congestible.  If the load depends on the rate at which bits arrive
233	      it is called bit-congestible.

235	      Examples of packet-congestible resources are route look-up engines
236	      and firewalls, because load depends on how many packet headers
237	      they have to process.  Examples of bit-congestible resources are
238	      transmission links, radio power and most buffer memory, because
239	      the load depends on how many bits they have to transmit or store.
240	      Some machine architectures use fixed size packet buffers, so
241	      buffer memory in these cases is packet-congestible (see
242	      Section 4.1.1).

244	      Currently a design goal of network processing equipment such as
245	      routers and firewalls is to keep packet processing uncongested
246	      even under worst case packet rates with runs of minimum size
247	      packets.  Therefore, packet-congestion is currently rare [RFC6077;
248	      S.3.3], but there is no guarantee that it will not become more
249	      common in future.

251	      Note that information is generally processed or transmitted with a
252	      minimum granularity greater than a bit (e.g. octets).  The
253	      appropriate granularity for the resource in question should be
254	      used, but for the sake of brevity we will talk in terms of bytes
255	      in this memo.

257	   Coarser Granularity:  Resources may be congestible at higher levels
258	      of granularity than bits or packets, for instance stateful
259	      firewalls are flow-congestible and call-servers are session-
260	      congestible.  This memo focuses on congestion of connectionless
261	      resources, but the same principles may be applicable for
262	      congestion notification protocols controlling per-flow and per-
263	      session processing or state.

265	   RED Terminology:  In RED whether to use packets or bytes when
266	      measuring queues is called respectively "packet-mode queue
267	      measurement" or "byte-mode queue measurement".  And whether the
268	      probability of dropping a particular packet is independent or
269	      dependent on its byte-size is called respectively "packet-mode
270	      drop" or "byte-mode drop".  The terms byte-mode and packet-mode
271	      should not be used without specifying whether they apply to queue
272	      measurement or to drop.

274	1.2.  Example Comparing Packet-Mode Drop and Byte-Mode Drop

276	   A central question addressed by this document is whether to recommend
277	   that AQM uses RED's packet-mode drop and to deprecate byte-mode drop.
278	   Table 1 compares how packet-mode and byte-mode drop affect two flows
279	   of different size packets.  For each it gives the expected number of
280	   packets and of bits dropped in one second.  Each example flow runs at
281	   the same bit-rate of 48Mb/s, but one is broken up into small 60 byte
282	   packets and the other into large 1500 byte packets.

284	   To keep up the same bit-rate, in one second there are about 25 times
285	   more small packets because they are 25 times smaller.  As can be seen
286	   from the table, the packet rate is 100,000 small packets versus 4,000
287	   large packets per second (pps).

289	      Parameter            Formula        Small packets Large packets
290	      -------------------- -------------- ------------- -------------
291	      Packet size          s/8                      60B        1,500B
292	      Packet size          s                       480b       12,000b
293	      Bit-rate             x                     48Mbps        48Mbps
294	      Packet-rate          u = x/s              100kpps         4kpps

296	      Packet-mode Drop
297	      Pkt loss probability p                       0.1%          0.1%
298	      Pkt loss-rate        p*u                   100pps          4pps
299	      Bit loss-rate        p*u*s                 48kbps        48kbps

301	      Byte-mode Drop       MTU, M=12,000b
302	      Pkt loss probability b = p*s/M             0.004%          0.1%
303	      Pkt loss-rate        b*u                     4pps          4pps
304	      Bit loss-rate        b*u*s               1.92kbps        48kbps

306	         Table 1: Example Comparing Packet-mode and Byte-mode Drop

308	   For packet-mode drop, we illustrate the effect of a drop probability
309	   of 0.1%, which the algorithm applies to all packets irrespective of
310	   size.  Because there are 25 times more small packets in one second,
311	   it naturally drops 25 times more small packets, that is 100 small
312	   packets but only 4 large packets.  But if we count how many bits it
313	   drops, there are 48,000 bits in 100 small packets and 48,000 bits in
314	   4 large packets--the same number of bits of small packets as large.

316	      The packet-mode drop algorithm drops any bit with the same
317	      probability whether the bit is in a small or a large packet.

319	   For byte-mode drop, again we use an example drop probability of 0.1%,
320	   but only for maximum size packets (assuming the link MTU is 1,500B or
321	   12,000b).  The byte-mode algorithm reduces the drop probability of
322	   smaller packets proportional to their size, making the probability
323	   that it drops a small packet 25 times smaller at 0.004%.  But there
324	   are 25 times more small packets, so dropping them with 25 times lower
325	   probability results in dropping the same number of packets: 4 drops
326	   in both cases.  The 4 small dropped packets contain 25 times less
327	   bits than the 4 large dropped packets: 1,920 compared to 48,000.

329	      The byte-mode drop algorithm drops any bit with a probability
330	      proportionate to the size of the packet it is in.

332	2.  Recommendations

334	   This section gives recommendations related to network equipment in
335	   Sections 2.1 and 2.2, and in Sections 2.3 and 2.4 we discuss the
336	   implications on the transport protocols.

338	2.1.  Recommendation on Queue Measurement

340	   Queue length is usually the most correct and simplest way to measure
341	   congestion of a resource.  To avoid the pathological effects of drop
342	   tail, an AQM function can then be used to transform queue length into
343	   the probability of dropping or marking a packet (e.g.  RED's
344	   piecewise linear function between thresholds).

346	   If the resource is bit-congestible, the implementation SHOULD measure
347	   the length of the queue in bytes.  If the resource is packet-
348	   congestible, the implementation SHOULD measure the length of the
349	   queue in packets.  No other choice makes sense, because the number of
350	   packets waiting in the queue isn't relevant if the resource gets
351	   congested by bytes and vice versa.

353	   What this advice means for the case of RED:

355	   1.  A RED implementation SHOULD use byte mode queue measurement for
356	       measuring the congestion of bit-congestible resources and packet
357	       mode queue measurement for packet-congestible resources.

359	   2.  An implementation SHOULD NOT make it possible to configure the
360	       way a queue measures itself, because whether a queue is bit-
361	       congestible or packet-congestible is an inherent property of the
362	       queue.

364	   The recommended approach in less straightforward scenarios, such as
365	   fixed size buffers, and resources without a queue, is discussed in
366	   Section 4.1.

368	2.2.  Recommendation on Encoding Congestion Notification

370	   When encoding congestion notification (e.g. by drop, ECN & PCN), a
371	   network device SHOULD treat all packets equally, regardless of their
372	   size.  In other words, the probability that network equipment drops
373	   or marks a particular packet to notify congestion SHOULD NOT depend
374	   on the size of the packet in question.  As the example in Section 1.2
375	   illustrates, to drop any bit with probability 0.1% it is only
376	   necessary to drop every packet with probability 0.1% without regard
377	   to the size of each packet.

379	   This approach ensures the network layer offers sufficient congestion
380	   information for all known and future transport protocols and also
381	   ensures no perverse incentives are created that would encourage
382	   transports to use inappropriately small packet sizes.

384	   What this advice means for the case of RED:

386	   1.  AQM algorithms such as RED SHOULD use packet-mode drop, ie they
387	       SHOULD NOT use byte-mode drop.  The latter is more complex, it
388	       creates the perverse incentive to fragment segments into tiny
389	       pieces and it is vulnerable to floods of small packets.

391	   2.  If a vendor has implemented byte-mode drop, and an operator has
392	       turned it on, it is RECOMMENDED to turn it off, after
393	       establishing if there are any implications on the relative
394	       performance of applications using different packet sizes.
395	       RED as a whole SHOULD NOT be turned off.  Without RED, a drop
396	       tail queue biases against large packets and is vulnerable to
397	       floods of small packets.

399	   Note well that RED's byte-mode queue drop is completely orthogonal to
400	   byte-mode queue measurement and should not be confused with it.  If a
401	   RED implementation has a byte-mode but does not specify what sort of
402	   byte-mode, it is most probably byte-mode queue measurement, which is
403	   fine.  However, if in doubt, the vendor should be consulted.

405	   A survey (Appendix A) showed that there appears to be little, if any,
406	   installed base of the byte-mode drop variant of RED.  This suggests
407	   that deprecating byte-mode drop will have little, if any, incremental
408	   deployment impact.

410	2.3.  Recommendation on Responding to Congestion

412	   When a transport detects that a packet has been lost or congestion
413	   marked, it SHOULD consider the strength of the congestion indication
414	   as proportionate to the size in octets (bytes) of the missing or
415	   marked packet.

417	   In other words, when a packet indicates congestion (by being lost or
418	   marked) it can be considered conceptually as if there is a congestion
419	   indication on every octet of the packet, not just one indication per
420	   packet.

422	   To be clear, the above recommendation solely describes how a
423	   transport should interpret the meaning of a congestion indication.
424	   It makes no recommendation on whether a transport should act
425	   differently based on this interpretation.  It merely aids
426	   interoperablity between transports, if they choose to make their
427	   actions depend on the strength of congestion indications.

429	   This definition will be useful as the the IETF transport area
430	   continues its programme of;

432	   o  updating host-based congestion control protocols to take account
433	      of packet size

435	   o  making transports less sensitive to losing control packets like
436	      SYNs and pure ACKs.

438	   What this advice means for the case of TCP:

440	   1.  If two TCP flows with different packet sizes are required to run
441	       at equal bit rates under the same path conditions, this should be
442	       done by altering TCP (Section 4.2.2), not network equipment (the
443	       latter affects other transports besides TCP).

445	   2.  If it is desired to improve TCP performance by reducing the
446	       chance that a SYN or a pure ACK will be dropped, this should be
447	       done by modifying TCP (Section 4.2.3), not network equipment.

449	   To be clear, we are not recommending at all that TCPs under
450	   equivalent conditions should aim for equal bit-rates.  We are merely
451	   saying that anyone trying to do such a thing should modify their TCP
452	   algorithm, not the network.

454	2.4.  Recommendation on Handling Congestion Indications when Splitting
455	      or Merging Packets

457	   Packets carrying congestion indications may be split or merged in
458	   some circumstances (e.g. at a RTCP transcoder or during IP fragment
459	   reassembly).  Splitting and merging only make sense in the context of
460	   ECN, not loss.

462	   The general rule to follow is that the number of octets in packets
463	   with congestion indications SHOULD be equivalent before and after
464	   merging or splitting.  This is based on the principle used above;
465	   that an indication of congestion on a packet can be considered as an
466	   indication of congestion on each octet of the packet.

468	   The above rule is not phrased with the word "MUST" to allow the
469	   following exception.  There are cases where pre-existing protocols
470	   were not designed to conserve congestion marked octets (e.g.  IP
471	   fragment reassembly [RFC3168] or loss statistics in RTCP receiver
472	   reports [RFC3550] before ECN was added
473	   [I-D.ietf-avtcore-ecn-for-rtp]).  When any such protocol is updated,
474	   it SHOULD comply with the above rule to conserve marked octets.
475	   However, the rule may be relaxed if it would otherwise become too
476	   complex to interoperate with pre-existing implementations of the
477	   protocol.

479	   One can think of a splitting or merging process as if all the
480	   incoming congestion-marked octets increment a counter and all the
481	   outgoing marked octets decrement the same counter.  In order to
482	   ensure that congestion indications remain timely, even the smallest
483	   positive remainder in the conceptual counter should trigger the next
484	   outgoing packet to be marked (causing the counter to go negative).

486	3.  Motivating Arguments

488	   This section is informative.  It justifies the recommendations given
489	   in the previous section.

491	3.1.  Avoiding Perverse Incentives to (Ab)use Smaller Packets

493	   Increasingly, it is being recognised that a protocol design must take
494	   care not to cause unintended consequences by giving the parties in
495	   the protocol exchange perverse incentives [Evol_cc][RFC3426].  Given
496	   there are many good reasons why larger path maximum transmission
497	   units (PMTUs) would help solve a number of scaling issues, we do not
498	   want to create any bias against large packets that is greater than
499	   their true cost.

501	   Imagine a scenario where the same bit rate of packets will contribute
502	   the same to bit-congestion of a link irrespective of whether it is
503	   sent as fewer larger packets or more smaller packets.  A protocol
504	   design that caused larger packets to be more likely to be dropped
505	   than smaller ones would be dangerous in both the following cases:

507	   Malicious transports:  A queue that gives an advantage to small
508	      packets can be used to amplify the force of a flooding attack.  By
509	      sending a flood of small packets, the attacker can get the queue
510	      to discard more traffic in large packets, allowing more attack
511	      traffic to get through to cause further damage.  Such a queue
512	      allows attack traffic to have a disproportionately large effect on
513	      regular traffic without the attacker having to do much work.

515	   Non-malicious transports:  Even if a transport designer is not
516	      actually malicious, if over time it is noticed that small packets
517	      tend to go faster, designers will act in their own interest and
518	      use smaller packets.  Queues that give advantage to small packets
519	      create an evolutionary pressure for transports to send at the same
520	      bit-rate but break their data stream down into tiny segments to
521	      reduce their drop rate.  Encouraging a high volume of tiny packets
522	      might in turn unnecessarily overload a completely unrelated part
523	      of the system, perhaps more limited by header-processing than
524	      bandwidth.

526	   Imagine two unresponsive flows arrive at a bit-congestible
527	   transmission link each with the same bit rate, say 1Mbps, but one
528	   consists of 1500B and the other 60B packets, which are 25x smaller.
529	   Consider a scenario where gentle RED [gentle_RED] is used, along with
530	   the variant of RED we advise against, i.e. where the RED algorithm is
531	   configured to adjust the drop probability of packets in proportion to
532	   each packet's size (byte mode packet drop).  In this case, RED aims
533	   to drop 25x more of the larger packets than the smaller ones.  Thus,
534	   for example if RED drops 25% of the larger packets, it will aim to
535	   drop 1% of the smaller packets (but in practice it may drop more as
536	   congestion increases [RFC4828; Appx B.4]).  Even though both flows
537	   arrive with the same bit rate, the bit rate the RED queue aims to
538	   pass to the line will be 750kbps for the flow of larger packets but
539	   990kbps for the smaller packets (because of rate variations it will
540	   actually be a little less than this target).

542	   Note that, although the byte-mode drop variant of RED amplifies small
543	   packet attacks, drop-tail queues amplify small packet attacks even
544	   more (see Security Considerations in Section 6).  Wherever possible
545	   neither should be used.

547	3.2.  Small != Control

549	   Dropping fewer control packets considerably improves performance.  It
550	   is tempting to drop small packets with lower probability in order to
551	   improve performance, because many control packets are small (TCP SYNs
552	   & ACKs, DNS queries & responses, SIP messages, HTTP GETs, etc).
553	   However, we must not give control packets preference purely by virtue
554	   of their smallness, otherwise it is too easy for any data source to
555	   get the same preferential treatment simply by sending data in smaller
556	   packets.  Again we should not create perverse incentives to favour
557	   small packets rather than to favour control packets, which is what we
558	   intend.

560	   Just because many control packets are small does not mean all small
561	   packets are control packets.

563	   So, rather than fix these problems in the network, we argue that the
564	   transport should be made more robust against losses of control
565	   packets (see 'Making Transports Robust against Control Packet Losses'
566	   in Section 4.2.3).

568	3.3.  Transport-Independent Network

570	   TCP congestion control ensures that flows competing for the same
571	   resource each maintain the same number of segments in flight,
572	   irrespective of segment size.  So under similar conditions, flows
573	   with different segment sizes will get different bit-rates.

575	   To counter this effect it seems tempting not to follow our
576	   recommendation, and instead for the network to bias congestion
577	   notification by packet size in order to equalise the bit-rates of
578	   flows with different packet sizes.  However, in order to do this, the
579	   queuing algorithm has to make assumptions about the transport, which
580	   become embedded in the network.  Specifically:

582	   o  The queuing algorithm has to assume how aggressively the transport
583	      will respond to congestion (see Section 4.2.4).  If the network
584	      assumes the transport responds as aggressively as TCP NewReno, it
585	      will be wrong for Compound TCP and differently wrong for Cubic
586	      TCP, etc.  To achieve equal bit-rates, each transport then has to
587	      guess what assumption the network made, and work out how to
588	      replace this assumed aggressiveness with its own aggressiveness.

590	   o  Also, if the network biases congestion notification by packet size
591	      it has to assume a baseline packet size--all proposed algorithms
592	      use the local MTU (for example see the byte-mode loss probability
593	      formula in Table 1).  Then if the non-Reno transports mentioned
594	      above are trying to reverse engineer what the network assumed,
595	      they also have to guess the MTU of the congested link.

597	   Even though reducing the drop probability of small packets (e.g.
598	   RED's byte-mode drop) helps ensure TCP flows with different packet
599	   sizes will achieve similar bit rates, we argue this correction should
600	   be made to any future transport protocols based on TCP, not to the
601	   network in order to fix one transport, no matter how predominant it
602	   is.  Effectively, favouring small packets is reverse engineering of
603	   network equipment around one particular transport protocol (TCP),
604	   contrary to the excellent advice in [RFC3426], which asks designers
605	   to question "Why are you proposing a solution at this layer of the
606	   protocol stack, rather than at another layer?"

608	   In contrast, if the network never takes account of packet size, the
609	   transport can be certain it will never need to guess any assumptions
610	   the network has made.  And the network passes two pieces of
611	   information to the transport that are sufficient in all cases: i)
612	   congestion notification on the packet and ii) the size of the packet.
613	   Both are available for the transport to combine (by taking account of
614	   packet size when responding to congestion) or not.  Appendix B checks
615	   that these two pieces of information are sufficient for all relevant
616	   scenarios.

618	   When the network does not take account of packet size, it allows
619	   transport protocols to choose whether to take account of packet size
620	   or not.  However, if the network were to bias congestion notification
621	   by packet size, transport protocols would have no choice; those that
622	   did not take account of packet size themselves would unwittingly
623	   become dependent on packet size, and those that already took account
624	   of packet size would end up taking account of it twice.

626	3.4.  Partial Deployment of AQM

628	   In overview, the argument in this section runs as follows:

630	   o  Because the network does not and cannot always drop packets in
631	      proportion to their size, it shouldn't be given the task of making
632	      drop signals depend on packet size at all.

634	   o  Transports on the other hand don't always want to make their rate
635	      response proportional to the size of dropped packets, but if they
636	      want to, they always can.

638	   The argument is similar to the end-to-end argument that says "Don't
639	   do X in the network if end-systems can do X by themselves, and they
640	   want to be able to choose whether to do X anyway."  Actually the
641	   following argument is stronger; in addition it says "Don't give the
642	   network task X that could be done by the end-systems, if X is not
643	   deployed on all network nodes, and end-systems won't be able to tell
644	   whether their network is doing X, or whether they need to do X
645	   themselves."  In this case, the X in question is "making the response
646	   to congestion depend on packet size".

648	   We will now re-run this argument taking each step in more depth.  The
649	   argument applies solely to drop, not to ECN marking.

651	   A queue drops packets for either of two reasons: a) to signal to host
652	   congestion controls that they should reduce the load and b) because
653	   there is no buffer left to store the packets.  Active queue
654	   management tries to use drops as a signal for hosts to slow down
655	   (case a) so that drop due to buffer exhaustion (case b) should not be
656	   necessary.

658	   AQM is not universally deployed in every queue in the Internet; many
659	   cheap ethernet bridges, software firewalls, NATs on consumer devices,
660	   etc implement simple tail-drop buffers.  Even if AQM were universal,
661	   it has to be able to cope with buffer exhaustion (by switching to a
662	   behaviour like tail-drop), in order to cope with unresponsive or
663	   excessive transports.  For these reasons networks will sometimes be
664	   dropping packets as a last resort (case b) rather than under AQM
665	   control (case a).

667	   When buffers are exhausted (case b), they don't naturally drop
668	   packets in proportion to their size.  The network can only reduce the
669	   probability of dropping smaller packets if it has enough space to
670	   store them somewhere while it waits for a larger packet that it can
671	   drop.  If the buffer is exhausted, it does not have this choice.
672	   Admittedly tail-drop does naturally drop somewhat fewer small
673	   packets, but exactly how few depends more on the mix of sizes than
674	   the size of the packet in question.  Nonetheless, in general, if we
675	   wanted networks to do size-dependent drop, we would need universal
676	   deployment of (packet-size dependent) AQM code, which is currently
677	   unrealistic.

679	   A host transport cannot know whether any particular drop was a
680	   deliberate signal from an AQM or a sign of a queue shedding packets
681	   due to buffer exhaustion.  Therefore, because the network cannot
682	   universally do size-dependent drop, it should not do it all.

684	   Whereas universality is desirable in the network, diversity is
685	   desirable between different transport layer protocols - some, like
686	   NewReno TCP [RFC5681], may not choose to make their rate response
687	   proportionate to the size of each dropped packet, while others will
688	   (e.g.  TFRC-SP [RFC4828]).

690	3.5.  Implementation Efficiency

692	   Biasing against large packets typically requires an extra multiply
693	   and divide in the network (see the example byte-mode drop formula in
694	   Table 1).  Allowing for packet size at the transport rather than in
695	   the network ensures that neither the network nor the transport needs
696	   to do a multiply operation--multiplication by packet size is
697	   effectively achieved as a repeated add when the transport adds to its
698	   count of marked bytes as each congestion event is fed to it.  Also
699	   the work to do the biasing is spread over many hosts, rather than
700	   concentrated in just the congested network element.  These aren't
701	   principled reasons in themselves, but they are a happy consequence of
702	   the other principled reasons.

704	4.  A Survey and Critique of Past Advice

706	   This section is informative, not normative.

708	   The original 1993 paper on RED [RED93] proposed two options for the
709	   RED active queue management algorithm: packet mode and byte mode.
710	   Packet mode measured the queue length in packets and dropped (or
711	   marked) individual packets with a probability independent of their
712	   size.  Byte mode measured the queue length in bytes and marked an
713	   individual packet with probability in proportion to its size
714	   (relative to the maximum packet size).  In the paper's outline of
715	   further work, it was stated that no recommendation had been made on
716	   whether the queue size should be measured in bytes or packets, but
717	   noted that the difference could be significant.

719	   When RED was recommended for general deployment in 1998 [RFC2309],
720	   the two modes were mentioned implying the choice between them was a
721	   question of performance, referring to a 1997 email [pktByteEmail] for
722	   advice on tuning.  A later addendum to this email introduced the
723	   insight that there are in fact two orthogonal choices:

725	   o  whether to measure queue length in bytes or packets (Section 4.1)

727	   o  whether the drop probability of an individual packet should depend
728	      on its own size (Section 4.2).

730	   The rest of this section is structured accordingly.

732	4.1.  Congestion Measurement Advice

734	   The choice of which metric to use to measure queue length was left
735	   open in RFC2309.  It is now well understood that queues for bit-
736	   congestible resources should be measured in bytes, and queues for
737	   packet-congestible resources should be measured in packets
738	   [pktByteEmail].

740	   Congestion in some legacy bit-congestible buffers is only measured in
741	   packets not bytes.  In such cases, the operator has to set the
742	   thresholds mindful of a typical mix of packets sizes.  Any AQM
743	   algorithm on such a buffer will be oversensitive to high proportions
744	   of small packets, e.g. a DoS attack, and undersensitive to high
745	   proportions of large packets.  However, there is no need to make
746	   allowances for the possibility of such legacy in future protocol
747	   design.  This is safe because any undersensitivity during unusual
748	   traffic mixes cannot lead to congestion collapse given the buffer
749	   will eventually revert to tail drop, discarding proportionately more
750	   large packets.

752	4.1.1.  Fixed Size Packet Buffers

754	   The question of whether to measure queues in bytes or packets seems
755	   to be well understood.  However, measuring congestion is not
756	   straightforward when the resource is bit congestible but the queue is
757	   packet congestible or vice versa.  This section outlines the approach
758	   to take.  There is no controversy over what should be done, you just
759	   need to be expert in probability to work it out.  And, even if you
760	   know what should be done, it's not always easy to find a practical
761	   algorithm to implement it.

763	   Some, mostly older, queuing hardware sets aside fixed sized buffers
764	   in which to store each packet in the queue.  Also, with some
765	   hardware, any fixed sized buffers not completely filled by a packet
766	   are padded when transmitted to the wire.  If we imagine a theoretical
767	   forwarding system with both queuing and transmission in fixed, MTU-
768	   sized units, it should clearly be treated as packet-congestible,
769	   because the queue length in packets would be a good model of
770	   congestion of the lower layer link.

772	   If we now imagine a hybrid forwarding system with transmission delay
773	   largely dependent on the byte-size of packets but buffers of one MTU
774	   per packet, it should strictly require a more complex algorithm to
775	   determine the probability of congestion.  It should be treated as two
776	   resources in sequence, where the sum of the byte-sizes of the packets
777	   within each packet buffer models congestion of the line while the
778	   length of the queue in packets models congestion of the queue.  Then
779	   the probability of congesting the forwarding buffer would be a
780	   conditional probability--conditional on the previously calculated
781	   probability of congesting the line.

783	   In systems that use fixed size buffers, it is unusual for all the
784	   buffers used by an interface to be the same size.  Typically pools of
785	   different sized buffers are provided (Cisco uses the term 'buffer
786	   carving' for the process of dividing up memory into these pools
787	   [IOSArch]).  Usually, if the pool of small buffers is exhausted,
788	   arriving small packets can borrow space in the pool of large buffers,
789	   but not vice versa.  However, it is easier to work out what should be
790	   done if we temporarily set aside the possibility of such borrowing.
791	   Then, with fixed pools of buffers for different sized packets and no
792	   borrowing, the size of each pool and the current queue length in each
793	   pool would both be measured in packets.  So an AQM algorithm would
794	   have to maintain the queue length for each pool, and judge whether to
795	   drop/mark a packet of a particular size by looking at the pool for
796	   packets of that size and using the length (in packets) of its queue.

798	   We now return to the issue we temporarily set aside: small packets
799	   borrowing space in larger buffers.  In this case, the only difference
800	   is that the pools for smaller packets have a maximum queue size that
801	   includes all the pools for larger packets.  And every time a packet
802	   takes a larger buffer, the current queue size has to be incremented
803	   for all queues in the pools of buffers less than or equal to the
804	   buffer size used.

806	   We will return to borrowing of fixed sized buffers when we discuss
807	   biasing the drop/marking probability of a specific packet because of
808	   its size in Section 4.2.1.  But here we can give a at least one
809	   simple rule for how to measure the length of queues of fixed buffers:
810	   no matter how complicated the scheme is, ultimately any fixed buffer
811	   system will need to measure its queue length in packets not bytes.

813	4.1.2.  Congestion Measurement without a Queue

815	   AQM algorithms are nearly always described assuming there is a queue
816	   for a congested resource and the algorithm can use the queue length
817	   to determine the probability that it will drop or mark each packet.

819	   But not all congested resources lead to queues.  For instance,
820	   wireless spectrum is usually regarded as bit-congestible (for a given
821	   coding scheme).  But wireless link protocols do not always maintain a
822	   queue that depends on spectrum interference.  Similarly, power
823	   limited resources are also usually bit-congestible if energy is
824	   primarily required for transmission rather than header processing,
825	   but it is rare for a link protocol to build a queue as it approaches
826	   maximum power.

828	   Nonetheless, AQM algorithms do not require a queue in order to work.
829	   For instance spectrum congestion can be modelled by signal quality
830	   using target bit-energy-to-noise-density ratio.  And, to model radio
831	   power exhaustion, transmission power levels can be measured and
832	   compared to the maximum power available.  [ECNFixedWireless] proposes
833	   a practical and theoretically sound way to combine congestion
834	   notification for different bit-congestible resources at different
835	   layers along an end to end path, whether wireless or wired, and
836	   whether with or without queues.

838	4.2.  Congestion Notification Advice

840	4.2.1.  Network Bias when Encoding

842	4.2.1.1.  Advice on Packet Size Bias in RED

844	   The previously mentioned email [pktByteEmail] referred to by
845	   [RFC2309] advised that most scarce resources in the Internet were
846	   bit-congestible, which is still believed to be true (Section 1.1).
847	   But it went on to offer advice that is updated by this memo.  It said
848	   that drop probability should depend on the size of the packet being
849	   considered for drop if the resource is bit-congestible, but not if it
850	   is packet-congestible.  The argument continued that if packet drops
851	   were inflated by packet size (byte-mode dropping), "a flow's fraction
852	   of the packet drops is then a good indication of that flow's fraction
853	   of the link bandwidth in bits per second".  This was consistent with
854	   a referenced policing mechanism being worked on at the time for
855	   detecting unusually high bandwidth flows, eventually published in
856	   1999 [pBox].  However, the problem could and should have been solved
857	   by making the policing mechanism count the volume of bytes randomly
858	   dropped, not the number of packets.

860	   A few months before RFC2309 was published, an addendum was added to
861	   the above archived email referenced from the RFC, in which the final
862	   paragraph seemed to partially retract what had previously been said.
863	   It clarified that the question of whether the probability of
864	   dropping/marking a packet should depend on its size was not related
865	   to whether the resource itself was bit congestible, but a completely
866	   orthogonal question.  However the only example given had the queue
867	   measured in packets but packet drop depended on the byte-size of the
868	   packet in question.  No example was given the other way round.

870	   In 2000, Cnodder et al [REDbyte] pointed out that there was an error
871	   in the part of the original 1993 RED algorithm that aimed to
872	   distribute drops uniformly, because it didn't correctly take into
873	   account the adjustment for packet size.  They recommended an
874	   algorithm called RED_4 to fix this.  But they also recommended a
875	   further change, RED_5, to adjust drop rate dependent on the square of
876	   relative packet size.  This was indeed consistent with one implied
877	   motivation behind RED's byte mode drop--that we should reverse
878	   engineer the network to improve the performance of dominant end-to-
879	   end congestion control mechanisms.  This memo makes a different
880	   recommendations in Section 2.

882	   By 2003, a further change had been made to the adjustment for packet
883	   size, this time in the RED algorithm of the ns2 simulator.  Instead
884	   of taking each packet's size relative to a `maximum packet size' it
885	   was taken relative to a `mean packet size', intended to be a static
886	   value representative of the `typical' packet size on the link.  We
887	   have not been able to find a justification in the literature for this
888	   change, however Eddy and Allman conducted experiments [REDbias] that
889	   assessed how sensitive RED was to this parameter, amongst other
890	   things.  However, this changed algorithm can often lead to drop
891	   probabilities of greater than 1 (which gives a hint that there is
892	   probably a mistake in the theory somewhere).

894	   On 10-Nov-2004, this variant of byte-mode packet drop was made the
895	   default in the ns2 simulator.  It seems unlikely that byte-mode drop
896	   has ever been implemented in production networks (Appendix A),
897	   therefore any conclusions based on ns2 simulations that use RED
898	   without disabling byte-mode drop are likely to behave very
899	   differently from RED in production networks.

901	4.2.1.2.  Packet Size Bias Regardless of RED

903	   The byte-mode drop variant of RED is, of course, not the only
904	   possible bias towards small packets in queueing systems.  We have
905	   already mentioned that tail-drop queues naturally tend to lock-out
906	   large packets once they are full.  But also queues with fixed sized
907	   buffers reduce the probability that small packets will be dropped if
908	   (and only if) they allow small packets to borrow buffers from the
909	   pools for larger packets.  As was explained in Section 4.1.1 on fixed
910	   size buffer carving, borrowing effectively makes the maximum queue
911	   size for small packets greater than that for large packets, because
912	   more buffers can be used by small packets while less will fit large
913	   packets.

915	   In itself, the bias towards small packets caused by buffer borrowing
916	   is perfectly correct.  Lower drop probability for small packets is
917	   legitimate in buffer borrowing schemes, because small packets
918	   genuinely congest the machine's buffer memory less than large
919	   packets, given they can fit in more spaces.  The bias towards small
920	   packets is not artificially added (as it is in RED's byte-mode drop
921	   algorithm), it merely reflects the reality of the way fixed buffer
922	   memory gets congested.  Incidentally, the bias towards small packets
923	   from buffer borrowing is nothing like as large as that of RED's byte-
924	   mode drop.

926	   Nonetheless, fixed-buffer memory with tail drop is still prone to
927	   lock-out large packets, purely because of the tail-drop aspect.  So a
928	   good AQM algorithm like RED with packet-mode drop should be used with
929	   fixed buffer memories where possible.  If RED is too complicated to
930	   implement with multiple fixed buffer pools, the minimum necessary to
931	   prevent large packet lock-out is to ensure smaller packets never use
932	   the last available buffer in any of the pools for larger packets.

934	4.2.2.  Transport Bias when Decoding

936	   The above proposals to alter the network equipment to bias towards
937	   smaller packets have largely carried on outside the IETF process.
938	   Whereas, within the IETF, there are many different proposals to alter
939	   transport protocols to achieve the same goals, i.e. either to make
940	   the flow bit-rate take account of packet size, or to protect control
941	   packets from loss.  This memo argues that altering transport
942	   protocols is the more principled approach.

944	   A recently approved experimental RFC adapts its transport layer
945	   protocol to take account of packet sizes relative to typical TCP
946	   packet sizes.  This proposes a new small-packet variant of TCP-
947	   friendly rate control [RFC5348] called TFRC-SP [RFC4828].
948	   Essentially, it proposes a rate equation that inflates the flow rate
949	   by the ratio of a typical TCP segment size (1500B including TCP
950	   header) over the actual segment size [PktSizeEquCC].  (There are also
951	   other important differences of detail relative to TFRC, such as using
952	   virtual packets [CCvarPktSize] to avoid responding to multiple losses
953	   per round trip and using a minimum inter-packet interval.)

955	   Section 4.5.1 of this TFRC-SP spec discusses the implications of
956	   operating in an environment where queues have been configured to drop
957	   smaller packets with proportionately lower probability than larger
958	   ones.  But it only discusses TCP operating in such an environment,
959	   only mentioning TFRC-SP briefly when discussing how to define
960	   fairness with TCP.  And it only discusses the byte-mode dropping
961	   version of RED as it was before Cnodder et al pointed out it didn't
962	   sufficiently bias towards small packets to make TCP independent of
963	   packet size.

965	   So the TFRC-SP spec doesn't address the issue of which of the network
966	   or the transport _should_ handle fairness between different packet
967	   sizes.  In its Appendix B.4 it discusses the possibility of both
968	   TFRC-SP and some network buffers duplicating each other's attempts to
969	   deliberately bias towards small packets.  But the discussion is not
970	   conclusive, instead reporting simulations of many of the
971	   possibilities in order to assess performance but not recommending any
972	   particular course of action.

974	   The paper originally proposing TFRC with virtual packets (VP-TFRC)
975	   [CCvarPktSize] proposed that there should perhaps be two variants to
976	   cater for the different variants of RED.  However, as the TFRC-SP
977	   authors point out, there is no way for a transport to know whether
978	   some queues on its path have deployed RED with byte-mode packet drop
979	   (except if an exhaustive survey found that no-one has deployed it!--
980	   see Appendix A).  Incidentally, VP-TFRC also proposed that byte-mode
981	   RED dropping should really square the packet-size compensation-factor
982	   (like that of Cnodder's RED_5, but apparently unaware of it).

984	   Pre-congestion notification [RFC5670] is an IETF technology to use a
985	   virtual queue for AQM marking for packets within one Diffserv class
986	   in order to give early warning prior to any real queuing.  The PCN
987	   marking algorithms have been designed not to take account of packet
988	   size when forwarding through queues.  Instead the general principle
989	   has been to take account of the sizes of marked packets when
990	   monitoring the fraction of marking at the edge of the network, as
991	   recommended here.

993	4.2.3.  Making Transports Robust against Control Packet Losses

995	   Recently, two RFCs have defined changes to TCP that make it more
996	   robust against losing small control packets [RFC5562] [RFC5690].  In
997	   both cases they note that the case for these two TCP changes would be
998	   weaker if RED were biased against dropping small packets.  We argue
999	   here that these two proposals are a safer and more principled way to
1000	   achieve TCP performance improvements than reverse engineering RED to
1001	   benefit TCP.

1003	   Although there are no known proposals, it would also be possible and
1004	   perfectly valid to make control packets robust against drop by
1005	   explicitly requesting a lower drop probability using their Diffserv
1006	   code point [RFC2474] to request a scheduling class with lower drop.

1008	   Although not brought to the IETF, a simple proposal from Wischik
1009	   [DupTCP] suggests that the first three packets of every TCP flow
1010	   should be routinely duplicated after a short delay.  It shows that
1011	   this would greatly improve the chances of short flows completing
1012	   quickly, but it would hardly increase traffic levels on the Internet,
1013	   because Internet bytes have always been concentrated in the large
1014	   flows.  It further shows that the performance of many typical
1015	   applications depends on completion of long serial chains of short
1016	   messages.  It argues that, given most of the value people get from
1017	   the Internet is concentrated within short flows, this simple
1018	   expedient would greatly increase the value of the best efforts
1019	   Internet at minimal cost.

1021	4.2.4.  Congestion Notification: Summary of Conflicting Advice

1023	   +-----------+----------------+-----------------+--------------------+
1024	   | transport |  RED_1 (packet |  RED_4 (linear  | RED_5 (square byte |
1025	   |        cc |   mode drop)   | byte mode drop) |     mode drop)     |
1026	   +-----------+----------------+-----------------+--------------------+
1027	   |    TCP or |    s/sqrt(p)   |    sqrt(s/p)    |      1/sqrt(p)     |
1028	   |      TFRC |                |                 |                    |
1029	   |   TFRC-SP |    1/sqrt(p)   |    1/sqrt(sp)   |    1/(s.sqrt(p))   |
1030	   +-----------+----------------+-----------------+--------------------+

1032	    Table 2: Dependence of flow bit-rate per RTT on packet size, s, and
1033	   drop probability, p, when network and/or transport bias towards small
1034	                        packets to varying degrees

1036	   Table 2 aims to summarise the potential effects of all the advice
1037	   from different sources.  Each column shows a different possible AQM
1038	   behaviour in different queues in the network, using the terminology
1039	   of Cnodder et al outlined earlier (RED_1 is basic RED with packet-
1040	   mode drop).  Each row shows a different transport behaviour: TCP
1041	   [RFC5681] and TFRC [RFC5348] on the top row with TFRC-SP [RFC4828]
1042	   below.  Each cell shows how the bits per round trip of a flow depends
1043	   on packet size, s, and drop probability, p.  In order to declutter
1044	   the formulae to focus on packet-size dependence they are all given
1045	   per round trip, which removes any RTT term.

1047	   Let us assume that the goal is for the bit-rate of a flow to be
1048	   independent of packet size.  Suppressing all inessential details, the
1049	   table shows that this should either be achievable by not altering the
1050	   TCP transport in a RED_5 network, or using the small packet TFRC-SP
1051	   transport (or similar) in a network without any byte-mode dropping
1052	   RED (top right and bottom left).  Top left is the `do nothing'
1053	   scenario, while bottom right is the `do-both' scenario in which bit-
1054	   rate would become far too biased towards small packets.  Of course,
1055	   if any form of byte-mode dropping RED has been deployed on a subset
1056	   of queues that congest, each path through the network will present a
1057	   different hybrid scenario to its transport.

1059	   Whatever, we can see that the linear byte-mode drop column in the
1060	   middle would considerably complicate the Internet.  It's a half-way
1061	   house that doesn't bias enough towards small packets even if one
1062	   believes the network should be doing the biasing.  Section 2
1063	   recommends that _all_ bias in network equipment towards small packets
1064	   should be turned off--if indeed any equipment vendors have
1065	   implemented it--leaving packet-size bias solely as the preserve of
1066	   the transport layer (solely the leftmost, packet-mode drop column).

1068	   In practice it seems that no deliberate bias towards small packets
1069	   has been implemented for production networks.  Of the 19% of vendors
1070	   who responded to a survey of 84 equipment vendors, none had
1071	   implemented byte-mode drop in RED (see Appendix A for details).

1073	5.  Outstanding Issues and Next Steps

1075	5.1.  Bit-congestible Network

1077	   For a connectionless network with nearly all resources being bit-
1078	   congestible the recommended position is clear--that the network
1079	   should not make allowance for packet sizes and the transport should.
1080	   This leaves two outstanding issues:

1082	   o  How to handle any legacy of AQM with byte-mode drop already
1083	      deployed;

1085	   o  The need to start a programme to update transport congestion
1086	      control protocol standards to take account of packet size.

1088	   A survey of equipment vendors (Section 4.2.4) found no evidence that
1089	   byte-mode packet drop had been implemented, so deployment will be
1090	   sparse at best.  A migration strategy is not really needed to remove
1091	   an algorithm that may not even be deployed.

1093	   A programme of experimental updates to take account of packet size in
1094	   transport congestion control protocols has already started with
1095	   TFRC-SP [RFC4828].

1097	5.2.  Bit- & Packet-congestible Network

1099	   The position is much less clear-cut if the Internet becomes populated
1100	   by a more even mix of both packet-congestible and bit-congestible
1101	   resources (see Appendix B.2).  This problem is not pressing, because
1102	   most Internet resources are designed to be bit-congestible before
1103	   packet processing starts to congest (see Section 1.1).

1105	   The IRTF Internet congestion control research group (ICCRG) has set
1106	   itself the task of reaching consensus on generic forwarding
1107	   mechanisms that are necessary and sufficient to support the
1108	   Internet's future congestion control requirements (the first
1109	   challenge in [RFC6077]).  The research question of whether packet
1110	   congestion might become common and what to do if it does may in the
1111	   future be explored in the IRTF (the "Challenge 3: Packet Size" in
1112	   [RFC6077]).

1114	6.  Security Considerations

1116	   This memo recommends that queues do not bias drop probability towards
1117	   small packets as this creates a perverse incentive for transports to
1118	   break down their flows into tiny segments.  One of the benefits of
1119	   implementing AQM was meant to be to remove this perverse incentive
1120	   that drop-tail queues gave to small packets.

1122	   In practice, transports cannot all be trusted to respond to
1123	   congestion.  So another reason for recommending that queues do not
1124	   bias drop probability towards small packets is to avoid the
1125	   vulnerability to small packet DDoS attacks that would otherwise
1126	   result.  One of the benefits of implementing AQM was meant to be to
1127	   remove drop-tail's DoS vulnerability to small packets, so we
1128	   shouldn't add it back again.

1130	   If most queues implemented AQM with byte-mode drop, the resulting
1131	   network would amplify the potency of a small packet DDoS attack.  At
1132	   the first queue the stream of packets would push aside a greater
1133	   proportion of large packets, so more of the small packets would
1134	   survive to attack the next queue.  Thus a flood of small packets
1135	   would continue on towards the destination, pushing regular traffic
1136	   with large packets out of the way in one queue after the next, but
1137	   suffering much less drop itself.

1139	   Appendix C explains why the ability of networks to police the
1140	   response of _any_ transport to congestion depends on bit-congestible
1141	   network resources only doing packet-mode not byte-mode drop.  In
1142	   summary, it says that making drop probability depend on the size of
1143	   the packets that bits happen to be divided into simply encourages the
1144	   bits to be divided into smaller packets.  Byte-mode drop would
1145	   therefore irreversibly complicate any attempt to fix the Internet's
1146	   incentive structures.

1148	7.  IANA Considerations

1150	   This document has no actions for IANA.

1152	8.  Conclusions

1154	   This memo identifies the three distinct stages of the congestion
1155	   notification process where implementations need to decide whether to
1156	   take packet size into account.  The recommendations provided in
1157	   Section 2 of this memo are different in each case:

1159	   o  When network equipment measures the length of a queue, whether it
1160	      counts in bytes or packets depends on whether the network resource
1161	      is congested respectively by bytes or by packets.

1163	   o  When network equipment decides whether to drop (or mark) a packet,
1164	      it is recommended that the size of the particular packet should
1165	      not be taken into account

1167	   o  However, when a transport algorithm responds to a dropped or
1168	      marked packet, the size of the rate reduction should be
1169	      proportionate to the size of the packet.

1171	   In summary, the answers are 'it depends', 'no' and 'yes' respectively

1173	   For the specific case of RED, this means that byte-mode queue
1174	   measurement will often be appropriate although byte-mode drop is
1175	   strongly deprecated.

1177	   At the transport layer the IETF should continue updating congestion
1178	   control protocols to take account of the size of each packet that
1179	   indicates congestion.  Also the IETF should continue to make
1180	   protocols less sensitive to losing control packets like SYNs, pure
1181	   ACKs and DNS exchanges.  Although many control packets happen to be
1182	   small, the alternative of network equipment favouring all small
1183	   packets would be dangerous.  That would create perverse incentives to
1184	   split data transfers into smaller packets.

1186	   The memo develops these recommendations from principled arguments
1187	   concerning scaling, layering, incentives, inherent efficiency,
1188	   security and policeability.  But it also addresses practical issues
1189	   such as specific buffer architectures and incremental deployment.
1190	   Indeed a limited survey of RED implementations is discussed, which
1191	   shows there appears to be little, if any, installed base of RED's
1192	   byte-mode drop.  Therefore it can be deprecated with little, if any,
1193	   incremental deployment complications.

1195	   The recommendations have been developed on the well-founded basis
1196	   that most Internet resources are bit-congestible not packet-
1197	   congestible.  We need to know the likelihood that this assumption
1198	   will prevail longer term and, if it might not, what protocol changes
1199	   will be needed to cater for a mix of the two.  The IRTF Internet
1200	   Congestion Control Research Group (ICCRG) is currently working on
1201	   these problems [RFC6077].

1203	9.  Acknowledgements

1205	   Thank you to Sally Floyd, who gave extensive and useful review
1206	   comments.  Also thanks for the reviews from Philip Eardley, David
1207	   Black, Fred Baker, Toby Moncaster, Arnaud Jacquet and Mirja
1208	   Kuehlewind as well as helpful explanations of different hardware
1209	   approaches from Larry Dunn and Fred Baker.  We are grateful to Bruce
1210	   Davie and his colleagues for providing a timely and efficient survey
1211	   of RED implementation in Cisco's product range.  Also grateful thanks
1212	   to Toby Moncaster, Will Dormann, John Regnault, Simon Carter and
1213	   Stefaan De Cnodder who further helped survey the current status of
1214	   RED implementation and deployment and, finally, thanks to the
1215	   anonymous individuals who responded.

1217	   Bob Briscoe and Jukka Manner were partly funded by Trilogy, a
1218	   research project (ICT- 216372) supported by the European Community
1219	   under its Seventh Framework Programme.  The views expressed here are
1220	   those of the authors only.

1222	10.  Comments Solicited

1224	   Comments and questions are encouraged and very welcome.  They can be
1225	   addressed to the IETF Transport Area working group mailing list
1226	   <tsvwg@ietf.org>, and/or to the authors.

1228	11.  References

1230	11.1.  Normative References

1232	   [RFC2119]                       Bradner, S., "Key words for use in
1233	                                   RFCs to Indicate Requirement Levels",
1234	                                   BCP 14, RFC 2119, March 1997.

1236	   [RFC3168]                       Ramakrishnan, K., Floyd, S., and D.
1237	                                   Black, "The Addition of Explicit
1238	                                   Congestion Notification (ECN) to IP",
1239	                                   RFC 3168, September 2001.

1241	11.2.  Informative References

1243	   [CCvarPktSize]                  Widmer, J., Boutremans, C., and J-Y.
1244	                                   Le Boudec, "Congestion Control for
1245	                                   Flows with Variable Packet Size", ACM
1246	                                   CCR 34(2) 137--151, 2004, <http://
1247	                                   doi.acm.org/10.1145/997150.997162>.

1249	   [CHOKe_Var_Pkt]                 Psounis, K., Pan, R., and B.
1250	                                   Prabhaker, "Approximate Fair Dropping
1251	                                   for Variable Length Packets", IEEE
1252	                                   Micro 21(1):48--56, January-
1253	                                   February 2001, <http://
1254	                                   www.stanford.edu/~balaji/papers/
1255	                                   01approximatefair.pdf}>.

1257	   [DRQ]                           Shin, M., Chong, S., and I. Rhee,
1258	                                   "Dual-Resource TCP/AQM for
1259	                                   Processing-Constrained Networks",
1260	                                   IEEE/ACM Transactions on
1261	                                   Networking Vol 16, issue 2,
1262	                                   April 2008, <http://dx.doi.org/
1263	                                   10.1109/TNET.2007.900415>.

1265	   [DupTCP]                        Wischik, D., "Short messages", Royal
1266	                                   Society workshop on networks:
1267	                                   modelling and control ,
1268	                                   September 2007, <http://
1269	                                   www.cs.ucl.ac.uk/staff/ucacdjw/
1270	                                   Research/shortmsg.html>.

1272	   [ECNFixedWireless]              Siris, V., "Resource Control for
1273	                                   Elastic Traffic in CDMA Networks",
1274	                                   Proc. ACM MOBICOM'02 ,
1275	                                   September 2002, <http://
1276	                                   www.ics.forth.gr/netlab/publications/
1277	                                   resource_control_elastic_cdma.html>.

1279	   [Evol_cc]                       Gibbens, R. and F. Kelly, "Resource
1280	                                   pricing and the evolution of
1281	                                   congestion control",
1282	                                   Automatica 35(12)1969--1985,
1283	                                   December 1999, <http://
1284	                                   www.statslab.cam.ac.uk/~frank/
1285	                                   evol.html>.

1287	   [I-D.ietf-avtcore-ecn-for-rtp]  Westerlund, M., Johansson, I.,
1288	                                   Perkins, C., O'Hanlon, P., and K.
1289	                                   Carlberg, "Explicit Congestion
1290	                                   Notification (ECN) for RTP over UDP",
1291	                                   draft-ietf-avtcore-ecn-for-rtp-08
1292	                                   (work in progress), May 2012.

1294	   [I-D.ietf-conex-concepts-uses]  Briscoe, B., Woundy, R., and A.
1295	                                   Cooper, "ConEx Concepts and Use
1296	                                   Cases",
1297	                                   (work in progress), March 2012.

1299	   [IOSArch]                       Bollapragada, V., White, R., and C.
1300	                                   Murphy, "Inside Cisco IOS Software
1301	                                   Architecture", Cisco Press: CCIE
1302	                                   Professional Development ISBN13: 978-
1303	                                   1-57870-181-0, July 2000.

1305	   [PktSizeEquCC]                  Vasallo, P., "Variable Packet Size
1306	                                   Equation-Based Congestion Control",
1307	                                   ICSI Technical Report tr-00-008,
1308	                                   2000, <http://http.icsi.berkeley.edu/
1309	                                   ftp/global/pub/techreports/2000/
1310	                                   tr-00-008.pdf>.

1312	   [RED93]                         Floyd, S. and V. Jacobson, "Random
1313	                                   Early Detection (RED) gateways for
1314	                                   Congestion Avoidance", IEEE/ACM
1315	                                   Transactions on Networking 1(4) 397--
1316	                                   413, August 1993, <http://
1317	                                   www.icir.org/floyd/papers/red/
1318	                                   red.html>.

1320	   [REDbias]                       Eddy, W. and M. Allman, "A Comparison
1321	                                   of RED's Byte and Packet Modes",
1322	                                   Computer Networks 42(3) 261--280,
1323	                                   June 2003, <http://www.ir.bbn.com/
1324	                                   documents/articles/redbias.ps>.

1326	   [REDbyte]                       De Cnodder, S., Elloumi, O., and K.
1327	                                   Pauwels, "RED behavior with different
1328	                                   packet sizes", Proc. 5th IEEE
1329	                                   Symposium on Computers and
1330	                                   Communications (ISCC) 793--799,
1331	                                   July 2000, <http://www.icir.org/
1332	                                   floyd/red/Elloumi99.pdf>.

1334	   [RFC2309]                       Braden, B., Clark, D., Crowcroft, J.,
1335	                                   Davie, B., Deering, S., Estrin, D.,
1336	                                   Floyd, S., Jacobson, V., Minshall,
1337	                                   G., Partridge, C., Peterson, L.,
1338	                                   Ramakrishnan, K., Shenker, S.,
1339	                                   Wroclawski, J., and L. Zhang,
1340	                                   "Recommendations on Queue Management
1341	                                   and Congestion Avoidance in the
1342	                                   Internet", RFC 2309, April 1998.

1344	   [RFC2474]                       Nichols, K., Blake, S., Baker, F.,
1345	                                   and D. Black, "Definition of the
1346	                                   Differentiated Services Field (DS
1347	                                   Field) in the IPv4 and IPv6 Headers",
1348	                                   RFC 2474, December 1998.

1350	   [RFC3426]                       Floyd, S., "General Architectural and
1351	                                   Policy Considerations", RFC 3426,
1352	                                   November 2002.

1354	   [RFC3550]                       Schulzrinne, H., Casner, S.,
1355	                                   Frederick, R., and V. Jacobson, "RTP:
1356	                                   A Transport Protocol for Real-Time
1357	                                   Applications", STD 64, RFC 3550,
1358	                                   July 2003.

1360	   [RFC3714]                       Floyd, S. and J. Kempf, "IAB Concerns
1361	                                   Regarding Congestion Control for
1362	                                   Voice Traffic in the Internet",
1363	                                   RFC 3714, March 2004.

1365	   [RFC4828]                       Floyd, S. and E. Kohler, "TCP
1366	                                   Friendly Rate Control (TFRC): The
1367	                                   Small-Packet (SP) Variant", RFC 4828,
1368	                                   April 2007.

1370	   [RFC5348]                       Floyd, S., Handley, M., Padhye, J.,
1371	                                   and J. Widmer, "TCP Friendly Rate
1372	                                   Control (TFRC): Protocol
1373	                                   Specification", RFC 5348,
1374	                                   September 2008.

1376	   [RFC5562]                       Kuzmanovic, A., Mondal, A., Floyd,
1377	                                   S., and K. Ramakrishnan, "Adding
1378	                                   Explicit Congestion Notification
1379	                                   (ECN) Capability to TCP's SYN/ACK
1380	                                   Packets", RFC 5562, June 2009.

1382	   [RFC5670]                       Eardley, P., "Metering and Marking
1383	                                   Behaviour of PCN-Nodes", RFC 5670,
1384	                                   November 2009.

1386	   [RFC5681]                       Allman, M., Paxson, V., and E.
1387	                                   Blanton, "TCP Congestion Control",
1388	                                   RFC 5681, September 2009.

1390	   [RFC5690]                       Floyd, S., Arcia, A., Ros, D., and J.
1391	                                   Iyengar, "Adding Acknowledgement
1392	                                   Congestion Control to TCP", RFC 5690,
1393	                                   February 2010.

1395	   [RFC6077]                       Papadimitriou, D., Welzl, M., Scharf,
1396	                                   M., and B. Briscoe, "Open Research
1397	                                   Issues in Internet Congestion
1398	                                   Control", RFC 6077, February 2011.

1400	   [Rate_fair_Dis]                 Briscoe, B., "Flow Rate Fairness:
1401	                                   Dismantling a Religion", ACM
1402	                                   CCR 37(2)63--74, April 2007, <http://
1403	                                   portal.acm.org/
1404	                                   citation.cfm?id=1232926>.

1406	   [gentle_RED]                    Floyd, S., "Recommendation on using
1407	                                   the "gentle_" variant of RED", Web
1408	                                   page , March 2000, <http://
1409	                                   www.icir.org/floyd/red/gentle.html>.

1411	   [pBox]                          Floyd, S. and K. Fall, "Promoting the
1412	                                   Use of End-to-End Congestion Control
1413	                                   in the Internet", IEEE/ACM
1414	                                   Transactions on Networking 7(4) 458--
1415	                                   472, August 1999, <http://
1416	                                   www.aciri.org/floyd/
1417	                                   end2end-paper.html>.

1419	   [pktByteEmail]                  Floyd, S., "RED: Discussions of Byte
1420	                                   and Packet Modes", Web page Red Queue
1421	                                   Management, March 1997, <Available
1422	                                   at: http://ee.lbl.gov/floyd/
1423	                                   REDaveraging.txt>.

1425	Appendix A.  Survey of RED Implementation Status

1427	   This Appendix is informative, not normative.

1429	   In May 2007 a survey was conducted of 84 vendors to assess how widely
1430	   drop probability based on packet size has been implemented in RED
1431	   Table 3.  About 19% of those surveyed replied, giving a sample size
1432	   of 16.  Although in most cases we do not have permission to identify
1433	   the respondents, we can say that those that have responded include
1434	   most of the larger equipment vendors, covering a large fraction of
1435	   the market.  The two who gave permission to be identified were Cisco
1436	   and Alcatel-Lucent.  The others range across the large network
1437	   equipment vendors at L3 & L2, firewall vendors, wireless equipment
1438	   vendors, as well as large software businesses with a small selection
1439	   of networking products.  All those who responded confirmed that they
1440	   have not implemented the variant of RED with drop dependent on packet
1441	   size (2 were fairly sure they had not but needed to check more
1442	   thoroughly).  At the time the survey was conducted, Linux did not
1443	   implement RED with packet-size bias of drop, although we have not
1444	   investigated a wider range of open source code.

1446	   +-------------------------------+----------------+-----------------+
1447	   |                      Response | No. of vendors | %age of vendors |
1448	   +-------------------------------+----------------+-----------------+
1449	   |               Not implemented |             14 |             17% |
1450	   |    Not implemented (probably) |              2 |              2% |
1451	   |                   Implemented |              0 |              0% |
1452	   |                   No response |             68 |             81% |
1453	   | Total companies/orgs surveyed |             84 |            100% |
1454	   +-------------------------------+----------------+-----------------+

1456	    Table 3: Vendor Survey on byte-mode drop variant of RED (lower drop
1457	                      probability for small packets)

1459	   Where reasons have been given, the extra complexity of packet bias
1460	   code has been most prevalent, though one vendor had a more principled
1461	   reason for avoiding it--similar to the argument of this document.

1463	   Our survey was of vendor implementations, so we cannot be certain
1464	   about operator deployment.  But we believe many queues in the
1465	   Internet are still tail-drop.  The company of one of the co-authors
1466	   (BT) has widely deployed RED, but many tail-drop queues are bound to
1467	   still exist, particularly in access network equipment and on
1468	   middleboxes like firewalls, where RED is not always available.

1470	   Routers using a memory architecture based on fixed size buffers with
1471	   borrowing may also still be prevalent in the Internet.  As explained
1472	   in Section 4.2.1, these also provide a marginal (but legitimate) bias
1473	   towards small packets.  So even though RED byte-mode drop is not
1474	   prevalent, it is likely there is still some bias towards small
1475	   packets in the Internet due to tail drop and fixed buffer borrowing.

1477	Appendix B.  Sufficiency of Packet-Mode Drop

1479	   This Appendix is informative, not normative.

1481	   Here we check that packet-mode drop (or marking) in the network gives
1482	   sufficiently generic information for the transport layer to use.  We
1483	   check against a 2x2 matrix of four scenarios that may occur now or in
1484	   the future (Table 4).  The horizontal and vertical dimensions have
1485	   been chosen because each tests extremes of sensitivity to packet size
1486	   in the transport and in the network respectively.

1488	   Note that this section does not consider byte-mode drop at all.
1489	   Having deprecated byte-mode drop, the goal here is to check that
1490	   packet-mode drop will be sufficient in all cases.

1492	   +-------------------------------+-----------------+-----------------+
1493	   |                     Transport |  a) Independent | b) Dependent on |
1494	   |                               |  of packet size |  packet size of |
1495	   | Network                       |  of congestion  |    congestion   |
1496	   |                               |  notifications  |  notifications  |
1497	   +-------------------------------+-----------------+-----------------+
1498	   | 1) Predominantly              |   Scenario a1)  |   Scenario b1)  |
1499	   | bit-congestible network       |                 |                 |
1500	   | 2) Mix of bit-congestible and |   Scenario a2)  |   Scenario b2)  |
1501	   | pkt-congestible network       |                 |                 |
1502	   +-------------------------------+-----------------+-----------------+

1504	                Table 4: Four Possible Congestion Scenarios

1506	   Appendix B.1 focuses on the horizontal dimension of Table 4 checking
1507	   that packet-mode drop (or marking) gives sufficient information,
1508	   whether or not the transport uses it--scenarios b) and a)
1509	   respectively.

1511	   Appendix B.2 focuses on the vertical dimension of Table 4, checking
1512	   that packet-mode drop gives sufficient information to the transport
1513	   whether resources in the network are bit-congestible or packet-
1514	   congestible (these terms are defined in Section 1.1).

1516	   Notation:  To be concrete, we will compare two flows with different
1517	      packet sizes, s_1 and s_2.  As an example, we will take s_1 = 60B
1518	      = 480b and s_2 = 1500B = 12,000b.

1520	      A flow's bit rate, x [bps], is related to its packet rate, u
1521	      [pps], by

1523	         x(t) = s.u(t).

1525	      In the bit-congestible case, path congestion will be denoted by
1526	      p_b, and in the packet-congestible case by p_p.  When either case
1527	      is implied, the letter p alone will denote path congestion.

1529	B.1.  Packet-Size (In)Dependence in Transports

1531	   In all cases we consider a packet-mode drop queue that indicates
1532	   congestion by dropping (or marking) packets with probability p
1533	   irrespective of packet size. We use an example value of loss
1534	   (marking) probability, p=0.1%.

1536	   A transport like RFC5681 TCP treats a congestion notification on any
1537	   packet whatever its size as one event.  However, a network with just
1538	   the packet-mode drop algorithm does give more information if the
1539	   transport chooses to use it.  We will use Table 5 to illustrate this.

1541	   We will set aside the last column until later.  The columns labelled
1542	   "Flow 1" and "Flow 2" compare two flows consisting of 60B and 1500B
1543	   packets respectively.  The body of the table considers two separate
1544	   cases, one where the flows have equal bit-rate and the other with
1545	   equal packet-rates.  In both cases, the two flows fill a 96Mbps link.
1546	   Therefore, in the equal bit-rate case they each have half the bit-
1547	   rate (48Mbps).  Whereas, with equal packet-rates, flow 1 uses 25
1548	   times smaller packets so it gets 25 times less bit-rate--it only gets
1549	   1/(1+25) of the link capacity (96Mbps/26 = 4Mbps after rounding).  In
1550	   contrast flow 2 gets 25 times more bit-rate (92Mbps) in the equal
1551	   packet rate case because its packets are 25 times larger.  The packet
1552	   rate shown for each flow could easily be derived once the bit-rate
1553	   was known by dividing bit-rate by packet size, as shown in the column
1554	   labelled "Formula".

1556	       Parameter               Formula      Flow 1  Flow 2 Combined
1557	       ----------------------- ----------- ------- ------- --------
1558	       Packet size             s/8             60B  1,500B    (Mix)
1559	       Packet size             s              480b 12,000b    (Mix)
1560	       Pkt loss probability    p              0.1%    0.1%     0.1%

1562	       EQUAL BIT-RATE CASE
1563	       Bit-rate                x            48Mbps  48Mbps   96Mbps
1564	       Packet-rate             u = x/s     100kpps   4kpps  104kpps
1565	       Absolute pkt-loss-rate  p*u          100pps    4pps   104pps
1566	       Absolute bit-loss-rate  p*u*s        48kbps  48kbps   96kbps
1567	       Ratio of lost/sent pkts p*u/u          0.1%    0.1%     0.1%
1568	       Ratio of lost/sent bits p*u*s/(u*s)    0.1%    0.1%     0.1%

1570	       EQUAL PACKET-RATE CASE
1571	       Bit-rate                x             4Mbps  92Mbps   96Mbps
1572	       Packet-rate             u = x/s       8kpps   8kpps   15kpps
1573	       Absolute pkt-loss-rate  p*u            8pps    8pps    15pps
1574	       Absolute bit-loss-rate  p*u*s         4kbps  92kbps   96kbps
1575	       Ratio of lost/sent pkts p*u/u          0.1%    0.1%     0.1%
1576	       Ratio of lost/sent bits p*u*s/(u*s)    0.1%    0.1%     0.1%

1578	    Table 5: Absolute Loss Rates and Loss Ratios for Flows of Small and
1579	                      Large Packets and Both Combined

1581	   So far we have merely set up the scenarios.  We now consider
1582	   congestion notification in the scenario.  Two TCP flows with the same
1583	   round trip time aim to equalise their packet-loss-rates over time.

1585	   That is the number of packets lost in a second, which is the packets
1586	   per second (u) multiplied by the probability that each one is dropped
1587	   (p).  Thus TCP converges on the "Equal packet-rate" case, where both
1588	   flows aim for the same "Absolute packet-loss-rate" (both 8pps in the
1589	   table).

1591	   Packet-mode drop actually gives flows sufficient information to
1592	   measure their loss-rate in bits per second, if they choose, not just
1593	   packets per second.  Each flow can count the size of a lost or marked
1594	   packet and scale its rate-response in proportion (as TFRC-SP does).
1595	   The result is shown in the row entitled "Absolute bit-loss-rate",
1596	   where the bits lost in a second is the packets per second (u)
1597	   multiplied by the probability of losing a packet (p) multiplied by
1598	   the packet size (s).  Such an algorithm would try to remove any
1599	   imbalance in bit-loss-rate such as the wide disparity in the "Equal
1600	   packet-rate" case (4kbps vs. 92kbps).  Instead, a packet-size-
1601	   dependent algorithm would aim for equal bit-loss-rates, which would
1602	   drive both flows towards the "Equal bit-rate" case, by driving them
1603	   to equal bit-loss-rates (both 48kbps in this example).

1605	   The explanation so far has assumed that each flow consists of packets
1606	   of only one constant size.  Nonetheless, it extends naturally to
1607	   flows with mixed packet sizes.  In the right-most column of Table 5 a
1608	   flow of mixed size packets is created simply by considering flow 1
1609	   and flow 2 as a single aggregated flow.  There is no need for a flow
1610	   to maintain an average packet size.  It is only necessary for the
1611	   transport to scale its response to each congestion indication by the
1612	   size of each individual lost (or marked) packet.  Taking for example
1613	   the "Equal packet-rate" case, in one second about 8 small packets and
1614	   8 large packets are lost (making closer to 15 than 16 losses per
1615	   second due to rounding).  If the transport multiplies each loss by
1616	   its size, in one second it responds to 8*480b and 8*12,000b lost
1617	   bits, adding up to 96,000 lost bits in a second.  This double checks
1618	   correctly, being the same as 0.1% of the total bit-rate of 96Mbps.
1619	   For completeness, the formula for absolute bit-loss-rate is p(u1*s1+
1620	   u2*s2).

1622	   Incidentally, a transport will always measure the loss probability
1623	   the same irrespective of whether it measures in packets or in bytes.
1624	   In other words, the ratio of lost to sent packets will be the same as
1625	   the ratio of lost to sent bytes.  (This is why TCP's bit rate is
1626	   still proportional to packet size even when byte-counting is used, as
1627	   recommended for TCP in [RFC5681], mainly for orthogonal security
1628	   reasons.)  This is intuitively obvious by comparing two example
1629	   flows; one with 60B packets, the other with 1500B packets.  If both
1630	   flows pass through a queue with drop probability 0.1%, each flow will
1631	   lose 1 in 1,000 packets.  In the stream of 60B packets the ratio of
1632	   bytes lost to sent will be 60B in every 60,000B; and in the stream of
1633	   1500B packets, the loss ratio will be 1,500B out of 1,500,000B. When
1634	   the transport responds to the ratio of lost to sent packets, it will
1635	   measure the same ratio whether it measures in packets or bytes: 0.1%
1636	   in both cases.  The fact that this ratio is the same whether measured
1637	   in packets or bytes can be seen in Table 5, where the ratio of lost
1638	   to sent packets and the ratio of lost to sent bytes is always 0.1% in
1639	   all cases (recall that the scenario was set up with p=0.1%).

1641	   This discussion of how the ratio can be measured in packets or bytes
1642	   is only raised here to highlight that it is irrelevant to this memo!
1643	   Whether a transport depends on packet size or not depends on how this
1644	   ratio is used within the congestion control algorithm.

1646	   So far we have shown that packet-mode drop passes sufficient
1647	   information to the transport layer so that the transport can take
1648	   account of bit-congestion, by using the sizes of the packets that
1649	   indicate congestion.  We have also shown that the transport can
1650	   choose not to take packet size into account if it wishes.  We will
1651	   now consider whether the transport can know which to do.

1653	B.2.  Bit-Congestible and Packet-Congestible Indications

1655	   As a thought-experiment, imagine an idealised congestion notification
1656	   protocol that supports both bit-congestible and packet-congestible
1657	   resources.  It would require at least two ECN flags, one for each of
1658	   bit-congestible and packet-congestible resources.

1660	   1.  A packet-congestible resource trying to code congestion level p_p
1661	       into a packet stream should mark the idealised `packet
1662	       congestion' field in each packet with probability p_p
1663	       irrespective of the packet's size.  The transport should then
1664	       take a packet with the packet congestion field marked to mean
1665	       just one mark, irrespective of the packet size.

1667	   2.  A bit-congestible resource trying to code time-varying byte-
1668	       congestion level p_b into a packet stream should mark the `byte
1669	       congestion' field in each packet with probability p_b, again
1670	       irrespective of the packet's size.  Unlike before, the transport
1671	       should take a packet with the byte congestion field marked to
1672	       count as a mark on each byte in the packet.

1674	   This hides a fundamental problem--much more fundamental than whether
1675	   we can magically create header space for yet another ECN flag, or
1676	   whether it would work while being deployed incrementally.
1677	   Distinguishing drop from delivery naturally provides just one
1678	   implicit bit of congestion indication information--the packet is
1679	   either dropped or not.  It is hard to drop a packet in two ways that
1680	   are distinguishable remotely.  This is a similar problem to that of
1681	   distinguishing wireless transmission losses from congestive losses.

1683	   This problem would not be solved even if ECN were universally
1684	   deployed.  A congestion notification protocol must survive a
1685	   transition from low levels of congestion to high.  Marking two states
1686	   is feasible with explicit marking, but much harder if packets are
1687	   dropped.  Also, it will not always be cost-effective to implement AQM
1688	   at every low level resource, so drop will often have to suffice.

1690	   We are not saying two ECN fields will be needed (and we are not
1691	   saying that somehow a resource should be able to drop a packet in one
1692	   of two different ways so that the transport can distinguish which
1693	   sort of drop it was!).  These two congestion notification channels
1694	   are a conceptual device to illustrate a dilemma we could face in the
1695	   future.  Section 3 gives four good reasons why it would be a bad idea
1696	   to allow for packet size by biasing drop probability in favour of
1697	   small packets within the network.  The impracticality of our thought
1698	   experiment shows that it will be hard to give transports a practical
1699	   way to know whether to take account of the size of congestion
1700	   indication packets or not.

1702	   Fortunately, this dilemma is not pressing because by design most
1703	   equipment becomes bit-congested before its packet-processing becomes
1704	   congested (as already outlined in Section 1.1).  Therefore transports
1705	   can be designed on the relatively sound assumption that a congestion
1706	   indication will usually imply bit-congestion.

1708	   Nonetheless, although the above idealised protocol isn't intended for
1709	   implementation, we do want to emphasise that research is needed to
1710	   predict whether there are good reasons to believe that packet
1711	   congestion might become more common, and if so, to find a way to
1712	   somehow distinguish between bit and packet congestion [RFC3714].

1714	   Recently, the dual resource queue (DRQ) proposal [DRQ] has been made
1715	   on the premise that, as network processors become more cost
1716	   effective, per packet operations will become more complex
1717	   (irrespective of whether more function in the network is desirable).
1718	   Consequently the premise is that CPU congestion will become more
1719	   common.  DRQ is a proposed modification to the RED algorithm that
1720	   folds both bit congestion and packet congestion into one signal
1721	   (either loss or ECN).

1723	   Finally, we note one further complication.  Strictly, packet-
1724	   congestible resources are often cycle-congestible.  For instance, for
1725	   routing look-ups load depends on the complexity of each look-up and
1726	   whether the pattern of arrivals is amenable to caching or not.  This
1727	   also reminds us that any solution must not require a forwarding
1728	   engine to use excessive processor cycles in order to decide how to
1729	   say it has no spare processor cycles.

1731	Appendix C.  Byte-mode Drop Complicates Policing Congestion Response

1733	   This section is informative, not normative.

1735	   There are two main classes of approach to policing congestion
1736	   response: i) policing at each bottleneck link or ii) policing at the
1737	   edges of networks.  Packet-mode drop in RED is compatible with
1738	   either, while byte-mode drop precludes edge policing.

1740	   The simplicity of an edge policer relies on one dropped or marked
1741	   packet being equivalent to another of the same size without having to
1742	   know which link the drop or mark occurred at.  However, the byte-mode
1743	   drop algorithm has to depend on the local MTU of the line--it needs
1744	   to use some concept of a 'normal' packet size.  Therefore, one
1745	   dropped or marked packet from a byte-mode drop algorithm is not
1746	   necessarily equivalent to another from a different link.  A policing
1747	   function local to the link can know the local MTU where the
1748	   congestion occurred.  However, a policer at the edge of the network
1749	   cannot, at least not without a lot of complexity.

1751	   The early research proposals for type (i) policing at a bottleneck
1752	   link [pBox] used byte-mode drop, then detected flows that contributed
1753	   disproportionately to the number of packets dropped.  However, with
1754	   no extra complexity, later proposals used packet mode drop and looked
1755	   for flows that contributed a disproportionate amount of dropped bytes
1756	   [CHOKe_Var_Pkt].

1758	   Work is progressing on the congestion exposure protocol (ConEx
1759	   [I-D.ietf-conex-concepts-uses]), which enables a type (ii) edge
1760	   policer located at a user's attachment point.  The idea is to be able
1761	   to take an integrated view of the effect of all a user's traffic on
1762	   any link in the internetwork.  However, byte-mode drop would
1763	   effectively preclude such edge policing because of the MTU issue
1764	   above.

1766	   Indeed, making drop probability depend on the size of the packets
1767	   that bits happen to be divided into would simply encourage the bits
1768	   to be divided into smaller packets in order to confuse policing.  In
1769	   contrast, as long as a dropped/marked packet is taken to mean that
1770	   all the bytes in the packet are dropped/marked, a policer can remain
1771	   robust against bits being re-divided into different size packets or
1772	   across different size flows [Rate_fair_Dis].

1774	Appendix D.  Changes from Previous Versions

1776	   To be removed by the RFC Editor on publication.

1778	   Full incremental diffs between each version are available at
1779	   <http://tools.ietf.org/wg/tsvwg/draft-ietf-tsvwg-byte-pkt-congest/>
1780	   (courtesy of the rfcdiff tool):

1782	   From -06 to -07:

1784	      *  A mix-up with the corollaries and their naming in 2.1 to 2.3
1785	         fixed.

1787	   From -05 to -06:

1789	      *  Primarily editorial fixes.

1791	   From -04 to -05:

1793	      *  Changed from Informational to BCP and highlighted non-normative
1794	         sections and appendices

1796	      *  Removed language about consensus

1798	      *  Added "Example Comparing Packet-Mode Drop and Byte-Mode Drop"

1800	      *  Arranged "Motivating Arguments" into a more logical order and
1801	         completely rewrote "Transport-Independent Network" & "Scaling
1802	         Congestion Control with Packet Size" arguments.  Removed "Why
1803	         Now?"

1805	      *  Clarified applicability of certain recommendations

1807	      *  Shifted vendor survey to an Appendix

1809	      *  Cut down "Outstanding Issues and Next Steps"

1811	      *  Re-drafted the start of the conclusions to highlight the three
1812	         distinct areas of concern

1814	      *  Completely re-wrote appendices

1816	      *  Editorial corrections throughout.

1818	   From -03 to -04:

1820	      *  Reordered Sections 2 and 3, and some clarifications here and
1821	         there based on feedback from Colin Perkins and Mirja
1822	         Kuehlewind.

1824	   From -02 to -03  (this version)

1826	      *  Structural changes:

1828	         +  Split off text at end of "Scaling Congestion Control with
1829	            Packet Size" into new section "Transport-Independent
1830	            Network"

1832	         +  Shifted "Recommendations" straight after "Motivating
1833	            Arguments" and added "Conclusions" at end to reinforce
1834	            Recommendations

1836	         +  Added more internal structure to Recommendations, so that
1837	            recommendations specific to RED or to TCP are just
1838	            corollaries of a more general recommendation, rather than
1839	            being listed as a separate recommendation.

1841	         +  Renamed "State of the Art" as "Critical Survey of Existing
1842	            Advice" and retitled a number of subsections with more
1843	            descriptive titles.

1845	         +  Split end of "Congestion Coding: Summary of Status" into a
1846	            new subsection called "RED Implementation Status".

1848	         +  Removed text that had been in the Appendix "Congestion
1849	            Notification Definition: Further Justification".

1851	      *  Reordered the intro text a little.

1853	      *  Made it clearer when advice being reported is deprecated and
1854	         when it is not.

1856	      *  Described AQM as in network equipment, rather than saying "at
1857	         the network layer" (to side-step controversy over whether
1858	         functions like AQM are in the transport layer but in network
1859	         equipment).

1861	      *  Minor improvements to clarity throughout

1863	   From -01 to -02:

1865	      *  Restructured the whole document for (hopefully) easier reading
1866	         and clarity.  The concrete recommendation, in RFC2119 language,
1867	         is now in Section 8.

1869	   From -00 to -01:

1871	      *  Minor clarifications throughout and updated references

1873	   From briscoe-byte-pkt-mark-02 to ietf-byte-pkt-congest-00:

1875	      *  Added note on relationship to existing RFCs

1877	      *  Posed the question of whether packet-congestion could become
1878	         common and deferred it to the IRTF ICCRG.  Added ref to the
1879	         dual-resource queue (DRQ) proposal.

1881	      *  Changed PCN references from the PCN charter & architecture to
1882	         the PCN marking behaviour draft most likely to imminently
1883	         become the standards track WG item.

1885	   From -01 to -02:

1887	      *  Abstract reorganised to align with clearer separation of issue
1888	         in the memo.

1890	      *  Introduction reorganised with motivating arguments removed to
1891	         new Section 3.

1893	      *  Clarified avoiding lock-out of large packets is not the main or
1894	         only motivation for RED.

1896	      *  Mentioned choice of drop or marking explicitly throughout,
1897	         rather than trying to coin a word to mean either.

1899	      *  Generalised the discussion throughout to any packet forwarding
1900	         function on any network equipment, not just routers.

1902	      *  Clarified the last point about why this is a good time to sort
1903	         out this issue: because it will be hard / impossible to design
1904	         new transports unless we decide whether the network or the
1905	         transport is allowing for packet size.

1907	      *  Added statement explaining the horizon of the memo is long
1908	         term, but with short term expediency in mind.

1910	      *  Added material on scaling congestion control with packet size
1911	         (Section 3.4).

1913	      *  Separated out issue of normalising TCP's bit rate from issue of
1914	         preference to control packets (Section 3.2).

1916	      *  Divided up Congestion Measurement section for clarity,
1917	         including new material on fixed size packet buffers and buffer
1918	         carving (Section 4.1.1 & Section 4.2.1) and on congestion
1919	         measurement in wireless link technologies without queues
1920	         (Section 4.1.2).

1922	      *  Added section on 'Making Transports Robust against Control
1923	         Packet Losses' (Section 4.2.3) with existing & new material
1924	         included.

1926	      *  Added tabulated results of vendor survey on byte-mode drop
1927	         variant of RED (Table 3).

1929	   From -00 to -01:

1931	      *  Clarified applicability to drop as well as ECN.

1933	      *  Highlighted DoS vulnerability.

1935	      *  Emphasised that drop-tail suffers from similar problems to
1936	         byte-mode drop, so only byte-mode drop should be turned off,
1937	         not RED itself.

1939	      *  Clarified the original apparent motivations for recommending
1940	         byte-mode drop included protecting SYNs and pure ACKs more than
1941	         equalising the bit rates of TCPs with different segment sizes.
1942	         Removed some conjectured motivations.

1944	      *  Added support for updates to TCP in progress (ackcc & ecn-syn-
1945	         ack).

1947	      *  Updated survey results with newly arrived data.

1949	      *  Pulled all recommendations together into the conclusions.

1951	      *  Moved some detailed points into two additional appendices and a
1952	         note.

1954	      *  Considerable clarifications throughout.

1956	      *  Updated references

1958	Authors' Addresses

1960	   Bob Briscoe
1961	   BT
1962	   B54/77, Adastral Park
1963	   Martlesham Heath
1964	   Ipswich  IP5 3RE
1965	   UK

1967	   Phone: +44 1473 645196
1968	   EMail: bob.briscoe@bt.com
1969	   URI:   http://bobbriscoe.net/

1971	   Jukka Manner
1972	   Aalto University
1973	   Department of Communications and Networking (Comnet)
1974	   P.O. Box 13000
1975	   FIN-00076 Aalto
1976	   Finland

1978	   Phone: +358 9 470 22481
1979	   EMail: jukka.manner@aalto.fi
1980	   URI:   http://www.netlab.tkk.fi/~jmanner/