idnits 2.17.1 

draft-ietf-tsvwg-byte-pkt-congest-05.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  -- The draft header indicates that this document updates RFC2309, but the
     abstract doesn't seem to mention this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1539 has weird spacing: '...ability    p  ...'

  == Line 1544 has weird spacing: '...ss-rate  p*u  ...'

  == Line 1545 has weird spacing: '...ss-rate  p*u*s...'

  == Line 1552 has weird spacing: '...ss-rate  p*u  ...'

  == Line 1553 has weird spacing: '...ss-rate  p*u*s...'

     (Using the creation date from RFC2309, updated by this document, for
     RFC5378 checks: 1997-03-25)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 31, 2011) is 4551 days in the past.  Is this
     intentional?


  Checking references for intended status: Best Current Practice
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 216

  -- Looks like a reference, but probably isn't: '1' on line 216

  ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567)

  ** Downref: Normative reference to an Informational RFC: RFC 3426

  == Outdated reference: A later version (-08) exists of
     draft-ietf-avtcore-ecn-for-rtp-04

  == Outdated reference: A later version (-05) exists of
     draft-ietf-conex-concepts-uses-03


     Summary: 3 errors (**), 0 flaws (~~), 8 warnings (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Transport Area Working Group                                  B. Briscoe
3	Internet-Draft                                                        BT
4	Updates: 2309 (if approved)                                    J. Manner
5	Intended status: BCP                                    Aalto University
6	Expires: May 3, 2012                                    October 31, 2011

8	                Byte and Packet Congestion Notification
9	                  draft-ietf-tsvwg-byte-pkt-congest-05

11	Abstract

13	   This memo concerns dropping or marking packets using active queue
14	   management (AQM) such as random early detection (RED) or pre-
15	   congestion notification (PCN).  We give three strong recommendations:
16	   (1) packet size should be taken into account when transports read and
17	   respond to congestion indications, (2) packet size should not be
18	   taken into account when network equipment creates congestion signals
19	   (marking, dropping), and therefore (3) the byte-mode packet drop
20	   variant of the RED AQM algorithm that drops fewer small packets
21	   should not be used.

23	Status of This Memo

25	   This Internet-Draft is submitted in full conformance with the
26	   provisions of BCP 78 and BCP 79.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF).  Note that other groups may also distribute
30	   working documents as Internet-Drafts.  The list of current Internet-
31	   Drafts is at http://datatracker.ietf.org/drafts/current/.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time.  It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   This Internet-Draft will expire on May 3, 2012.

40	Copyright Notice

42	   Copyright (c) 2011 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (http://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document.  Please review these documents
49	   carefully, as they describe your rights and restrictions with respect
50	   to this document.  Code Components extracted from this document must
51	   include Simplified BSD License text as described in Section 4.e of
52	   the Trust Legal Provisions and are provided without warranty as
53	   described in the Simplified BSD License.

55	Table of Contents

57	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
58	     1.1.  Terminology and Scoping  . . . . . . . . . . . . . . . . .  6
59	     1.2.  Example Comparing Packet-Mode Drop and Byte-Mode Drop  . .  7
60	   2.  Recommendations  . . . . . . . . . . . . . . . . . . . . . . .  8
61	     2.1.  Recommendation on Queue Measurement  . . . . . . . . . . .  9
62	     2.2.  Recommendation on Encoding Congestion Notification . . . .  9
63	     2.3.  Recommendation on Responding to Congestion . . . . . . . . 10
64	     2.4.  Recommendation on Handling Congestion Indications when
65	           Splitting or Merging Packets . . . . . . . . . . . . . . . 11
66	   3.  Motivating Arguments . . . . . . . . . . . . . . . . . . . . . 11
67	     3.1.  Avoiding Perverse Incentives to (Ab)use Smaller Packets  . 12
68	     3.2.  Small != Control . . . . . . . . . . . . . . . . . . . . . 13
69	     3.3.  Transport-Independent Network  . . . . . . . . . . . . . . 13
70	     3.4.  Scaling Congestion Control with Packet Size  . . . . . . . 14
71	     3.5.  Implementation Efficiency  . . . . . . . . . . . . . . . . 16
72	   4.  A Survey and Critique of Past Advice . . . . . . . . . . . . . 16
73	     4.1.  Congestion Measurement Advice  . . . . . . . . . . . . . . 16
74	       4.1.1.  Fixed Size Packet Buffers  . . . . . . . . . . . . . . 17
75	       4.1.2.  Congestion Measurement without a Queue . . . . . . . . 18
76	     4.2.  Congestion Notification Advice . . . . . . . . . . . . . . 19
77	       4.2.1.  Network Bias when Encoding . . . . . . . . . . . . . . 19
78	       4.2.2.  Transport Bias when Decoding . . . . . . . . . . . . . 21
79	       4.2.3.  Making Transports Robust against Control Packet
80	               Losses . . . . . . . . . . . . . . . . . . . . . . . . 22
81	       4.2.4.  Congestion Notification: Summary of Conflicting
82	               Advice . . . . . . . . . . . . . . . . . . . . . . . . 23
83	   5.  Outstanding Issues and Next Steps  . . . . . . . . . . . . . . 24
84	     5.1.  Bit-congestible Network  . . . . . . . . . . . . . . . . . 24
85	     5.2.  Bit- & Packet-congestible Network  . . . . . . . . . . . . 24
86	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 24
87	   7.  Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 25
88	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 26
89	   9.  Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 27
90	   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 27
91	     10.1. Normative References . . . . . . . . . . . . . . . . . . . 27
92	     10.2. Informative References . . . . . . . . . . . . . . . . . . 27
93	   Appendix A.  Survey of RED Implementation Status . . . . . . . . . 31
94	   Appendix B.  Sufficiency of Packet-Mode Drop . . . . . . . . . . . 32
95	     B.1.  Packet-Size (In)Dependence in Transports . . . . . . . . . 33
96	     B.2.  Bit-Congestible and Packet-Congestible Indications . . . . 36

98	   Appendix C.  Byte-mode Drop Complicates Policing Congestion
99	                Response  . . . . . . . . . . . . . . . . . . . . . . 37
100	   Appendix D.  Changes from Previous Versions  . . . . . . . . . . . 38

102	1.  Introduction

104	   This memo concerns how we should correctly scale congestion control
105	   functions with packet size for the long term.  It also recognises
106	   that expediency may be necessary to deal with existing widely
107	   deployed protocols that don't live up to the long term goal.

109	   When notifying congestion, the problem of how (and whether) to take
110	   packet sizes into account has exercised the minds of researchers and
111	   practitioners for as long as active queue management (AQM) has been
112	   discussed.  Indeed, one reason AQM was originally introduced was to
113	   reduce the lock-out effects that small packets can have on large
114	   packets in drop-tail queues.  This memo aims to state the principles
115	   we should be using and to outline how these principles will affect
116	   future protocol design, taking into account the existing deployments
117	   we have already.

119	   The question of whether to take into account packet size arises at
120	   three stages in the congestion notification process:

122	   Measuring congestion:  When a congested resource measures locally how
123	      congested it is, should it measure its queue length in bytes or
124	      packets?

126	   Encoding congestion notification into the wire protocol:  When a
127	      congested network resource notifies its level of congestion,
128	      should it drop / mark each packet dependent on the byte-size of
129	      the particular packet in question?

131	   Decoding congestion notification from the wire protocol:  When a
132	      transport interprets the notification in order to decide how much
133	      to respond to congestion, should it take into account the byte-
134	      size of each missing or marked packet?

136	   Consensus has emerged over the years concerning the first stage:
137	   whether queues are measured in bytes or packets, termed byte-mode
138	   queue measurement or packet-mode queue measurement.  Section 2.1 of
139	   this memo records this consensus in the RFC Series.  In summary the
140	   choice solely depends on whether the resource is congested by bytes
141	   or packets.

143	   The controversy is mainly around the last two stages: whether to
144	   allow for the size of the specific packet notifying congestion i)
145	   when the network encodes or ii) when the transport decodes the
146	   congestion notification.

148	   Currently, the RFC series is silent on this matter other than a paper
149	   trail of advice referenced from [RFC2309], which conditionally
150	   recommends byte-mode (packet-size dependent) drop [pktByteEmail].
151	   Reducing drop of small packets certainly has some tempting
152	   advantages: i) it drops less control packets, which tend to be small
153	   and ii) it makes TCP's bit-rate less dependent on packet size.
154	   However, there are ways of addressing these issues at the transport
155	   layer, rather than reverse engineering network forwarding to fix the
156	   problems.

158	   This memo updates [RFC2309] to deprecate deliberate preferential
159	   treatment of small packets in AQM algorithms.  It recommends that (1)
160	   packet size should be taken into account when transports read
161	   congestion indications, (2) not when network equipment writes them.

163	   In particular this means that the byte-mode packet drop variant of
164	   Random early Detection (RED) should not be used to drop fewer small
165	   packets, because that creates a perverse incentive for transports to
166	   use tiny segments, consequently also opening up a DoS vulnerability.
167	   Fortunately all the RED implementers who responded to our admittedly
168	   limited survey (Section 4.2.4) have not followed the earlier advice
169	   to use byte-mode drop, so the position this memo argues for seems to
170	   already exist in implementations.

172	   However, at the transport layer, TCP congestion control is a widely
173	   deployed protocol that doesn't scale with packet size.  To date this
174	   hasn't been a significant problem because most TCP implementations
175	   have been used with similar packet sizes.  But, as we design new
176	   congestion control mechanisms, the current recommendation is that we
177	   should build in scaling with packet size rather than assuming we
178	   should follow TCP's example.

180	   This memo continues as follows.  First it discusses terminology and
181	   scoping.  Section 2 gives the concrete formal recommendations,
182	   followed by motivating arguments in Section 3.  We then critically
183	   survey the advice given previously in the RFC series and the research
184	   literature (Section 4), referring to an assessment of whether or not
185	   this advice has been followed in production networks (Appendix A).
186	   To wrap up, outstanding issues are discussed that will need
187	   resolution both to inform future protocol designs and to handle
188	   legacy (Section 5).  Then security issues are collected together in
189	   Section 6 before conclusions are drawn in Section 7.  The interested
190	   reader can find discussion of more detailed issues on the theme of
191	   byte vs. packet in the appendices.

193	   This memo intentionally includes a non-negligible amount of material
194	   on the subject.  For the busy reader Section 2 summarises the
195	   recommendations for the Internet community.

197	1.1.  Terminology and Scoping

199	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
200	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
201	   document are to be interpreted as described in [RFC2119].

203	   Congestion Notification:  Congestion notification is a changing
204	      signal that aims to communicate the probability that the network
205	      resource(s) will not be able to forward the level of traffic load
206	      offered (or that there is an impending risk that they will not be
207	      able to).

209	      The `impending risk' qualifier is added, because AQM systems (e.g.
210	      RED, PCN [RFC5670]) set a virtual limit smaller than the actual
211	      limit to the resource, then notify when this virtual limit is
212	      exceeded in order to avoid uncontrolled congestion of the actual
213	      capacity.

215	      Congestion notification communicates a real number bounded by the
216	      range [0,1].  This ties in with the most well-understood measure
217	      of congestion notification: drop probability.

219	   Explicit and Implicit Notification:  The byte vs. packet dilemma
220	      concerns congestion notification irrespective of whether it is
221	      signalled implicitly by drop or using explicit congestion
222	      notification (ECN [RFC3168] or PCN [RFC5670]).  Throughout this
223	      document, unless clear from the context, the term marking will be
224	      used to mean notifying congestion explicitly, while congestion
225	      notification will be used to mean notifying congestion either
226	      implicitly by drop or explicitly by marking.

228	   Bit-congestible vs. Packet-congestible:  If the load on a resource
229	      depends on the rate at which packets arrive, it is called packet-
230	      congestible.  If the load depends on the rate at which bits arrive
231	      it is called bit-congestible.

233	      Examples of packet-congestible resources are route look-up engines
234	      and firewalls, because load depends on how many packet headers
235	      they have to process.  Examples of bit-congestible resources are
236	      transmission links, radio power and most buffer memory, because
237	      the load depends on how many bits they have to transmit or store.
238	      Some machine architectures use fixed size packet buffers, so
239	      buffer memory in these cases is packet-congestible (see
240	      Section 4.1.1).

242	      Currently a design goal of network processing equipment such as
243	      routers and firewalls is to keep packet processing uncongested
244	      even under worst case packet rates with runs of minimum size
245	      packets.  Therefore, packet-congestion is currently rare [RFC6077;
246	      S.3.3], but there is no guarantee that it will not become more
247	      common in future.

249	      Note that information is generally processed or transmitted with a
250	      minimum granularity greater than a bit (e.g. octets).  The
251	      appropriate granularity for the resource in question should be
252	      used, but for the sake of brevity we will talk in terms of bytes
253	      in this memo.

255	   Coarser Granularity:  Resources may be congestible at higher levels
256	      of granularity than bits or packets, for instance stateful
257	      firewalls are flow-congestible and call-servers are session-
258	      congestible.  This memo focuses on congestion of connectionless
259	      resources, but the same principles may be applicable for
260	      congestion notification protocols controlling per-flow and per-
261	      session processing or state.

263	   RED Terminology:  In RED whether to use packets or bytes when
264	      measuring queues is called respectively "packet-mode queue
265	      measurement" or "byte-mode queue measurement".  And whether the
266	      probability of dropping a particular packet is independent or
267	      dependent on its byte-size is called respectively "packet-mode
268	      drop" or "byte-mode drop".  The terms byte-mode and packet-mode
269	      should not be used without specifying whether they apply to queue
270	      measurement or to drop.

272	1.2.  Example Comparing Packet-Mode Drop and Byte-Mode Drop

274	   A central question addressed by this document is whether to recommend
275	   RED's packet-mode drop and to deprecate byte-mode drop.  Table 1
276	   compares how packet-mode and byte-mode drop affect two flows of
277	   different size packets.  For each it gives the expected number of
278	   packets and of bits dropped in one second.  Each example flow runs at
279	   the same bit-rate of 48Mb/s, but one is broken up into small 60 byte
280	   packets and the other into large 1500 byte packets.

282	   To keep up the same bit-rate, in one second there are about 25 times
283	   more small packets because they are 25 times smaller.  As can be seen
284	   from the table, the packet rate is 100,000 small packets versus 4,000
285	   large packets per second (pps).

287	      Parameter            Formula        Small packets Large packets
288	      -------------------- -------------- ------------- -------------
289	      Packet size          s/8                      60B        1,500B
290	      Packet size          s                       480b       12,000b
291	      Bit-rate             x                     48Mbps        48Mbps
292	      Packet-rate          u = x/s              100kpps         4kpps

294	      Packet-mode Drop
295	      Pkt loss probability p                       0.1%          0.1%
296	      Pkt loss-rate        p*u                   100pps          4pps
297	      Bit loss-rate        p*u*s                 48kbps        48kbps

299	      Byte-mode Drop       MTU, M=12,000b
300	      Pkt loss probability b = p*s/M             0.004%          0.1%
301	      Pkt loss-rate        b*u                     4pps          4pps
302	      Bit loss-rate        b*u*s               1.92kbps        48kbps

304	         Table 1: Example Comparing Packet-mode and Byte-mode Drop

306	   For packet-mode drop, we illustrate the effect of a drop probability
307	   of 0.1%, which the algorithm applies to all packets irrespective of
308	   size.  Because there are 25 times more small packets in one second,
309	   it naturally drops 25 times more small packets, that is 100 small
310	   packets but only 4 large packets.  But if we count how many bits it
311	   drops, there are 48,000 bits in 100 small packets and 48,000 bits in
312	   4 large packets--the same number of bits of small packets as large.

314	      The packet-mode drop algorithm drops any bit with the same
315	      probability whether the bit is in a small or a large packet.

317	   For byte-mode drop, again we use an example drop probability of 0.1%,
318	   but only for maximum size packets (assuming the link MTU is 1,500B or
319	   12,000b).  The byte-mode algorithm reduces the drop probability of
320	   smaller packets proportional to their size, making the probability
321	   that it drops a small packet 25 times smaller at 0.004%.  But there
322	   are 25 times more small packets, so dropping them with 25 times lower
323	   probability results in dropping the same number of packets: 4 drops
324	   in both cases.  The 4 small dropped packets contain 25 times less
325	   bits than the 4 large dropped packets: 1,920 compared to 48,000.

327	      The byte-mode drop algorithm drops any bit with a probability
328	      proportionate to the size of the packet it is in.

330	2.  Recommendations
331	2.1.  Recommendation on Queue Measurement

333	   Queue length is usually the most correct and simplest way to measure
334	   congestion of a resource.  To avoid the pathological effects of drop
335	   tail, an AQM function can then be used to transform queue length into
336	   the probability of dropping or marking a packet (e.g.  RED's
337	   piecewise linear function between thresholds).

339	   If the resource is bit-congestible, the implementation SHOULD measure
340	   the length of the queue in bytes.  If the resource is packet-
341	   congestible, the implementation SHOULD measure the length of the
342	   queue in packets.  No other choice makes sense, because the number of
343	   packets waiting in the queue isn't relevant if the resource gets
344	   congested by bytes and vice versa.

346	   Corollaries:

348	   1.  A RED implementation SHOULD use byte mode queue measurement for
349	       measuring the congestion of bit-congestible resources and packet
350	       mode queue measurement for packet-congestible resources.

352	   2.  An implementation SHOULD NOT make it possible to configure the
353	       way a queue measures itself, because whether a queue is bit-
354	       congestible or packet-congestible is an inherent property of the
355	       queue.

357	   The recommended approach in less straightforward scenarios, such as
358	   fixed size buffers, and resources without a queue, is discussed in
359	   Section 4.1.

361	2.2.  Recommendation on Encoding Congestion Notification

363	   When encoding congestion notification (e.g. by drop, ECN & PCN), a
364	   network device SHOULD treat all packets equally, regardless of their
365	   size.  In other words, the probability that network equipment drops
366	   or marks a particular packet to notify congestion SHOULD NOT depend
367	   on the size of the packet in question.  As the example in Section 1.2
368	   illustrates, to drop any bit with probability 0.1% it is only
369	   necessary to drop every packet with probability 0.1% without regard
370	   to the size of each packet.

372	   This approach ensures the network layer offers sufficient congestion
373	   information for all known and future transport protocols and also
374	   ensures no perverse incentives are created that would encourage
375	   transports to use inappropriately small packet sizes.

377	   Corollaries:

379	   1.  AQM algorithms such as RED SHOULD NOT use byte-mode drop, which
380	       deflates RED's drop probability for smaller packet sizes.  RED's
381	       byte-mode drop has no enduring advantages.  It is more complex,
382	       it creates the perverse incentive to fragment segments into tiny
383	       pieces and it reopens the vulnerability to floods of small-
384	       packets that drop-tail queues suffered from and AQM was designed
385	       to remove.

387	   2.  If a vendor has implemented byte-mode drop, and an operator has
388	       turned it on, it is RECOMMENDED to turn it off.  Note that RED as
389	       a whole SHOULD NOT be turned off, as without it, a drop tail
390	       queue also biases against large packets.  But note also that
391	       turning off byte-mode drop may alter the relative performance of
392	       applications using different packet sizes, so it would be
393	       advisable to establish the implications before turning it off.

395	       NOTE WELL that RED's byte-mode queue drop is completely
396	       orthogonal to byte-mode queue measurement and should not be
397	       confused with it.  If a RED implementation has a byte-mode but
398	       does not specify what sort of byte-mode, it is most probably
399	       byte-mode queue measurement, which is fine.  However, if in
400	       doubt, the vendor should be consulted.

402	   A survey (Appendix A) showed that there appears to be little, if any,
403	   installed base of the byte-mode drop variant of RED.  This suggests
404	   that deprecating byte-mode drop will have little, if any, incremental
405	   deployment impact.

407	2.3.  Recommendation on Responding to Congestion

409	   When a transport detects that a packet has been lost or congestion
410	   marked, it SHOULD consider the strength of the congestion indication
411	   as proportionate to the size in octets (bytes) of the missing or
412	   marked packet.

414	   In other words, when a packet indicates congestion (by being lost or
415	   marked) it can be considered conceptually as if there is a congestion
416	   indication on every octet of the packet, not just one indication per
417	   packet.

419	   Therefore, the IETF transport area should continue its programme of;

421	   o  updating host-based congestion control protocols to take account
422	      of packet size

424	   o  making transports less sensitive to losing control packets like
425	      SYNs and pure ACKs.

427	   Corollaries:

429	   1.  If two TCP flows with different packet sizes are required to run
430	       at equal bit rates under the same path conditions, this should be
431	       done by altering TCP (Section 4.2.2), not network equipment (the
432	       latter affects other transports besides TCP).

434	   2.  If it is desired to improve TCP performance by reducing the
435	       chance that a SYN or a pure ACK will be dropped, this should be
436	       done by modifying TCP (Section 4.2.3), not network equipment.

438	2.4.  Recommendation on Handling Congestion Indications when Splitting
439	      or Merging Packets

441	   Packets carrying congestion indications may be split or merged in
442	   some circumstances (e.g. at a RTCP transcoder or during IP fragment
443	   reassembly).  Splitting and merging only make sense in the context of
444	   ECN, not loss.

446	   The general rule to follow is that the number of octets in packets
447	   with congestion indications SHOULD be equivalent before and after
448	   merging or splitting.  This is based on the principle used above;
449	   that an indication of congestion on a packet can be considered as an
450	   indication of congestion on each octet of the packet.

452	   The above rule is not phrased with the word "MUST" to allow the
453	   following exception.  There are cases where pre-existing protocols
454	   were not designed to conserve congestion marked octets (e.g.  IP
455	   fragment reassembly [RFC3168] or loss statistics in RTCP receiver
456	   reports [RFC3550] before ECN was added
457	   [I-D.ietf-avtcore-ecn-for-rtp]).  When any such protocol is updated,
458	   it SHOULD comply with the above rule to conserved marked octets.
459	   However, the rule may be relaxed if it would otherwise become too
460	   complex to interoperate with pre-existing implementations of the
461	   protocol.

463	   One can think of a splitting or merging process as if all the
464	   incoming congestion-marked octets increment a counter and all the
465	   outgoing marked octets decrement the same counter.  In order to
466	   ensure that congestion indications remain timely, even the smallest
467	   positive remainder in the conceptual counter should trigger the next
468	   outgoing packet to be marked (causing the counter to go negative).

470	3.  Motivating Arguments

472	   In this section, we justify the recommendations given in the previous
473	   section.

475	3.1.  Avoiding Perverse Incentives to (Ab)use Smaller Packets

477	   Increasingly, it is being recognised that a protocol design must take
478	   care not to cause unintended consequences by giving the parties in
479	   the protocol exchange perverse incentives [Evol_cc][RFC3426].  Given
480	   there are many good reasons why larger path max transmission units
481	   (PMTUs) would help solve a number of scaling issues, we do not want
482	   to create any bias against large packets that is greater than their
483	   true cost.

485	   Imagine a scenario where the same bit rate of packets will contribute
486	   the same to bit-congestion of a link irrespective of whether it is
487	   sent as fewer larger packets or more smaller packets.  A protocol
488	   design that caused larger packets to be more likely to be dropped
489	   than smaller ones would be dangerous in this case:

491	   Malicious transports:  A queue that gives an advantage to small
492	      packets can be used to amplify the force of a flooding attack.  By
493	      sending a flood of small packets, the attacker can get the queue
494	      to discard more traffic in large packets, allowing more attack
495	      traffic to get through to cause further damage.  Such a queue
496	      allows attack traffic to have a disproportionately large effect on
497	      regular traffic without the attacker having to do much work.

499	   Non-malicious transports:  Even if a transport designer is not
500	      actually malicious, if over time it is noticed that small packets
501	      tend to go faster, designers will act in their own interest and
502	      use smaller packets.  Queues that give advantage to small packets
503	      create an evolutionary pressure for transports to send at the same
504	      bit-rate but break their data stream down into tiny segments to
505	      reduce their drop rate.  Encouraging a high volume of tiny packets
506	      might in turn unnecessarily overload a completely unrelated part
507	      of the system, perhaps more limited by header-processing than
508	      bandwidth.

510	   Imagine two unresponsive flows arrive at a bit-congestible
511	   transmission link each with the same bit rate, say 1Mbps, but one
512	   consists of 1500B and the other 60B packets, which are 25x smaller.
513	   Consider a scenario where gentle RED [gentle_RED] is used, along with
514	   the variant of RED we advise against, i.e. where the RED algorithm is
515	   configured to adjust the drop probability of packets in proportion to
516	   each packet's size (byte mode packet drop).  In this case, RED aims
517	   to drop 25x more of the larger packets than the smaller ones.  Thus,
518	   for example if RED drops 25% of the larger packets, it will aim to
519	   drop 1% of the smaller packets (but in practice it may drop more as
520	   congestion increases [RFC4828; Appx B.4]).  Even though both flows
521	   arrive with the same bit rate, the bit rate the RED queue aims to
522	   pass to the line will be 750kbps for the flow of larger packets but
523	   990kbps for the smaller packets (because of rate variations it will
524	   actually be a little less than this target).

526	   Note that, although the byte-mode drop variant of RED amplifies small
527	   packet attacks, drop-tail queues amplify small packet attacks even
528	   more (see Security Considerations in Section 6).  Wherever possible
529	   neither should be used.

531	3.2.  Small != Control

533	   Dropping fewer control packets considerably improves performance.  It
534	   is tempting to drop small packets with lower probability in order to
535	   improve performance, because many control packets are small (TCP SYNs
536	   & ACKs, DNS queries & responses, SIP messages, HTTP GETs, etc).
537	   However, we must not give control packets preference purely by virtue
538	   of their smallness, otherwise it is too easy for any data source to
539	   get the same preferential treatment simply by sending data in smaller
540	   packets.  Again we should not create perverse incentives to favour
541	   small packets rather than to favour control packets, which is what we
542	   intend.

544	   Just because many control packets are small does not mean all small
545	   packets are control packets.

547	   So, rather than fix these problems in the network, we argue that the
548	   transport should be made more robust against losses of control
549	   packets (see 'Making Transports Robust against Control Packet Losses'
550	   in Section 4.2.3).

552	3.3.  Transport-Independent Network

554	   TCP congestion control ensures that flows competing for the same
555	   resource each maintain the same number of segments in flight,
556	   irrespective of segment size.  So under similar conditions, flows
557	   with different segment sizes will get different bit-rates.

559	   One motivation for the network biasing congestion notification by
560	   packet size is to counter this effect and try to equalise the bit-
561	   rates of flows with different packet sizes.  However, in order to do
562	   this, the queuing algorithm has to make assumptions about the
563	   transport, which become embedded in the network.  Specifically:

565	   o  The queuing algorithm has to assume how aggressively the transport
566	      will respond to congestion (see Section 4.2.4).  If the network
567	      assumes the transport responds as aggressively as TCP NewReno, it
568	      will be wrong for Compound TCP and differently wrong for Cubic
569	      TCP, etc.  To achieve equal bit-rates, each transport then has to
570	      guess what assumption the network made, and work out how to
571	      replace this assumed aggressiveness with its own aggressiveness.

573	   o  Also, if the network biases congestion notification by packet size
574	      it has to assume a baseline packet size--all proposed algorithms
575	      use the local MTU.  Then transports have to guess which link was
576	      congested and what its local MTU was, in order to know how to
577	      tailor their congestion response to that link.

579	   Even though reducing the drop probability of small packets (e.g.
580	   RED's byte-mode drop) helps ensure TCP flows with different packet
581	   sizes will achieve similar bit rates, we argue this correction should
582	   be made to any future transport protocols based on TCP, not to the
583	   network in order to fix one transport, no matter how predominant it
584	   is.  Effectively, favouring small packets is reverse engineering of
585	   network equipment around one particular transport protocol (TCP),
586	   contrary to the excellent advice in [RFC3426], which asks designers
587	   to question "Why are you proposing a solution at this layer of the
588	   protocol stack, rather than at another layer?"

590	   In contrast, if the network never takes account of packet size, the
591	   transport can be certain it will never need to guess any assumptions
592	   the network has made.  And the network passes two pieces of
593	   information to the transport that are sufficient in all cases: i)
594	   congestion notification on the packet and ii) the size of the packet.
595	   Both are available for the transport to combine (by taking account of
596	   packet size when responding to congestion) or not.  Appendix B checks
597	   that these two pieces of information are sufficient for all relevant
598	   scenarios.

600	   When the network does not take account of packet size, it allows
601	   transport protocols to choose whether to take account of packet size
602	   or not.  However, if the network were to bias congestion notification
603	   by packet size, transport protocols would have no choice; those that
604	   did not take account of packet size themselves would unwittingly
605	   become dependent on packet size, and those that already took account
606	   of packet size would end up taking account of it twice.

608	3.4.  Scaling Congestion Control with Packet Size

610	   Having so far justified only our recommendations for the network,
611	   this section focuses on the host.  We construct a scaling argument to
612	   justify the recommendation that a host should respond to a dropped or
613	   marked packet in proportion to its size, not just as a single
614	   congestion event.

616	   The argument assumes that we have already sufficiently justified our
617	   recommendation that the network should not take account of packet
618	   size.

620	   Also, we assume bit-congestible links are the predominant source of
621	   congestion.  As the Internet stands, it is hard if not impossible to
622	   know whether congestion notification is from a bit-congestible or a
623	   packet-congestible resource (see Appendix B.2) so we have to assume
624	   the most prevalent case (see Section 1.1).  If this assumption is
625	   wrong, and particular congestion indications are actually due to
626	   overload of packet-processing, there is no issue of safety at stake.
627	   Any congestion control that triggers a multiplicative decrease in
628	   response to a congestion indication will bring packet processing back
629	   to its operating point just as quickly.  The only issue at stake is
630	   that the resource could be utilised more efficiently if packet-
631	   congestion could be separately identified.

633	   Imagine a bit-congestible link shared by many flows, so that each
634	   busy period tends to cause packets to be lost from different flows.
635	   Consider further two sources that have the same data rate but break
636	   the load into large packets in one application (A) and small packets
637	   in the other (B).  Of course, because the load is the same, there
638	   will be proportionately more packets in the small packet flow (B).

640	   If a congestion control scales with packet size it should respond in
641	   the same way to the same congestion notification, irrespective of the
642	   size of the packets that the bytes causing congestion happen to be
643	   broken down into.

645	   A bit-congestible queue suffering congestion has to drop or mark the
646	   same excess bytes whether they are in a few large packets (A) or many
647	   small packets (B).  So for the same amount of congestion overload,
648	   the same amount of bytes has to be shed to get the load back to its
649	   operating point.  But, of course, for smaller packets (B) more
650	   packets will have to be discarded to shed the same bytes.

652	   If both the transports interpret each drop/mark as a single loss
653	   event irrespective of the size of the packet dropped, the flow of
654	   smaller packets (B) will respond more times to the same congestion.
655	   On the other hand, if a transport responds proportionately less when
656	   smaller packets are dropped/marked, overall it will be able to
657	   respond the same to the same amount of congestion.

659	   Therefore, for a congestion control to scale with packet size it
660	   should respond to dropped or marked bytes (as TFRC-SP [RFC4828]
661	   effectively does), instead of dropped or marked packets (as TCP
662	   does).

664	   For the avoidance of doubt, this is not a recommendation that TCP
665	   should be changed so that it scales with packet size.  It is a
666	   recommendation that any future transport protocol proposal should
667	   respond to dropped or marked bytes if it wishes to claim that it is
668	   scalable.

670	3.5.  Implementation Efficiency

672	   Allowing for packet size at the transport rather than in the network
673	   ensures that neither the network nor the transport needs to do a
674	   multiply operation--multiplication by packet size is effectively
675	   achieved as a repeated add when the transport adds to its count of
676	   marked bytes as each congestion event is fed to it.  This isn't a
677	   principled reason in itself, but it is a happy consequence of the
678	   other principled reasons.

680	4.  A Survey and Critique of Past Advice

682	   This section is informative, not normative.

684	   The original 1993 paper on RED [RED93] proposed two options for the
685	   RED active queue management algorithm: packet mode and byte mode.
686	   Packet mode measured the queue length in packets and dropped (or
687	   marked) individual packets with a probability independent of their
688	   size.  Byte mode measured the queue length in bytes and marked an
689	   individual packet with probability in proportion to its size
690	   (relative to the maximum packet size).  In the paper's outline of
691	   further work, it was stated that no recommendation had been made on
692	   whether the queue size should be measured in bytes or packets, but
693	   noted that the difference could be significant.

695	   When RED was recommended for general deployment in 1998 [RFC2309],
696	   the two modes were mentioned implying the choice between them was a
697	   question of performance, referring to a 1997 email [pktByteEmail] for
698	   advice on tuning.  A later addendum to this email introduced the
699	   insight that there are in fact two orthogonal choices:

701	   o  whether to measure queue length in bytes or packets (Section 4.1)

703	   o  whether the drop probability of an individual packet should depend
704	      on its own size (Section 4.2).

706	   The rest of this section is structured accordingly.

708	4.1.  Congestion Measurement Advice

710	   The choice of which metric to use to measure queue length was left
711	   open in RFC2309.  It is now well understood that queues for bit-
712	   congestible resources should be measured in bytes, and queues for
713	   packet-congestible resources should be measured in packets
714	   [pktByteEmail].

716	   Some modern queue implementations give a choice for setting RED's
717	   thresholds in byte-mode or packet-mode.  This may merely be an
718	   administrator-interface preference, not altering how the queue itself
719	   is measured but on some hardware it does actually change the way it
720	   measures its queue.  Whether a resource is bit-congestible or packet-
721	   congestible is a property of the resource, so an admin should not
722	   ever need to, or be able to, configure the way a queue measures
723	   itself.

725	   NOTE: Congestion in some legacy bit-congestible buffers is only
726	   measured in packets not bytes.  In such cases, the operator has to
727	   set the thresholds mindful of a typical mix of packets sizes.  Any
728	   AQM algorithm on such a buffer will be oversensitive to high
729	   proportions of small packets, e.g. a DoS attack, and undersensitive
730	   to high proportions of large packets.  However, there is no need to
731	   make allowances for the possibility of such legacy in future protocol
732	   design.  This is safe because any undersensitivity during unusual
733	   traffic mixes cannot lead to congestion collapse given the buffer
734	   will eventually revert to tail drop, discarding proportionately more
735	   large packets.

737	4.1.1.  Fixed Size Packet Buffers

739	   The question of whether to measure queues in bytes or packets seems
740	   to be well understood.  However, measuring congestion is not
741	   straightforward when the resource is bit congestible but the queue is
742	   packet congestible or vice versa.  This section outlines the approach
743	   to take.  There is no controversy over what should be done, you just
744	   need to be expert in probability to work it out.  And, even if you
745	   know what should be done, it's not always easy to find a practical
746	   algorithm to implement it.

748	   Some, mostly older, queuing hardware sets aside fixed sized buffers
749	   in which to store each packet in the queue.  Also, with some
750	   hardware, any fixed sized buffers not completely filled by a packet
751	   are padded when transmitted to the wire.  If we imagine a theoretical
752	   forwarding system with both queuing and transmission in fixed, MTU-
753	   sized units, it should clearly be treated as packet-congestible,
754	   because the queue length in packets would be a good model of
755	   congestion of the lower layer link.

757	   If we now imagine a hybrid forwarding system with transmission delay
758	   largely dependent on the byte-size of packets but buffers of one MTU
759	   per packet, it should strictly require a more complex algorithm to
760	   determine the probability of congestion.  It should be treated as two
761	   resources in sequence, where the sum of the byte-sizes of the packets
762	   within each packet buffer models congestion of the line while the
763	   length of the queue in packets models congestion of the queue.  Then
764	   the probability of congesting the forwarding buffer would be a
765	   conditional probability--conditional on the previously calculated
766	   probability of congesting the line.

768	   In systems that use fixed size buffers, it is unusual for all the
769	   buffers used by an interface to be the same size.  Typically pools of
770	   different sized buffers are provided (Cisco uses the term 'buffer
771	   carving' for the process of dividing up memory into these pools
772	   [IOSArch]).  Usually, if the pool of small buffers is exhausted,
773	   arriving small packets can borrow space in the pool of large buffers,
774	   but not vice versa.  However, it is easier to work out what should be
775	   done if we temporarily set aside the possibility of such borrowing.
776	   Then, with fixed pools of buffers for different sized packets and no
777	   borrowing, the size of each pool and the current queue length in each
778	   pool would both be measured in packets.  So an AQM algorithm would
779	   have to maintain the queue length for each pool, and judge whether to
780	   drop/mark a packet of a particular size by looking at the pool for
781	   packets of that size and using the length (in packets) of its queue.

783	   We now return to the issue we temporarily set aside: small packets
784	   borrowing space in larger buffers.  In this case, the only difference
785	   is that the pools for smaller packets have a maximum queue size that
786	   includes all the pools for larger packets.  And every time a packet
787	   takes a larger buffer, the current queue size has to be incremented
788	   for all queues in the pools of buffers less than or equal to the
789	   buffer size used.

791	   We will return to borrowing of fixed sized buffers when we discuss
792	   biasing the drop/marking probability of a specific packet because of
793	   its size in Section 4.2.1.  But here we can give a at least one
794	   simple rule for how to measure the length of queues of fixed buffers:
795	   no matter how complicated the scheme is, ultimately any fixed buffer
796	   system will need to measure its queue length in packets not bytes.

798	4.1.2.  Congestion Measurement without a Queue

800	   AQM algorithms are nearly always described assuming there is a queue
801	   for a congested resource and the algorithm can use the queue length
802	   to determine the probability that it will drop or mark each packet.
803	   But not all congested resources lead to queues.  For instance,
804	   wireless spectrum is usually regarded as bit-congestible (for a given
805	   coding scheme).  But wireless link protocols do not always maintain a
806	   queue that depends on spectrum interference.  Similarly, power
807	   limited resources are also usually bit-congestible if energy is
808	   primarily required for transmission rather than header processing,
809	   but it is rare for a link protocol to build a queue as it approaches
810	   maximum power.

812	   Nonetheless, AQM algorithms do not require a queue in order to work.
813	   For instance spectrum congestion can be modelled by signal quality
814	   using target bit-energy-to-noise-density ratio.  And, to model radio
815	   power exhaustion, transmission power levels can be measured and
816	   compared to the maximum power available.  [ECNFixedWireless] proposes
817	   a practical and theoretically sound way to combine congestion
818	   notification for different bit-congestible resources at different
819	   layers along an end to end path, whether wireless or wired, and
820	   whether with or without queues.

822	4.2.  Congestion Notification Advice

824	4.2.1.  Network Bias when Encoding

826	4.2.1.1.  Advice on Packet Size Bias in RED

828	   The previously mentioned email [pktByteEmail] referred to by
829	   [RFC2309] advised that most scarce resources in the Internet were
830	   bit-congestible, which is still believed to be true (Section 1.1).
831	   But it went on to offer advice that is updated by this memo.  It said
832	   that drop probability should depend on the size of the packet being
833	   considered for drop if the resource is bit-congestible, but not if it
834	   is packet-congestible.  The argument continued that if packet drops
835	   were inflated by packet size (byte-mode dropping), "a flow's fraction
836	   of the packet drops is then a good indication of that flow's fraction
837	   of the link bandwidth in bits per second".  This was consistent with
838	   a referenced policing mechanism being worked on at the time for
839	   detecting unusually high bandwidth flows, eventually published in
840	   1999 [pBox].  However, the problem could and should have been solved
841	   by making the policing mechanism count the volume of bytes randomly
842	   dropped, not the number of packets.

844	   A few months before RFC2309 was published, an addendum was added to
845	   the above archived email referenced from the RFC, in which the final
846	   paragraph seemed to partially retract what had previously been said.
847	   It clarified that the question of whether the probability of
848	   dropping/marking a packet should depend on its size was not related
849	   to whether the resource itself was bit congestible, but a completely
850	   orthogonal question.  However the only example given had the queue
851	   measured in packets but packet drop depended on the byte-size of the
852	   packet in question.  No example was given the other way round.

854	   In 2000, Cnodder et al [REDbyte] pointed out that there was an error
855	   in the part of the original 1993 RED algorithm that aimed to
856	   distribute drops uniformly, because it didn't correctly take into
857	   account the adjustment for packet size.  They recommended an
858	   algorithm called RED_4 to fix this.  But they also recommended a
859	   further change, RED_5, to adjust drop rate dependent on the square of
860	   relative packet size.  This was indeed consistent with one implied
861	   motivation behind RED's byte mode drop--that we should reverse
862	   engineer the network to improve the performance of dominant end-to-
863	   end congestion control mechanisms.  This memo makes a different
864	   recommendations in Section 2.

866	   By 2003, a further change had been made to the adjustment for packet
867	   size, this time in the RED algorithm of the ns2 simulator.  Instead
868	   of taking each packet's size relative to a `maximum packet size' it
869	   was taken relative to a `mean packet size', intended to be a static
870	   value representative of the `typical' packet size on the link.  We
871	   have not been able to find a justification in the literature for this
872	   change, however Eddy and Allman conducted experiments [REDbias] that
873	   assessed how sensitive RED was to this parameter, amongst other
874	   things.  However, this changed algorithm can often lead to drop
875	   probabilities of greater than 1 (which gives a hint that there is
876	   probably a mistake in the theory somewhere).

878	   On 10-Nov-2004, this variant of byte-mode packet drop was made the
879	   default in the ns2 simulator.  It seems unlikely that byte-mode drop
880	   has ever been implemented in production networks (Appendix A),
881	   therefore any conclusions based on ns2 simulations that use RED
882	   without disabling byte-mode drop are likely to behave very
883	   differently from RED in production networks.

885	4.2.1.2.  Packet Size Bias Regardless of RED

887	   The byte-mode drop variant of RED is, of course, not the only
888	   possible bias towards small packets in queueing systems.  We have
889	   already mentioned that tail-drop queues naturally tend to lock-out
890	   large packets once they are full.  But also queues with fixed sized
891	   buffers reduce the probability that small packets will be dropped if
892	   (and only if) they allow small packets to borrow buffers from the
893	   pools for larger packets.  As was explained in Section 4.1.1 on fixed
894	   size buffer carving, borrowing effectively makes the maximum queue
895	   size for small packets greater than that for large packets, because
896	   more buffers can be used by small packets while less will fit large
897	   packets.

899	   In itself, the bias towards small packets caused by buffer borrowing
900	   is perfectly correct.  Lower drop probability for small packets is
901	   legitimate in buffer borrowing schemes, because small packets
902	   genuinely congest the machine's buffer memory less than large
903	   packets, given they can fit in more spaces.  The bias towards small
904	   packets is not artificially added (as it is in RED's byte-mode drop
905	   algorithm), it merely reflects the reality of the way fixed buffer
906	   memory gets congested.  Incidentally, the bias towards small packets
907	   from buffer borrowing is nothing like as large as that of RED's byte-
908	   mode drop.

910	   Nonetheless, fixed-buffer memory with tail drop is still prone to
911	   lock-out large packets, purely because of the tail-drop aspect.  So a
912	   good AQM algorithm like RED with packet-mode drop should be used with
913	   fixed buffer memories where possible.  If RED is too complicated to
914	   implement with multiple fixed buffer pools, the minimum necessary to
915	   prevent large packet lock-out is to ensure smaller packets never use
916	   the last available buffer in any of the pools for larger packets.

918	4.2.2.  Transport Bias when Decoding

920	   The above proposals to alter the network equipment to bias towards
921	   smaller packets have largely carried on outside the IETF process.
922	   Whereas, within the IETF, there are many different proposals to alter
923	   transport protocols to achieve the same goals, i.e. either to make
924	   the flow bit-rate take account of packet size, or to protect control
925	   packets from loss.  This memo argues that altering transport
926	   protocols is the more principled approach.

928	   A recently approved experimental RFC adapts its transport layer
929	   protocol to take account of packet sizes relative to typical TCP
930	   packet sizes.  This proposes a new small-packet variant of TCP-
931	   friendly rate control [RFC5348] called TFRC-SP [RFC4828].
932	   Essentially, it proposes a rate equation that inflates the flow rate
933	   by the ratio of a typical TCP segment size (1500B including TCP
934	   header) over the actual segment size [PktSizeEquCC].  (There are also
935	   other important differences of detail relative to TFRC, such as using
936	   virtual packets [CCvarPktSize] to avoid responding to multiple losses
937	   per round trip and using a minimum inter-packet interval.)

939	   Section 4.5.1 of this TFRC-SP spec discusses the implications of
940	   operating in an environment where queues have been configured to drop
941	   smaller packets with proportionately lower probability than larger
942	   ones.  But it only discusses TCP operating in such an environment,
943	   only mentioning TFRC-SP briefly when discussing how to define
944	   fairness with TCP.  And it only discusses the byte-mode dropping
945	   version of RED as it was before Cnodder et al pointed out it didn't
946	   sufficiently bias towards small packets to make TCP independent of
947	   packet size.

949	   So the TFRC-SP spec doesn't address the issue of which of the network
950	   or the transport _should_ handle fairness between different packet
951	   sizes.  In its Appendix B.4 it discusses the possibility of both
952	   TFRC-SP and some network buffers duplicating each other's attempts to
953	   deliberately bias towards small packets.  But the discussion is not
954	   conclusive, instead reporting simulations of many of the
955	   possibilities in order to assess performance but not recommending any
956	   particular course of action.

958	   The paper originally proposing TFRC with virtual packets (VP-TFRC)
959	   [CCvarPktSize] proposed that there should perhaps be two variants to
960	   cater for the different variants of RED.  However, as the TFRC-SP
961	   authors point out, there is no way for a transport to know whether
962	   some queues on its path have deployed RED with byte-mode packet drop
963	   (except if an exhaustive survey found that no-one has deployed it!--
964	   see Appendix A).  Incidentally, VP-TFRC also proposed that byte-mode
965	   RED dropping should really square the packet-size compensation-factor
966	   (like that of Cnodder's RED_5, but apparently unaware of it).

968	   Pre-congestion notification [RFC5670] is an IETF technology to use a
969	   virtual queue for AQM marking for packets within one Diffserv class
970	   in order to give early warning prior to any real queuing.  The PCN
971	   marking algorithms have been designed not to take account of packet
972	   size when forwarding through queues.  Instead the general principle
973	   has been to take account of the sizes of marked packets when
974	   monitoring the fraction of marking at the edge of the network, as
975	   recommended here.

977	4.2.3.  Making Transports Robust against Control Packet Losses

979	   Recently, two RFCs have defined changes to TCP that make it more
980	   robust against losing small control packets [RFC5562] [RFC5690].  In
981	   both cases they note that the case for these two TCP changes would be
982	   weaker if RED were biased against dropping small packets.  We argue
983	   here that these two proposals are a safer and more principled way to
984	   achieve TCP performance improvements than reverse engineering RED to
985	   benefit TCP.

987	   Although there are no known proposals, it would also be possible and
988	   perfectly valid to make control packets robust against drop by
989	   explicitly requesting a lower drop probability using their Diffserv
990	   code point [RFC2474] to request a scheduling class with lower drop.

992	   Although not brought to the IETF, a simple proposal from Wischik
993	   [DupTCP] suggests that the first three packets of every TCP flow
994	   should be routinely duplicated after a short delay.  It shows that
995	   this would greatly improve the chances of short flows completing
996	   quickly, but it would hardly increase traffic levels on the Internet,
997	   because Internet bytes have always been concentrated in the large
998	   flows.  It further shows that the performance of many typical
999	   applications depends on completion of long serial chains of short
1000	   messages.  It argues that, given most of the value people get from
1001	   the Internet is concentrated within short flows, this simple
1002	   expedient would greatly increase the value of the best efforts
1003	   Internet at minimal cost.

1005	4.2.4.  Congestion Notification: Summary of Conflicting Advice

1007	   +-----------+----------------+-----------------+--------------------+
1008	   | transport |  RED_1 (packet |  RED_4 (linear  | RED_5 (square byte |
1009	   |        cc |   mode drop)   | byte mode drop) |     mode drop)     |
1010	   +-----------+----------------+-----------------+--------------------+
1011	   |    TCP or |    s/sqrt(p)   |    sqrt(s/p)    |      1/sqrt(p)     |
1012	   |      TFRC |                |                 |                    |
1013	   |   TFRC-SP |    1/sqrt(p)   |    1/sqrt(sp)   |    1/(s.sqrt(p))   |
1014	   +-----------+----------------+-----------------+--------------------+

1016	    Table 2: Dependence of flow bit-rate per RTT on packet size, s, and
1017	   drop probability, p, when network and/or transport bias towards small
1018	                        packets to varying degrees

1020	   Table 2 aims to summarise the potential effects of all the advice
1021	   from different sources.  Each column shows a different possible AQM
1022	   behaviour in different queues in the network, using the terminology
1023	   of Cnodder et al outlined earlier (RED_1 is basic RED with packet-
1024	   mode drop).  Each row shows a different transport behaviour: TCP
1025	   [RFC5681] and TFRC [RFC5348] on the top row with TFRC-SP [RFC4828]
1026	   below.  Each cell shows how the bits per round trip of a flow depends
1027	   on packet size, s, and drop probability, p.  In order to declutter
1028	   the formulae to focus on packet-size dependence they are all given
1029	   per round trip, which removes any RTT term.

1031	   Let us assume that the goal is for the bit-rate of a flow to be
1032	   independent of packet size.  Suppressing all inessential details, the
1033	   table shows that this should either be achievable by not altering the
1034	   TCP transport in a RED_5 network, or using the small packet TFRC-SP
1035	   transport (or similar) in a network without any byte-mode dropping
1036	   RED (top right and bottom left).  Top left is the `do nothing'
1037	   scenario, while bottom right is the `do-both' scenario in which bit-
1038	   rate would become far too biased towards small packets.  Of course,
1039	   if any form of byte-mode dropping RED has been deployed on a subset
1040	   of queues that congest, each path through the network will present a
1041	   different hybrid scenario to its transport.

1043	   Whatever, we can see that the linear byte-mode drop column in the
1044	   middle would considerably complicate the Internet.  It's a half-way
1045	   house that doesn't bias enough towards small packets even if one
1046	   believes the network should be doing the biasing.  Section 2
1047	   recommends that _all_ bias in network equipment towards small packets
1048	   should be turned off--if indeed any equipment vendors have
1049	   implemented it--leaving packet-size bias solely as the preserve of
1050	   the transport layer (solely the leftmost, packet-mode drop column).

1052	   In practice it seems that no deliberate bias towards small packets
1053	   has been implemented for production networks.  Of the 19% of vendors
1054	   who responded to a survey of 84 equipment vendors, none had
1055	   implemented byte-mode drop in RED (see Appendix A for details).

1057	5.  Outstanding Issues and Next Steps

1059	5.1.  Bit-congestible Network

1061	   For a connectionless network with nearly all resources being bit-
1062	   congestible the recommended position is clear--that the network
1063	   should not make allowance for packet sizes and the transport should.
1064	   This leaves two outstanding issues:

1066	   o  How to handle any legacy of AQM with byte-mode drop already
1067	      deployed;

1069	   o  The need to start a programme to update transport congestion
1070	      control protocol standards to take account of packet size.

1072	   A survey of equipment vendors (Section 4.2.4) found no evidence that
1073	   byte-mode packet drop had been implemented, so deployment will be
1074	   sparse at best.  A migration strategy is not really needed to remove
1075	   an algorithm that may not even be deployed.

1077	   A programme of experimental updates to take account of packet size in
1078	   transport congestion control protocols has already started with
1079	   TFRC-SP [RFC4828].

1081	5.2.  Bit- & Packet-congestible Network

1083	   The position is much less clear-cut if the Internet becomes populated
1084	   by a more even mix of both packet-congestible and bit-congestible
1085	   resources (see Appendix B.2).  This problem is not pressing, because
1086	   most Internet resources are designed to be bit-congestible before
1087	   packet processing starts to congest (see Section 1.1).

1089	   The IRTF Internet congestion control research group (ICCRG) has set
1090	   itself the task of reaching consensus on generic forwarding
1091	   mechanisms that are necessary and sufficient to support the
1092	   Internet's future congestion control requirements (the first
1093	   challenge in [RFC6077]).  Therefore, we defer the question of whether
1094	   packet congestion might become common and what to do if it does to
1095	   the IRTF (the 'Small Packets' challenge in [RFC6077]).

1097	6.  Security Considerations

1099	   This memo recommends that queues do not bias drop probability towards
1100	   small packets as this creates a perverse incentive for transports to
1101	   break down their flows into tiny segments.  One of the benefits of
1102	   implementing AQM was meant to be to remove this perverse incentive
1103	   that drop-tail queues gave to small packets.

1105	   In practice, transports cannot all be trusted to respond to
1106	   congestion.  So another reason for recommending that queues do not
1107	   bias drop probability towards small packets is to avoid the
1108	   vulnerability to small packet DDoS attacks that would otherwise
1109	   result.  One of the benefits of implementing AQM was meant to be to
1110	   remove drop-tail's DoS vulnerability to small packets, so we
1111	   shouldn't add it back again.

1113	   If most queues implemented AQM with byte-mode drop, the resulting
1114	   network would amplify the potency of a small packet DDoS attack.  At
1115	   the first queue the stream of packets would push aside a greater
1116	   proportion of large packets, so more of the small packets would
1117	   survive to attack the next queue.  Thus a flood of small packets
1118	   would continue on towards the destination, pushing regular traffic
1119	   with large packets out of the way in one queue after the next, but
1120	   suffering much less drop itself.

1122	   Appendix C explains why the ability of networks to police the
1123	   response of _any_ transport to congestion depends on bit-congestible
1124	   network resources only doing packet-mode not byte-mode drop.  In
1125	   summary, it says that making drop probability depend on the size of
1126	   the packets that bits happen to be divided into simply encourages the
1127	   bits to be divided into smaller packets.  Byte-mode drop would
1128	   therefore irreversibly complicate any attempt to fix the Internet's
1129	   incentive structures.

1131	7.  Conclusions

1133	   This memo identifies the three distinct stages of the congestion
1134	   notification process where implementations need to decide whether to
1135	   take packet size into account.  The recommendation of this memo is
1136	   different in each case:

1138	   o  When network equipment measures the length of a queue, whether it
1139	      counts in bytes or packets depends on whether the network resource
1140	      is congested respectively by bytes or by packets.

1142	   o  When network equipment decides whether to drop (or mark) a packet,
1143	      it is recommended that the size of the particular packet should
1144	      not be taken into account

1146	   o  However, when a transport algorithm responds to a dropped or
1147	      marked packet, the size of the rate reduction should be
1148	      proportionate to the size of the packet.

1150	   In summary, the answers are 'it depends', 'no' and 'yes' respectively

1152	   This means that RED's byte-mode queue measurement will often be
1153	   appropriate although byte-mode drop is strongly deprecated.

1155	   At the transport layer the IETF should continue updating congestion
1156	   control protocols to take account of the size of each packet that
1157	   indicates congestion.  Also the IETF should continue to make
1158	   protocols less sensitive to losing control packets like SYNs, pure
1159	   ACKs and DNS exchanges.  Although many control packets happen to be
1160	   small, the alternative of network equipment favouring all small
1161	   packets would be dangerous.  That would create perverse incentives to
1162	   split data transfers into smaller packets.

1164	   The memo develops these recommendations from principled arguments
1165	   concerning scaling, layering, incentives, inherent efficiency,
1166	   security and policeability.  But it also addresses practical issues
1167	   such as specific buffer architectures and incremental deployment.
1168	   Indeed a limited survey of RED implementations is discussed, which
1169	   shows there appears to be little, if any, installed base of RED's
1170	   byte-mode drop.  Therefore it can be deprecated with little, if any,
1171	   incremental deployment complications.

1173	   The recommendations have been developed on the well-founded basis
1174	   that most Internet resources are bit-congestible not packet-
1175	   congestible.  We need to know the likelihood that this assumption
1176	   will prevail longer term and, if it might not, what protocol changes
1177	   will be needed to cater for a mix of the two.  This problem is
1178	   deferred to the IRTF Internet Congestion Control Research Group
1179	   (ICCRG).

1181	8.  Acknowledgements

1183	   Thank you to Sally Floyd, who gave extensive and useful review
1184	   comments.  Also thanks for the reviews from Philip Eardley, David
1185	   Black, Fred Baker, Toby Moncaster, Arnaud Jacquet and Mirja
1186	   Kuehlewind as well as helpful explanations of different hardware
1187	   approaches from Larry Dunn and Fred Baker.  We are grateful to Bruce
1188	   Davie and his colleagues for providing a timely and efficient survey
1189	   of RED implementation in Cisco's product range.  Also grateful thanks
1190	   to Toby Moncaster, Will Dormann, John Regnault, Simon Carter and
1191	   Stefaan De Cnodder who further helped survey the current status of
1192	   RED implementation and deployment and, finally, thanks to the
1193	   anonymous individuals who responded.

1195	   Bob Briscoe and Jukka Manner are partly funded by Trilogy, a research
1196	   project (ICT- 216372) supported by the European Community under its
1197	   Seventh Framework Programme.  The views expressed here are those of
1198	   the authors only.

1200	9.  Comments Solicited

1202	   Comments and questions are encouraged and very welcome.  They can be
1203	   addressed to the IETF Transport Area working group mailing list
1204	   <tsvwg@ietf.org>, and/or to the authors.

1206	10.  References

1208	10.1.  Normative References

1210	   [RFC2119]                       Bradner, S., "Key words for use in
1211	                                   RFCs to Indicate Requirement Levels",
1212	                                   BCP 14, RFC 2119, March 1997.

1214	   [RFC2309]                       Braden, B., Clark, D., Crowcroft, J.,
1215	                                   Davie, B., Deering, S., Estrin, D.,
1216	                                   Floyd, S., Jacobson, V., Minshall,
1217	                                   G., Partridge, C., Peterson, L.,
1218	                                   Ramakrishnan, K., Shenker, S.,
1219	                                   Wroclawski, J., and L. Zhang,
1220	                                   "Recommendations on Queue Management
1221	                                   and Congestion Avoidance in the
1222	                                   Internet", RFC 2309, April 1998.

1224	   [RFC3168]                       Ramakrishnan, K., Floyd, S., and D.
1225	                                   Black, "The Addition of Explicit
1226	                                   Congestion Notification (ECN) to IP",
1227	                                   RFC 3168, September 2001.

1229	   [RFC3426]                       Floyd, S., "General Architectural and
1230	                                   Policy Considerations", RFC 3426,
1231	                                   November 2002.

1233	10.2.  Informative References

1235	   [CCvarPktSize]                  Widmer, J., Boutremans, C., and J-Y.
1236	                                   Le Boudec, "Congestion Control for
1237	                                   Flows with Variable Packet Size", ACM
1238	                                   CCR 34(2) 137--151, 2004, <http://
1239	                                   doi.acm.org/10.1145/997150.997162>.

1241	   [CHOKe_Var_Pkt]                 Psounis, K., Pan, R., and B.
1242	                                   Prabhaker, "Approximate Fair Dropping
1243	                                   for Variable Length Packets", IEEE
1244	                                   Micro 21(1):48--56, January-
1245	                                   February 2001, <http://
1246	                                   www.stanford.edu/~balaji/papers/
1247	                                   01approximatefair.pdf}>.

1249	   [DRQ]                           Shin, M., Chong, S., and I. Rhee,
1250	                                   "Dual-Resource TCP/AQM for
1251	                                   Processing-Constrained Networks",
1252	                                   IEEE/ACM Transactions on
1253	                                   Networking Vol 16, issue 2,
1254	                                   April 2008, <http://dx.doi.org/
1255	                                   10.1109/TNET.2007.900415>.

1257	   [DupTCP]                        Wischik, D., "Short messages", Royal
1258	                                   Society workshop on networks:
1259	                                   modelling and control ,
1260	                                   September 2007, <http://
1261	                                   www.cs.ucl.ac.uk/staff/ucacdjw/
1262	                                   Research/shortmsg.html>.

1264	   [ECNFixedWireless]              Siris, V., "Resource Control for
1265	                                   Elastic Traffic in CDMA Networks",
1266	                                   Proc. ACM MOBICOM'02 ,
1267	                                   September 2002, <http://
1268	                                   www.ics.forth.gr/netlab/publications/
1269	                                   resource_control_elastic_cdma.html>.

1271	   [Evol_cc]                       Gibbens, R. and F. Kelly, "Resource
1272	                                   pricing and the evolution of
1273	                                   congestion control",
1274	                                   Automatica 35(12)1969--1985,
1275	                                   December 1999, <http://
1276	                                   www.statslab.cam.ac.uk/~frank/
1277	                                   evol.html>.

1279	   [I-D.ietf-avtcore-ecn-for-rtp]  Westerlund, M., Johansson, I.,
1280	                                   Perkins, C., O'Hanlon, P., and K.
1281	                                   Carlberg, "Explicit Congestion
1282	                                   Notification (ECN) for RTP over UDP",
1283	                                   draft-ietf-avtcore-ecn-for-rtp-04
1284	                                   (work in progress), July 2011.

1286	   [I-D.ietf-conex-concepts-uses]  Briscoe, B., Woundy, R., and A.
1287	                                   Cooper, "ConEx Concepts and Use
1288	                                   Cases",
1289	                                   draft-ietf-conex-concepts-uses-03
1290	                                   (work in progress), October 2011.

1292	   [IOSArch]                       Bollapragada, V., White, R., and C.
1293	                                   Murphy, "Inside Cisco IOS Software
1294	                                   Architecture", Cisco Press: CCIE
1295	                                   Professional Development ISBN13: 978-
1296	                                   1-57870-181-0, July 2000.

1298	   [PktSizeEquCC]                  Vasallo, P., "Variable Packet Size
1299	                                   Equation-Based Congestion Control",
1300	                                   ICSI Technical Report tr-00-008,
1301	                                   2000, <http://http.icsi.berkeley.edu/
1302	                                   ftp/global/pub/techreports/2000/
1303	                                   tr-00-008.pdf>.

1305	   [RED93]                         Floyd, S. and V. Jacobson, "Random
1306	                                   Early Detection (RED) gateways for
1307	                                   Congestion Avoidance", IEEE/ACM
1308	                                   Transactions on Networking 1(4) 397--
1309	                                   413, August 1993, <http://
1310	                                   www.icir.org/floyd/papers/red/
1311	                                   red.html>.

1313	   [REDbias]                       Eddy, W. and M. Allman, "A Comparison
1314	                                   of RED's Byte and Packet Modes",
1315	                                   Computer Networks 42(3) 261--280,
1316	                                   June 2003, <http://www.ir.bbn.com/
1317	                                   documents/articles/redbias.ps>.

1319	   [REDbyte]                       De Cnodder, S., Elloumi, O., and K.
1320	                                   Pauwels, "RED behavior with different
1321	                                   packet sizes", Proc. 5th IEEE
1322	                                   Symposium on Computers and
1323	                                   Communications (ISCC) 793--799,
1324	                                   July 2000, <http://www.icir.org/
1325	                                   floyd/red/Elloumi99.pdf>.

1327	   [RFC2474]                       Nichols, K., Blake, S., Baker, F.,
1328	                                   and D. Black, "Definition of the
1329	                                   Differentiated Services Field (DS
1330	                                   Field) in the IPv4 and IPv6 Headers",
1331	                                   RFC 2474, December 1998.

1333	   [RFC3550]                       Schulzrinne, H., Casner, S.,
1334	                                   Frederick, R., and V. Jacobson, "RTP:
1335	                                   A Transport Protocol for Real-Time
1336	                                   Applications", STD 64, RFC 3550,
1337	                                   July 2003.

1339	   [RFC3714]                       Floyd, S. and J. Kempf, "IAB Concerns
1340	                                   Regarding Congestion Control for
1341	                                   Voice Traffic in the Internet",
1342	                                   RFC 3714, March 2004.

1344	   [RFC4828]                       Floyd, S. and E. Kohler, "TCP
1345	                                   Friendly Rate Control (TFRC): The
1346	                                   Small-Packet (SP) Variant", RFC 4828,
1347	                                   April 2007.

1349	   [RFC5348]                       Floyd, S., Handley, M., Padhye, J.,
1350	                                   and J. Widmer, "TCP Friendly Rate
1351	                                   Control (TFRC): Protocol
1352	                                   Specification", RFC 5348,
1353	                                   September 2008.

1355	   [RFC5562]                       Kuzmanovic, A., Mondal, A., Floyd,
1356	                                   S., and K. Ramakrishnan, "Adding
1357	                                   Explicit Congestion Notification
1358	                                   (ECN) Capability to TCP's SYN/ACK
1359	                                   Packets", RFC 5562, June 2009.

1361	   [RFC5670]                       Eardley, P., "Metering and Marking
1362	                                   Behaviour of PCN-Nodes", RFC 5670,
1363	                                   November 2009.

1365	   [RFC5681]                       Allman, M., Paxson, V., and E.
1366	                                   Blanton, "TCP Congestion Control",
1367	                                   RFC 5681, September 2009.

1369	   [RFC5690]                       Floyd, S., Arcia, A., Ros, D., and J.
1370	                                   Iyengar, "Adding Acknowledgement
1371	                                   Congestion Control to TCP", RFC 5690,
1372	                                   February 2010.

1374	   [RFC6077]                       Papadimitriou, D., Welzl, M., Scharf,
1375	                                   M., and B. Briscoe, "Open Research
1376	                                   Issues in Internet Congestion
1377	                                   Control", RFC 6077, February 2011.

1379	   [Rate_fair_Dis]                 Briscoe, B., "Flow Rate Fairness:
1380	                                   Dismantling a Religion", ACM
1381	                                   CCR 37(2)63--74, April 2007, <http://
1382	                                   portal.acm.org/
1383	                                   citation.cfm?id=1232926>.

1385	   [gentle_RED]                    Floyd, S., "Recommendation on using
1386	                                   the "gentle_" variant of RED", Web
1387	                                   page , March 2000, <http://
1388	                                   www.icir.org/floyd/red/gentle.html>.

1390	   [pBox]                          Floyd, S. and K. Fall, "Promoting the
1391	                                   Use of End-to-End Congestion Control
1392	                                   in the Internet", IEEE/ACM
1393	                                   Transactions on Networking 7(4) 458--
1394	                                   472, August 1999, <http://
1395	                                   www.aciri.org/floyd/
1396	                                   end2end-paper.html>.

1398	   [pktByteEmail]                  Floyd, S., "RED: Discussions of Byte
1399	                                   and Packet Modes", email ,
1400	                                   March 1997, <http://
1401	                                   www-nrg.ee.lbl.gov/floyd/
1402	                                   REDaveraging.txt>.

1404	Appendix A.  Survey of RED Implementation Status

1406	   This Appendix is informative, not normative.

1408	   In May 2007 a survey was conducted of 84 vendors to assess how widely
1409	   drop probability based on packet size has been implemented in RED
1410	   Table 3.  About 19% of those surveyed replied, giving a sample size
1411	   of 16.  Although in most cases we do not have permission to identify
1412	   the respondents, we can say that those that have responded include
1413	   most of the larger equipment vendors, covering a large fraction of
1414	   the market.  The two who gave permission to be identified were Cisco
1415	   and Alcatel-Lucent.  The others range across the large network
1416	   equipment vendors at L3 & L2, firewall vendors, wireless equipment
1417	   vendors, as well as large software businesses with a small selection
1418	   of networking products.  All those who responded confirmed that they
1419	   have not implemented the variant of RED with drop dependent on packet
1420	   size (2 were fairly sure they had not but needed to check more
1421	   thoroughly).  At the time the survey was conducted, Linux did not
1422	   implement RED with packet-size bias of drop, although we have not
1423	   investigated a wider range of open source code.

1425	   +-------------------------------+----------------+-----------------+
1426	   |                      Response | No. of vendors | %age of vendors |
1427	   +-------------------------------+----------------+-----------------+
1428	   |               Not implemented |             14 |             17% |
1429	   |    Not implemented (probably) |              2 |              2% |
1430	   |                   Implemented |              0 |              0% |
1431	   |                   No response |             68 |             81% |
1432	   | Total companies/orgs surveyed |             84 |            100% |
1433	   +-------------------------------+----------------+-----------------+

1435	    Table 3: Vendor Survey on byte-mode drop variant of RED (lower drop
1436	                      probability for small packets)

1438	   Where reasons have been given, the extra complexity of packet bias
1439	   code has been most prevalent, though one vendor had a more principled
1440	   reason for avoiding it--similar to the argument of this document.

1442	   Our survey was of vendor implementations, so we cannot be certain
1443	   about operator deployment.  But we believe many queues in the
1444	   Internet are still tail-drop.  The company of one of the co-authors
1445	   (BT) has widely deployed RED, but many tail-drop queues are bound to
1446	   still exist, particularly in access network equipment and on
1447	   middleboxes like firewalls, where RED is not always available.

1449	   Routers using a memory architecture based on fixed size buffers with
1450	   borrowing may also still be prevalent in the Internet.  As explained
1451	   in Section 4.2.1, these also provide a marginal (but legitimate) bias
1452	   towards small packets.  So even though RED byte-mode drop is not
1453	   prevalent, it is likely there is still some bias towards small
1454	   packets in the Internet due to tail drop and fixed buffer borrowing.

1456	Appendix B.  Sufficiency of Packet-Mode Drop

1458	   This Appendix is informative, not normative.

1460	   Here we check that packet-mode drop (or marking) in the network gives
1461	   sufficiently generic information for the transport layer to use.  We
1462	   check against a 2x2 matrix of four scenarios that may occur now or in
1463	   the future (Table 4).  The horizontal and vertical dimensions have
1464	   been chosen because each tests extremes of sensitivity to packet size
1465	   in the transport and in the network respectively.

1467	   Note that this section does not consider byte-mode drop at all.
1468	   Having deprecated byte-mode drop, the goal here is to check that
1469	   packet-mode drop will be sufficient in all cases.

1471	   +-------------------------------+-----------------+-----------------+
1472	   |                     Transport |  a) Independent | b) Dependent on |
1473	   |                               |  of packet size |  packet size of |
1474	   | Network                       |  of congestion  |    congestion   |
1475	   |                               |  notifications  |  notifications  |
1476	   +-------------------------------+-----------------+-----------------+
1477	   | 1) Predominantly              |   Scenario a1)  |   Scenario b1)  |
1478	   | bit-congestible network       |                 |                 |
1479	   | 2) Mix of bit-congestible and |   Scenario a2)  |   Scenario b2)  |
1480	   | pkt-congestible network       |                 |                 |
1481	   +-------------------------------+-----------------+-----------------+

1483	                Table 4: Four Possible Congestion Scenarios

1485	   Appendix B.1 focuses on the horizontal dimension of Table 4 checking
1486	   that packet-mode drop (or marking) gives sufficient information,
1487	   whether or not the transport uses it--scenarios b) and a)
1488	   respectively.

1490	   Appendix B.2 focuses on the vertical dimension of Table 4, checking
1491	   that packet-mode drop gives sufficient information to the transport
1492	   whether resources in the network are bit-congestible or packet-
1493	   congestible (these terms are defined in Section 1.1).

1495	   Notation:  To be concrete, we will compare two flows with different
1496	      packet sizes, s_1 and s_2.  As an example, we will take s_1 = 60B
1497	      = 480b and s_2 = 1500B = 12,000b.

1499	      A flow's bit rate, x [bps], is related to its packet rate, u
1500	      [pps], by

1502	         x(t) = s.u(t).

1504	      In the bit-congestible case, path congestion will be denoted by
1505	      p_b, and in the packet-congestible case by p_p.  When either case
1506	      is implied, the letter p alone will denote path congestion.

1508	B.1.  Packet-Size (In)Dependence in Transports

1510	   In all cases we consider a packet-mode drop queue that indicates
1511	   congestion by dropping (or marking) packets with probability p
1512	   irrespective of packet size. We use an example value of loss
1513	   (marking) probability, p=0.1%.

1515	   A transport like RFC5681 TCP treats a congestion notification on any
1516	   packet whatever its size as one event.  However, a network with just
1517	   the packet-mode drop algorithm does give more information if the
1518	   transport chooses to use it.  We will use Table 5 to illustrate this.

1520	   We will set aside the last column until later.  The columns labelled
1521	   "Flow 1" and "Flow 2" compare two flows consisting of 60B and 1500B
1522	   packets respectively.  The body of the table considers two separate
1523	   cases, one where the flows have equal bit-rate and the other with
1524	   equal packet-rates.  In both cases, the two flows fill a 96Mbps link.
1525	   Therefore, in the equal bit-rate case they each have half the bit-
1526	   rate (48Mbps).  Whereas, with equal packet-rates, flow 1 uses 25
1527	   times smaller packets so it gets 25 times less bit-rate--it only gets
1528	   1/(1+25) of the link capacity (96Mbps/26 = 4Mbps after rounding).  In
1529	   contrast flow 2 gets 25 times more bit-rate (92Mbps) in the equal
1530	   packet rate case because its packets are 25 times larger.  The packet
1531	   rate shown for each flow could easily be derived once the bit-rate
1532	   was known by dividing bit-rate by packet size, as shown in the column
1533	   labelled "Formula".

1535	       Parameter               Formula      Flow 1  Flow 2 Combined
1536	       ----------------------- ----------- ------- ------- --------
1537	       Packet size             s/8             60B  1,500B    (Mix)
1538	       Packet size             s              480b 12,000b    (Mix)
1539	       Pkt loss probability    p              0.1%    0.1%     0.1%

1541	       EQUAL BIT-RATE CASE
1542	       Bit-rate                x            48Mbps  48Mbps   96Mbps
1543	       Packet-rate             u = x/s     100kpps   4kpps  104kpps
1544	       Absolute pkt-loss-rate  p*u          100pps    4pps   104pps
1545	       Absolute bit-loss-rate  p*u*s        48kbps  48kbps   96kbps
1546	       Ratio of lost/sent pkts p*u/u          0.1%    0.1%     0.1%
1547	       Ratio of lost/sent bits p*u*s/(u*s)    0.1%    0.1%     0.1%

1549	       EQUAL PACKET-RATE CASE
1550	       Bit-rate                x             4Mbps  92Mbps   96Mbps
1551	       Packet-rate             u = x/s       8kpps   8kpps   15kpps
1552	       Absolute pkt-loss-rate  p*u            8pps    8pps    15pps
1553	       Absolute bit-loss-rate  p*u*s         4kbps  92kbps   96kbps
1554	       Ratio of lost/sent pkts p*u/u          0.1%    0.1%     0.1%
1555	       Ratio of lost/sent bits p*u*s/(u*s)    0.1%    0.1%     0.1%

1557	    Table 5: Absolute Loss Rates and Loss Ratios for Flows of Small and
1558	                      Large Packets and Both Combined

1560	   So far we have merely set up the scenarios.  We now consider
1561	   congestion notification in the scenario.  Two TCP flows with the same
1562	   round trip time aim to equalise their packet-loss-rates over time.
1563	   That is the number of packets lost in a second, which is the packets
1564	   per second (u) multiplied by the probability that each one is dropped
1565	   (p).  Thus TCP converges on the "Equal packet-rate" case, where both
1566	   flows aim for the same "Absolute packet-loss-rate" (both 8pps in the
1567	   table).

1569	   Packet-mode drop actually gives flows sufficient information to
1570	   measure their loss-rate in bits per second, if they choose, not just
1571	   packets per second.  Each flow can count the size of a lost or marked
1572	   packet and scale its rate-response in proportion (as TFRC-SP does).
1573	   The result is shown in the row entitled "Absolute bit-loss-rate",
1574	   where the bits lost in a second is the packets per second (u)
1575	   multiplied by the probability of losing a packet (p) multiplied by
1576	   the packet size (s).  Such an algorithm would try to remove any
1577	   imbalance in bit-loss-rate such as the wide disparity in the "Equal
1578	   packet-rate" case (4kbps vs. 92kbps).  Instead, a packet-size-
1579	   dependent algorithm would aim for equal bit-loss-rates, which would
1580	   drive both flows towards the "Equal bit-rate" case, by driving them
1581	   to equal bit-loss-rates (both 48kbps in this example).

1583	   The explanation so far has assumed that each flow consists of packets
1584	   of only one constant size.  Nonetheless, it extends naturally to
1585	   flows with mixed packet sizes.  In the right-most column of Table 5 a
1586	   flow of mixed size packets is created simply by considering flow 1
1587	   and flow 2 as a single aggregated flow.  There is no need for a flow
1588	   to maintain an average packet size.  It is only necessary for the
1589	   transport to scale its response to each congestion indication by the
1590	   size of each individual lost (or marked) packet.  Taking for example
1591	   the "Equal packet-rate" case, in one second about 8 small packets and
1592	   8 large packets are lost (making closer to 15 than 16 losses per
1593	   second due to rounding).  If the transport multiplies each loss by
1594	   its size, in one second it responds to 8*480b and 8*12,000b lost
1595	   bits, adding up to 96,000 lost bits in a second.  This double checks
1596	   correctly, being the same as 0.1% of the total bit-rate of 96Mbps.
1597	   For completeness, the formula for absolute bit-loss-rate is p(u1*s1+
1598	   u2*s2).

1600	   Incidentally, a transport will always measure the loss probability
1601	   the same irrespective of whether it measures in packets or in bytes.
1602	   In other words, the ratio of lost to sent packets will be the same as
1603	   the ratio of lost to sent bytes.  (This is why TCP's bit rate is
1604	   still proportional to packet size even when byte-counting is used, as
1605	   recommended for TCP in [RFC5681], mainly for orthogonal security
1606	   reasons.)  This is intuitively obvious by comparing two example
1607	   flows; one with 60B packets, the other with 1500B packets.  If both
1608	   flows pass through a queue with drop probability 0.1%, each flow will
1609	   lose 1 in 1,000 packets.  In the stream of 60B packets the ratio of
1610	   bytes lost to sent will be 60B in every 60,000B; and in the stream of
1611	   1500B packets, the loss ratio will be 1,500B out of 1,500,000B. When
1612	   the transport responds to the ratio of lost to sent packets, it will
1613	   measure the same ratio whether it measures in packets or bytes: 0.1%
1614	   in both cases.  The fact that this ratio is the same whether measured
1615	   in packets or bytes can be seen in Table 5, where the ratio of lost
1616	   to sent packets and the ratio of lost to sent bytes is always 0.1% in
1617	   all cases (recall that the scenario was set up with p=0.1%).

1619	   This discussion of how the ratio can be measured in packets or bytes
1620	   is only raised here to highlight that it is irrelevant to this memo!
1621	   Whether a transport depends on packet size or not depends on how this
1622	   ratio is used within the congestion control algorithm.

1624	   So far we have shown that packet-mode drop passes sufficient
1625	   information to the transport layer so that the transport can take
1626	   account of bit-congestion, by using the sizes of the packets that
1627	   indicate congestion.  We have also shown that the transport can
1628	   choose not to take packet size into account if it wishes.  We will
1629	   now consider whether the transport can know which to do.

1631	B.2.  Bit-Congestible and Packet-Congestible Indications

1633	   As a thought-experiment, imagine an idealised congestion notification
1634	   protocol that supports both bit-congestible and packet-congestible
1635	   resources.  It would require at least two ECN flags, one for each of
1636	   bit-congestible and packet-congestible resources.

1638	   1.  A packet-congestible resource trying to code congestion level p_p
1639	       into a packet stream should mark the idealised `packet
1640	       congestion' field in each packet with probability p_p
1641	       irrespective of the packet's size.  The transport should then
1642	       take a packet with the packet congestion field marked to mean
1643	       just one mark, irrespective of the packet size.

1645	   2.  A bit-congestible resource trying to code time-varying byte-
1646	       congestion level p_b into a packet stream should mark the `byte
1647	       congestion' field in each packet with probability p_b, again
1648	       irrespective of the packet's size.  Unlike before, the transport
1649	       should take a packet with the byte congestion field marked to
1650	       count as a mark on each byte in the packet.

1652	   This hides a fundamental problem--much more fundamental than whether
1653	   we can magically create header space for yet another ECN flag, or
1654	   whether it would work while being deployed incrementally.
1655	   Distinguishing drop from delivery naturally provides just one
1656	   implicit bit of congestion indication information--the packet is
1657	   either dropped or not.  It is hard to drop a packet in two ways that
1658	   are distinguishable remotely.  This is a similar problem to that of
1659	   distinguishing wireless transmission losses from congestive losses.

1661	   This problem would not be solved even if ECN were universally
1662	   deployed.  A congestion notification protocol must survive a
1663	   transition from low levels of congestion to high.  Marking two states
1664	   is feasible with explicit marking, but much harder if packets are
1665	   dropped.  Also, it will not always be cost-effective to implement AQM
1666	   at every low level resource, so drop will often have to suffice.

1668	   We are not saying two ECN fields will be needed (and we are not
1669	   saying that somehow a resource should be able to drop a packet in one
1670	   of two different ways so that the transport can distinguish which
1671	   sort of drop it was!).  These two congestion notification channels
1672	   are a conceptual device to illustrate a dilemma we could face in the
1673	   future.  Section 3 gives four good reasons why it would be a bad idea
1674	   to allow for packet size by biasing drop probability in favour of
1675	   small packets within the network.  The impracticality of our thought
1676	   experiment shows that it will be hard to give transports a practical
1677	   way to know whether to take account of the size of congestion
1678	   indication packets or not.

1680	   Fortunately, this dilemma is not pressing because by design most
1681	   equipment becomes bit-congested before its packet-processing becomes
1682	   congested (as already outlined in Section 1.1).  Therefore transports
1683	   can be designed on the relatively sound assumption that a congestion
1684	   indication will usually imply bit-congestion.

1686	   Nonetheless, although the above idealised protocol isn't intended for
1687	   implementation, we do want to emphasise that research is needed to
1688	   predict whether there are good reasons to believe that packet
1689	   congestion might become more common, and if so, to find a way to
1690	   somehow distinguish between bit and packet congestion [RFC3714].

1692	   Recently, the dual resource queue (DRQ) proposal [DRQ] has been made
1693	   on the premise that, as network processors become more cost
1694	   effective, per packet operations will become more complex
1695	   (irrespective of whether more function in the network is desirable).
1696	   Consequently the premise is that CPU congestion will become more
1697	   common.  DRQ is a proposed modification to the RED algorithm that
1698	   folds both bit congestion and packet congestion into one signal
1699	   (either loss or ECN).

1701	   Finally, we note one further complication.  Strictly, packet-
1702	   congestible resources are often cycle-congestible.  For instance, for
1703	   routing look-ups load depends on the complexity of each look-up and
1704	   whether the pattern of arrivals is amenable to caching or not.  This
1705	   also reminds us that any solution must not require a forwarding
1706	   engine to use excessive processor cycles in order to decide how to
1707	   say it has no spare processor cycles.

1709	Appendix C.  Byte-mode Drop Complicates Policing Congestion Response

1711	   There are two main classes of approach to policing congestion
1712	   response: i) policing at each bottleneck link or ii) policing at the
1713	   edges of networks.  Packet-mode drop in RED is compatible with
1714	   either, while byte-mode drop precludes edge policing.

1716	   The simplicity of an edge policer relies on one dropped or marked
1717	   packet being equivalent to another of the same size without having to
1718	   know which link the drop or mark occurred at.  However, the byte-mode
1719	   drop algorithm has to depend on the local MTU of the line--it needs
1720	   to use some concept of a 'normal' packet size.  Therefore, one
1721	   dropped or marked packet from a byte-mode drop algorithm is not
1722	   necessarily equivalent to another from a different link.  A policing
1723	   function local to the link can know the local MTU where the
1724	   congestion occurred.  However, a policer at the edge of the network
1725	   cannot, at least not without a lot of complexity.

1727	   The early research proposals for type (i) policing at a bottleneck
1728	   link [pBox] used byte-mode drop, then detected flows that contributed
1729	   disproportionately to the number of packets dropped.  However, with
1730	   no extra complexity, later proposals used packet mode drop and looked
1731	   for flows that contributed a disproportionate amount of dropped bytes
1732	   [CHOKe_Var_Pkt].

1734	   Work is progressing on the congestion exposure protocol (ConEx
1735	   [I-D.ietf-conex-concepts-uses]), which enables a type (ii) edge
1736	   policer located at a user's attachment point.  The idea is to be able
1737	   to take an integrated view of the effect of all a user's traffic on
1738	   any link in the internetwork.  However, byte-mode drop would
1739	   effectively preclude such edge policing because of the MTU issue
1740	   above.

1742	   Indeed, making drop probability depend on the size of the packets
1743	   that bits happen to be divided into would simply encourage the bits
1744	   to be divided into smaller packets in order to confuse policing.  In
1745	   contrast, as long as a dropped/marked packet is taken to mean that
1746	   all the bytes in the packet are dropped/marked, a policer can remain
1747	   robust against bits being re-divided into different size packets or
1748	   across different size flows [Rate_fair_Dis].

1750	Appendix D.  Changes from Previous Versions

1752	   To be removed by the RFC Editor on publication.

1754	   Full incremental diffs between each version are available at
1755	   <http://tools.ietf.org/wg/tsvwg/draft-ietf-tsvwg-byte-pkt-congest/>
1756	   (courtesy of the rfcdiff tool):

1758	   From -04 to -05:

1760	      *  Changed from Informational to BCP and highlighted non-normative
1761	         sections and appendices

1763	      *  Removed language about consensus

1765	      *  Added "Example Comparing Packet-Mode Drop and Byte-Mode Drop"

1767	      *  Arranged "Motivating Arguments" into a more logical order and
1768	         completely rewrote "Transport-Independent Network" & "Scaling
1769	         Congestion Control with Packet Size" arguments.  Removed "Why
1770	         Now?"

1772	      *  Clarified applicability of certain recommendations

1774	      *  Shifted vendor survey to an Appendix

1776	      *  Cut down "Outstanding Issues and Next Steps"

1778	      *  Re-drafted the start of the conclusions to highlight the three
1779	         distinct areas of concern

1781	      *  Completely re-wrote appendices

1783	      *  Editorial corrections throughout.

1785	   From -03 to -04:

1787	      *  Reordered Sections 2 and 3, and some clarifications here and
1788	         there based on feedback from Colin Perkins and Mirja
1789	         Kuehlewind.

1791	   From -02 to -03  (this version)

1793	      *  Structural changes:

1795	         +  Split off text at end of "Scaling Congestion Control with
1796	            Packet Size" into new section "Transport-Independent
1797	            Network"

1799	         +  Shifted "Recommendations" straight after "Motivating
1800	            Arguments" and added "Conclusions" at end to reinforce
1801	            Recommendations

1803	         +  Added more internal structure to Recommendations, so that
1804	            recommendations specific to RED or to TCP are just
1805	            corollaries of a more general recommendation, rather than
1806	            being listed as a separate recommendation.

1808	         +  Renamed "State of the Art" as "Critical Survey of Existing
1809	            Advice" and retitled a number of subsections with more
1810	            descriptive titles.

1812	         +  Split end of "Congestion Coding: Summary of Status" into a
1813	            new subsection called "RED Implementation Status".

1815	         +  Removed text that had been in the Appendix "Congestion
1816	            Notification Definition: Further Justification".

1818	      *  Reordered the intro text a little.

1820	      *  Made it clearer when advice being reported is deprecated and
1821	         when it is not.

1823	      *  Described AQM as in network equipment, rather than saying "at
1824	         the network layer" (to side-step controversy over whether
1825	         functions like AQM are in the transport layer but in network
1826	         equipment).

1828	      *  Minor improvements to clarity throughout

1830	   From -01 to -02:

1832	      *  Restructured the whole document for (hopefully) easier reading
1833	         and clarity.  The concrete recommendation, in RFC2119 language,
1834	         is now in Section 7.

1836	   From -00 to -01:

1838	      *  Minor clarifications throughout and updated references

1840	   From briscoe-byte-pkt-mark-02 to ietf-byte-pkt-congest-00:

1842	      *  Added note on relationship to existing RFCs

1844	      *  Posed the question of whether packet-congestion could become
1845	         common and deferred it to the IRTF ICCRG.  Added ref to the
1846	         dual-resource queue (DRQ) proposal.

1848	      *  Changed PCN references from the PCN charter & architecture to
1849	         the PCN marking behaviour draft most likely to imminently
1850	         become the standards track WG item.

1852	   From -01 to -02:

1854	      *  Abstract reorganised to align with clearer separation of issue
1855	         in the memo.

1857	      *  Introduction reorganised with motivating arguments removed to
1858	         new Section 3.

1860	      *  Clarified avoiding lock-out of large packets is not the main or
1861	         only motivation for RED.

1863	      *  Mentioned choice of drop or marking explicitly throughout,
1864	         rather than trying to coin a word to mean either.

1866	      *  Generalised the discussion throughout to any packet forwarding
1867	         function on any network equipment, not just routers.

1869	      *  Clarified the last point about why this is a good time to sort
1870	         out this issue: because it will be hard / impossible to design
1871	         new transports unless we decide whether the network or the
1872	         transport is allowing for packet size.

1874	      *  Added statement explaining the horizon of the memo is long
1875	         term, but with short term expediency in mind.

1877	      *  Added material on scaling congestion control with packet size
1878	         (Section 3.4).

1880	      *  Separated out issue of normalising TCP's bit rate from issue of
1881	         preference to control packets (Section 3.2).

1883	      *  Divided up Congestion Measurement section for clarity,
1884	         including new material on fixed size packet buffers and buffer
1885	         carving (Section 4.1.1 & Section 4.2.1) and on congestion
1886	         measurement in wireless link technologies without queues
1887	         (Section 4.1.2).

1889	      *  Added section on 'Making Transports Robust against Control
1890	         Packet Losses' (Section 4.2.3) with existing & new material
1891	         included.

1893	      *  Added tabulated results of vendor survey on byte-mode drop
1894	         variant of RED (Table 3).

1896	   From -00 to -01:

1898	      *  Clarified applicability to drop as well as ECN.

1900	      *  Highlighted DoS vulnerability.

1902	      *  Emphasised that drop-tail suffers from similar problems to
1903	         byte-mode drop, so only byte-mode drop should be turned off,
1904	         not RED itself.

1906	      *  Clarified the original apparent motivations for recommending
1907	         byte-mode drop included protecting SYNs and pure ACKs more than
1908	         equalising the bit rates of TCPs with different segment sizes.
1909	         Removed some conjectured motivations.

1911	      *  Added support for updates to TCP in progress (ackcc & ecn-syn-
1912	         ack).

1914	      *  Updated survey results with newly arrived data.

1916	      *  Pulled all recommendations together into the conclusions.

1918	      *  Moved some detailed points into two additional appendices and a
1919	         note.

1921	      *  Considerable clarifications throughout.

1923	      *  Updated references

1925	Authors' Addresses

1927	   Bob Briscoe
1928	   BT
1929	   B54/77, Adastral Park
1930	   Martlesham Heath
1931	   Ipswich  IP5 3RE
1932	   UK

1934	   Phone: +44 1473 645196
1935	   EMail: bob.briscoe@bt.com
1936	   URI:   http://bobbriscoe.net/
1937	   Jukka Manner
1938	   Aalto University
1939	   Department of Communications and Networking (Comnet)
1940	   P.O. Box 13000
1941	   FIN-00076 Aalto
1942	   Finland

1944	   Phone: +358 9 470 22481
1945	   EMail: jukka.manner@tkk.fi
1946	   URI:   http://www.netlab.tkk.fi/~jmanner/