idnits 2.17.1 

draft-ietf-tsvwg-byte-pkt-congest-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  -- The draft header indicates that this document updates RFC2309, but the
     abstract doesn't seem to mention this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 1537 has weird spacing: '...ability    p  ...'

  == Line 1542 has weird spacing: '...ss-rate  p*u  ...'

  == Line 1543 has weird spacing: '...ss-rate  p*u*s...'

  == Line 1550 has weird spacing: '...ss-rate  p*u  ...'

  == Line 1551 has weird spacing: '...ss-rate  p*u*s...'

     (Using the creation date from RFC2309, updated by this document, for
     RFC5378 checks: 1997-03-25)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (February 20, 2012) is 4442 days in the past.  Is this
     intentional?


  Checking references for intended status: Best Current Practice
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 216

  -- Looks like a reference, but probably isn't: '1' on line 216

  ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567)

  ** Downref: Normative reference to an Informational RFC: RFC 3426

  == Outdated reference: A later version (-08) exists of
     draft-ietf-avtcore-ecn-for-rtp-06

  == Outdated reference: A later version (-05) exists of
     draft-ietf-conex-concepts-uses-03


     Summary: 3 errors (**), 0 flaws (~~), 8 warnings (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Transport Area Working Group                                  B. Briscoe
3	Internet-Draft                                                        BT
4	Updates: 2309 (if approved)                                    J. Manner
5	Intended status: BCP                                    Aalto University
6	Expires: August 23, 2012                               February 20, 2012

8	                Byte and Packet Congestion Notification
9	                  draft-ietf-tsvwg-byte-pkt-congest-07

11	Abstract

13	   This memo concerns dropping or marking packets using active queue
14	   management (AQM) such as random early detection (RED) or pre-
15	   congestion notification (PCN).  We give three strong recommendations:
16	   (1) packet size should be taken into account when transports read and
17	   respond to congestion indications, (2) packet size should not be
18	   taken into account when network equipment creates congestion signals
19	   (marking, dropping), and therefore (3) the byte-mode packet drop
20	   variant of the RED AQM algorithm that drops fewer small packets
21	   should not be used.

23	Status of This Memo

25	   This Internet-Draft is submitted in full conformance with the
26	   provisions of BCP 78 and BCP 79.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF).  Note that other groups may also distribute
30	   working documents as Internet-Drafts.  The list of current Internet-
31	   Drafts is at http://datatracker.ietf.org/drafts/current/.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time.  It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   This Internet-Draft will expire on August 23, 2012.

40	Copyright Notice

42	   Copyright (c) 2012 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (http://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document.  Please review these documents
49	   carefully, as they describe your rights and restrictions with respect
50	   to this document.  Code Components extracted from this document must
51	   include Simplified BSD License text as described in Section 4.e of
52	   the Trust Legal Provisions and are provided without warranty as
53	   described in the Simplified BSD License.

55	Table of Contents

57	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
58	     1.1.  Terminology and Scoping  . . . . . . . . . . . . . . . . .  6
59	     1.2.  Example Comparing Packet-Mode Drop and Byte-Mode Drop  . .  7
60	   2.  Recommendations  . . . . . . . . . . . . . . . . . . . . . . .  8
61	     2.1.  Recommendation on Queue Measurement  . . . . . . . . . . .  9
62	     2.2.  Recommendation on Encoding Congestion Notification . . . .  9
63	     2.3.  Recommendation on Responding to Congestion . . . . . . . . 10
64	     2.4.  Recommendation on Handling Congestion Indications when
65	           Splitting or Merging Packets . . . . . . . . . . . . . . . 11
66	   3.  Motivating Arguments . . . . . . . . . . . . . . . . . . . . . 11
67	     3.1.  Avoiding Perverse Incentives to (Ab)use Smaller Packets  . 12
68	     3.2.  Small != Control . . . . . . . . . . . . . . . . . . . . . 13
69	     3.3.  Transport-Independent Network  . . . . . . . . . . . . . . 13
70	     3.4.  Scaling Congestion Control with Packet Size  . . . . . . . 14
71	     3.5.  Implementation Efficiency  . . . . . . . . . . . . . . . . 16
72	   4.  A Survey and Critique of Past Advice . . . . . . . . . . . . . 16
73	     4.1.  Congestion Measurement Advice  . . . . . . . . . . . . . . 16
74	       4.1.1.  Fixed Size Packet Buffers  . . . . . . . . . . . . . . 17
75	       4.1.2.  Congestion Measurement without a Queue . . . . . . . . 18
76	     4.2.  Congestion Notification Advice . . . . . . . . . . . . . . 19
77	       4.2.1.  Network Bias when Encoding . . . . . . . . . . . . . . 19
78	       4.2.2.  Transport Bias when Decoding . . . . . . . . . . . . . 21
79	       4.2.3.  Making Transports Robust against Control Packet
80	               Losses . . . . . . . . . . . . . . . . . . . . . . . . 22
81	       4.2.4.  Congestion Notification: Summary of Conflicting
82	               Advice . . . . . . . . . . . . . . . . . . . . . . . . 22
83	   5.  Outstanding Issues and Next Steps  . . . . . . . . . . . . . . 24
84	     5.1.  Bit-congestible Network  . . . . . . . . . . . . . . . . . 24
85	     5.2.  Bit- & Packet-congestible Network  . . . . . . . . . . . . 24
86	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 24
87	   7.  Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 25
88	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 26
89	   9.  Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 27
90	   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 27
91	     10.1. Normative References . . . . . . . . . . . . . . . . . . . 27
92	     10.2. Informative References . . . . . . . . . . . . . . . . . . 27
93	   Appendix A.  Survey of RED Implementation Status . . . . . . . . . 31
94	   Appendix B.  Sufficiency of Packet-Mode Drop . . . . . . . . . . . 32
95	     B.1.  Packet-Size (In)Dependence in Transports . . . . . . . . . 33
96	     B.2.  Bit-Congestible and Packet-Congestible Indications . . . . 36

98	   Appendix C.  Byte-mode Drop Complicates Policing Congestion
99	                Response  . . . . . . . . . . . . . . . . . . . . . . 37
100	   Appendix D.  Changes from Previous Versions  . . . . . . . . . . . 38

102	1.  Introduction

104	   This memo concerns how we should correctly scale congestion control
105	   functions with packet size for the long term.  It also recognises
106	   that expediency may be necessary to deal with existing widely
107	   deployed protocols that don't live up to the long term goal.

109	   When notifying congestion, the problem of how (and whether) to take
110	   packet sizes into account has exercised the minds of researchers and
111	   practitioners for as long as active queue management (AQM) has been
112	   discussed.  Indeed, one reason AQM was originally introduced was to
113	   reduce the lock-out effects that small packets can have on large
114	   packets in drop-tail queues.  This memo aims to state the principles
115	   we should be using and to outline how these principles will affect
116	   future protocol design, taking into account the existing deployments
117	   we have already.

119	   The question of whether to take into account packet size arises at
120	   three stages in the congestion notification process:

122	   Measuring congestion:  When a congested resource measures locally how
123	      congested it is, should it measure its queue length in bytes or
124	      packets?

126	   Encoding congestion notification into the wire protocol:  When a
127	      congested network resource notifies its level of congestion,
128	      should it drop / mark each packet dependent on the byte-size of
129	      the particular packet in question?

131	   Decoding congestion notification from the wire protocol:  When a
132	      transport interprets the notification in order to decide how much
133	      to respond to congestion, should it take into account the byte-
134	      size of each missing or marked packet?

136	   Consensus has emerged over the years concerning the first stage:
137	   whether queues are measured in bytes or packets, termed byte-mode
138	   queue measurement or packet-mode queue measurement.  Section 2.1 of
139	   this memo records this consensus in the RFC Series.  In summary the
140	   choice solely depends on whether the resource is congested by bytes
141	   or packets.

143	   The controversy is mainly around the last two stages: whether to
144	   allow for the size of the specific packet notifying congestion i)
145	   when the network encodes or ii) when the transport decodes the
146	   congestion notification.

148	   Currently, the RFC series is silent on this matter other than a paper
149	   trail of advice referenced from [RFC2309], which conditionally
150	   recommends byte-mode (packet-size dependent) drop [pktByteEmail].
151	   Reducing drop of small packets certainly has some tempting
152	   advantages: i) it drops less control packets, which tend to be small
153	   and ii) it makes TCP's bit-rate less dependent on packet size.
154	   However, there are ways of addressing these issues at the transport
155	   layer, rather than reverse engineering network forwarding to fix the
156	   problems.

158	   This memo updates [RFC2309] to deprecate deliberate preferential
159	   treatment of small packets in AQM algorithms.  It recommends that (1)
160	   packet size should be taken into account when transports read
161	   congestion indications, (2) not when network equipment writes them.

163	   In particular this means that the byte-mode packet drop variant of
164	   Random early Detection (RED) should not be used to drop fewer small
165	   packets, because that creates a perverse incentive for transports to
166	   use tiny segments, consequently also opening up a DoS vulnerability.
167	   Fortunately all the RED implementers who responded to our admittedly
168	   limited survey (Section 4.2.4) have not followed the earlier advice
169	   to use byte-mode drop, so the position this memo argues for seems to
170	   already exist in implementations.

172	   However, at the transport layer, TCP congestion control is a widely
173	   deployed protocol that doesn't scale with packet size.  To date this
174	   hasn't been a significant problem because most TCP implementations
175	   have been used with similar packet sizes.  But, as we design new
176	   congestion control mechanisms, the current recommendation is that we
177	   should build in scaling with packet size rather than assuming we
178	   should follow TCP's example.

180	   This memo continues as follows.  First it discusses terminology and
181	   scoping.  Section 2 gives the concrete formal recommendations,
182	   followed by motivating arguments in Section 3.  We then critically
183	   survey the advice given previously in the RFC series and the research
184	   literature (Section 4), referring to an assessment of whether or not
185	   this advice has been followed in production networks (Appendix A).
186	   To wrap up, outstanding issues are discussed that will need
187	   resolution both to inform future protocol designs and to handle
188	   legacy (Section 5).  Then security issues are collected together in
189	   Section 6 before conclusions are drawn in Section 7.  The interested
190	   reader can find discussion of more detailed issues on the theme of
191	   byte vs. packet in the appendices.

193	   This memo intentionally includes a non-negligible amount of material
194	   on the subject.  For the busy reader Section 2 summarises the
195	   recommendations for the Internet community.

197	1.1.  Terminology and Scoping

199	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
200	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
201	   document are to be interpreted as described in [RFC2119].

203	   Congestion Notification:  Congestion notification is a changing
204	      signal that aims to communicate the probability that the network
205	      resource(s) will not be able to forward the level of traffic load
206	      offered (or that there is an impending risk that they will not be
207	      able to).

209	      The `impending risk' qualifier is added, because AQM systems (e.g.
210	      RED, PCN [RFC5670]) set a virtual limit smaller than the actual
211	      limit to the resource, then notify when this virtual limit is
212	      exceeded in order to avoid uncontrolled congestion of the actual
213	      capacity.

215	      Congestion notification communicates a real number bounded by the
216	      range [0,1].  This ties in with the most well-understood measure
217	      of congestion notification: drop probability.

219	   Explicit and Implicit Notification:  The byte vs. packet dilemma
220	      concerns congestion notification irrespective of whether it is
221	      signalled implicitly by drop or using explicit congestion
222	      notification (ECN [RFC3168] or PCN [RFC5670]).  Throughout this
223	      document, unless clear from the context, the term marking will be
224	      used to mean notifying congestion explicitly, while congestion
225	      notification will be used to mean notifying congestion either
226	      implicitly by drop or explicitly by marking.

228	   Bit-congestible vs. Packet-congestible:  If the load on a resource
229	      depends on the rate at which packets arrive, it is called packet-
230	      congestible.  If the load depends on the rate at which bits arrive
231	      it is called bit-congestible.

233	      Examples of packet-congestible resources are route look-up engines
234	      and firewalls, because load depends on how many packet headers
235	      they have to process.  Examples of bit-congestible resources are
236	      transmission links, radio power and most buffer memory, because
237	      the load depends on how many bits they have to transmit or store.
238	      Some machine architectures use fixed size packet buffers, so
239	      buffer memory in these cases is packet-congestible (see
240	      Section 4.1.1).

242	      Currently a design goal of network processing equipment such as
243	      routers and firewalls is to keep packet processing uncongested
244	      even under worst case packet rates with runs of minimum size
245	      packets.  Therefore, packet-congestion is currently rare [RFC6077;
246	      S.3.3], but there is no guarantee that it will not become more
247	      common in future.

249	      Note that information is generally processed or transmitted with a
250	      minimum granularity greater than a bit (e.g. octets).  The
251	      appropriate granularity for the resource in question should be
252	      used, but for the sake of brevity we will talk in terms of bytes
253	      in this memo.

255	   Coarser Granularity:  Resources may be congestible at higher levels
256	      of granularity than bits or packets, for instance stateful
257	      firewalls are flow-congestible and call-servers are session-
258	      congestible.  This memo focuses on congestion of connectionless
259	      resources, but the same principles may be applicable for
260	      congestion notification protocols controlling per-flow and per-
261	      session processing or state.

263	   RED Terminology:  In RED whether to use packets or bytes when
264	      measuring queues is called respectively "packet-mode queue
265	      measurement" or "byte-mode queue measurement".  And whether the
266	      probability of dropping a particular packet is independent or
267	      dependent on its byte-size is called respectively "packet-mode
268	      drop" or "byte-mode drop".  The terms byte-mode and packet-mode
269	      should not be used without specifying whether they apply to queue
270	      measurement or to drop.

272	1.2.  Example Comparing Packet-Mode Drop and Byte-Mode Drop

274	   A central question addressed by this document is whether to recommend
275	   that AQM uses RED's packet-mode drop and to deprecate byte-mode drop.
276	   Table 1 compares how packet-mode and byte-mode drop affect two flows
277	   of different size packets.  For each it gives the expected number of
278	   packets and of bits dropped in one second.  Each example flow runs at
279	   the same bit-rate of 48Mb/s, but one is broken up into small 60 byte
280	   packets and the other into large 1500 byte packets.

282	   To keep up the same bit-rate, in one second there are about 25 times
283	   more small packets because they are 25 times smaller.  As can be seen
284	   from the table, the packet rate is 100,000 small packets versus 4,000
285	   large packets per second (pps).

287	      Parameter            Formula        Small packets Large packets
288	      -------------------- -------------- ------------- -------------
289	      Packet size          s/8                      60B        1,500B
290	      Packet size          s                       480b       12,000b
291	      Bit-rate             x                     48Mbps        48Mbps
292	      Packet-rate          u = x/s              100kpps         4kpps

294	      Packet-mode Drop
295	      Pkt loss probability p                       0.1%          0.1%
296	      Pkt loss-rate        p*u                   100pps          4pps
297	      Bit loss-rate        p*u*s                 48kbps        48kbps

299	      Byte-mode Drop       MTU, M=12,000b
300	      Pkt loss probability b = p*s/M             0.004%          0.1%
301	      Pkt loss-rate        b*u                     4pps          4pps
302	      Bit loss-rate        b*u*s               1.92kbps        48kbps

304	         Table 1: Example Comparing Packet-mode and Byte-mode Drop

306	   For packet-mode drop, we illustrate the effect of a drop probability
307	   of 0.1%, which the algorithm applies to all packets irrespective of
308	   size.  Because there are 25 times more small packets in one second,
309	   it naturally drops 25 times more small packets, that is 100 small
310	   packets but only 4 large packets.  But if we count how many bits it
311	   drops, there are 48,000 bits in 100 small packets and 48,000 bits in
312	   4 large packets--the same number of bits of small packets as large.

314	      The packet-mode drop algorithm drops any bit with the same
315	      probability whether the bit is in a small or a large packet.

317	   For byte-mode drop, again we use an example drop probability of 0.1%,
318	   but only for maximum size packets (assuming the link MTU is 1,500B or
319	   12,000b).  The byte-mode algorithm reduces the drop probability of
320	   smaller packets proportional to their size, making the probability
321	   that it drops a small packet 25 times smaller at 0.004%.  But there
322	   are 25 times more small packets, so dropping them with 25 times lower
323	   probability results in dropping the same number of packets: 4 drops
324	   in both cases.  The 4 small dropped packets contain 25 times less
325	   bits than the 4 large dropped packets: 1,920 compared to 48,000.

327	      The byte-mode drop algorithm drops any bit with a probability
328	      proportionate to the size of the packet it is in.

330	2.  Recommendations

332	   This section gives recommendations related to network equipment in
333	   Sections 2.1 and 2.2, and in Sections 2.3 and 2.4 we discuss the
334	   implications on the transport protocols.

336	2.1.  Recommendation on Queue Measurement

338	   Queue length is usually the most correct and simplest way to measure
339	   congestion of a resource.  To avoid the pathological effects of drop
340	   tail, an AQM function can then be used to transform queue length into
341	   the probability of dropping or marking a packet (e.g.  RED's
342	   piecewise linear function between thresholds).

344	   If the resource is bit-congestible, the implementation SHOULD measure
345	   the length of the queue in bytes.  If the resource is packet-
346	   congestible, the implementation SHOULD measure the length of the
347	   queue in packets.  No other choice makes sense, because the number of
348	   packets waiting in the queue isn't relevant if the resource gets
349	   congested by bytes and vice versa.

351	   What this advice means for the case of RED:

353	   1.  A RED implementation SHOULD use byte mode queue measurement for
354	       measuring the congestion of bit-congestible resources and packet
355	       mode queue measurement for packet-congestible resources.

357	   2.  An implementation SHOULD NOT make it possible to configure the
358	       way a queue measures itself, because whether a queue is bit-
359	       congestible or packet-congestible is an inherent property of the
360	       queue.

362	   The recommended approach in less straightforward scenarios, such as
363	   fixed size buffers, and resources without a queue, is discussed in
364	   Section 4.1.

366	2.2.  Recommendation on Encoding Congestion Notification

368	   When encoding congestion notification (e.g. by drop, ECN & PCN), a
369	   network device SHOULD treat all packets equally, regardless of their
370	   size.  In other words, the probability that network equipment drops
371	   or marks a particular packet to notify congestion SHOULD NOT depend
372	   on the size of the packet in question.  As the example in Section 1.2
373	   illustrates, to drop any bit with probability 0.1% it is only
374	   necessary to drop every packet with probability 0.1% without regard
375	   to the size of each packet.

377	   This approach ensures the network layer offers sufficient congestion
378	   information for all known and future transport protocols and also
379	   ensures no perverse incentives are created that would encourage
380	   transports to use inappropriately small packet sizes.

382	   What this advice means for the case of RED:

384	   1.  AQM algorithms such as RED SHOULD NOT use byte-mode drop, which
385	       deflates RED's drop probability for smaller packet sizes.  RED's
386	       byte-mode drop has no enduring advantages.  It is more complex,
387	       it creates the perverse incentive to fragment segments into tiny
388	       pieces and it reopens the vulnerability to floods of small-
389	       packets that drop-tail queues suffered from and AQM was designed
390	       to remove.

392	   2.  If a vendor has implemented byte-mode drop, and an operator has
393	       turned it on, it is RECOMMENDED to turn it off.  Note that RED as
394	       a whole SHOULD NOT be turned off, as without it, a drop tail
395	       queue also biases against large packets.  But note also that
396	       turning off byte-mode drop may alter the relative performance of
397	       applications using different packet sizes, so it would be
398	       advisable to establish the implications before turning it off.

400	       Note well that RED's byte-mode queue drop is completely
401	       orthogonal to byte-mode queue measurement and should not be
402	       confused with it.  If a RED implementation has a byte-mode but
403	       does not specify what sort of byte-mode, it is most probably
404	       byte-mode queue measurement, which is fine.  However, if in
405	       doubt, the vendor should be consulted.

407	   A survey (Appendix A) showed that there appears to be little, if any,
408	   installed base of the byte-mode drop variant of RED.  This suggests
409	   that deprecating byte-mode drop will have little, if any, incremental
410	   deployment impact.

412	2.3.  Recommendation on Responding to Congestion

414	   When a transport detects that a packet has been lost or congestion
415	   marked, it SHOULD consider the strength of the congestion indication
416	   as proportionate to the size in octets (bytes) of the missing or
417	   marked packet.

419	   In other words, when a packet indicates congestion (by being lost or
420	   marked) it can be considered conceptually as if there is a congestion
421	   indication on every octet of the packet, not just one indication per
422	   packet.

424	   Therefore, the IETF transport area should continue its programme of;

426	   o  updating host-based congestion control protocols to take account
427	      of packet size

429	   o  making transports less sensitive to losing control packets like
430	      SYNs and pure ACKs.

432	   What this advice means for the case of TCP:

434	   1.  If two TCP flows with different packet sizes are required to run
435	       at equal bit rates under the same path conditions, this should be
436	       done by altering TCP (Section 4.2.2), not network equipment (the
437	       latter affects other transports besides TCP).

439	   2.  If it is desired to improve TCP performance by reducing the
440	       chance that a SYN or a pure ACK will be dropped, this should be
441	       done by modifying TCP (Section 4.2.3), not network equipment.

443	2.4.  Recommendation on Handling Congestion Indications when Splitting
444	      or Merging Packets

446	   Packets carrying congestion indications may be split or merged in
447	   some circumstances (e.g. at a RTCP transcoder or during IP fragment
448	   reassembly).  Splitting and merging only make sense in the context of
449	   ECN, not loss.

451	   The general rule to follow is that the number of octets in packets
452	   with congestion indications SHOULD be equivalent before and after
453	   merging or splitting.  This is based on the principle used above;
454	   that an indication of congestion on a packet can be considered as an
455	   indication of congestion on each octet of the packet.

457	   The above rule is not phrased with the word "MUST" to allow the
458	   following exception.  There are cases where pre-existing protocols
459	   were not designed to conserve congestion marked octets (e.g.  IP
460	   fragment reassembly [RFC3168] or loss statistics in RTCP receiver
461	   reports [RFC3550] before ECN was added
462	   [I-D.ietf-avtcore-ecn-for-rtp]).  When any such protocol is updated,
463	   it SHOULD comply with the above rule to conserve marked octets.
464	   However, the rule may be relaxed if it would otherwise become too
465	   complex to interoperate with pre-existing implementations of the
466	   protocol.

468	   One can think of a splitting or merging process as if all the
469	   incoming congestion-marked octets increment a counter and all the
470	   outgoing marked octets decrement the same counter.  In order to
471	   ensure that congestion indications remain timely, even the smallest
472	   positive remainder in the conceptual counter should trigger the next
473	   outgoing packet to be marked (causing the counter to go negative).

475	3.  Motivating Arguments

477	   This section is informative.  It justifies the recommendations given
478	   in the previous section.

480	3.1.  Avoiding Perverse Incentives to (Ab)use Smaller Packets

482	   Increasingly, it is being recognised that a protocol design must take
483	   care not to cause unintended consequences by giving the parties in
484	   the protocol exchange perverse incentives [Evol_cc][RFC3426].  Given
485	   there are many good reasons why larger path maximum transmission
486	   units (PMTUs) would help solve a number of scaling issues, we do not
487	   want to create any bias against large packets that is greater than
488	   their true cost.

490	   Imagine a scenario where the same bit rate of packets will contribute
491	   the same to bit-congestion of a link irrespective of whether it is
492	   sent as fewer larger packets or more smaller packets.  A protocol
493	   design that caused larger packets to be more likely to be dropped
494	   than smaller ones would be dangerous in both the following cases:

496	   Malicious transports:  A queue that gives an advantage to small
497	      packets can be used to amplify the force of a flooding attack.  By
498	      sending a flood of small packets, the attacker can get the queue
499	      to discard more traffic in large packets, allowing more attack
500	      traffic to get through to cause further damage.  Such a queue
501	      allows attack traffic to have a disproportionately large effect on
502	      regular traffic without the attacker having to do much work.

504	   Non-malicious transports:  Even if a transport designer is not
505	      actually malicious, if over time it is noticed that small packets
506	      tend to go faster, designers will act in their own interest and
507	      use smaller packets.  Queues that give advantage to small packets
508	      create an evolutionary pressure for transports to send at the same
509	      bit-rate but break their data stream down into tiny segments to
510	      reduce their drop rate.  Encouraging a high volume of tiny packets
511	      might in turn unnecessarily overload a completely unrelated part
512	      of the system, perhaps more limited by header-processing than
513	      bandwidth.

515	   Imagine two unresponsive flows arrive at a bit-congestible
516	   transmission link each with the same bit rate, say 1Mbps, but one
517	   consists of 1500B and the other 60B packets, which are 25x smaller.
518	   Consider a scenario where gentle RED [gentle_RED] is used, along with
519	   the variant of RED we advise against, i.e. where the RED algorithm is
520	   configured to adjust the drop probability of packets in proportion to
521	   each packet's size (byte mode packet drop).  In this case, RED aims
522	   to drop 25x more of the larger packets than the smaller ones.  Thus,
523	   for example if RED drops 25% of the larger packets, it will aim to
524	   drop 1% of the smaller packets (but in practice it may drop more as
525	   congestion increases [RFC4828; Appx B.4]).  Even though both flows
526	   arrive with the same bit rate, the bit rate the RED queue aims to
527	   pass to the line will be 750kbps for the flow of larger packets but
528	   990kbps for the smaller packets (because of rate variations it will
529	   actually be a little less than this target).

531	   Note that, although the byte-mode drop variant of RED amplifies small
532	   packet attacks, drop-tail queues amplify small packet attacks even
533	   more (see Security Considerations in Section 6).  Wherever possible
534	   neither should be used.

536	3.2.  Small != Control

538	   Dropping fewer control packets considerably improves performance.  It
539	   is tempting to drop small packets with lower probability in order to
540	   improve performance, because many control packets are small (TCP SYNs
541	   & ACKs, DNS queries & responses, SIP messages, HTTP GETs, etc).
542	   However, we must not give control packets preference purely by virtue
543	   of their smallness, otherwise it is too easy for any data source to
544	   get the same preferential treatment simply by sending data in smaller
545	   packets.  Again we should not create perverse incentives to favour
546	   small packets rather than to favour control packets, which is what we
547	   intend.

549	   Just because many control packets are small does not mean all small
550	   packets are control packets.

552	   So, rather than fix these problems in the network, we argue that the
553	   transport should be made more robust against losses of control
554	   packets (see 'Making Transports Robust against Control Packet Losses'
555	   in Section 4.2.3).

557	3.3.  Transport-Independent Network

559	   TCP congestion control ensures that flows competing for the same
560	   resource each maintain the same number of segments in flight,
561	   irrespective of segment size.  So under similar conditions, flows
562	   with different segment sizes will get different bit-rates.

564	   One motivation for the network biasing congestion notification by
565	   packet size is to counter this effect and try to equalise the bit-
566	   rates of flows with different packet sizes.  However, in order to do
567	   this, the queuing algorithm has to make assumptions about the
568	   transport, which become embedded in the network.  Specifically:

570	   o  The queuing algorithm has to assume how aggressively the transport
571	      will respond to congestion (see Section 4.2.4).  If the network
572	      assumes the transport responds as aggressively as TCP NewReno, it
573	      will be wrong for Compound TCP and differently wrong for Cubic
574	      TCP, etc.  To achieve equal bit-rates, each transport then has to
575	      guess what assumption the network made, and work out how to
576	      replace this assumed aggressiveness with its own aggressiveness.

578	   o  Also, if the network biases congestion notification by packet size
579	      it has to assume a baseline packet size--all proposed algorithms
580	      use the local MTU.  Then transports have to guess which link was
581	      congested and what its local MTU was, in order to know how to
582	      tailor their congestion response to that link.

584	   Even though reducing the drop probability of small packets (e.g.
585	   RED's byte-mode drop) helps ensure TCP flows with different packet
586	   sizes will achieve similar bit rates, we argue this correction should
587	   be made to any future transport protocols based on TCP, not to the
588	   network in order to fix one transport, no matter how predominant it
589	   is.  Effectively, favouring small packets is reverse engineering of
590	   network equipment around one particular transport protocol (TCP),
591	   contrary to the excellent advice in [RFC3426], which asks designers
592	   to question "Why are you proposing a solution at this layer of the
593	   protocol stack, rather than at another layer?"

595	   In contrast, if the network never takes account of packet size, the
596	   transport can be certain it will never need to guess any assumptions
597	   the network has made.  And the network passes two pieces of
598	   information to the transport that are sufficient in all cases: i)
599	   congestion notification on the packet and ii) the size of the packet.
600	   Both are available for the transport to combine (by taking account of
601	   packet size when responding to congestion) or not.  Appendix B checks
602	   that these two pieces of information are sufficient for all relevant
603	   scenarios.

605	   When the network does not take account of packet size, it allows
606	   transport protocols to choose whether to take account of packet size
607	   or not.  However, if the network were to bias congestion notification
608	   by packet size, transport protocols would have no choice; those that
609	   did not take account of packet size themselves would unwittingly
610	   become dependent on packet size, and those that already took account
611	   of packet size would end up taking account of it twice.

613	3.4.  Scaling Congestion Control with Packet Size

615	   Having so far justified only our recommendations for the network,
616	   this section focuses on the host.  We construct a scaling argument to
617	   justify the recommendation that a host should respond to a dropped or
618	   marked packet in proportion to its size, not just as a single
619	   congestion event.

621	   The argument assumes that we have already sufficiently justified our
622	   recommendation that the network should not take account of packet
623	   size.

625	   Also, we assume bit-congestible links are the predominant source of
626	   congestion.  As the Internet stands, it is hard if not impossible to
627	   know whether congestion notification is from a bit-congestible or a
628	   packet-congestible resource (see Appendix B.2) so we have to assume
629	   the most prevalent case (see Section 1.1).  If this assumption is
630	   wrong, and particular congestion indications are actually due to
631	   overload of packet-processing, there is no issue of safety at stake.
632	   Any congestion control that triggers a multiplicative decrease in
633	   response to a congestion indication will bring packet processing back
634	   to its operating point just as quickly.  The only issue at stake is
635	   that the resource could be utilised more efficiently if packet-
636	   congestion could be separately identified.

638	   Imagine a bit-congestible link shared by many flows, so that each
639	   busy period tends to cause packets to be lost from different flows.
640	   Consider further two sources that have the same data rate but break
641	   the load into large packets in one application (A) and small packets
642	   in the other (B).  Of course, because the load is the same, there
643	   will be proportionately more packets in the small packet flow (B).

645	   If a congestion control scales with packet size it should respond in
646	   the same way to the same congestion notification, irrespective of the
647	   size of the packets containing the bytes that contribute to
648	   congestion.

650	   A bit-congestible queue suffering congestion has to drop or mark the
651	   same excess bytes whether they are in a few large packets (A) or many
652	   small packets (B).  So for the same amount of congestion overload,
653	   the same amount of bytes has to be shed to get the load back to its
654	   operating point.  For smaller packets (B) more packets will have to
655	   be discarded to shed the same bytes.

657	   If both the transports interpret each drop/mark as a single loss
658	   event irrespective of the size of the packet dropped, the flow of
659	   smaller packets (B) will respond more times to the same congestion.
660	   On the other hand, if a transport responds proportionately less when
661	   smaller packets are dropped/marked, overall it will be able to
662	   respond the same to the same amount of congestion.

664	   Therefore, for a congestion control to scale with packet size it
665	   should respond to dropped or marked bytes (as TFRC-SP [RFC4828]
666	   effectively does), instead of dropped or marked packets (as TCP
667	   does).

669	   For the avoidance of doubt, this is not a recommendation that TCP
670	   should be changed so that it scales with packet size.  It is a
671	   recommendation that any future transport protocol proposal should
672	   respond to dropped or marked bytes if it wishes to claim that it is
673	   scalable.

675	3.5.  Implementation Efficiency

677	   Allowing for packet size at the transport rather than in the network
678	   ensures that neither the network nor the transport needs to do a
679	   multiply operation--multiplication by packet size is effectively
680	   achieved as a repeated add when the transport adds to its count of
681	   marked bytes as each congestion event is fed to it.  This isn't a
682	   principled reason in itself, but it is a happy consequence of the
683	   other principled reasons.

685	4.  A Survey and Critique of Past Advice

687	   This section is informative, not normative.

689	   The original 1993 paper on RED [RED93] proposed two options for the
690	   RED active queue management algorithm: packet mode and byte mode.
691	   Packet mode measured the queue length in packets and dropped (or
692	   marked) individual packets with a probability independent of their
693	   size.  Byte mode measured the queue length in bytes and marked an
694	   individual packet with probability in proportion to its size
695	   (relative to the maximum packet size).  In the paper's outline of
696	   further work, it was stated that no recommendation had been made on
697	   whether the queue size should be measured in bytes or packets, but
698	   noted that the difference could be significant.

700	   When RED was recommended for general deployment in 1998 [RFC2309],
701	   the two modes were mentioned implying the choice between them was a
702	   question of performance, referring to a 1997 email [pktByteEmail] for
703	   advice on tuning.  A later addendum to this email introduced the
704	   insight that there are in fact two orthogonal choices:

706	   o  whether to measure queue length in bytes or packets (Section 4.1)

708	   o  whether the drop probability of an individual packet should depend
709	      on its own size (Section 4.2).

711	   The rest of this section is structured accordingly.

713	4.1.  Congestion Measurement Advice

715	   The choice of which metric to use to measure queue length was left
716	   open in RFC2309.  It is now well understood that queues for bit-
717	   congestible resources should be measured in bytes, and queues for
718	   packet-congestible resources should be measured in packets
719	   [pktByteEmail].

721	   Congestion in some legacy bit-congestible buffers is only measured in
722	   packets not bytes.  In such cases, the operator has to set the
723	   thresholds mindful of a typical mix of packets sizes.  Any AQM
724	   algorithm on such a buffer will be oversensitive to high proportions
725	   of small packets, e.g. a DoS attack, and undersensitive to high
726	   proportions of large packets.  However, there is no need to make
727	   allowances for the possibility of such legacy in future protocol
728	   design.  This is safe because any undersensitivity during unusual
729	   traffic mixes cannot lead to congestion collapse given the buffer
730	   will eventually revert to tail drop, discarding proportionately more
731	   large packets.

733	4.1.1.  Fixed Size Packet Buffers

735	   The question of whether to measure queues in bytes or packets seems
736	   to be well understood.  However, measuring congestion is not
737	   straightforward when the resource is bit congestible but the queue is
738	   packet congestible or vice versa.  This section outlines the approach
739	   to take.  There is no controversy over what should be done, you just
740	   need to be expert in probability to work it out.  And, even if you
741	   know what should be done, it's not always easy to find a practical
742	   algorithm to implement it.

744	   Some, mostly older, queuing hardware sets aside fixed sized buffers
745	   in which to store each packet in the queue.  Also, with some
746	   hardware, any fixed sized buffers not completely filled by a packet
747	   are padded when transmitted to the wire.  If we imagine a theoretical
748	   forwarding system with both queuing and transmission in fixed, MTU-
749	   sized units, it should clearly be treated as packet-congestible,
750	   because the queue length in packets would be a good model of
751	   congestion of the lower layer link.

753	   If we now imagine a hybrid forwarding system with transmission delay
754	   largely dependent on the byte-size of packets but buffers of one MTU
755	   per packet, it should strictly require a more complex algorithm to
756	   determine the probability of congestion.  It should be treated as two
757	   resources in sequence, where the sum of the byte-sizes of the packets
758	   within each packet buffer models congestion of the line while the
759	   length of the queue in packets models congestion of the queue.  Then
760	   the probability of congesting the forwarding buffer would be a
761	   conditional probability--conditional on the previously calculated
762	   probability of congesting the line.

764	   In systems that use fixed size buffers, it is unusual for all the
765	   buffers used by an interface to be the same size.  Typically pools of
766	   different sized buffers are provided (Cisco uses the term 'buffer
767	   carving' for the process of dividing up memory into these pools
768	   [IOSArch]).  Usually, if the pool of small buffers is exhausted,
769	   arriving small packets can borrow space in the pool of large buffers,
770	   but not vice versa.  However, it is easier to work out what should be
771	   done if we temporarily set aside the possibility of such borrowing.
772	   Then, with fixed pools of buffers for different sized packets and no
773	   borrowing, the size of each pool and the current queue length in each
774	   pool would both be measured in packets.  So an AQM algorithm would
775	   have to maintain the queue length for each pool, and judge whether to
776	   drop/mark a packet of a particular size by looking at the pool for
777	   packets of that size and using the length (in packets) of its queue.

779	   We now return to the issue we temporarily set aside: small packets
780	   borrowing space in larger buffers.  In this case, the only difference
781	   is that the pools for smaller packets have a maximum queue size that
782	   includes all the pools for larger packets.  And every time a packet
783	   takes a larger buffer, the current queue size has to be incremented
784	   for all queues in the pools of buffers less than or equal to the
785	   buffer size used.

787	   We will return to borrowing of fixed sized buffers when we discuss
788	   biasing the drop/marking probability of a specific packet because of
789	   its size in Section 4.2.1.  But here we can give a at least one
790	   simple rule for how to measure the length of queues of fixed buffers:
791	   no matter how complicated the scheme is, ultimately any fixed buffer
792	   system will need to measure its queue length in packets not bytes.

794	4.1.2.  Congestion Measurement without a Queue

796	   AQM algorithms are nearly always described assuming there is a queue
797	   for a congested resource and the algorithm can use the queue length
798	   to determine the probability that it will drop or mark each packet.
799	   But not all congested resources lead to queues.  For instance,
800	   wireless spectrum is usually regarded as bit-congestible (for a given
801	   coding scheme).  But wireless link protocols do not always maintain a
802	   queue that depends on spectrum interference.  Similarly, power
803	   limited resources are also usually bit-congestible if energy is
804	   primarily required for transmission rather than header processing,
805	   but it is rare for a link protocol to build a queue as it approaches
806	   maximum power.

808	   Nonetheless, AQM algorithms do not require a queue in order to work.
809	   For instance spectrum congestion can be modelled by signal quality
810	   using target bit-energy-to-noise-density ratio.  And, to model radio
811	   power exhaustion, transmission power levels can be measured and
812	   compared to the maximum power available.  [ECNFixedWireless] proposes
813	   a practical and theoretically sound way to combine congestion
814	   notification for different bit-congestible resources at different
815	   layers along an end to end path, whether wireless or wired, and
816	   whether with or without queues.

818	4.2.  Congestion Notification Advice

820	4.2.1.  Network Bias when Encoding

822	4.2.1.1.  Advice on Packet Size Bias in RED

824	   The previously mentioned email [pktByteEmail] referred to by
825	   [RFC2309] advised that most scarce resources in the Internet were
826	   bit-congestible, which is still believed to be true (Section 1.1).
827	   But it went on to offer advice that is updated by this memo.  It said
828	   that drop probability should depend on the size of the packet being
829	   considered for drop if the resource is bit-congestible, but not if it
830	   is packet-congestible.  The argument continued that if packet drops
831	   were inflated by packet size (byte-mode dropping), "a flow's fraction
832	   of the packet drops is then a good indication of that flow's fraction
833	   of the link bandwidth in bits per second".  This was consistent with
834	   a referenced policing mechanism being worked on at the time for
835	   detecting unusually high bandwidth flows, eventually published in
836	   1999 [pBox].  However, the problem could and should have been solved
837	   by making the policing mechanism count the volume of bytes randomly
838	   dropped, not the number of packets.

840	   A few months before RFC2309 was published, an addendum was added to
841	   the above archived email referenced from the RFC, in which the final
842	   paragraph seemed to partially retract what had previously been said.
843	   It clarified that the question of whether the probability of
844	   dropping/marking a packet should depend on its size was not related
845	   to whether the resource itself was bit congestible, but a completely
846	   orthogonal question.  However the only example given had the queue
847	   measured in packets but packet drop depended on the byte-size of the
848	   packet in question.  No example was given the other way round.

850	   In 2000, Cnodder et al [REDbyte] pointed out that there was an error
851	   in the part of the original 1993 RED algorithm that aimed to
852	   distribute drops uniformly, because it didn't correctly take into
853	   account the adjustment for packet size.  They recommended an
854	   algorithm called RED_4 to fix this.  But they also recommended a
855	   further change, RED_5, to adjust drop rate dependent on the square of
856	   relative packet size.  This was indeed consistent with one implied
857	   motivation behind RED's byte mode drop--that we should reverse
858	   engineer the network to improve the performance of dominant end-to-
859	   end congestion control mechanisms.  This memo makes a different
860	   recommendations in Section 2.

862	   By 2003, a further change had been made to the adjustment for packet
863	   size, this time in the RED algorithm of the ns2 simulator.  Instead
864	   of taking each packet's size relative to a `maximum packet size' it
865	   was taken relative to a `mean packet size', intended to be a static
866	   value representative of the `typical' packet size on the link.  We
867	   have not been able to find a justification in the literature for this
868	   change, however Eddy and Allman conducted experiments [REDbias] that
869	   assessed how sensitive RED was to this parameter, amongst other
870	   things.  However, this changed algorithm can often lead to drop
871	   probabilities of greater than 1 (which gives a hint that there is
872	   probably a mistake in the theory somewhere).

874	   On 10-Nov-2004, this variant of byte-mode packet drop was made the
875	   default in the ns2 simulator.  It seems unlikely that byte-mode drop
876	   has ever been implemented in production networks (Appendix A),
877	   therefore any conclusions based on ns2 simulations that use RED
878	   without disabling byte-mode drop are likely to behave very
879	   differently from RED in production networks.

881	4.2.1.2.  Packet Size Bias Regardless of RED

883	   The byte-mode drop variant of RED is, of course, not the only
884	   possible bias towards small packets in queueing systems.  We have
885	   already mentioned that tail-drop queues naturally tend to lock-out
886	   large packets once they are full.  But also queues with fixed sized
887	   buffers reduce the probability that small packets will be dropped if
888	   (and only if) they allow small packets to borrow buffers from the
889	   pools for larger packets.  As was explained in Section 4.1.1 on fixed
890	   size buffer carving, borrowing effectively makes the maximum queue
891	   size for small packets greater than that for large packets, because
892	   more buffers can be used by small packets while less will fit large
893	   packets.

895	   In itself, the bias towards small packets caused by buffer borrowing
896	   is perfectly correct.  Lower drop probability for small packets is
897	   legitimate in buffer borrowing schemes, because small packets
898	   genuinely congest the machine's buffer memory less than large
899	   packets, given they can fit in more spaces.  The bias towards small
900	   packets is not artificially added (as it is in RED's byte-mode drop
901	   algorithm), it merely reflects the reality of the way fixed buffer
902	   memory gets congested.  Incidentally, the bias towards small packets
903	   from buffer borrowing is nothing like as large as that of RED's byte-
904	   mode drop.

906	   Nonetheless, fixed-buffer memory with tail drop is still prone to
907	   lock-out large packets, purely because of the tail-drop aspect.  So a
908	   good AQM algorithm like RED with packet-mode drop should be used with
909	   fixed buffer memories where possible.  If RED is too complicated to
910	   implement with multiple fixed buffer pools, the minimum necessary to
911	   prevent large packet lock-out is to ensure smaller packets never use
912	   the last available buffer in any of the pools for larger packets.

914	4.2.2.  Transport Bias when Decoding

916	   The above proposals to alter the network equipment to bias towards
917	   smaller packets have largely carried on outside the IETF process.
918	   Whereas, within the IETF, there are many different proposals to alter
919	   transport protocols to achieve the same goals, i.e. either to make
920	   the flow bit-rate take account of packet size, or to protect control
921	   packets from loss.  This memo argues that altering transport
922	   protocols is the more principled approach.

924	   A recently approved experimental RFC adapts its transport layer
925	   protocol to take account of packet sizes relative to typical TCP
926	   packet sizes.  This proposes a new small-packet variant of TCP-
927	   friendly rate control [RFC5348] called TFRC-SP [RFC4828].
928	   Essentially, it proposes a rate equation that inflates the flow rate
929	   by the ratio of a typical TCP segment size (1500B including TCP
930	   header) over the actual segment size [PktSizeEquCC].  (There are also
931	   other important differences of detail relative to TFRC, such as using
932	   virtual packets [CCvarPktSize] to avoid responding to multiple losses
933	   per round trip and using a minimum inter-packet interval.)

935	   Section 4.5.1 of this TFRC-SP spec discusses the implications of
936	   operating in an environment where queues have been configured to drop
937	   smaller packets with proportionately lower probability than larger
938	   ones.  But it only discusses TCP operating in such an environment,
939	   only mentioning TFRC-SP briefly when discussing how to define
940	   fairness with TCP.  And it only discusses the byte-mode dropping
941	   version of RED as it was before Cnodder et al pointed out it didn't
942	   sufficiently bias towards small packets to make TCP independent of
943	   packet size.

945	   So the TFRC-SP spec doesn't address the issue of which of the network
946	   or the transport _should_ handle fairness between different packet
947	   sizes.  In its Appendix B.4 it discusses the possibility of both
948	   TFRC-SP and some network buffers duplicating each other's attempts to
949	   deliberately bias towards small packets.  But the discussion is not
950	   conclusive, instead reporting simulations of many of the
951	   possibilities in order to assess performance but not recommending any
952	   particular course of action.

954	   The paper originally proposing TFRC with virtual packets (VP-TFRC)
955	   [CCvarPktSize] proposed that there should perhaps be two variants to
956	   cater for the different variants of RED.  However, as the TFRC-SP
957	   authors point out, there is no way for a transport to know whether
958	   some queues on its path have deployed RED with byte-mode packet drop
959	   (except if an exhaustive survey found that no-one has deployed it!--
960	   see Appendix A).  Incidentally, VP-TFRC also proposed that byte-mode
961	   RED dropping should really square the packet-size compensation-factor
962	   (like that of Cnodder's RED_5, but apparently unaware of it).

964	   Pre-congestion notification [RFC5670] is an IETF technology to use a
965	   virtual queue for AQM marking for packets within one Diffserv class
966	   in order to give early warning prior to any real queuing.  The PCN
967	   marking algorithms have been designed not to take account of packet
968	   size when forwarding through queues.  Instead the general principle
969	   has been to take account of the sizes of marked packets when
970	   monitoring the fraction of marking at the edge of the network, as
971	   recommended here.

973	4.2.3.  Making Transports Robust against Control Packet Losses

975	   Recently, two RFCs have defined changes to TCP that make it more
976	   robust against losing small control packets [RFC5562] [RFC5690].  In
977	   both cases they note that the case for these two TCP changes would be
978	   weaker if RED were biased against dropping small packets.  We argue
979	   here that these two proposals are a safer and more principled way to
980	   achieve TCP performance improvements than reverse engineering RED to
981	   benefit TCP.

983	   Although there are no known proposals, it would also be possible and
984	   perfectly valid to make control packets robust against drop by
985	   explicitly requesting a lower drop probability using their Diffserv
986	   code point [RFC2474] to request a scheduling class with lower drop.

988	   Although not brought to the IETF, a simple proposal from Wischik
989	   [DupTCP] suggests that the first three packets of every TCP flow
990	   should be routinely duplicated after a short delay.  It shows that
991	   this would greatly improve the chances of short flows completing
992	   quickly, but it would hardly increase traffic levels on the Internet,
993	   because Internet bytes have always been concentrated in the large
994	   flows.  It further shows that the performance of many typical
995	   applications depends on completion of long serial chains of short
996	   messages.  It argues that, given most of the value people get from
997	   the Internet is concentrated within short flows, this simple
998	   expedient would greatly increase the value of the best efforts
999	   Internet at minimal cost.

1001	4.2.4.  Congestion Notification: Summary of Conflicting Advice
1002	   +-----------+----------------+-----------------+--------------------+
1003	   | transport |  RED_1 (packet |  RED_4 (linear  | RED_5 (square byte |
1004	   |        cc |   mode drop)   | byte mode drop) |     mode drop)     |
1005	   +-----------+----------------+-----------------+--------------------+
1006	   |    TCP or |    s/sqrt(p)   |    sqrt(s/p)    |      1/sqrt(p)     |
1007	   |      TFRC |                |                 |                    |
1008	   |   TFRC-SP |    1/sqrt(p)   |    1/sqrt(sp)   |    1/(s.sqrt(p))   |
1009	   +-----------+----------------+-----------------+--------------------+

1011	    Table 2: Dependence of flow bit-rate per RTT on packet size, s, and
1012	   drop probability, p, when network and/or transport bias towards small
1013	                        packets to varying degrees

1015	   Table 2 aims to summarise the potential effects of all the advice
1016	   from different sources.  Each column shows a different possible AQM
1017	   behaviour in different queues in the network, using the terminology
1018	   of Cnodder et al outlined earlier (RED_1 is basic RED with packet-
1019	   mode drop).  Each row shows a different transport behaviour: TCP
1020	   [RFC5681] and TFRC [RFC5348] on the top row with TFRC-SP [RFC4828]
1021	   below.  Each cell shows how the bits per round trip of a flow depends
1022	   on packet size, s, and drop probability, p.  In order to declutter
1023	   the formulae to focus on packet-size dependence they are all given
1024	   per round trip, which removes any RTT term.

1026	   Let us assume that the goal is for the bit-rate of a flow to be
1027	   independent of packet size.  Suppressing all inessential details, the
1028	   table shows that this should either be achievable by not altering the
1029	   TCP transport in a RED_5 network, or using the small packet TFRC-SP
1030	   transport (or similar) in a network without any byte-mode dropping
1031	   RED (top right and bottom left).  Top left is the `do nothing'
1032	   scenario, while bottom right is the `do-both' scenario in which bit-
1033	   rate would become far too biased towards small packets.  Of course,
1034	   if any form of byte-mode dropping RED has been deployed on a subset
1035	   of queues that congest, each path through the network will present a
1036	   different hybrid scenario to its transport.

1038	   Whatever, we can see that the linear byte-mode drop column in the
1039	   middle would considerably complicate the Internet.  It's a half-way
1040	   house that doesn't bias enough towards small packets even if one
1041	   believes the network should be doing the biasing.  Section 2
1042	   recommends that _all_ bias in network equipment towards small packets
1043	   should be turned off--if indeed any equipment vendors have
1044	   implemented it--leaving packet-size bias solely as the preserve of
1045	   the transport layer (solely the leftmost, packet-mode drop column).

1047	   In practice it seems that no deliberate bias towards small packets
1048	   has been implemented for production networks.  Of the 19% of vendors
1049	   who responded to a survey of 84 equipment vendors, none had
1050	   implemented byte-mode drop in RED (see Appendix A for details).

1052	5.  Outstanding Issues and Next Steps

1054	5.1.  Bit-congestible Network

1056	   For a connectionless network with nearly all resources being bit-
1057	   congestible the recommended position is clear--that the network
1058	   should not make allowance for packet sizes and the transport should.
1059	   This leaves two outstanding issues:

1061	   o  How to handle any legacy of AQM with byte-mode drop already
1062	      deployed;

1064	   o  The need to start a programme to update transport congestion
1065	      control protocol standards to take account of packet size.

1067	   A survey of equipment vendors (Section 4.2.4) found no evidence that
1068	   byte-mode packet drop had been implemented, so deployment will be
1069	   sparse at best.  A migration strategy is not really needed to remove
1070	   an algorithm that may not even be deployed.

1072	   A programme of experimental updates to take account of packet size in
1073	   transport congestion control protocols has already started with
1074	   TFRC-SP [RFC4828].

1076	5.2.  Bit- & Packet-congestible Network

1078	   The position is much less clear-cut if the Internet becomes populated
1079	   by a more even mix of both packet-congestible and bit-congestible
1080	   resources (see Appendix B.2).  This problem is not pressing, because
1081	   most Internet resources are designed to be bit-congestible before
1082	   packet processing starts to congest (see Section 1.1).

1084	   The IRTF Internet congestion control research group (ICCRG) has set
1085	   itself the task of reaching consensus on generic forwarding
1086	   mechanisms that are necessary and sufficient to support the
1087	   Internet's future congestion control requirements (the first
1088	   challenge in [RFC6077]).  The research question of whether packet
1089	   congestion might become common and what to do if it does may in the
1090	   future be explored in the IRTF (the "Challenge 3: Packet Size" in
1091	   [RFC6077]).

1093	6.  Security Considerations

1095	   This memo recommends that queues do not bias drop probability towards
1096	   small packets as this creates a perverse incentive for transports to
1097	   break down their flows into tiny segments.  One of the benefits of
1098	   implementing AQM was meant to be to remove this perverse incentive
1099	   that drop-tail queues gave to small packets.

1101	   In practice, transports cannot all be trusted to respond to
1102	   congestion.  So another reason for recommending that queues do not
1103	   bias drop probability towards small packets is to avoid the
1104	   vulnerability to small packet DDoS attacks that would otherwise
1105	   result.  One of the benefits of implementing AQM was meant to be to
1106	   remove drop-tail's DoS vulnerability to small packets, so we
1107	   shouldn't add it back again.

1109	   If most queues implemented AQM with byte-mode drop, the resulting
1110	   network would amplify the potency of a small packet DDoS attack.  At
1111	   the first queue the stream of packets would push aside a greater
1112	   proportion of large packets, so more of the small packets would
1113	   survive to attack the next queue.  Thus a flood of small packets
1114	   would continue on towards the destination, pushing regular traffic
1115	   with large packets out of the way in one queue after the next, but
1116	   suffering much less drop itself.

1118	   Appendix C explains why the ability of networks to police the
1119	   response of _any_ transport to congestion depends on bit-congestible
1120	   network resources only doing packet-mode not byte-mode drop.  In
1121	   summary, it says that making drop probability depend on the size of
1122	   the packets that bits happen to be divided into simply encourages the
1123	   bits to be divided into smaller packets.  Byte-mode drop would
1124	   therefore irreversibly complicate any attempt to fix the Internet's
1125	   incentive structures.

1127	7.  Conclusions

1129	   This memo identifies the three distinct stages of the congestion
1130	   notification process where implementations need to decide whether to
1131	   take packet size into account.  The recommendations provided in
1132	   Section 2 of this memo are different in each case:

1134	   o  When network equipment measures the length of a queue, whether it
1135	      counts in bytes or packets depends on whether the network resource
1136	      is congested respectively by bytes or by packets.

1138	   o  When network equipment decides whether to drop (or mark) a packet,
1139	      it is recommended that the size of the particular packet should
1140	      not be taken into account

1142	   o  However, when a transport algorithm responds to a dropped or
1143	      marked packet, the size of the rate reduction should be
1144	      proportionate to the size of the packet.

1146	   In summary, the answers are 'it depends', 'no' and 'yes' respectively

1148	   For the specific case of RED, this means that byte-mode queue
1149	   measurement will often be appropriate although byte-mode drop is
1150	   strongly deprecated.

1152	   At the transport layer the IETF should continue updating congestion
1153	   control protocols to take account of the size of each packet that
1154	   indicates congestion.  Also the IETF should continue to make
1155	   protocols less sensitive to losing control packets like SYNs, pure
1156	   ACKs and DNS exchanges.  Although many control packets happen to be
1157	   small, the alternative of network equipment favouring all small
1158	   packets would be dangerous.  That would create perverse incentives to
1159	   split data transfers into smaller packets.

1161	   The memo develops these recommendations from principled arguments
1162	   concerning scaling, layering, incentives, inherent efficiency,
1163	   security and policeability.  But it also addresses practical issues
1164	   such as specific buffer architectures and incremental deployment.
1165	   Indeed a limited survey of RED implementations is discussed, which
1166	   shows there appears to be little, if any, installed base of RED's
1167	   byte-mode drop.  Therefore it can be deprecated with little, if any,
1168	   incremental deployment complications.

1170	   The recommendations have been developed on the well-founded basis
1171	   that most Internet resources are bit-congestible not packet-
1172	   congestible.  We need to know the likelihood that this assumption
1173	   will prevail longer term and, if it might not, what protocol changes
1174	   will be needed to cater for a mix of the two.  The IRTF Internet
1175	   Congestion Control Research Group (ICCRG) is currently working on
1176	   these problems [RFC6077].

1178	8.  Acknowledgements

1180	   Thank you to Sally Floyd, who gave extensive and useful review
1181	   comments.  Also thanks for the reviews from Philip Eardley, David
1182	   Black, Fred Baker, Toby Moncaster, Arnaud Jacquet and Mirja
1183	   Kuehlewind as well as helpful explanations of different hardware
1184	   approaches from Larry Dunn and Fred Baker.  We are grateful to Bruce
1185	   Davie and his colleagues for providing a timely and efficient survey
1186	   of RED implementation in Cisco's product range.  Also grateful thanks
1187	   to Toby Moncaster, Will Dormann, John Regnault, Simon Carter and
1188	   Stefaan De Cnodder who further helped survey the current status of
1189	   RED implementation and deployment and, finally, thanks to the
1190	   anonymous individuals who responded.

1192	   Bob Briscoe and Jukka Manner were partly funded by Trilogy, a
1193	   research project (ICT- 216372) supported by the European Community
1194	   under its Seventh Framework Programme.  The views expressed here are
1195	   those of the authors only.

1197	9.  Comments Solicited

1199	   Comments and questions are encouraged and very welcome.  They can be
1200	   addressed to the IETF Transport Area working group mailing list
1201	   <tsvwg@ietf.org>, and/or to the authors.

1203	10.  References

1205	10.1.  Normative References

1207	   [RFC2119]                       Bradner, S., "Key words for use in
1208	                                   RFCs to Indicate Requirement Levels",
1209	                                   BCP 14, RFC 2119, March 1997.

1211	   [RFC2309]                       Braden, B., Clark, D., Crowcroft, J.,
1212	                                   Davie, B., Deering, S., Estrin, D.,
1213	                                   Floyd, S., Jacobson, V., Minshall,
1214	                                   G., Partridge, C., Peterson, L.,
1215	                                   Ramakrishnan, K., Shenker, S.,
1216	                                   Wroclawski, J., and L. Zhang,
1217	                                   "Recommendations on Queue Management
1218	                                   and Congestion Avoidance in the
1219	                                   Internet", RFC 2309, April 1998.

1221	   [RFC3168]                       Ramakrishnan, K., Floyd, S., and D.
1222	                                   Black, "The Addition of Explicit
1223	                                   Congestion Notification (ECN) to IP",
1224	                                   RFC 3168, September 2001.

1226	   [RFC3426]                       Floyd, S., "General Architectural and
1227	                                   Policy Considerations", RFC 3426,
1228	                                   November 2002.

1230	10.2.  Informative References

1232	   [CCvarPktSize]                  Widmer, J., Boutremans, C., and J-Y.
1233	                                   Le Boudec, "Congestion Control for
1234	                                   Flows with Variable Packet Size", ACM
1235	                                   CCR 34(2) 137--151, 2004, <http://
1236	                                   doi.acm.org/10.1145/997150.997162>.

1238	   [CHOKe_Var_Pkt]                 Psounis, K., Pan, R., and B.
1239	                                   Prabhaker, "Approximate Fair Dropping
1240	                                   for Variable Length Packets", IEEE
1241	                                   Micro 21(1):48--56, January-
1242	                                   February 2001, <http://
1243	                                   www.stanford.edu/~balaji/papers/
1244	                                   01approximatefair.pdf}>.

1246	   [DRQ]                           Shin, M., Chong, S., and I. Rhee,
1247	                                   "Dual-Resource TCP/AQM for
1248	                                   Processing-Constrained Networks",
1249	                                   IEEE/ACM Transactions on
1250	                                   Networking Vol 16, issue 2,
1251	                                   April 2008, <http://dx.doi.org/
1252	                                   10.1109/TNET.2007.900415>.

1254	   [DupTCP]                        Wischik, D., "Short messages", Royal
1255	                                   Society workshop on networks:
1256	                                   modelling and control ,
1257	                                   September 2007, <http://
1258	                                   www.cs.ucl.ac.uk/staff/ucacdjw/
1259	                                   Research/shortmsg.html>.

1261	   [ECNFixedWireless]              Siris, V., "Resource Control for
1262	                                   Elastic Traffic in CDMA Networks",
1263	                                   Proc. ACM MOBICOM'02 ,
1264	                                   September 2002, <http://
1265	                                   www.ics.forth.gr/netlab/publications/
1266	                                   resource_control_elastic_cdma.html>.

1268	   [Evol_cc]                       Gibbens, R. and F. Kelly, "Resource
1269	                                   pricing and the evolution of
1270	                                   congestion control",
1271	                                   Automatica 35(12)1969--1985,
1272	                                   December 1999, <http://
1273	                                   www.statslab.cam.ac.uk/~frank/
1274	                                   evol.html>.

1276	   [I-D.ietf-avtcore-ecn-for-rtp]  Westerlund, M., Johansson, I.,
1277	                                   Perkins, C., O'Hanlon, P., and K.
1278	                                   Carlberg, "Explicit Congestion
1279	                                   Notification (ECN) for RTP over UDP",
1280	                                   draft-ietf-avtcore-ecn-for-rtp-06
1281	                                   (work in progress), February 2012.

1283	   [I-D.ietf-conex-concepts-uses]  Briscoe, B., Woundy, R., and A.
1284	                                   Cooper, "ConEx Concepts and Use
1285	                                   Cases",
1286	                                   draft-ietf-conex-concepts-uses-03
1287	                                   (work in progress), October 2011.

1289	   [IOSArch]                       Bollapragada, V., White, R., and C.

1291	                                   Murphy, "Inside Cisco IOS Software
1292	                                   Architecture", Cisco Press: CCIE
1293	                                   Professional Development ISBN13: 978-
1294	                                   1-57870-181-0, July 2000.

1296	   [PktSizeEquCC]                  Vasallo, P., "Variable Packet Size
1297	                                   Equation-Based Congestion Control",
1298	                                   ICSI Technical Report tr-00-008,
1299	                                   2000, <http://http.icsi.berkeley.edu/
1300	                                   ftp/global/pub/techreports/2000/
1301	                                   tr-00-008.pdf>.

1303	   [RED93]                         Floyd, S. and V. Jacobson, "Random
1304	                                   Early Detection (RED) gateways for
1305	                                   Congestion Avoidance", IEEE/ACM
1306	                                   Transactions on Networking 1(4) 397--
1307	                                   413, August 1993, <http://
1308	                                   www.icir.org/floyd/papers/red/
1309	                                   red.html>.

1311	   [REDbias]                       Eddy, W. and M. Allman, "A Comparison
1312	                                   of RED's Byte and Packet Modes",
1313	                                   Computer Networks 42(3) 261--280,
1314	                                   June 2003, <http://www.ir.bbn.com/
1315	                                   documents/articles/redbias.ps>.

1317	   [REDbyte]                       De Cnodder, S., Elloumi, O., and K.
1318	                                   Pauwels, "RED behavior with different
1319	                                   packet sizes", Proc. 5th IEEE
1320	                                   Symposium on Computers and
1321	                                   Communications (ISCC) 793--799,
1322	                                   July 2000, <http://www.icir.org/
1323	                                   floyd/red/Elloumi99.pdf>.

1325	   [RFC2474]                       Nichols, K., Blake, S., Baker, F.,
1326	                                   and D. Black, "Definition of the
1327	                                   Differentiated Services Field (DS
1328	                                   Field) in the IPv4 and IPv6 Headers",
1329	                                   RFC 2474, December 1998.

1331	   [RFC3550]                       Schulzrinne, H., Casner, S.,
1332	                                   Frederick, R., and V. Jacobson, "RTP:
1333	                                   A Transport Protocol for Real-Time
1334	                                   Applications", STD 64, RFC 3550,
1335	                                   July 2003.

1337	   [RFC3714]                       Floyd, S. and J. Kempf, "IAB Concerns
1338	                                   Regarding Congestion Control for
1339	                                   Voice Traffic in the Internet",
1340	                                   RFC 3714, March 2004.

1342	   [RFC4828]                       Floyd, S. and E. Kohler, "TCP
1343	                                   Friendly Rate Control (TFRC): The
1344	                                   Small-Packet (SP) Variant", RFC 4828,
1345	                                   April 2007.

1347	   [RFC5348]                       Floyd, S., Handley, M., Padhye, J.,
1348	                                   and J. Widmer, "TCP Friendly Rate
1349	                                   Control (TFRC): Protocol
1350	                                   Specification", RFC 5348,
1351	                                   September 2008.

1353	   [RFC5562]                       Kuzmanovic, A., Mondal, A., Floyd,
1354	                                   S., and K. Ramakrishnan, "Adding
1355	                                   Explicit Congestion Notification
1356	                                   (ECN) Capability to TCP's SYN/ACK
1357	                                   Packets", RFC 5562, June 2009.

1359	   [RFC5670]                       Eardley, P., "Metering and Marking
1360	                                   Behaviour of PCN-Nodes", RFC 5670,
1361	                                   November 2009.

1363	   [RFC5681]                       Allman, M., Paxson, V., and E.
1364	                                   Blanton, "TCP Congestion Control",
1365	                                   RFC 5681, September 2009.

1367	   [RFC5690]                       Floyd, S., Arcia, A., Ros, D., and J.
1368	                                   Iyengar, "Adding Acknowledgement
1369	                                   Congestion Control to TCP", RFC 5690,
1370	                                   February 2010.

1372	   [RFC6077]                       Papadimitriou, D., Welzl, M., Scharf,
1373	                                   M., and B. Briscoe, "Open Research
1374	                                   Issues in Internet Congestion
1375	                                   Control", RFC 6077, February 2011.

1377	   [Rate_fair_Dis]                 Briscoe, B., "Flow Rate Fairness:
1378	                                   Dismantling a Religion", ACM
1379	                                   CCR 37(2)63--74, April 2007, <http://
1380	                                   portal.acm.org/
1381	                                   citation.cfm?id=1232926>.

1383	   [gentle_RED]                    Floyd, S., "Recommendation on using
1384	                                   the "gentle_" variant of RED", Web
1385	                                   page , March 2000, <http://
1386	                                   www.icir.org/floyd/red/gentle.html>.

1388	   [pBox]                          Floyd, S. and K. Fall, "Promoting the
1389	                                   Use of End-to-End Congestion Control
1390	                                   in the Internet", IEEE/ACM
1391	                                   Transactions on Networking 7(4) 458--
1392	                                   472, August 1999, <http://
1393	                                   www.aciri.org/floyd/
1394	                                   end2end-paper.html>.

1396	   [pktByteEmail]                  Floyd, S., "RED: Discussions of Byte
1397	                                   and Packet Modes", Web page Red Queue
1398	                                   Management, March 1997, <Available
1399	                                   at: http://ee.lbl.gov/floyd/
1400	                                   REDaveraging.txt>.

1402	Appendix A.  Survey of RED Implementation Status

1404	   This Appendix is informative, not normative.

1406	   In May 2007 a survey was conducted of 84 vendors to assess how widely
1407	   drop probability based on packet size has been implemented in RED
1408	   Table 3.  About 19% of those surveyed replied, giving a sample size
1409	   of 16.  Although in most cases we do not have permission to identify
1410	   the respondents, we can say that those that have responded include
1411	   most of the larger equipment vendors, covering a large fraction of
1412	   the market.  The two who gave permission to be identified were Cisco
1413	   and Alcatel-Lucent.  The others range across the large network
1414	   equipment vendors at L3 & L2, firewall vendors, wireless equipment
1415	   vendors, as well as large software businesses with a small selection
1416	   of networking products.  All those who responded confirmed that they
1417	   have not implemented the variant of RED with drop dependent on packet
1418	   size (2 were fairly sure they had not but needed to check more
1419	   thoroughly).  At the time the survey was conducted, Linux did not
1420	   implement RED with packet-size bias of drop, although we have not
1421	   investigated a wider range of open source code.

1423	   +-------------------------------+----------------+-----------------+
1424	   |                      Response | No. of vendors | %age of vendors |
1425	   +-------------------------------+----------------+-----------------+
1426	   |               Not implemented |             14 |             17% |
1427	   |    Not implemented (probably) |              2 |              2% |
1428	   |                   Implemented |              0 |              0% |
1429	   |                   No response |             68 |             81% |
1430	   | Total companies/orgs surveyed |             84 |            100% |
1431	   +-------------------------------+----------------+-----------------+

1433	    Table 3: Vendor Survey on byte-mode drop variant of RED (lower drop
1434	                      probability for small packets)

1436	   Where reasons have been given, the extra complexity of packet bias
1437	   code has been most prevalent, though one vendor had a more principled
1438	   reason for avoiding it--similar to the argument of this document.

1440	   Our survey was of vendor implementations, so we cannot be certain
1441	   about operator deployment.  But we believe many queues in the
1442	   Internet are still tail-drop.  The company of one of the co-authors
1443	   (BT) has widely deployed RED, but many tail-drop queues are bound to
1444	   still exist, particularly in access network equipment and on
1445	   middleboxes like firewalls, where RED is not always available.

1447	   Routers using a memory architecture based on fixed size buffers with
1448	   borrowing may also still be prevalent in the Internet.  As explained
1449	   in Section 4.2.1, these also provide a marginal (but legitimate) bias
1450	   towards small packets.  So even though RED byte-mode drop is not
1451	   prevalent, it is likely there is still some bias towards small
1452	   packets in the Internet due to tail drop and fixed buffer borrowing.

1454	Appendix B.  Sufficiency of Packet-Mode Drop

1456	   This Appendix is informative, not normative.

1458	   Here we check that packet-mode drop (or marking) in the network gives
1459	   sufficiently generic information for the transport layer to use.  We
1460	   check against a 2x2 matrix of four scenarios that may occur now or in
1461	   the future (Table 4).  The horizontal and vertical dimensions have
1462	   been chosen because each tests extremes of sensitivity to packet size
1463	   in the transport and in the network respectively.

1465	   Note that this section does not consider byte-mode drop at all.
1466	   Having deprecated byte-mode drop, the goal here is to check that
1467	   packet-mode drop will be sufficient in all cases.

1469	   +-------------------------------+-----------------+-----------------+
1470	   |                     Transport |  a) Independent | b) Dependent on |
1471	   |                               |  of packet size |  packet size of |
1472	   | Network                       |  of congestion  |    congestion   |
1473	   |                               |  notifications  |  notifications  |
1474	   +-------------------------------+-----------------+-----------------+
1475	   | 1) Predominantly              |   Scenario a1)  |   Scenario b1)  |
1476	   | bit-congestible network       |                 |                 |
1477	   | 2) Mix of bit-congestible and |   Scenario a2)  |   Scenario b2)  |
1478	   | pkt-congestible network       |                 |                 |
1479	   +-------------------------------+-----------------+-----------------+

1481	                Table 4: Four Possible Congestion Scenarios

1483	   Appendix B.1 focuses on the horizontal dimension of Table 4 checking
1484	   that packet-mode drop (or marking) gives sufficient information,
1485	   whether or not the transport uses it--scenarios b) and a)
1486	   respectively.

1488	   Appendix B.2 focuses on the vertical dimension of Table 4, checking
1489	   that packet-mode drop gives sufficient information to the transport
1490	   whether resources in the network are bit-congestible or packet-
1491	   congestible (these terms are defined in Section 1.1).

1493	   Notation:  To be concrete, we will compare two flows with different
1494	      packet sizes, s_1 and s_2.  As an example, we will take s_1 = 60B
1495	      = 480b and s_2 = 1500B = 12,000b.

1497	      A flow's bit rate, x [bps], is related to its packet rate, u
1498	      [pps], by

1500	         x(t) = s.u(t).

1502	      In the bit-congestible case, path congestion will be denoted by
1503	      p_b, and in the packet-congestible case by p_p.  When either case
1504	      is implied, the letter p alone will denote path congestion.

1506	B.1.  Packet-Size (In)Dependence in Transports

1508	   In all cases we consider a packet-mode drop queue that indicates
1509	   congestion by dropping (or marking) packets with probability p
1510	   irrespective of packet size. We use an example value of loss
1511	   (marking) probability, p=0.1%.

1513	   A transport like RFC5681 TCP treats a congestion notification on any
1514	   packet whatever its size as one event.  However, a network with just
1515	   the packet-mode drop algorithm does give more information if the
1516	   transport chooses to use it.  We will use Table 5 to illustrate this.

1518	   We will set aside the last column until later.  The columns labelled
1519	   "Flow 1" and "Flow 2" compare two flows consisting of 60B and 1500B
1520	   packets respectively.  The body of the table considers two separate
1521	   cases, one where the flows have equal bit-rate and the other with
1522	   equal packet-rates.  In both cases, the two flows fill a 96Mbps link.
1523	   Therefore, in the equal bit-rate case they each have half the bit-
1524	   rate (48Mbps).  Whereas, with equal packet-rates, flow 1 uses 25
1525	   times smaller packets so it gets 25 times less bit-rate--it only gets
1526	   1/(1+25) of the link capacity (96Mbps/26 = 4Mbps after rounding).  In
1527	   contrast flow 2 gets 25 times more bit-rate (92Mbps) in the equal
1528	   packet rate case because its packets are 25 times larger.  The packet
1529	   rate shown for each flow could easily be derived once the bit-rate
1530	   was known by dividing bit-rate by packet size, as shown in the column
1531	   labelled "Formula".

1533	       Parameter               Formula      Flow 1  Flow 2 Combined
1534	       ----------------------- ----------- ------- ------- --------
1535	       Packet size             s/8             60B  1,500B    (Mix)
1536	       Packet size             s              480b 12,000b    (Mix)
1537	       Pkt loss probability    p              0.1%    0.1%     0.1%

1539	       EQUAL BIT-RATE CASE
1540	       Bit-rate                x            48Mbps  48Mbps   96Mbps
1541	       Packet-rate             u = x/s     100kpps   4kpps  104kpps
1542	       Absolute pkt-loss-rate  p*u          100pps    4pps   104pps
1543	       Absolute bit-loss-rate  p*u*s        48kbps  48kbps   96kbps
1544	       Ratio of lost/sent pkts p*u/u          0.1%    0.1%     0.1%
1545	       Ratio of lost/sent bits p*u*s/(u*s)    0.1%    0.1%     0.1%

1547	       EQUAL PACKET-RATE CASE
1548	       Bit-rate                x             4Mbps  92Mbps   96Mbps
1549	       Packet-rate             u = x/s       8kpps   8kpps   15kpps
1550	       Absolute pkt-loss-rate  p*u            8pps    8pps    15pps
1551	       Absolute bit-loss-rate  p*u*s         4kbps  92kbps   96kbps
1552	       Ratio of lost/sent pkts p*u/u          0.1%    0.1%     0.1%
1553	       Ratio of lost/sent bits p*u*s/(u*s)    0.1%    0.1%     0.1%

1555	    Table 5: Absolute Loss Rates and Loss Ratios for Flows of Small and
1556	                      Large Packets and Both Combined

1558	   So far we have merely set up the scenarios.  We now consider
1559	   congestion notification in the scenario.  Two TCP flows with the same
1560	   round trip time aim to equalise their packet-loss-rates over time.
1561	   That is the number of packets lost in a second, which is the packets
1562	   per second (u) multiplied by the probability that each one is dropped
1563	   (p).  Thus TCP converges on the "Equal packet-rate" case, where both
1564	   flows aim for the same "Absolute packet-loss-rate" (both 8pps in the
1565	   table).

1567	   Packet-mode drop actually gives flows sufficient information to
1568	   measure their loss-rate in bits per second, if they choose, not just
1569	   packets per second.  Each flow can count the size of a lost or marked
1570	   packet and scale its rate-response in proportion (as TFRC-SP does).
1571	   The result is shown in the row entitled "Absolute bit-loss-rate",
1572	   where the bits lost in a second is the packets per second (u)
1573	   multiplied by the probability of losing a packet (p) multiplied by
1574	   the packet size (s).  Such an algorithm would try to remove any
1575	   imbalance in bit-loss-rate such as the wide disparity in the "Equal
1576	   packet-rate" case (4kbps vs. 92kbps).  Instead, a packet-size-
1577	   dependent algorithm would aim for equal bit-loss-rates, which would
1578	   drive both flows towards the "Equal bit-rate" case, by driving them
1579	   to equal bit-loss-rates (both 48kbps in this example).

1581	   The explanation so far has assumed that each flow consists of packets
1582	   of only one constant size.  Nonetheless, it extends naturally to
1583	   flows with mixed packet sizes.  In the right-most column of Table 5 a
1584	   flow of mixed size packets is created simply by considering flow 1
1585	   and flow 2 as a single aggregated flow.  There is no need for a flow
1586	   to maintain an average packet size.  It is only necessary for the
1587	   transport to scale its response to each congestion indication by the
1588	   size of each individual lost (or marked) packet.  Taking for example
1589	   the "Equal packet-rate" case, in one second about 8 small packets and
1590	   8 large packets are lost (making closer to 15 than 16 losses per
1591	   second due to rounding).  If the transport multiplies each loss by
1592	   its size, in one second it responds to 8*480b and 8*12,000b lost
1593	   bits, adding up to 96,000 lost bits in a second.  This double checks
1594	   correctly, being the same as 0.1% of the total bit-rate of 96Mbps.
1595	   For completeness, the formula for absolute bit-loss-rate is p(u1*s1+
1596	   u2*s2).

1598	   Incidentally, a transport will always measure the loss probability
1599	   the same irrespective of whether it measures in packets or in bytes.
1600	   In other words, the ratio of lost to sent packets will be the same as
1601	   the ratio of lost to sent bytes.  (This is why TCP's bit rate is
1602	   still proportional to packet size even when byte-counting is used, as
1603	   recommended for TCP in [RFC5681], mainly for orthogonal security
1604	   reasons.)  This is intuitively obvious by comparing two example
1605	   flows; one with 60B packets, the other with 1500B packets.  If both
1606	   flows pass through a queue with drop probability 0.1%, each flow will
1607	   lose 1 in 1,000 packets.  In the stream of 60B packets the ratio of
1608	   bytes lost to sent will be 60B in every 60,000B; and in the stream of
1609	   1500B packets, the loss ratio will be 1,500B out of 1,500,000B. When
1610	   the transport responds to the ratio of lost to sent packets, it will
1611	   measure the same ratio whether it measures in packets or bytes: 0.1%
1612	   in both cases.  The fact that this ratio is the same whether measured
1613	   in packets or bytes can be seen in Table 5, where the ratio of lost
1614	   to sent packets and the ratio of lost to sent bytes is always 0.1% in
1615	   all cases (recall that the scenario was set up with p=0.1%).

1617	   This discussion of how the ratio can be measured in packets or bytes
1618	   is only raised here to highlight that it is irrelevant to this memo!
1619	   Whether a transport depends on packet size or not depends on how this
1620	   ratio is used within the congestion control algorithm.

1622	   So far we have shown that packet-mode drop passes sufficient
1623	   information to the transport layer so that the transport can take
1624	   account of bit-congestion, by using the sizes of the packets that
1625	   indicate congestion.  We have also shown that the transport can
1626	   choose not to take packet size into account if it wishes.  We will
1627	   now consider whether the transport can know which to do.

1629	B.2.  Bit-Congestible and Packet-Congestible Indications

1631	   As a thought-experiment, imagine an idealised congestion notification
1632	   protocol that supports both bit-congestible and packet-congestible
1633	   resources.  It would require at least two ECN flags, one for each of
1634	   bit-congestible and packet-congestible resources.

1636	   1.  A packet-congestible resource trying to code congestion level p_p
1637	       into a packet stream should mark the idealised `packet
1638	       congestion' field in each packet with probability p_p
1639	       irrespective of the packet's size.  The transport should then
1640	       take a packet with the packet congestion field marked to mean
1641	       just one mark, irrespective of the packet size.

1643	   2.  A bit-congestible resource trying to code time-varying byte-
1644	       congestion level p_b into a packet stream should mark the `byte
1645	       congestion' field in each packet with probability p_b, again
1646	       irrespective of the packet's size.  Unlike before, the transport
1647	       should take a packet with the byte congestion field marked to
1648	       count as a mark on each byte in the packet.

1650	   This hides a fundamental problem--much more fundamental than whether
1651	   we can magically create header space for yet another ECN flag, or
1652	   whether it would work while being deployed incrementally.
1653	   Distinguishing drop from delivery naturally provides just one
1654	   implicit bit of congestion indication information--the packet is
1655	   either dropped or not.  It is hard to drop a packet in two ways that
1656	   are distinguishable remotely.  This is a similar problem to that of
1657	   distinguishing wireless transmission losses from congestive losses.

1659	   This problem would not be solved even if ECN were universally
1660	   deployed.  A congestion notification protocol must survive a
1661	   transition from low levels of congestion to high.  Marking two states
1662	   is feasible with explicit marking, but much harder if packets are
1663	   dropped.  Also, it will not always be cost-effective to implement AQM
1664	   at every low level resource, so drop will often have to suffice.

1666	   We are not saying two ECN fields will be needed (and we are not
1667	   saying that somehow a resource should be able to drop a packet in one
1668	   of two different ways so that the transport can distinguish which
1669	   sort of drop it was!).  These two congestion notification channels
1670	   are a conceptual device to illustrate a dilemma we could face in the
1671	   future.  Section 3 gives four good reasons why it would be a bad idea
1672	   to allow for packet size by biasing drop probability in favour of
1673	   small packets within the network.  The impracticality of our thought
1674	   experiment shows that it will be hard to give transports a practical
1675	   way to know whether to take account of the size of congestion
1676	   indication packets or not.

1678	   Fortunately, this dilemma is not pressing because by design most
1679	   equipment becomes bit-congested before its packet-processing becomes
1680	   congested (as already outlined in Section 1.1).  Therefore transports
1681	   can be designed on the relatively sound assumption that a congestion
1682	   indication will usually imply bit-congestion.

1684	   Nonetheless, although the above idealised protocol isn't intended for
1685	   implementation, we do want to emphasise that research is needed to
1686	   predict whether there are good reasons to believe that packet
1687	   congestion might become more common, and if so, to find a way to
1688	   somehow distinguish between bit and packet congestion [RFC3714].

1690	   Recently, the dual resource queue (DRQ) proposal [DRQ] has been made
1691	   on the premise that, as network processors become more cost
1692	   effective, per packet operations will become more complex
1693	   (irrespective of whether more function in the network is desirable).
1694	   Consequently the premise is that CPU congestion will become more
1695	   common.  DRQ is a proposed modification to the RED algorithm that
1696	   folds both bit congestion and packet congestion into one signal
1697	   (either loss or ECN).

1699	   Finally, we note one further complication.  Strictly, packet-
1700	   congestible resources are often cycle-congestible.  For instance, for
1701	   routing look-ups load depends on the complexity of each look-up and
1702	   whether the pattern of arrivals is amenable to caching or not.  This
1703	   also reminds us that any solution must not require a forwarding
1704	   engine to use excessive processor cycles in order to decide how to
1705	   say it has no spare processor cycles.

1707	Appendix C.  Byte-mode Drop Complicates Policing Congestion Response

1709	   This section is informative, not normative.

1711	   There are two main classes of approach to policing congestion
1712	   response: i) policing at each bottleneck link or ii) policing at the
1713	   edges of networks.  Packet-mode drop in RED is compatible with
1714	   either, while byte-mode drop precludes edge policing.

1716	   The simplicity of an edge policer relies on one dropped or marked
1717	   packet being equivalent to another of the same size without having to
1718	   know which link the drop or mark occurred at.  However, the byte-mode
1719	   drop algorithm has to depend on the local MTU of the line--it needs
1720	   to use some concept of a 'normal' packet size.  Therefore, one
1721	   dropped or marked packet from a byte-mode drop algorithm is not
1722	   necessarily equivalent to another from a different link.  A policing
1723	   function local to the link can know the local MTU where the
1724	   congestion occurred.  However, a policer at the edge of the network
1725	   cannot, at least not without a lot of complexity.

1727	   The early research proposals for type (i) policing at a bottleneck
1728	   link [pBox] used byte-mode drop, then detected flows that contributed
1729	   disproportionately to the number of packets dropped.  However, with
1730	   no extra complexity, later proposals used packet mode drop and looked
1731	   for flows that contributed a disproportionate amount of dropped bytes
1732	   [CHOKe_Var_Pkt].

1734	   Work is progressing on the congestion exposure protocol (ConEx
1735	   [I-D.ietf-conex-concepts-uses]), which enables a type (ii) edge
1736	   policer located at a user's attachment point.  The idea is to be able
1737	   to take an integrated view of the effect of all a user's traffic on
1738	   any link in the internetwork.  However, byte-mode drop would
1739	   effectively preclude such edge policing because of the MTU issue
1740	   above.

1742	   Indeed, making drop probability depend on the size of the packets
1743	   that bits happen to be divided into would simply encourage the bits
1744	   to be divided into smaller packets in order to confuse policing.  In
1745	   contrast, as long as a dropped/marked packet is taken to mean that
1746	   all the bytes in the packet are dropped/marked, a policer can remain
1747	   robust against bits being re-divided into different size packets or
1748	   across different size flows [Rate_fair_Dis].

1750	Appendix D.  Changes from Previous Versions

1752	   To be removed by the RFC Editor on publication.

1754	   Full incremental diffs between each version are available at
1755	   <http://tools.ietf.org/wg/tsvwg/draft-ietf-tsvwg-byte-pkt-congest/>
1756	   (courtesy of the rfcdiff tool):

1758	   From -06 to -07:

1760	      *  A mix-up with the corollaries and their naming in 2.1 to 2.3
1761	         fixed.

1763	   From -05 to -06:

1765	      *  Primarily editorial fixes.

1767	   From -04 to -05:

1769	      *  Changed from Informational to BCP and highlighted non-normative
1770	         sections and appendices

1772	      *  Removed language about consensus

1774	      *  Added "Example Comparing Packet-Mode Drop and Byte-Mode Drop"

1776	      *  Arranged "Motivating Arguments" into a more logical order and
1777	         completely rewrote "Transport-Independent Network" & "Scaling
1778	         Congestion Control with Packet Size" arguments.  Removed "Why
1779	         Now?"

1781	      *  Clarified applicability of certain recommendations

1783	      *  Shifted vendor survey to an Appendix

1785	      *  Cut down "Outstanding Issues and Next Steps"

1787	      *  Re-drafted the start of the conclusions to highlight the three
1788	         distinct areas of concern

1790	      *  Completely re-wrote appendices

1792	      *  Editorial corrections throughout.

1794	   From -03 to -04:

1796	      *  Reordered Sections 2 and 3, and some clarifications here and
1797	         there based on feedback from Colin Perkins and Mirja
1798	         Kuehlewind.

1800	   From -02 to -03  (this version)

1802	      *  Structural changes:

1804	         +  Split off text at end of "Scaling Congestion Control with
1805	            Packet Size" into new section "Transport-Independent
1806	            Network"

1808	         +  Shifted "Recommendations" straight after "Motivating
1809	            Arguments" and added "Conclusions" at end to reinforce
1810	            Recommendations

1812	         +  Added more internal structure to Recommendations, so that
1813	            recommendations specific to RED or to TCP are just
1814	            corollaries of a more general recommendation, rather than
1815	            being listed as a separate recommendation.

1817	         +  Renamed "State of the Art" as "Critical Survey of Existing
1818	            Advice" and retitled a number of subsections with more
1819	            descriptive titles.

1821	         +  Split end of "Congestion Coding: Summary of Status" into a
1822	            new subsection called "RED Implementation Status".

1824	         +  Removed text that had been in the Appendix "Congestion
1825	            Notification Definition: Further Justification".

1827	      *  Reordered the intro text a little.

1829	      *  Made it clearer when advice being reported is deprecated and
1830	         when it is not.

1832	      *  Described AQM as in network equipment, rather than saying "at
1833	         the network layer" (to side-step controversy over whether
1834	         functions like AQM are in the transport layer but in network
1835	         equipment).

1837	      *  Minor improvements to clarity throughout

1839	   From -01 to -02:

1841	      *  Restructured the whole document for (hopefully) easier reading
1842	         and clarity.  The concrete recommendation, in RFC2119 language,
1843	         is now in Section 7.

1845	   From -00 to -01:

1847	      *  Minor clarifications throughout and updated references

1849	   From briscoe-byte-pkt-mark-02 to ietf-byte-pkt-congest-00:

1851	      *  Added note on relationship to existing RFCs
1852	      *  Posed the question of whether packet-congestion could become
1853	         common and deferred it to the IRTF ICCRG.  Added ref to the
1854	         dual-resource queue (DRQ) proposal.

1856	      *  Changed PCN references from the PCN charter & architecture to
1857	         the PCN marking behaviour draft most likely to imminently
1858	         become the standards track WG item.

1860	   From -01 to -02:

1862	      *  Abstract reorganised to align with clearer separation of issue
1863	         in the memo.

1865	      *  Introduction reorganised with motivating arguments removed to
1866	         new Section 3.

1868	      *  Clarified avoiding lock-out of large packets is not the main or
1869	         only motivation for RED.

1871	      *  Mentioned choice of drop or marking explicitly throughout,
1872	         rather than trying to coin a word to mean either.

1874	      *  Generalised the discussion throughout to any packet forwarding
1875	         function on any network equipment, not just routers.

1877	      *  Clarified the last point about why this is a good time to sort
1878	         out this issue: because it will be hard / impossible to design
1879	         new transports unless we decide whether the network or the
1880	         transport is allowing for packet size.

1882	      *  Added statement explaining the horizon of the memo is long
1883	         term, but with short term expediency in mind.

1885	      *  Added material on scaling congestion control with packet size
1886	         (Section 3.4).

1888	      *  Separated out issue of normalising TCP's bit rate from issue of
1889	         preference to control packets (Section 3.2).

1891	      *  Divided up Congestion Measurement section for clarity,
1892	         including new material on fixed size packet buffers and buffer
1893	         carving (Section 4.1.1 & Section 4.2.1) and on congestion
1894	         measurement in wireless link technologies without queues
1895	         (Section 4.1.2).

1897	      *  Added section on 'Making Transports Robust against Control
1898	         Packet Losses' (Section 4.2.3) with existing & new material
1899	         included.

1901	      *  Added tabulated results of vendor survey on byte-mode drop
1902	         variant of RED (Table 3).

1904	   From -00 to -01:

1906	      *  Clarified applicability to drop as well as ECN.

1908	      *  Highlighted DoS vulnerability.

1910	      *  Emphasised that drop-tail suffers from similar problems to
1911	         byte-mode drop, so only byte-mode drop should be turned off,
1912	         not RED itself.

1914	      *  Clarified the original apparent motivations for recommending
1915	         byte-mode drop included protecting SYNs and pure ACKs more than
1916	         equalising the bit rates of TCPs with different segment sizes.
1917	         Removed some conjectured motivations.

1919	      *  Added support for updates to TCP in progress (ackcc & ecn-syn-
1920	         ack).

1922	      *  Updated survey results with newly arrived data.

1924	      *  Pulled all recommendations together into the conclusions.

1926	      *  Moved some detailed points into two additional appendices and a
1927	         note.

1929	      *  Considerable clarifications throughout.

1931	      *  Updated references

1933	Authors' Addresses

1935	   Bob Briscoe
1936	   BT
1937	   B54/77, Adastral Park
1938	   Martlesham Heath
1939	   Ipswich  IP5 3RE
1940	   UK

1942	   Phone: +44 1473 645196
1943	   EMail: bob.briscoe@bt.com
1944	   URI:   http://bobbriscoe.net/
1945	   Jukka Manner
1946	   Aalto University
1947	   Department of Communications and Networking (Comnet)
1948	   P.O. Box 13000
1949	   FIN-00076 Aalto
1950	   Finland

1952	   Phone: +358 9 470 22481
1953	   EMail: jukka.manner@aalto.fi
1954	   URI:   http://www.netlab.tkk.fi/~jmanner/