idnits 2.17.1 

draft-ietf-tsvwg-byte-pkt-congest-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  -- The draft header indicates that this document updates RFC2309, but the
     abstract doesn't seem to mention this, which it should.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

     (Using the creation date from RFC2309, updated by this document, for
     RFC5378 checks: 1997-03-25)

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (October 24, 2010) is 4925 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '0' on line 226

  -- Looks like a reference, but probably isn't: '1' on line 226

  ** Obsolete normative reference: RFC 2309 (Obsoleted by RFC 7567)

  == Outdated reference: A later version (-02) exists of
     draft-moncaster-conex-concepts-uses-01

  == Outdated reference: A later version (-03) exists of
     draft-ietf-avt-ecn-for-rtp-02

  -- No information found for
     draft-irtf-iccrg-welzl-congestion-control-open-research - is the name
     correct?

  -- Obsolete informational reference (is this intentional?): RFC 3448
     (Obsoleted by RFC 5348)


     Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Transport Area Working Group                                  B. Briscoe
3	Internet-Draft                                                        BT
4	Updates: 2309 (if approved)                                    J. Manner
5	Intended status: Informational                          Aalto University
6	Expires: April 27, 2011                                 October 24, 2010

8	                Byte and Packet Congestion Notification
9	                  draft-ietf-tsvwg-byte-pkt-congest-03

11	Abstract

13	   This memo concerns dropping or marking packets using active queue
14	   management (AQM) such as random early detection (RED) or pre-
15	   congestion notification (PCN).  We give three strong recommendations:
16	   (1) packet size should be taken into account when transports read
17	   congestion indications, (2) packet size should not be taken into
18	   account when network equipment creates congestion signals (marking,
19	   dropping), and therefore (3) the byte-mode packet drop variant of the
20	   RED AQM algorithm that drops fewer small packets should not be used.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on April 27, 2011.

39	Copyright Notice

41	   Copyright (c) 2010 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  5
57	     1.1.  Terminology and Scoping  . . . . . . . . . . . . . . . . .  7
58	     1.2.  Why now? . . . . . . . . . . . . . . . . . . . . . . . . .  8
59	   2.  Motivating Arguments . . . . . . . . . . . . . . . . . . . . . 10
60	     2.1.  Scaling Congestion Control with Packet Size  . . . . . . . 10
61	     2.2.  Transport-Independent Network  . . . . . . . . . . . . . . 10
62	     2.3.  Avoiding Perverse Incentives to (Ab)use Smaller Packets  . 11
63	     2.4.  Small != Control . . . . . . . . . . . . . . . . . . . . . 12
64	     2.5.  Implementation Efficiency  . . . . . . . . . . . . . . . . 13
65	   3.  Recommendations  . . . . . . . . . . . . . . . . . . . . . . . 13
66	     3.1.  Recommendation on Queue Measurement  . . . . . . . . . . . 13
67	     3.2.  Recommendation on Notifying Congestion . . . . . . . . . . 13
68	     3.3.  Recommendation on Responding to Congestion . . . . . . . . 14
69	     3.4.  Recommended Future Research  . . . . . . . . . . . . . . . 15
70	   4.  A Survey and Critique of Past Advice . . . . . . . . . . . . . 15
71	     4.1.  Congestion Measurement Advice  . . . . . . . . . . . . . . 16
72	       4.1.1.  Fixed Size Packet Buffers  . . . . . . . . . . . . . . 16
73	       4.1.2.  Congestion Measurement without a Queue . . . . . . . . 17
74	     4.2.  Congestion Notification Advice . . . . . . . . . . . . . . 18
75	       4.2.1.  Network Bias when Encoding . . . . . . . . . . . . . . 18
76	       4.2.2.  Transport Bias when Decoding . . . . . . . . . . . . . 20
77	       4.2.3.  Making Transports Robust against Control Packet
78	               Losses . . . . . . . . . . . . . . . . . . . . . . . . 21
79	       4.2.4.  Congestion Notification: Summary of Conflicting
80	               Advice . . . . . . . . . . . . . . . . . . . . . . . . 22
81	       4.2.5.  RED Implementation Status  . . . . . . . . . . . . . . 23
82	   5.  Outstanding Issues and Next Steps  . . . . . . . . . . . . . . 24
83	     5.1.  Bit-congestible World  . . . . . . . . . . . . . . . . . . 24
84	     5.2.  Bit- & Packet-congestible World  . . . . . . . . . . . . . 25
85	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 26
86	   7.  Conclusions  . . . . . . . . . . . . . . . . . . . . . . . . . 27
87	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 27
88	   9.  Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 28
89	   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 28
90	     10.1. Normative References . . . . . . . . . . . . . . . . . . . 28
91	     10.2. Informative References . . . . . . . . . . . . . . . . . . 29
92	   Appendix A.  Idealised Wire Protocol . . . . . . . . . . . . . . . 32
93	     A.1.  Protocol Coding  . . . . . . . . . . . . . . . . . . . . . 32
94	     A.2.  Example Scenarios  . . . . . . . . . . . . . . . . . . . . 34
95	       A.2.1.  Notation . . . . . . . . . . . . . . . . . . . . . . . 34
96	       A.2.2.  Bit-congestible resource, equal bit rates (Ai) . . . . 34
97	       A.2.3.  Bit-congestible resource, equal packet rates (Bi)  . . 35
98	       A.2.4.  Pkt-congestible resource, equal bit rates (Aii)  . . . 36
99	       A.2.5.  Pkt-congestible resource, equal packet rates (Bii) . . 37
100	   Appendix B.  Byte-mode Drop Complicates Policing Congestion
101	                Response  . . . . . . . . . . . . . . . . . . . . . . 37

103	   Appendix C.  Changes from Previous Versions  . . . . . . . . . . . 38

105	1.  Introduction

107	   This memo is initially concerned with how we should correctly scale
108	   congestion control functions with packet size for the long term.  But
109	   it also recognises that expediency may be necessary to deal with
110	   existing widely deployed protocols that don't live up to the long
111	   term goal.

113	   When notifying congestion, the problem of how (and whether) to take
114	   packet sizes into account has exercised the minds of researchers and
115	   practitioners for as long as active queue management (AQM) has been
116	   discussed.  Indeed, one reason AQM was originally introduced was to
117	   reduce the lock-out effects that small packets can have on large
118	   packets in drop-tail queues.  This memo aims to state the principles
119	   we should be using and to come to conclusions on what these
120	   principles will mean for future protocol design, taking into account
121	   the deployments we have already.

123	   The byte vs. packet dilemma arises at three stages in the congestion
124	   notification process:

126	   Measuring congestion:  When the congested resource decides locally to
127	      measure how congested it is, should the queue measure its length
128	      in bytes or packets?

130	   Encoding congestion notification into the wire protocol:  When the
131	      congested network resource decides whether to notify the level of
132	      congestion by dropping or marking a particular packet, should its
133	      decision depend on the byte-size of the particular packet being
134	      dropped or marked?

136	   Decoding congestion notification from the wire protocol:  When the
137	      transport interprets the notification in order to decide how much
138	      to respond to congestion, should it take into account the byte-
139	      size of each missing or marked packet?

141	   Consensus has emerged over the years concerning the first stage:
142	   whether queues are measured in bytes or packets, termed byte-mode
143	   queue measurement or packet-mode queue measurement.  This memo
144	   records this consensus in the RFC Series.  In summary the choice
145	   solely depends on whether the resource is congested by bytes or
146	   packets.

148	   The controversy is mainly around the last two stages: whether to
149	   allow for the size of the specific packet notifying congestion i)
150	   when the network encodes or ii) when the transport decodes the
151	   congestion notification.

153	   Currently, the RFC series is silent on this matter other than a paper
154	   trail of advice referenced from [RFC2309], which conditionally
155	   recommends byte-mode (packet-size dependent) drop [pktByteEmail].
156	   Reducing drop of small packets certainly has some tempting
157	   advantages: i) it drops less control packets, which tend to be small
158	   and ii) it makes TCP's bit-rate less dependent on packet size.
159	   However, there are ways of addressing these issues at the transport
160	   layer, rather than reverse engineering network forwarding to fix the
161	   problems of one specific transport.

163	   The primary purpose of this memo is to build a definitive consensus
164	   against deliberate preferential treatment for small packets in AQM
165	   algorithms and to record this advice within the RFC series.  It
166	   recommends that (1) packet size should be taken into account when
167	   transports read congestion indications, (2) not when network
168	   equipment writes them.

170	   In particular this means that the byte-mode packet drop variant of
171	   RED should not be used to drop fewer small packets, because that
172	   creates a perverse incentive for transports to use tiny segments,
173	   consequently also opening up a DoS vulnerability.  Fortunately all
174	   the RED implementers who responded to our survey (Section 4.2.4) have
175	   not followed the earlier advice to use byte-mode drop, so the
176	   consensus this memo argues for seems to already exist in
177	   implementations.

179	   However, at the transport layer, TCP congestion control is a widely
180	   deployed protocol that we argue doesn't scale correctly with packet
181	   size.  To date this hasn't been a significant problem because most
182	   TCPs have been used with similar packet sizes.  But, as we design new
183	   congestion controls, we should build in scaling with packet size
184	   rather than assuming we should follow TCP's example.

186	   This memo continues as follows.  First it discusses terminology and
187	   scoping and why it is relevant to publish this memo now.  Section 2
188	   gives motivating arguments for the recommendations that are formally
189	   stated in Section 3, which follows.  We then critically survey the
190	   advice given previously in the RFC series and the research literature
191	   (Section 4), followed by an assessment of whether or not this advice
192	   has been followed in production networks (Section 4.2.5).  To wrap
193	   up, outstanding issues are discussed that will need resolution both
194	   to inform future protocols designs and to handle legacy (Section 5).
195	   Then security issues are collected together in Section 6 before
196	   conclusions are drawn in Section 7.  The interested reader can find
197	   discussion of more detailed issues on the theme of byte vs. packet in
198	   the appendices.

200	   This memo intentionally includes a non-negligible amount of material
201	   on the subject.  A busy reader can jump right into Section 3 to read
202	   a summary of the recommendations for the Internet community.

204	1.1.  Terminology and Scoping

206	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
207	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
208	   document are to be interpreted as described in [RFC2119].

210	   Congestion Notification:  Rather than aim to achieve what many have
211	      tried and failed, this memo will not try to define congestion.  It
212	      will give a working definition of what congestion notification
213	      should be taken to mean for this document.  Congestion
214	      notification is a changing signal that aims to communicate the
215	      ratio E/L. E is the instantaneous excess load offered to a
216	      resource that it is either incapable of serving or unwilling to
217	      serve.  L is the instantaneous offered load.

219	      The phrase `unwilling to serve' is added, because AQM systems
220	      (e.g.  RED, PCN [RFC5670]) set a virtual limit smaller than the
221	      actual limit to the resource, then notify when this virtual limit
222	      is exceeded in order to avoid congestion of the actual capacity.

224	      Note that the denominator is offered load, not capacity.
225	      Therefore congestion notification is a real number bounded by the
226	      range [0,1].  This ties in with the most well-understood measure
227	      of congestion notification: drop probability (often loosely called
228	      loss rate).  It also means that congestion has a natural
229	      interpretation as a probability; the probability of offered
230	      traffic not being served (or being marked as at risk of not being
231	      served).

233	   Explicit and Implicit Notification:  The byte vs. packet dilemma
234	      concerns congestion notification irrespective of whether it is
235	      signalled implicitly by drop or using explicit congestion
236	      notification (ECN [RFC3168] or PCN [RFC5670]).  Throughout this
237	      document, unless clear from the context, the term marking will be
238	      used to mean notifying congestion explicitly, while congestion
239	      notification will be used to mean notifying congestion either
240	      implicitly by drop or explicitly by marking.

242	   Bit-congestible vs. Packet-congestible:  If the load on a resource
243	      depends on the rate at which packets arrive, it is called packet-
244	      congestible.  If the load depends on the rate at which bits arrive
245	      it is called bit-congestible.

247	      Examples of packet-congestible resources are route look-up engines
248	      and firewalls, because load depends on how many packet headers
249	      they have to process.  Examples of bit-congestible resources are
250	      transmission links, radio power and most buffer memory, because
251	      the load depends on how many bits they have to transmit or store.
252	      Some machine architectures use fixed size packet buffers, so
253	      buffer memory in these cases is packet-congestible (see
254	      Section 4.1.1).

256	      Currently a design goal of network processing equipment such as
257	      routers and firewalls is to keep packet processing uncongested
258	      even under worst case bit rates with minimum packet sizes.
259	      Therefore, packet-congestion is currently rare
260	      [I-D.irtf-iccrg-welzl; S.3.3], but there is no guarantee that it
261	      will not become common with future technology trends.

263	      Note that information is generally processed or transmitted with a
264	      minimum granularity greater than a bit (e.g. octets).  The
265	      appropriate granularity for the resource in question should be
266	      used, but for the sake of brevity we will talk in terms of bytes
267	      in this memo.

269	   Coarser Granularity:  Resources may be congestible at higher levels
270	      of granularity than bits or packets, for instance stateful
271	      firewalls are flow-congestible and call-servers are session-
272	      congestible.  This memo focuses on congestion of connectionless
273	      resources, but the same principles may be applicable for
274	      congestion notification protocols controlling per-flow and per-
275	      session processing or state.

277	   RED Terminology:  In RED, whether to use packets or bytes when
278	      measuring queues is called respectively packet-mode queue
279	      measurement or byte-mode queue measurement.  And whether the
280	      probability of dropping a packet is independent or dependent on
281	      its byte-size is called respectively packet-mode drop or byte-mode
282	      drop.  The terms byte-mode and packet-mode should not be used
283	      without specifying whether they apply to queue measurement or to
284	      drop.

286	1.2.  Why now?

288	   Now is a good time to discuss whether fairness between different
289	   sized packets would best be implemented in network equipment, or at
290	   the transport, for a number of reasons:

292	   1.  The IETF pre-congestion notification (PCN) working group is
293	       standardising the external behaviour of a PCN congestion
294	       notification (AQM) algorithm [RFC5670];

296	   2.  [RFC2309] says RED may either take account of packet size or not
297	       when dropping, but gives no recommendation between the two,
298	       referring instead to advice on the performance implications in an
299	       email [pktByteEmail], which recommends byte-mode drop.  Further,
300	       just before RFC2309 was issued, an addendum was added to the
301	       archived email that revisited the issue of packet vs. byte-mode
302	       drop in its last paragraph, making the recommendation less clear-
303	       cut;

305	   3.  Without the present memo, the only advice in the RFC series on
306	       packet size bias in AQM algorithms would be a reference to an
307	       archived email in [RFC2309] (including an addendum at the end of
308	       the email to correct the original).

310	   4.  The IRTF Internet Congestion Control Research Group (ICCRG)
311	       recently took on the challenge of building consensus on what
312	       common congestion control support should be required from network
313	       forwarding functions in future [I-D.irtf-iccrg-welzl].  The wider
314	       Internet community needs to discuss whether the complexity of
315	       adjusting for packet size should be in the network or in
316	       transports;

318	   5.  Given there are many good reasons why larger path max
319	       transmission units (PMTUs) would help solve a number of scaling
320	       issues, we don't want to create any bias against large packets
321	       that is greater than their true cost;

323	   6.  The IETF audio/video transport (AVT) working group is
324	       standardising how the real-time protocol (RTP) should feedback
325	       and respond to explicit congestion notification (ECN)
326	       [I-D.ietf-avt-ecn-for-rtp].

328	   7.  The IETF has started to consider the question of fairness between
329	       flows that use different packet sizes (e.g. in the small-packet
330	       variant of TCP-friendly rate control, TFRC-SP [RFC4828]).  Given
331	       transports with different packet sizes, if we don't decide
332	       whether the network or the transport should allow for packet
333	       size, it will be hard if not impossible to design any transport
334	       protocol so that its bit-rate relative to other transports meets
335	       design guidelines [RFC5033] (Note however that, if the concern
336	       were fairness between users, rather than between flows
337	       [Rate_fair_Dis], relative rates between flows would have to come
338	       under run-time control rather than being embedded in protocol
339	       designs).

341	2.  Motivating Arguments

343	   In this section, we evaluate the topic of packet vs. byte based
344	   congestion notifications and motivate the recommendations given in
345	   this document.

347	2.1.  Scaling Congestion Control with Packet Size

349	   There are two ways of interpreting a dropped or marked packet.  It
350	   can either be considered as a single loss event or as loss/marking of
351	   the bytes in the packet.

353	   Consider a bit-congestible link shared by many flows (bit-congestible
354	   is the more common case, see Section 1.1), so that each busy period
355	   tends to cause packets to be lost from different flows.  Consider
356	   further two sources that have the same data rate but break the load
357	   into large packets in one application (A) and small packets in the
358	   other (B).  Of course, because the load is the same, there will be
359	   proportionately more packets in the small packet flow (B).

361	   If a congestion control scales with packet size it should respond in
362	   the same way to the same congestion excursion, irrespective of the
363	   size of the packets that the bytes causing congestion happen to be
364	   broken down into.

366	   A bit-congestible queue suffering a congestion excursion has to drop
367	   or mark the same excess bytes whether they are in a few large packets
368	   (A) or many small packets (B).  So for the same congestion excursion,
369	   the same amount of bytes have to be shed to get the load back to its
370	   operating point.  But, of course, for smaller packets (B) more
371	   packets will have to be discarded to shed the same bytes.

373	   If all the transports interpret each drop/mark as a single loss event
374	   irrespective of the size of the packet dropped, those with smaller
375	   packets (B) will respond more to the same congestion excursion.  On
376	   the other hand, if they respond proportionately less when smaller
377	   packets are dropped/marked, overall they will be able to respond the
378	   same to the same congestion excursion.

380	   Therefore, for a congestion control to scale with packet size it
381	   should respond to dropped or marked bytes (as TFRC-SP [RFC4828]
382	   effectively does), instead of dropped or marked packets (as TCP
383	   does).

385	2.2.  Transport-Independent Network

387	   TCP congestion control ensures that flows competing for the same
388	   resource each maintain the same number of segments in flight,
389	   irrespective of segment size.  So under similar conditions, flows
390	   with different segment sizes will get different bit rates.

392	   Even though reducing the drop probability of small packets (e.g.
393	   RED's byte-mode drop) helps ensure TCPs with different packet sizes
394	   will achieve similar bit rates, we argue this correction should be
395	   made to any future transport protocols based on TCP, not to the
396	   network in order to fix one transport, no matter how prominent it is.
397	   Effectively, favouring small packets is reverse engineering of
398	   network equipment around one particular transport protocol (TCP),
399	   contrary to the excellent advice in [RFC3426], which asks designers
400	   to question "Why are you proposing a solution at this layer of the
401	   protocol stack, rather than at another layer?"

403	   RFC2309 refers to an email [pktByteEmail] for advice on how RED
404	   should allow for different packet sizes.  The email says the question
405	   of whether a packet's own size should affect its drop probability
406	   "depends on the dominant end-to-end congestion control mechanisms".
407	   But we argue network equipment should not be specialised for whatever
408	   transport is predominant.  No matter how convenient it is, we SHOULD
409	   NOT hack the network solely to allow for omissions from the design of
410	   one transport protocol, even if it is as predominant as TCP.

412	2.3.  Avoiding Perverse Incentives to (Ab)use Smaller Packets

414	   Increasingly, it is being recognised that a protocol design must take
415	   care not to cause unintended consequences by giving the parties in
416	   the protocol exchange perverse incentives [Evol_cc][RFC3426].  Again,
417	   imagine a scenario where the same bit rate of packets will contribute
418	   the same to bit-congestion of a link irrespective of whether it is
419	   sent as fewer larger packets or more smaller packets.  A protocol
420	   design that caused larger packets to be more likely to be dropped
421	   than smaller ones would be dangerous in this case:

423	   Malicious transports:  A queue that gives an advantage to small
424	      packets can be used to amplify the force of a flooding attack.  By
425	      sending a flood of small packets, the attacker can get the queue
426	      to discard more traffic in large packets, allowing more attack
427	      traffic to get through to cause further damage.  Such a queue
428	      allows attack traffic to have a disproportionately large effect on
429	      regular traffic without the attacker having to do much work.

431	   Non-malicious transports:  Even if a transport is not actually
432	      malicious, if it finds small packets go faster, over time it will
433	      tend to act in its own interest and use them.  Queues that give
434	      advantage to small packets create an evolutionary pressure for
435	      transports to send at the same bit-rate but break their data
436	      stream down into tiny segments to reduce their drop rate.

438	      Encouraging a high volume of tiny packets might in turn
439	      unnecessarily overload a completely unrelated part of the system,
440	      perhaps more limited by header-processing than bandwidth.

442	   Imagine two unresponsive flows arrive at a bit-congestible
443	   transmission link each with the same bit rate, say 1Mbps, but one
444	   consists of 1500B and the other 60B packets, which are 25x smaller.
445	   Consider a scenario where gentle RED [gentle_RED] is used, along with
446	   the variant of RED we advise against, i.e. where the RED algorithm is
447	   configured to adjust the drop probability of packets in proportion to
448	   each packet's size (byte mode packet drop).  In this case, if RED
449	   drops 25% of the larger packets, it will aim to drop 1% of the
450	   smaller packets (but in practice it may drop more as congestion
451	   increases [RFC4828; S.B.4]).  Even though both flows arrive with the
452	   same bit rate, the bit rate the RED queue aims to pass to the line
453	   will be 750k for the flow of larger packet but 990k for the smaller
454	   packets (but because of rate variation it will be less than this
455	   target).

457	   Note that, although the byte-mode drop variant of RED amplifies small
458	   packet attacks, drop-tail queues amplify small packet attacks even
459	   more (see Security Considerations in Section 6).  Wherever possible
460	   neither should be used.

462	2.4.  Small != Control

464	   It is tempting to drop small packets with lower probability to
465	   improve performance, because many control packets are small (TCP SYNs
466	   & ACKs, DNS queries & responses, SIP messages, HTTP GETs, etc) and
467	   dropping fewer control packets considerably improves performance.
468	   However, we must not give control packets preference purely by virtue
469	   of their smallness, otherwise it is too easy for any data source to
470	   get the same preferential treatment simply by sending data in smaller
471	   packets.  Again we should not create perverse incentives to favour
472	   small packets rather than to favour control packets, which is what we
473	   intend.

475	   Just because many control packets are small does not mean all small
476	   packets are control packets.

478	   So again, rather than fix these problems in the network, we argue
479	   that the transport should be made more robust against losses of
480	   control packets (see 'Making Transports Robust against Control Packet
481	   Losses' in Section 4.2.3).

483	2.5.  Implementation Efficiency

485	   Allowing for packet size at the transport rather than in the network
486	   ensures that neither the network nor the transport needs to do a
487	   multiply operation--multiplication by packet size is effectively
488	   achieved as a repeated add when the transport adds to its count of
489	   marked bytes as each congestion event is fed to it.  This isn't a
490	   principled reason in itself, but it is a happy consequence of the
491	   other principled reasons.

493	3.  Recommendations

495	3.1.  Recommendation on Queue Measurement

497	   Queue length is usually the most correct and simplest way to measure
498	   congestion of a resource.  To avoid the pathological effects of drop
499	   tail, an AQM function can then be used to transform queue length into
500	   the probability of dropping or marking a packet (e.g.  RED's
501	   piecewise linear function between thresholds).

503	   If the resource is bit-congestible, the implementation SHOULD measure
504	   the length of the queue in bytes.  If the resource is packet-
505	   congestible, the implementation SHOULD measure the length of the
506	   queue in packets.  No other choice makes sense, because the number of
507	   packets waiting in the queue isn't relevant if the resource gets
508	   congested by bytes and vice versa.

510	   Corollaries:

512	   1.  Whether a resource is bit-congestible or packet-congestible is a
513	       property of the resource, so an admin should not ever need to, or
514	       be able to, configure the way a queue measures itself.

516	   2.  If RED is used, the implementation SHOULD use byte mode queue
517	       measurement for measuring the congestion of bit-congestible
518	       resources and packet mode queue measurement for packet-
519	       congestible resources.

521	   The recommended approach in less straightforward scenarios, such as
522	   fixed size buffers, and resources without a queue, is discussed in
523	   Section 4.1.

525	3.2.  Recommendation on Notifying Congestion

527	   The Internet's congestion notification protocols (drop, ECN & PCN)
528	   SHOULD NOT take account of packet size when congestion is notified by
529	   network equipment.  Allowance for packet size is only appropriate
530	   when the transport responds to congestion (See Recommendation 3.3).

532	   This approach offers sufficient and correct congestion information
533	   for all known and future transport protocols and also ensures no
534	   perverse incentives are created that would encourage transports to
535	   use inappropriately small packet sizes.

537	   Corollaries:

539	   1.  AQM algorithms such as RED SHOULD NOT use byte-mode drop, which
540	       deflates RED's drop probability for smaller packet sizes.  RED's
541	       byte-mode drop has no enduring advantages.  It is more complex,
542	       it creates the perverse incentive to fragment segments into tiny
543	       pieces and it reopens the vulnerability to floods of small-
544	       packets that drop-tail queues suffered from and AQM was designed
545	       to remove.

547	   2.  If a vendor has implemented byte-mode drop, and an operator has
548	       turned it on, it is strongly RECOMMENDED that it SHOULD be turned
549	       off.  Note that RED as a whole SHOULD NOT be turned off, as
550	       without it, a drop tail queue also biases against large packets.
551	       But note also that turning off byte-mode drop may alter the
552	       relative performance of applications using different packet
553	       sizes, so it would be advisable to establish the implications
554	       before turning it off.

556	       NOTE WELL that RED's byte-mode queue drop is completely
557	       orthogonal to byte-mode queue measurement and should not be
558	       confused with it.  If a RED implementation has a byte-mode but
559	       does not specify what sort of byte-mode, it is most probably
560	       byte-mode queue measurement, which is fine.  However, if in
561	       doubt, the vendor should be consulted.

563	   The byte mode packet drop variant of RED was recommended in the past
564	   (see Section 4.2.1 for how thinking evolved).  However, our survey of
565	   84 vendors across the industry (Section 4.2.5) has found that none of
566	   the 19% who responded have implemented byte mode drop in RED.  Given
567	   there appears to be little, if any, installed base it seems we can
568	   deprecate byte-mode drop in RED with little, if any, incremental
569	   deployment impact.

571	3.3.  Recommendation on Responding to Congestion

573	   Instead of network equipment biasing its congestion notification in
574	   favour of small packets, the IETF transport area should continue its
575	   programme of;

577	   o  updating host-based congestion control protocols to take account
578	      of packet size

580	   o  making transports less sensitive to losing control packets like
581	      SYNs and pure ACKs.

583	   Corollaries:

585	   1.  If two TCPs with different packet sizes are required to run at
586	       equal bit rates under the same path conditions, this SHOULD be
587	       done by altering TCP (Section 4.2.2), not network equipment,
588	       which would otherwise affect other transports besides TCP.

590	   2.  If it is desired to improve TCP performance by reducing the
591	       chance that a SYN or a pure ACK will be dropped, this should be
592	       done by modifying TCP (Section 4.2.3), not network equipment.

594	3.4.  Recommended Future Research

596	   The above conclusions cater for the Internet as it is today with most
597	   resources being primarily bit-congestible.  A secondary conclusion of
598	   this memo is that research is needed to determine whether there might
599	   be more packet-congestible resources in the future.  Then further
600	   research would be needed to extend the Internet's congestion
601	   notification (drop or ECN) so that it would be able to handle a more
602	   even mix of bit-congestible and packet-congestible resources.

604	4.  A Survey and Critique of Past Advice

606	   The original 1993 paper on RED [RED93] proposed two options for the
607	   RED active queue management algorithm: packet mode and byte mode.
608	   Packet mode measured the queue length in packets and dropped (or
609	   marked) individual packets with a probability independent of their
610	   size.  Byte mode measured the queue length in bytes and marked an
611	   individual packet with probability in proportion to its size
612	   (relative to the maximum packet size).  In the paper's outline of
613	   further work, it was stated that no recommendation had been made on
614	   whether the queue size should be measured in bytes or packets, but
615	   noted that the difference could be significant.

617	   When RED was recommended for general deployment in 1998 [RFC2309],
618	   the two modes were mentioned implying the choice between them was a
619	   question of performance, referring to a 1997 email [pktByteEmail] for
620	   advice on tuning.  A later addendum to this email introduced the
621	   insight that there are in fact two orthogonal choices:

623	   o  whether to measure queue length in bytes or packets (Section 4.1)

625	   o  whether the drop probability of an individual packet should depend
626	      on its own size (Section 4.2).

628	   The rest of this section is structured accordingly.

630	4.1.  Congestion Measurement Advice

632	   The choice of which metric to use to measure queue length was left
633	   open in RFC2309.  It is now well understood that queues for bit-
634	   congestible resources should be measured in bytes, and queues for
635	   packet-congestible resources should be measured in packets.

637	   Some modern queue implementations give a choice for setting RED's
638	   thresholds in byte-mode or packet-mode.  This may merely be an
639	   administrator-interface preference, not altering how the queue itself
640	   is measured but on some hardware it does actually change the way it
641	   measures its queue.  Whether a resource is bit-congestible or packet-
642	   congestible is a property of the resource, so an admin should not
643	   ever need to, or be able to, configure the way a queue measures
644	   itself.

646	   NOTE: Congestion in some legacy bit-congestible buffers is only
647	   measured in packets not bytes.  In such cases, the operator has to
648	   set the thresholds mindful of a typical mix of packets sizes.  Any
649	   AQM algorithm on such a buffer will be oversensitive to high
650	   proportions of small packets, e.g. a DoS attack, and undersensitive
651	   to high proportions of large packets.  However, there is no need to
652	   make allowances for the possibility of such legacy in future protocol
653	   design.  This is safe because any undersensitivity during unusual
654	   traffic mixes cannot lead to congestion collapse given the buffer
655	   will eventually revert to tail drop, discarding proportionately more
656	   large packets.

658	4.1.1.  Fixed Size Packet Buffers

660	   Although the question of whether to measure queues in bytes or
661	   packets is fairly well understood these days, measuring congestion is
662	   not straightforward when the resource is bit congestible but the
663	   queue is packet congestible or vice versa.  This section outlines the
664	   approach to take.  There is no controversy over what should be done,
665	   you just need to be expert in probability to work it out.  And, even
666	   if you know what should be done, it's not always easy to find a
667	   practical algorithm to implement it.

669	   Some, mostly older, queuing hardware sets aside fixed sized buffers
670	   in which to store each packet in the queue.  Also, with some
671	   hardware, any fixed sized buffers not completely filled by a packet
672	   are padded when transmitted to the wire.  If we imagine a theoretical
673	   forwarding system with both queuing and transmission in fixed, MTU-
674	   sized units, it should clearly be treated as packet-congestible,
675	   because the queue length in packets would be a good model of
676	   congestion of the lower layer link.

678	   If we now imagine a hybrid forwarding system with transmission delay
679	   largely dependent on the byte-size of packets but buffers of one MTU
680	   per packet, it should strictly require a more complex algorithm to
681	   determine the probability of congestion.  It should be treated as two
682	   resources in sequence, where the sum of the byte-sizes of the packets
683	   within each packet buffer models congestion of the line while the
684	   length of the queue in packets models congestion of the queue.  Then
685	   the probability of congesting the forwarding buffer would be a
686	   conditional probability--conditional on the previously calculated
687	   probability of congesting the line.

689	   In systems that use fixed size buffers, it is unusual for all the
690	   buffers used by an interface to be the same size.  Typically pools of
691	   different sized buffers are provided (Cisco uses the term 'buffer
692	   carving' for the process of dividing up memory into these pools
693	   [IOSArch]).  Usually, if the pool of small buffers is exhausted,
694	   arriving small packets can borrow space in the pool of large buffers,
695	   but not vice versa.  However, it is easier to work out what should be
696	   done if we temporarily set aside the possibility of such borrowing.
697	   Then, with fixed pools of buffers for different sized packets and no
698	   borrowing, the size of each pool and the current queue length in each
699	   pool would both be measured in packets.  So an AQM algorithm would
700	   have to maintain the queue length for each pool, and judge whether to
701	   drop/mark a packet of a particular size by looking at the pool for
702	   packets of that size and using the length (in packets) of its queue.

704	   We now return to the issue we temporarily set aside: small packets
705	   borrowing space in larger buffers.  In this case, the only difference
706	   is that the pools for smaller packets have a maximum queue size that
707	   includes all the pools for larger packets.  And every time a packet
708	   takes a larger buffer, the current queue size has to be incremented
709	   for all queues in the pools of buffers less than or equal to the
710	   buffer size used.

712	   We will return to borrowing of fixed sized buffers when we discuss
713	   biasing the drop/marking probability of a specific packet because of
714	   its size in Section 4.2.1.  But here we can give a at least one
715	   simple rule for how to measure the length of queues of fixed buffers:
716	   no matter how complicated the scheme is, ultimately any fixed buffer
717	   system will need to measure its queue length in packets not bytes.

719	4.1.2.  Congestion Measurement without a Queue

721	   AQM algorithms are nearly always described assuming there is a queue
722	   for a congested resource and the algorithm can use the queue length
723	   to determine the probability that it will drop or mark each packet.

725	   But not all congested resources lead to queues.  For instance,
726	   wireless spectrum is bit-congestible (for a given coding scheme),
727	   because interference increases with the rate at which bits are
728	   transmitted.  But wireless link protocols do not always maintain a
729	   queue that depends on spectrum interference.  Similarly, power
730	   limited resources are also usually bit-congestible if energy is
731	   primarily required for transmission rather than header processing,
732	   but it is rare for a link protocol to build a queue as it approaches
733	   maximum power.

735	   Nonetheless, AQM algorithms do not require a queue in order to work.
736	   For instance spectrum congestion can be modelled by signal quality
737	   using target bit-energy-to-noise-density ratio.  And, to model radio
738	   power exhaustion, transmission power levels can be measured and
739	   compared to the maximum power available.  [ECNFixedWireless] proposes
740	   a practical and theoretically sound way to combine congestion
741	   notification for different bit-congestible resources at different
742	   layers along an end to end path, whether wireless or wired, and
743	   whether with or without queues.

745	4.2.  Congestion Notification Advice

747	4.2.1.  Network Bias when Encoding

749	   The previously mentioned email [pktByteEmail] referred to by
750	   [RFC2309] advised that most scarce resources in the Internet were
751	   bit-congestible, which is still believed to be true (Section 1.1).
752	   But it went on to give advice we now disagree with.  It said that
753	   drop probability should depend on the size of the packet being
754	   considered for drop if the resource is bit-congestible, but not if it
755	   is packet-congestible.  The argument continued that if packet drops
756	   were inflated by packet size (byte-mode dropping), "a flow's fraction
757	   of the packet drops is then a good indication of that flow's fraction
758	   of the link bandwidth in bits per second".  This was consistent with
759	   a referenced policing mechanism being worked on at the time for
760	   detecting unusually high bandwidth flows, eventually published in
761	   1999 [pBox].  However, the problem could and should have been solved
762	   by making the policing mechanism count the volume of bytes randomly
763	   dropped, not the number of packets.

765	   A few months before RFC2309 was published, an addendum was added to
766	   the above archived email referenced from the RFC, in which the final
767	   paragraph seemed to partially retract what had previously been said.
768	   It clarified that the question of whether the probability of
769	   dropping/marking a packet should depend on its size was not related
770	   to whether the resource itself was bit congestible, but a completely
771	   orthogonal question.  However the only example given had the queue
772	   measured in packets but packet drop depended on the byte-size of the
773	   packet in question.  No example was given the other way round.

775	   In 2000, Cnodder et al [REDbyte] pointed out that there was an error
776	   in the part of the original 1993 RED algorithm that aimed to
777	   distribute drops uniformly, because it didn't correctly take into
778	   account the adjustment for packet size.  They recommended an
779	   algorithm called RED_4 to fix this.  But they also recommended a
780	   further change, RED_5, to adjust drop rate dependent on the square of
781	   relative packet size.  This was indeed consistent with one implied
782	   motivation behind RED's byte mode drop--that we should reverse
783	   engineer the network to improve the performance of dominant end-to-
784	   end congestion control mechanisms.  But it is not consistent with the
785	   present recommendations of Section 3.

787	   By 2003, a further change had been made to the adjustment for packet
788	   size, this time in the RED algorithm of the ns2 simulator.  Instead
789	   of taking each packet's size relative to a `maximum packet size' it
790	   was taken relative to a `mean packet size', intended to be a static
791	   value representative of the `typical' packet size on the link.  We
792	   have not been able to find a justification in the literature for this
793	   change, however Eddy and Allman conducted experiments [REDbias] that
794	   assessed how sensitive RED was to this parameter, amongst other
795	   things.  No-one seems to have pointed out that this changed algorithm
796	   can often lead to drop probabilities of greater than 1 (which should
797	   ring alarm bells hinting that there's a mistake in the theory
798	   somewhere).

800	   On 10-Nov-2004, this variant of byte-mode packet drop was made the
801	   default in the ns2 simulator.  None of the responses to our
802	   admittedly limited survey of implementers (Section 4.2.5) found any
803	   variant of byte-mode drop had been implemented.  Therefore any
804	   conclusions based on ns2 simulations that use RED without disabling
805	   byte-mode drop are likely to be highly questionable.

807	   The byte-mode drop variant of RED is, of course, not the only
808	   possible bias towards small packets in queueing systems.  We have
809	   already mentioned that tail-drop queues naturally tend to lock-out
810	   large packets once they are full.  But also queues with fixed sized
811	   buffers reduce the probability that small packets will be dropped if
812	   (and only if) they allow small packets to borrow buffers from the
813	   pools for larger packets.  As was explained in Section 4.1.1 on fixed
814	   size buffer carving, borrowing effectively makes the maximum queue
815	   size for small packets greater than that for large packets, because
816	   more buffers can be used by small packets while less will fit large
817	   packets.

819	   In itself, the bias towards small packets caused by buffer borrowing
820	   is perfectly correct.  Lower drop probability for small packets is
821	   legitimate in buffer borrowing schemes, because small packets
822	   genuinely congest the machine's buffer memory less than large
823	   packets, given they can fit in more spaces.  The bias towards small
824	   packets is not artificially added (as it is in RED's byte-mode drop
825	   algorithm), it merely reflects the reality of the way fixed buffer
826	   memory gets congested.  Incidentally, the bias towards small packets
827	   from buffer borrowing is nothing like as large as that of RED's byte-
828	   mode drop.

830	   Nonetheless, fixed-buffer memory with tail drop is still prone to
831	   lock-out large packets, purely because of the tail-drop aspect.  So a
832	   good AQM algorithm like RED with packet-mode drop should be used with
833	   fixed buffer memories where possible.  If RED is too complicated to
834	   implement with multiple fixed buffer pools, the minimum necessary to
835	   prevent large packet lock-out is to ensure smaller packets never use
836	   the last available buffer in any of the pools for larger packets.

838	4.2.2.  Transport Bias when Decoding

840	   The above proposals to alter the network equipment to bias towards
841	   smaller packets have largely carried on outside the IETF process
842	   (unless one counts a reference in an informational RFC to an archived
843	   email!).  Whereas, within the IETF, there are many different
844	   proposals to alter transport protocols to achieve the same goals,
845	   i.e. either to make the flow bit-rate take account of packet size, or
846	   to protect control packets from loss.  This memo argues that altering
847	   transport protocols is the more principled approach.

849	   A recently approved experimental RFC adapts its transport layer
850	   protocol to take account of packet sizes relative to typical TCP
851	   packet sizes.  This proposes a new small-packet variant of TCP-
852	   friendly rate control [RFC3448] called TFRC-SP [RFC4828].
853	   Essentially, it proposes a rate equation that inflates the flow rate
854	   by the ratio of a typical TCP segment size (1500B including TCP
855	   header) over the actual segment size [PktSizeEquCC].  (There are also
856	   other important differences of detail relative to TFRC, such as using
857	   virtual packets [CCvarPktSize] to avoid responding to multiple losses
858	   per round trip and using a minimum inter-packet interval.)

860	   Section 4.5.1 of this TFRC-SP spec discusses the implications of
861	   operating in an environment where queues have been configured to drop
862	   smaller packets with proportionately lower probability than larger
863	   ones.  But it only discusses TCP operating in such an environment,
864	   only mentioning TFRC-SP briefly when discussing how to define
865	   fairness with TCP.  And it only discusses the byte-mode dropping
866	   version of RED as it was before Cnodder et al pointed out it didn't
867	   sufficiently bias towards small packets to make TCP independent of
868	   packet size.

870	   So the TFRC-SP spec doesn't address the issue of which of the network
871	   or the transport _should_ handle fairness between different packet
872	   sizes.  In its Appendix B.4 it discusses the possibility of both
873	   TFRC-SP and some network buffers duplicating each other's attempts to
874	   deliberately bias towards small packets.  But the discussion is not
875	   conclusive, instead reporting simulations of many of the
876	   possibilities in order to assess performance but not recommending any
877	   particular course of action.

879	   The paper originally proposing TFRC with virtual packets (VP-TFRC)
880	   [CCvarPktSize] proposed that there should perhaps be two variants to
881	   cater for the different variants of RED.  However, as the TFRC-SP
882	   authors point out, there is no way for a transport to know whether
883	   some queues on its path have deployed RED with byte-mode packet drop
884	   (except if an exhaustive survey found that no-one has deployed it!--
885	   see Section 4.2.4).  Incidentally, VP-TFRC also proposed that byte-
886	   mode RED dropping should really square the packet size compensation
887	   factor (like that of Cnodder's RED_5, but apparently unaware of it).

889	   Pre-congestion notification [RFC5670] is a proposal to use a virtual
890	   queue for AQM marking for packets within one Diffserv class in order
891	   to give early warning prior to any real queuing.  The proposed PCN
892	   marking algorithms have been designed not to take account of packet
893	   size when forwarding through queues.  Instead the general principle
894	   has been to take account of the sizes of marked packets when
895	   monitoring the fraction of marking at the edge of the network, as
896	   recommended here.

898	4.2.3.  Making Transports Robust against Control Packet Losses

900	   Recently, two RFCs have defined changes to TCP that make it more
901	   robust against losing small control packets [RFC5562] [RFC5690].  In
902	   both cases they note that the case for these two TCP changes would be
903	   weaker if RED were biased against dropping small packets.  We argue
904	   here that these two proposals are a safer and more principled way to
905	   achieve TCP performance improvements than reverse engineering RED to
906	   benefit TCP.

908	   Although no proposals exist as far as we know, it would also be
909	   possible and perfectly valid to make control packets robust against
910	   drop by explicitly requesting a lower drop probability using their
911	   Diffserv code point [RFC2474] to request a scheduling class with
912	   lower drop.

914	   Although not brought to the IETF, a simple proposal from Wischik
915	   [DupTCP] suggests that the first three packets of every TCP flow
916	   should be routinely duplicated after a short delay.  It shows that
917	   this would greatly improve the chances of short flows completing
918	   quickly, but it would hardly increase traffic levels on the Internet,
919	   because Internet bytes have always been concentrated in the large
920	   flows.  It further shows that the performance of many typical
921	   applications depends on completion of long serial chains of short
922	   messages.  It argues that, given most of the value people get from
923	   the Internet is concentrated within short flows, this simple
924	   expedient would greatly increase the value of the best efforts
925	   Internet at minimal cost.

927	4.2.4.  Congestion Notification: Summary of Conflicting Advice

929	   +-----------+----------------+-----------------+--------------------+
930	   | transport |  RED_1 (packet |  RED_4 (linear  | RED_5 (square byte |
931	   |        cc |   mode drop)   | byte mode drop) |     mode drop)     |
932	   +-----------+----------------+-----------------+--------------------+
933	   |    TCP or |    s/sqrt(p)   |    sqrt(s/p)    |      1/sqrt(p)     |
934	   |      TFRC |                |                 |                    |
935	   |   TFRC-SP |    1/sqrt(p)   |    1/sqrt(sp)   |    1/(s.sqrt(p))   |
936	   +-----------+----------------+-----------------+--------------------+

938	     Table 1: Dependence of flow bit-rate per RTT on packet size s and
939	   drop rate p when network and/or transport bias towards small packets
940	                            to varying degrees

942	   Table 1 aims to summarise the potential effects of all the advice
943	   from different sources.  Each column shows a different possible AQM
944	   behaviour in different queues in the network, using the terminology
945	   of Cnodder et al outlined earlier (RED_1 is basic RED with packet-
946	   mode drop).  Each row shows a different transport behaviour: TCP
947	   [RFC5681] and TFRC [RFC3448] on the top row with TFRC-SP [RFC4828]
948	   below.

950	   Let us assume that the goal is for the bit-rate of a flow to be
951	   independent of packet size.  Suppressing all inessential details, the
952	   table shows that this should either be achievable by not altering the
953	   TCP transport in a RED_5 network, or using the small packet TFRC-SP
954	   transport (or similar) in a network without any byte-mode dropping
955	   RED (top right and bottom left).  Top left is the `do nothing'
956	   scenario, while bottom right is the `do-both' scenario in which bit-
957	   rate would become far too biased towards small packets.  Of course,
958	   if any form of byte-mode dropping RED has been deployed on a subset
959	   of queues that congest, each path through the network will present a
960	   different hybrid scenario to its transport.

962	   Whatever, we can see that the linear byte-mode drop column in the
963	   middle considerably complicates the Internet.  It's a half-way house
964	   that doesn't bias enough towards small packets even if one believes
965	   the network should be doing the biasing.  Section 3 recommends that
966	   _all_ bias in network equipment towards small packets should be
967	   turned off--if indeed any equipment vendors have implemented it--
968	   leaving packet size bias solely as the preserve of the transport
969	   layer (solely the leftmost, packet-mode drop column).

971	4.2.5.  RED Implementation Status

973	   A survey has been conducted of 84 vendors to assess how widely drop
974	   probability based on packet size has been implemented in RED.  Prior
975	   to the survey, an individual approach to Cisco received confirmation
976	   that, having checked the code-base for each of the product ranges,
977	   Cisco has not implemented any discrimination based on packet size in
978	   any AQM algorithm in any of its products.  Also an individual
979	   approach to Alcatel-Lucent drew a confirmation that it was very
980	   likely that none of their products contained RED code that
981	   implemented any packet-size bias.

983	   Turning to our more formal survey (Table 2), about 19% of those
984	   surveyed have replied so far, giving a sample size of 16.  Although
985	   we do not have permission to identify the respondents, we can say
986	   that those that have responded include most of the larger vendors,
987	   covering a large fraction of the market.  They range across the large
988	   network equipment vendors at L3 & L2, firewall vendors, wireless
989	   equipment vendors, as well as large software businesses with a small
990	   selection of networking products.  So far, all those who have
991	   responded have confirmed that they have not implemented the variant
992	   of RED with drop dependent on packet size (2 were fairly sure they
993	   had not but needed to check more thoroughly).  We have established
994	   that Linux does not implement RED with packet size drop bias,
995	   although we have not investigated a wider range of open source code.

997	   +-------------------------------+----------------+-----------------+
998	   |                      Response | No. of vendors | %age of vendors |
999	   +-------------------------------+----------------+-----------------+
1000	   |               Not implemented |             14 |             17% |
1001	   |    Not implemented (probably) |              2 |              2% |
1002	   |                   Implemented |              0 |              0% |
1003	   |                   No response |             68 |             81% |
1004	   | Total companies/orgs surveyed |             84 |            100% |
1005	   +-------------------------------+----------------+-----------------+

1007	    Table 2: Vendor Survey on byte-mode drop variant of RED (lower drop
1008	                      probability for small packets)

1010	   Where reasons have been given, the extra complexity of packet bias
1011	   code has been most prevalent, though one vendor had a more principled
1012	   reason for avoiding it--similar to the argument of this document.

1014	   Finally, we repeat that RED's byte mode drop SHOULD be disabled, but
1015	   active queue management such as RED SHOULD be enabled wherever
1016	   possible if we are to eradicate bias towards small packets--without
1017	   any AQM at all, tail-drop tends to lock-out large packets very
1018	   effectively.

1020	   Our survey was of vendor implementations, so we cannot be certain
1021	   about operator deployment.  But we believe many queues in the
1022	   Internet are still tail-drop.  The company of one of the co-authors
1023	   (BT) has widely deployed RED, but many tail-drop queues are there are
1024	   bound to still exist, particularly in access network equipment and on
1025	   middleboxes like firewalls, where RED is not always available.

1027	   Routers using a memory architecture based on fixed size buffers with
1028	   borrowing may also still be prevalent in the Internet.  As explained
1029	   in Section 4.2.1, these also provide a marginal (but legitimate) bias
1030	   towards small packets.  So even though RED byte-mode drop is not
1031	   prevalent, it is likely there is still some bias towards small
1032	   packets in the Internet due to tail drop and fixed buffer borrowing.

1034	5.  Outstanding Issues and Next Steps

1036	5.1.  Bit-congestible World

1038	   For a connectionless network with nearly all resources being bit-
1039	   congestible we believe the recommended position is now unarguably
1040	   clear--that the network should not make allowance for packet sizes
1041	   and the transport should.  This leaves two outstanding issues:

1043	   o  How to handle any legacy of AQM with byte-mode drop already
1044	      deployed;

1046	   o  The need to start a programme to update transport congestion
1047	      control protocol standards to take account of packet size.

1049	   The sample of returns from our vendor survey Section 4.2.4 suggest
1050	   that byte-mode packet drop seems not to be implemented at all let
1051	   alone deployed, or if it is, it is likely to be very sparse.
1052	   Therefore, we do not really need a migration strategy from all but
1053	   nothing to nothing.

1055	   A programme of standards updates to take account of packet size in
1056	   transport congestion control protocols has started with TFRC-SP
1057	   [RFC4828], while weighted TCPs implemented in the research community
1058	   [WindowPropFair] could form the basis of a future change to TCP
1059	   congestion control [RFC5681] itself.

1061	5.2.  Bit- & Packet-congestible World

1063	   Nonetheless, the position is much less clear-cut if the Internet
1064	   becomes populated by a more even mix of both packet-congestible and
1065	   bit-congestible resources.  If we believe we should allow for this
1066	   possibility in the future, this space contains a truly open research
1067	   issue.

1069	   We develop the concept of an idealised congestion notification
1070	   protocol that supports both bit-congestible and packet-congestible
1071	   resources in Appendix A.  This congestion notification requires at
1072	   least two flags for congestion of bit-congestible and packet-
1073	   congestible resources.  This hides a fundamental problem--much more
1074	   fundamental than whether we can magically create header space for yet
1075	   another ECN flag in IPv4, or whether it would work while being
1076	   deployed incrementally.  Distinguishing drop from delivery naturally
1077	   provides just one congestion flag--it is hard to drop a packet in two
1078	   ways that are distinguishable remotely.  This is a similar problem to
1079	   that of distinguishing wireless transmission losses from congestive
1080	   losses.

1082	   This problem would not be solved even if ECN were universally
1083	   deployed.  A congestion notification protocol must survive a
1084	   transition from low levels of congestion to high.  Marking two states
1085	   is feasible with explicit marking, but much harder if packets are
1086	   dropped.  Also, it will not always be cost-effective to implement AQM
1087	   at every low level resource, so drop will often have to suffice.

1089	   We should also note that, strictly, packet-congestible resources are
1090	   actually cycle-congestible because load also depends on the
1091	   complexity of each look-up and whether the pattern of arrivals is
1092	   amenable to caching or not.  Further, this reminds us that any
1093	   solution must not require a forwarding engine to use excessive
1094	   processor cycles in order to decide how to say it has no spare
1095	   processor cycles.

1097	   Recently, the dual resource queue (DRQ) proposal [DRQ] has been made
1098	   on the premise that, as network processors become more cost
1099	   effective, per packet operations will become more complex
1100	   (irrespective of whether more function in the network is desirable).
1101	   Consequently the premise is that CPU congestion will become more
1102	   common.  DRQ is a proposed modification to the RED algorithm that
1103	   folds both bit congestion and packet congestion into one signal
1104	   (either loss or ECN).

1106	   The problem of signalling packet processing congestion is not
1107	   pressing, as most Internet resources are designed to be bit-
1108	   congestible before packet processing starts to congest (see
1109	   Section 1.1).  However, the IRTF Internet congestion control research
1110	   group (ICCRG) has set itself the task of reaching consensus on
1111	   generic forwarding mechanisms that are necessary and sufficient to
1112	   support the Internet's future congestion control requirements (the
1113	   first challenge in [I-D.irtf-iccrg-welzl]).  Therefore, rather than
1114	   not giving this problem any thought at all, just because it is hard
1115	   and currently hypothetical, we defer the question of whether packet
1116	   congestion might become common and what to do if it does to the IRTF
1117	   (the 'Small Packets' challenge in [I-D.irtf-iccrg-welzl]).

1119	6.  Security Considerations

1121	   This draft recommends that queues do not bias drop probability
1122	   towards small packets as this creates a perverse incentive for
1123	   transports to break down their flows into tiny segments.  One of the
1124	   benefits of implementing AQM was meant to be to remove this perverse
1125	   incentive that drop-tail queues gave to small packets.  Of course, if
1126	   transports really want to make the greatest gains, they don't have to
1127	   respond to congestion anyway.  But we don't want applications that
1128	   are trying to behave to discover that they can go faster by using
1129	   smaller packets.

1131	   In practice, transports cannot all be trusted to respond to
1132	   congestion.  So another reason for recommending that queues do not
1133	   bias drop probability towards small packets is to avoid the
1134	   vulnerability to small packet DDoS attacks that would otherwise
1135	   result.  One of the benefits of implementing AQM was meant to be to
1136	   remove drop-tail's DoS vulnerability to small packets, so we
1137	   shouldn't add it back again.

1139	   If most queues implemented AQM with byte-mode drop, the resulting
1140	   network would amplify the potency of a small packet DDoS attack.  At
1141	   the first queue the stream of packets would push aside a greater
1142	   proportion of large packets, so more of the small packets would
1143	   survive to attack the next queue.  Thus a flood of small packets
1144	   would continue on towards the destination, pushing regular traffic
1145	   with large packets out of the way in one queue after the next, but
1146	   suffering much less drop itself.

1148	   Appendix B explains why the ability of networks to police the
1149	   response of _any_ transport to congestion depends on bit-congestible
1150	   network resources only doing packet-mode not byte-mode drop.  In
1151	   summary, it says that making drop probability depend on the size of
1152	   the packets that bits happen to be divided into simply encourages the
1153	   bits to be divided into smaller packets.  Byte-mode drop would
1154	   therefore irreversibly complicate any attempt to fix the Internet's
1155	   incentive structures.

1157	7.  Conclusions

1159	   This memo strongly recommends that the size of an individual packet
1160	   that is dropped or marked should only be taken into account when a
1161	   transport reads this as a congestion indication, not when network
1162	   equipment writes it.  The memo therefore strongly deprecates using
1163	   RED's byte-mode of packet drop in network equipment.

1165	   Whether network equipment should measure the length of a queue by
1166	   counting bytes or counting packets is a different question to whether
1167	   it should take into account the size of each packet being dropped or
1168	   marked.  The answer depends on whether the network resource is
1169	   congested respectively by bytes or by packets.  This means that RED's
1170	   byte-mode queue measurement will often be appropriate even though
1171	   byte-mode drop is strongly deprecated.

1173	   At the transport layer the IETF should continue updating congestion
1174	   control protocols to take account of the size of each packet that
1175	   indicates congestion.  Also the IETF should continue to make
1176	   transports less sensitive to losing control packets like SYNs, pure
1177	   ACKs and DNS exchanges.  Although many control packets happen to be
1178	   small, the alternative of network equipment favouring all small
1179	   packets would be dangerous.  That would create perverse incentives to
1180	   split data transfers into smaller packets.

1182	   The memo develops these recommendations from principled arguments
1183	   concerning scaling, layering, incentives, inherent efficiency,
1184	   security and policability.  But it also addresses practical issues
1185	   such as specific buffer architectures and incremental deployment.
1186	   Indeed a limited survey of RED implementations is included, which
1187	   shows there appears to be little, if any, installed base of RED's
1188	   byte-mode drop.  Therefore it can be deprecated with little, if any,
1189	   incremental deployment complications.

1191	   The recommendations have been developed on the well-founded basis
1192	   that most Internet resources are bit-congestible not packet-
1193	   congestible.  We need to know the likelihood that this assumption
1194	   will prevail longer term and, if it might not, what protocol changes
1195	   will be needed to cater for a mix of the two.  These questions have
1196	   been delegated to the IRTF.

1198	8.  Acknowledgements

1200	   Thank you to Sally Floyd, who gave extensive and useful review
1201	   comments.  Also thanks for the reviews from Philip Eardley, Toby
1202	   Moncaster and Arnaud Jacquet as well as helpful explanations of
1203	   different hardware approaches from Larry Dunn and Fred Baker.  I am
1204	   grateful to Bruce Davie and his colleagues for providing a timely and
1205	   efficient survey of RED implementation in Cisco's product range.
1206	   Also grateful thanks to Toby Moncaster, Will Dormann, John Regnault,
1207	   Simon Carter and Stefaan De Cnodder who further helped survey the
1208	   current status of RED implementation and deployment and, finally,
1209	   thanks to the anonymous individuals who responded.

1211	   Bob Briscoe and Jukka Manner are partly funded by Trilogy, a research
1212	   project (ICT- 216372) supported by the European Community under its
1213	   Seventh Framework Programme.  The views expressed here are those of
1214	   the authors only.

1216	9.  Comments Solicited

1218	   Comments and questions are encouraged and very welcome.  They can be
1219	   addressed to the IETF Transport Area working group mailing list
1220	   <tsvwg@ietf.org>, and/or to the authors.

1222	10.  References

1224	10.1.  Normative References

1226	   [RFC2119]                   Bradner, S., "Key words for use in RFCs
1227	                               to Indicate Requirement Levels", BCP 14,
1228	                               RFC 2119, March 1997.

1230	   [RFC2309]                   Braden, B., Clark, D., Crowcroft, J.,
1231	                               Davie, B., Deering, S., Estrin, D.,
1232	                               Floyd, S., Jacobson, V., Minshall, G.,
1233	                               Partridge, C., Peterson, L.,
1234	                               Ramakrishnan, K., Shenker, S.,
1235	                               Wroclawski, J., and L. Zhang,
1236	                               "Recommendations on Queue Management and
1237	                               Congestion Avoidance in the Internet",
1238	                               RFC 2309, April 1998.

1240	   [RFC3168]                   Ramakrishnan, K., Floyd, S., and D.
1241	                               Black, "The Addition of Explicit
1242	                               Congestion Notification (ECN) to IP",
1243	                               RFC 3168, September 2001.

1245	   [RFC3426]                   Floyd, S., "General Architectural and
1246	                               Policy Considerations", RFC 3426,
1247	                               November 2002.

1249	   [RFC5033]                   Floyd, S. and M. Allman, "Specifying New
1250	                               Congestion Control Algorithms", BCP 133,
1251	                               RFC 5033, August 2007.

1253	10.2.  Informative References

1255	   [CCvarPktSize]              Widmer, J., Boutremans, C., and J-Y. Le
1256	                               Boudec, "Congestion Control for Flows
1257	                               with Variable Packet Size", ACM CCR 34(2)
1258	                               137--151, 2004, <http://doi.acm.org/
1259	                               10.1145/997150.997162>.

1261	   [DRQ]                       Shin, M., Chong, S., and I. Rhee, "Dual-
1262	                               Resource TCP/AQM for Processing-
1263	                               Constrained Networks", IEEE/ACM
1264	                               Transactions on Networking Vol 16, issue
1265	                               2, April 2008, <http://dx.doi.org/
1266	                               10.1109/TNET.2007.900415>.

1268	   [DupTCP]                    Wischik, D., "Short messages", Royal
1269	                               Society workshop on networks: modelling
1270	                               and control , September 2007, <http://
1271	                               www.cs.ucl.ac.uk/staff/ucacdjw/Research/
1272	                               shortmsg.html>.

1274	   [ECNFixedWireless]          Siris, V., "Resource Control for Elastic
1275	                               Traffic in CDMA Networks", Proc. ACM
1276	                               MOBICOM'02 , September 2002, <http://
1277	                               www.ics.forth.gr/netlab/publications/
1278	                               resource_control_elastic_cdma.html>.

1280	   [Evol_cc]                   Gibbens, R. and F. Kelly, "Resource
1281	                               pricing and the evolution of congestion
1282	                               control", Automatica 35(12)1969--1985,
1283	                               December 1999, <http://
1284	                               www.statslab.cam.ac.uk/~frank/evol.html>.

1286	   [I-D.conex-concepts-uses]   Briscoe, B., Woundy, R., Moncaster, T.,
1287	                               and J. Leslie, "ConEx Concepts and Use
1288	                               Cases",
1289	                               draft-moncaster-conex-concepts-uses-01
1290	                               (work in progress), July 2010.

1292	   [I-D.ietf-avt-ecn-for-rtp]  Westerlund, M., Johansson, I., Perkins,
1293	                               C., and K. Carlberg, "Explicit Congestion
1294	                               Notification (ECN) for RTP over UDP",
1295	                               draft-ietf-avt-ecn-for-rtp-02 (work in
1296	                               progress), July 2010.

1298	   [I-D.irtf-iccrg-welzl]      Welzl, M., Scharf, M., Briscoe, B., and
1299	                               D. Papadimitriou, "Open Research Issues
1300	                               in Internet Congestion Control", draft-
1301	                               irtf-iccrg-welzl-congestion-control-open-
1302	                               research-08 (work in progress),
1303	                               September 2010.

1305	   [IOSArch]                   Bollapragada, V., White, R., and C.
1306	                               Murphy, "Inside Cisco IOS Software
1307	                               Architecture", Cisco Press: CCIE
1308	                               Professional Development ISBN13: 978-1-
1309	                               57870-181-0, July 2000.

1311	   [MulTCP]                    Crowcroft, J. and Ph. Oechslin,
1312	                               "Differentiated End to End Internet
1313	                               Services using a Weighted Proportional
1314	                               Fair Sharing TCP", CCR 28(3) 53--69,
1315	                               July 1998, <http://www.cs.ucl.ac.uk/
1316	                               staff/J.Crowcroft/hipparch/pricing.html>.

1318	   [PktSizeEquCC]              Vasallo, P., "Variable Packet Size
1319	                               Equation-Based Congestion Control", ICSI
1320	                               Technical Report tr-00-008, 2000, <http:/
1321	                               /http.icsi.berkeley.edu/ftp/global/pub/
1322	                               techreports/2000/tr-00-008.pdf>.

1324	   [RED93]                     Floyd, S. and V. Jacobson, "Random Early
1325	                               Detection (RED) gateways for Congestion
1326	                               Avoidance", IEEE/ACM Transactions on
1327	                               Networking 1(4) 397--413, August 1993, <h
1328	                               ttp://www.icir.org/floyd/papers/red/
1329	                               red.html>.

1331	   [REDbias]                   Eddy, W. and M. Allman, "A Comparison of
1332	                               RED's Byte and Packet Modes", Computer
1333	                               Networks 42(3) 261--280, June 2003, <http
1334	                               ://www.ir.bbn.com/documents/articles/
1335	                               redbias.ps>.

1337	   [REDbyte]                   De Cnodder, S., Elloumi, O., and K.
1338	                               Pauwels, "RED behavior with different
1339	                               packet sizes", Proc. 5th IEEE Symposium
1340	                               on Computers and Communications
1341	                               (ISCC) 793--799, July 2000, <http://
1342	                               www.icir.org/floyd/red/Elloumi99.pdf>.

1344	   [RFC2474]                   Nichols, K., Blake, S., Baker, F., and D.
1345	                               Black, "Definition of the Differentiated
1346	                               Services Field (DS Field) in the IPv4 and
1347	                               IPv6 Headers", RFC 2474, December 1998.

1349	   [RFC3448]                   Handley, M., Floyd, S., Padhye, J., and
1350	                               J. Widmer, "TCP Friendly Rate Control
1351	                               (TFRC): Protocol Specification",
1352	                               RFC 3448, January 2003.

1354	   [RFC3714]                   Floyd, S. and J. Kempf, "IAB Concerns
1355	                               Regarding Congestion Control for Voice
1356	                               Traffic in the Internet", RFC 3714,
1357	                               March 2004.

1359	   [RFC4828]                   Floyd, S. and E. Kohler, "TCP Friendly
1360	                               Rate Control (TFRC): The Small-Packet
1361	                               (SP) Variant", RFC 4828, April 2007.

1363	   [RFC5562]                   Kuzmanovic, A., Mondal, A., Floyd, S.,
1364	                               and K. Ramakrishnan, "Adding Explicit
1365	                               Congestion Notification (ECN) Capability
1366	                               to TCP's SYN/ACK Packets", RFC 5562,
1367	                               June 2009.

1369	   [RFC5670]                   Eardley, P., "Metering and Marking
1370	                               Behaviour of PCN-Nodes", RFC 5670,
1371	                               November 2009.

1373	   [RFC5681]                   Allman, M., Paxson, V., and E. Blanton,
1374	                               "TCP Congestion Control", RFC 5681,
1375	                               September 2009.

1377	   [RFC5690]                   Floyd, S., Arcia, A., Ros, D., and J.
1378	                               Iyengar, "Adding Acknowledgement
1379	                               Congestion Control to TCP", RFC 5690,
1380	                               February 2010.

1382	   [Rate_fair_Dis]             Briscoe, B., "Flow Rate Fairness:
1383	                               Dismantling a Religion", ACM
1384	                               CCR 37(2)63--74, April 2007, <http://
1385	                               portal.acm.org/citation.cfm?id=1232926>.

1387	   [WindowPropFair]            Siris, V., "Service Differentiation and
1388	                               Performance of Weighted Window-Based
1389	                               Congestion Control and Packet Marking
1390	                               Algorithms in ECN Networks", Computer
1391	                               Communications 26(4) 314--326, 2002, <htt
1392	                               p://www.ics.forth.gr/netgroup/
1393	                               publications/
1394	                               weighted_window_control.html>.

1396	   [gentle_RED]                Floyd, S., "Recommendation on using the
1397	                               "gentle_" variant of RED", Web page ,
1398	                               March 2000, <http://www.icir.org/floyd/
1399	                               red/gentle.html>.

1401	   [pBox]                      Floyd, S. and K. Fall, "Promoting the Use
1402	                               of End-to-End Congestion Control in the
1403	                               Internet", IEEE/ACM Transactions on
1404	                               Networking 7(4) 458--472, August 1999, <h
1405	                               ttp://www.aciri.org/floyd/
1406	                               end2end-paper.html>.

1408	   [pktByteEmail]              Floyd, S., "RED: Discussions of Byte and
1409	                               Packet Modes", email , March 1997, <http:
1410	                               //www-nrg.ee.lbl.gov/floyd/
1411	                               REDaveraging.txt>.

1413	Appendix A.  Idealised Wire Protocol

1415	   We will start by inventing an idealised congestion notification
1416	   protocol before discussing how to make it practical.  The idealised
1417	   protocol is shown to be correct using examples later in this
1418	   appendix.

1420	A.1.  Protocol Coding

1422	   Congestion notification involves the congested resource coding a
1423	   congestion notification signal into the packet stream and the
1424	   transports decoding it.  The idealised protocol uses two different
1425	   (imaginary) fields in each datagram to signal congestion: one for
1426	   byte congestion and one for packet congestion.

1428	   We are not saying two ECN fields will be needed (and we are not
1429	   saying that somehow a resource should be able to drop a packet in one
1430	   of two different ways so that the transport can distinguish which
1431	   sort of drop it was!).  These two congestion notification channels
1432	   are just a conceptual device.  They allow us to defer having to
1433	   decide whether to distinguish between byte and packet congestion when
1434	   the network resource codes the signal or when the transport decodes
1435	   it.

1437	   However, although this idealised mechanism isn't intended for
1438	   implementation, we do want to emphasise that we may need to find a
1439	   way to implement it, because it could become necessary to somehow
1440	   distinguish between bit and packet congestion [RFC3714].  Currently,
1441	   packet-congestion is not the common case, but there is no guarantee
1442	   that it will not become common with future technology trends.

1444	   The idealised wire protocol is given below.  It accounts for packet
1445	   sizes at the transport layer, not in the network, and then only in
1446	   the case of bit-congestible resources.  This avoids the perverse
1447	   incentive to send smaller packets and the DoS vulnerability that
1448	   would otherwise result if the network were to bias towards them (see
1449	   the motivating argument about avoiding perverse incentives in
1450	   Section 2.3):

1452	   1.  A packet-congestible resource trying to code congestion level p_p
1453	       into a packet stream should mark the idealised `packet
1454	       congestion' field in each packet with probability p_p
1455	       irrespective of the packet's size.  The transport should then
1456	       take a packet with the packet congestion field marked to mean
1457	       just one mark, irrespective of the packet size.

1459	   2.  A bit-congestible resource trying to code time-varying byte-
1460	       congestion level p_b into a packet stream should mark the `byte
1461	       congestion' field in each packet with probability p_b, again
1462	       irrespective of the packet's size.  Unlike before, the transport
1463	       should take a packet with the byte congestion field marked to
1464	       count as a mark on each byte in the packet.

1466	   The worked examples in Appendix A.2 show that transports can extract
1467	   sufficient and correct congestion notification from these protocols
1468	   for cases when two flows with different packet sizes have matching
1469	   bit rates or matching packet rates.  Examples are also given that mix
1470	   these two flows into one to show that a flow with mixed packet sizes
1471	   would still be able to extract sufficient and correct information.

1473	   Sufficient and correct congestion information means that there is
1474	   sufficient information for the two different types of transport
1475	   requirements:

1477	   Ratio-based:  Established transport congestion controls like TCP's
1478	      [RFC5681] aim to achieve equal segment rates per RTT through the
1479	      same bottleneck--TCP friendliness [RFC3448].  They work with the
1480	      ratio of dropped to delivered segments (or marked to unmarked
1481	      segments in the case of ECN).  The example scenarios show that
1482	      these ratio-based transports are effectively the same whether
1483	      counting in bytes or packets, because the units cancel out.
1484	      (Incidentally, this is why TCP's bit rate is still proportional to
1485	      packet size even when byte-counting is used, as recommended for
1486	      TCP in [RFC5681], mainly for orthogonal security reasons.)

1488	   Absolute-target-based:  Other congestion controls proposed in the
1489	      research community aim to limit the volume of congestion caused to
1490	      a constant weight parameter.  [MulTCP][WindowPropFair] are
1491	      examples of weighted proportionally fair transports designed for
1492	      cost-fair environments [Rate_fair_Dis].  In this case, the
1493	      transport requires a count (not a ratio) of dropped/marked bytes
1494	      in the bit-congestible case and of dropped/marked packets in the
1495	      packet congestible case.

1497	A.2.  Example Scenarios

1499	A.2.1.  Notation

1501	   To prove our idealised wire protocol (Appendix A.1) is correct, we
1502	   will compare two flows with different packet sizes, s_1 and s_2 [bit/
1503	   pkt], to make sure their transports each see the correct congestion
1504	   notification.  Initially, within each flow we will take all packets
1505	   as having equal sizes, but later we will generalise to flows within
1506	   which packet sizes vary.  A flow's bit rate, x [bit/s], is related to
1507	   its packet rate, u [pkt/s], by

1509	      x(t) = s.u(t).

1511	   We will consider a 2x2 matrix of four scenarios:

1513	   +-----------------------------+------------------+------------------+
1514	   |           resource type and |   A) Equal bit   |   B) Equal pkt   |
1515	   |            congestion level |       rates      |       rates      |
1516	   +-----------------------------+------------------+------------------+
1517	   |     i) bit-congestible, p_b |       (Ai)       |       (Bi)       |
1518	   |    ii) pkt-congestible, p_p |       (Aii)      |       (Bii)      |
1519	   +-----------------------------+------------------+------------------+

1521	                                  Table 3

1523	A.2.2.  Bit-congestible resource, equal bit rates (Ai)

1525	   Starting with the bit-congestible scenario, for two flows to maintain
1526	   equal bit rates (Ai) the ratio of the packet rates must be the
1527	   inverse of the ratio of packet sizes: u_2/u_1 = s_1/s_2.  So, for
1528	   instance, a flow of 60B packets would have to send 25x more packets
1529	   to achieve the same bit rate as a flow of 1500B packets.  If a
1530	   congested resource marks proportion p_b of packets irrespective of
1531	   size, the ratio of marked packets received by each transport will
1532	   still be the same as the ratio of their packet rates, p_b.u_2/p_b.u_1
1533	   = s_1/s_2.  So of the 25x more 60B packets sent, 25x more will be
1534	   marked than in the 1500B packet flow, but 25x more won't be marked
1535	   too.

1537	   In this scenario, the resource is bit-congestible, so it always uses
1538	   our idealised bit-congestion field when it marks packets.  Therefore
1539	   the transport should count marked bytes not packets.  But it doesn't
1540	   actually matter for ratio-based transports like TCP (Appendix A.1).

1542	   The ratio of marked to unmarked bytes seen by each flow will be p_b,
1543	   as will the ratio of marked to unmarked packets.  Because they are
1544	   ratios, the units cancel out.

1546	   If a flow sent an inconsistent mixture of packet sizes, we have said
1547	   it should count the ratio of marked and unmarked bytes not packets in
1548	   order to correctly decode the level of congestion.  But actually, if
1549	   all it is trying to do is decode p_b, it still doesn't matter.  For
1550	   instance, imagine the two equal bit rate flows were actually one flow
1551	   at twice the bit rate sending a mixture of one 1500B packet for every
1552	   thirty 60B packets. 25x more small packets will be marked and 25x
1553	   more will be unmarked.  The transport can still calculate p_b whether
1554	   it uses bytes or packets for the ratio.  In general, for any
1555	   algorithm which works on a ratio of marks to non-marks, either bytes
1556	   or packets can be counted interchangeably, because the choice cancels
1557	   out in the ratio calculation.

1559	   However, where an absolute target rather than relative volume of
1560	   congestion caused is important (Appendix A.1), as it is for
1561	   congestion accountability [Rate_fair_Dis], the transport must count
1562	   marked bytes not packets, in this bit-congestible case.  Aside from
1563	   the goal of congestion accountability, this is how the bit rate of a
1564	   transport can be made independent of packet size; by ensuring the
1565	   rate of congestion caused is kept to a constant weight
1566	   [WindowPropFair], rather than merely responding to the ratio of
1567	   marked and unmarked bytes.

1569	   Note the unit of byte-congestion-volume is the byte.

1571	A.2.3.  Bit-congestible resource, equal packet rates (Bi)

1573	   If two flows send different packet sizes but at the same packet rate,
1574	   their bit rates will be in the same ratio as their packet sizes, x_2/
1575	   x_1 = s_2/s_1.  For instance, a flow sending 1500B packets at the
1576	   same packet rate as another sending 60B packets will be sending at
1577	   25x greater bit rate.  In this case, if a congested resource marks
1578	   proportion p_b of packets irrespective of size, the ratio of packets
1579	   received with the byte-congestion field marked by each transport will
1580	   be the same, p_b.u_2/p_b.u_1 = 1.

1582	   Because the byte-congestion field is marked, the transport should
1583	   count marked bytes not packets.  But because each flow sends
1584	   consistently sized packets it still doesn't matter for ratio-based
1585	   transports.  The ratio of marked to unmarked bytes seen by each flow
1586	   will be p_b, as will the ratio of marked to unmarked packets.
1587	   Therefore, if the congestion control algorithm is only concerned with
1588	   the ratio of marked to unmarked packets (as is TCP), both flows will
1589	   be able to decode p_b correctly whether they count packets or bytes.

1591	   But if the absolute volume of congestion is important, e.g. for
1592	   congestion accountability, the transport must count marked bytes not
1593	   packets.  Then the lower bit rate flow using smaller packets will
1594	   rightly be perceived as causing less byte-congestion even though its
1595	   packet rate is the same.

1597	   If the two flows are mixed into one, of bit rate x1+x2, with equal
1598	   packet rates of each size packet, the ratio p_b will still be
1599	   measurable by counting the ratio of marked to unmarked bytes (or
1600	   packets because the ratio cancels out the units).  However, if the
1601	   absolute volume of congestion is required, the transport must count
1602	   the sum of congestion marked bytes, which indeed gives a correct
1603	   measure of the rate of byte-congestion p_b(x_1 + x_2) caused by the
1604	   combined bit rate.

1606	A.2.4.  Pkt-congestible resource, equal bit rates (Aii)

1608	   Moving to the case of packet-congestible resources, we now take two
1609	   flows that send different packet sizes at the same bit rate, but this
1610	   time the pkt-congestion field is marked by the resource with
1611	   probability p_p.  As in scenario Ai with the same bit rates but a
1612	   bit-congestible resource, the flow with smaller packets will have a
1613	   higher packet rate, so more packets will be both marked and unmarked,
1614	   but in the same proportion.

1616	   This time, the transport should only count marks without taking into
1617	   account packet sizes.  Transports will get the same result, p_p, by
1618	   decoding the ratio of marked to unmarked packets in either flow.

1620	   If one flow imitates the two flows but merged together, the bit rate
1621	   will double with more small packets than large.  The ratio of marked
1622	   to unmarked packets will still be p_p.  But if the absolute number of
1623	   pkt-congestion marked packets is counted it will accumulate at the
1624	   combined packet rate times the marking probability, p_p(u_1+u_2), 26x
1625	   faster than packet congestion accumulates in the single 1500B packet
1626	   flow of our example, as required.

1628	   But if the transport is interested in the absolute number of packet
1629	   congestion, it should just count how many marked packets arrive.  For
1630	   instance, a flow sending 60B packets will see 25x more marked packets
1631	   than one sending 1500B packets at the same bit rate, because it is
1632	   sending more packets through a packet-congestible resource.

1634	   Note the unit of packet congestion is a packet.

1636	A.2.5.  Pkt-congestible resource, equal packet rates (Bii)

1638	   Finally, if two flows with the same packet rate, pass through a
1639	   packet-congestible resource, they will both suffer the same
1640	   proportion of marking, p_p, irrespective of their packet sizes.  On
1641	   detecting that the pkt-congestion field is marked, the transport
1642	   should count packets, and it will be able to extract the ratio p_p of
1643	   marked to unmarked packets from both flows, irrespective of packet
1644	   sizes.

1646	   Even if the transport is monitoring the absolute amount of packets
1647	   congestion over a period, still it will see the same amount of packet
1648	   congestion from either flow.

1650	   And if the two equal packet rates of different size packets are mixed
1651	   together in one flow, the packet rate will double, so the absolute
1652	   volume of packet-congestion will accumulate at twice the rate of
1653	   either flow, 2p_p.u_1 = p_p(u_1+u_2).

1655	Appendix B.  Byte-mode Drop Complicates Policing Congestion Response

1657	   This appendix explains why the ability of networks to police the
1658	   response of _any_ transport to congestion depends on bit-congestible
1659	   network resources only doing packet-mode not byte-mode drop.

1661	   To be able to police a transport's response to congestion when
1662	   fairness can only be judged over time and over all an individual's
1663	   flows, the policer has to have an integrated view of all the
1664	   congestion an individual (not just one flow) has caused due to all
1665	   traffic entering the Internet from that individual.  This is termed
1666	   congestion accountability.

1668	   But a byte-mode drop algorithm has to depend on the local MTU of the
1669	   line - an algorithm needs to use some concept of a 'normal' packet
1670	   size.  Therefore, one dropped or marked packet is not necessarily
1671	   equivalent to another unless you know the MTU at the queue where it
1672	   was dropped/marked.  To have an integrated view of a user, we believe
1673	   congestion policing has to be located at an individual's attachment
1674	   point to the Internet [I-D.conex-concepts-uses].  But from there it
1675	   cannot know the MTU of each remote queue that caused each drop/mark.
1676	   Therefore it cannot take an integrated approach to policing all the
1677	   responses to congestion of all the transports of one individual.
1678	   Therefore it cannot police anything.

1680	   The security/incentive argument _for_ packet-mode drop is similar.
1681	   Firstly, confining RED to packet-mode drop would not preclude
1682	   bottleneck policing approaches such as [pBox] as it seems likely they
1683	   could work just as well by monitoring the volume of dropped bytes
1684	   rather than packets.  Secondly packet-mode dropping/marking naturally
1685	   allows the congestion notification of packets to be globally
1686	   meaningful without relying on MTU information held elsewhere.

1688	   Because we recommend that a dropped/marked packet should be taken to
1689	   mean that all the bytes in the packet are dropped/marked, a policer
1690	   can remain robust against bits being re-divided into different size
1691	   packets or across different size flows [Rate_fair_Dis].  Therefore
1692	   policing would work naturally with just simple packet-mode drop in
1693	   RED.

1695	   In summary, making drop probability depend on the size of the packets
1696	   that bits happen to be divided into simply encourages the bits to be
1697	   divided into smaller packets.  Byte-mode drop would therefore
1698	   irreversibly complicate any attempt to fix the Internet's incentive
1699	   structures.

1701	Appendix C.  Changes from Previous Versions

1703	   To be removed by the RFC Editor on publication.

1705	   Full incremental diffs between each version are available at
1706	   <http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#byte-pkt-congest>
1707	   or
1708	   <http://tools.ietf.org/wg/tsvwg/draft-ietf-tsvwg-byte-pkt-congest/>
1709	   (courtesy of the rfcdiff tool):

1711	   From -02 to -03  (this version)

1713	      *  Structural changes:

1715	         +  Split off text at end of "Scaling Congestion Control with
1716	            Packet Size" into new section "Transport-Independent
1717	            Network"

1719	         +  Shifted "Recommendations" straight after "Motivating
1720	            Arguments" and added "Conclusions" at end to reinforce
1721	            Recommendations

1723	         +  Added more internal structure to Recommendations, so that
1724	            recommendations specific to RED or to TCP are just
1725	            corollaries of a more general recommendation, rather than
1726	            being listed as a separate recommendation.

1728	         +  Renamed "State of the Art" as "Critical Survey of Existing
1729	            Advice" and retitled a number of subsections with more
1730	            descriptive titles.

1732	         +  Split end of "Congestion Coding: Summary of Status" into a
1733	            new subsection called "RED Implementation Status".

1735	         +  Removed text that had been in the Appendix "Congestion
1736	            Notification Definition: Further Justification".

1738	      *  Reordered the intro text a little.

1740	      *  Made it clearer when advice being reported is deprecated and
1741	         when it is not.

1743	      *  Described AQM as in network equipment, rather than saying "at
1744	         the network layer" (to side-step controversy over whether
1745	         functions like AQM are in the transport layer but in network
1746	         equipment).

1748	      *  Minor improvements to clarity throughout

1750	   From -01 to -02:

1752	      *  Restructured the whole document for (hopefully) easier reading
1753	         and clarity.  The concrete recommendation, in RFC2119 language,
1754	         is now in Section 7.

1756	   From -00 to -01:

1758	      *  Minor clarifications throughout and updated references

1760	   From briscoe-byte-pkt-mark-02 to ietf-byte-pkt-congest-00:

1762	      *  Added note on relationship to existing RFCs

1764	      *  Posed the question of whether packet-congestion could become
1765	         common and deferred it to the IRTF ICCRG.  Added ref to the
1766	         dual-resource queue (DRQ) proposal.

1768	      *  Changed PCN references from the PCN charter & architecture to
1769	         the PCN marking behaviour draft most likely to imminently
1770	         become the standards track WG item.

1772	   From -01 to -02:

1774	      *  Abstract reorganised to align with clearer separation of issue
1775	         in the memo.

1777	      *  Introduction reorganised with motivating arguments removed to
1778	         new Section 2.

1780	      *  Clarified avoiding lock-out of large packets is not the main or
1781	         only motivation for RED.

1783	      *  Mentioned choice of drop or marking explicitly throughout,
1784	         rather than trying to coin a word to mean either.

1786	      *  Generalised the discussion throughout to any packet forwarding
1787	         function on any network equipment, not just routers.

1789	      *  Clarified the last point about why this is a good time to sort
1790	         out this issue: because it will be hard / impossible to design
1791	         new transports unless we decide whether the network or the
1792	         transport is allowing for packet size.

1794	      *  Added statement explaining the horizon of the memo is long
1795	         term, but with short term expediency in mind.

1797	      *  Added material on scaling congestion control with packet size
1798	         (Section 2.1).

1800	      *  Separated out issue of normalising TCP's bit rate from issue of
1801	         preference to control packets (Section 2.4).

1803	      *  Divided up Congestion Measurement section for clarity,
1804	         including new material on fixed size packet buffers and buffer
1805	         carving (Section 4.1.1 & Section 4.2.1) and on congestion
1806	         measurement in wireless link technologies without queues
1807	         (Section 4.1.2).

1809	      *  Added section on 'Making Transports Robust against Control
1810	         Packet Losses' (Section 4.2.3) with existing & new material
1811	         included.

1813	      *  Added tabulated results of vendor survey on byte-mode drop
1814	         variant of RED (Table 2).

1816	   From -00 to -01:

1818	      *  Clarified applicability to drop as well as ECN.

1820	      *  Highlighted DoS vulnerability.

1822	      *  Emphasised that drop-tail suffers from similar problems to
1823	         byte-mode drop, so only byte-mode drop should be turned off,
1824	         not RED itself.

1826	      *  Clarified the original apparent motivations for recommending
1827	         byte-mode drop included protecting SYNs and pure ACKs more than
1828	         equalising the bit rates of TCPs with different segment sizes.
1829	         Removed some conjectured motivations.

1831	      *  Added support for updates to TCP in progress (ackcc & ecn-syn-
1832	         ack).

1834	      *  Updated survey results with newly arrived data.

1836	      *  Pulled all recommendations together into the conclusions.

1838	      *  Moved some detailed points into two additional appendices and a
1839	         note.

1841	      *  Considerable clarifications throughout.

1843	      *  Updated references

1845	Authors' Addresses

1847	   Bob Briscoe
1848	   BT
1849	   B54/77, Adastral Park
1850	   Martlesham Heath
1851	   Ipswich  IP5 3RE
1852	   UK

1854	   Phone: +44 1473 645196
1855	   EMail: bob.briscoe@bt.com
1856	   URI:   http://bobbriscoe.net/

1858	   Jukka Manner
1859	   Aalto University
1860	   Department of Communications and Networking (Comnet)
1861	   P.O. Box 13000
1862	   FIN-00076 Aalto
1863	   Finland

1865	   Phone: +358 9 470 22481
1866	   EMail: jukka.manner@tkk.fi
1867	   URI:   http://www.netlab.tkk.fi/~jmanner/