idnits 2.17.1 

draft-briscoe-docsis-q-protection-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == There are 4 instances of lines with non-ascii characters in the document.


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 888 has weird spacing: '... pkt_sz   prob...'

  -- The document date (13 May 2022) is 712 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '0' on line 711

  -- Looks like a reference, but probably isn't: '1' on line 711

  == Outdated reference: A later version (-29) exists of
     draft-ietf-tsvwg-ecn-l4s-id-25

  == Outdated reference: A later version (-22) exists of
     draft-ietf-tsvwg-nqb-10

  == Outdated reference: A later version (-03) exists of
     draft-briscoe-iccrg-prague-congestion-control-00

  == Outdated reference: A later version (-25) exists of
     draft-ietf-tsvwg-aqm-dualq-coupled-23

  == Outdated reference: A later version (-20) exists of
     draft-ietf-tsvwg-l4s-arch-17


     Summary: 0 errors (**), 0 flaws (~~), 8 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                    B. Briscoe, Ed.
3	Internet-Draft                                               Independent
4	Intended status: Informational                                  G. White
5	Expires: 14 November 2022                                      CableLabs
6	                                                             13 May 2022

8	     The DOCSIS® Queue Protection Algorithm to Preserve Low Latency
9	                  draft-briscoe-docsis-q-protection-06

11	Abstract

13	   This informational document explains the specification of the queue
14	   protection algorithm used in DOCSIS technology since version 3.1.  A
15	   shared low latency queue relies on the non-queue-building behaviour
16	   of every traffic flow using it.  However, some flows might not take
17	   such care, either accidentally or maliciously.  If a queue is about
18	   to exceed a threshold level of delay, the queue protection algorithm
19	   can rapidly detect the flows most likely to be responsible.  It can
20	   then prevent harm to other traffic in the low latency queue by
21	   ejecting selected packets (or all packets) of these flows.  The
22	   document is designed for four types of audience: a) congestion
23	   control designers who need to understand how to keep on the 'good'
24	   side of the algorithm; b) implementers of the algorithm who want to
25	   understand it in more depth; c) designers of algorithms with similar
26	   goals, perhaps for non-DOCSIS scenarios; and d) researchers
27	   interested in evaluating the algorithm.

29	Status of This Memo

31	   This Internet-Draft is submitted in full conformance with the
32	   provisions of BCP 78 and BCP 79.

34	   Internet-Drafts are working documents of the Internet Engineering
35	   Task Force (IETF).  Note that other groups may also distribute
36	   working documents as Internet-Drafts.  The list of current Internet-
37	   Drafts is at https://datatracker.ietf.org/drafts/current/.

39	   Internet-Drafts are draft documents valid for a maximum of six months
40	   and may be updated, replaced, or obsoleted by other documents at any
41	   time.  It is inappropriate to use Internet-Drafts as reference
42	   material or to cite them other than as "work in progress."

44	   This Internet-Draft will expire on 14 November 2022.

46	Copyright Notice

48	   Copyright (c) 2022 IETF Trust and the persons identified as the
49	   document authors.  All rights reserved.

51	   This document is subject to BCP 78 and the IETF Trust's Legal
52	   Provisions Relating to IETF Documents (https://trustee.ietf.org/
53	   license-info) in effect on the date of publication of this document.
54	   Please review these documents carefully, as they describe your rights
55	   and restrictions with respect to this document.  Code Components
56	   extracted from this document must include Revised BSD License text as
57	   described in Section 4.e of the Trust Legal Provisions and are
58	   provided without warranty as described in the Revised BSD License.

60	Table of Contents

62	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
63	     1.1.  Document Roadmap  . . . . . . . . . . . . . . . . . . . .   4
64	     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   5
65	     1.3.  Copyright Material  . . . . . . . . . . . . . . . . . . .   5
66	   2.  Approach - In Brief . . . . . . . . . . . . . . . . . . . . .   5
67	     2.1.  Mechanism . . . . . . . . . . . . . . . . . . . . . . . .   6
68	     2.2.  Policy  . . . . . . . . . . . . . . . . . . . . . . . . .   7
69	       2.2.1.  Policy Conditions . . . . . . . . . . . . . . . . . .   7
70	       2.2.2.  Policy Action . . . . . . . . . . . . . . . . . . . .   7
71	   3.  Necessary Flow Behaviour  . . . . . . . . . . . . . . . . . .   7
72	   4.  Pseudocode Walk-Through . . . . . . . . . . . . . . . . . . .   8
73	     4.1.  Input Parameters, Constants and Variables . . . . . . . .   9
74	     4.2.  Queue Protection Data Path  . . . . . . . . . . . . . . .  12
75	       4.2.1.  The qprotect() function . . . . . . . . . . . . . . .  13
76	       4.2.2.  The pick_bucket() function  . . . . . . . . . . . . .  14
77	       4.2.3.  The fill_bucket() function  . . . . . . . . . . . . .  17
78	       4.2.4.  The calcProbNative() function . . . . . . . . . . . .  17
79	   5.  Rationale . . . . . . . . . . . . . . . . . . . . . . . . . .  18
80	     5.1.  Rationale: Blame for Queuing, not for Rate in Itself  . .  18
81	     5.2.  Rationale for Aging the Queuing Score . . . . . . . . . .  20
82	     5.3.  Rationale for Transformed Queuing Score . . . . . . . . .  21
83	     5.4.  Rationale for Policy Conditions . . . . . . . . . . . . .  22
84	     5.5.  Rationale for Reclassification as the Policy Action . . .  25
85	   6.  Limitations . . . . . . . . . . . . . . . . . . . . . . . . .  26
86	   7.  IANA Considerations (to be removed by RFC Editor) . . . . . .  26
87	   8.  Implementation Status . . . . . . . . . . . . . . . . . . . .  26
88	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  27
89	     9.1.  Resource Exhaustion Attacks . . . . . . . . . . . . . . .  28
90	       9.1.1.  Exhausting Flow-State Storage . . . . . . . . . . . .  28
91	       9.1.2.  Exhausting Processing Resources . . . . . . . . . . .  29
92	   10. Comments Solicited  . . . . . . . . . . . . . . . . . . . . .  29
93	   11. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  29
94	   12. References  . . . . . . . . . . . . . . . . . . . . . . . . .  30
95	     12.1.  Normative References . . . . . . . . . . . . . . . . . .  30
96	     12.2.  Informative References . . . . . . . . . . . . . . . . .  31
97	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  32

99	1.  Introduction

101	   This informational document explains the specification of the queue
102	   protection (QProt) algorithm used in DOCSIS technology since version
103	   3.1 [DOCSIS].

105	   Although the algorithm is defined in annex P of [DOCSIS], it relies
106	   on cross-references to other parts of the set of specs.  This
107	   document pulls all the strands together into one self-contained
108	   document.  The core of the document is a similar pseudocode walk-
109	   through to that in the DOCSIS spec, but it also includes additional
110	   material: i) a brief overview; ii) a definition of how a data sender
111	   needs to behave to avoid triggering queue protection; and iii) a
112	   section giving the rationale for the design choices.

114	   Low queuing delay depends on hosts sending their data smoothly,
115	   either at low rate or responding to explicit congestion notifications
116	   (ECN [RFC8311], [I-D.ietf-tsvwg-ecn-l4s-id]).  So low queuing latency
117	   is something hosts create themselves, not something the network gives
118	   them.  This tends to ensure that self-interest alone does not drive
119	   flows to mis-mark their packets for the low latency queue.  However,
120	   traffic from an application that does not behave in a non-queue-
121	   building way might erroneously be classified into a low latency
122	   queue, whether accidentally or maliciously.  QProt protects other
123	   traffic in the low latency queue from the harm due to excess queuing
124	   that would otherwise be caused by such anomalous behaviour.

126	   In normal scenarios without misclassified traffic, QProt is not
127	   expected to intervene at all in the classification or forwarding of
128	   packets.

130	   An overview of how low latency support has been added to DOCSIS
131	   technology is given in [LLD].  In each direction of a DOCSIS link
132	   (upstream and downstream), there are two queues: one for Low Latency
133	   (LL) and one for Classic traffic, in an arrangement similar to the
134	   IETF's Coupled DualQ AQM [I-D.ietf-tsvwg-aqm-dualq-coupled].  The two
135	   queues enable a transition from 'Classic' to 'Scalable' congestion
136	   control so that low latency can become the norm for any application,
137	   including ones seeking both full throughput and low latency, not just
138	   low-rate applications that have been more traditionally associated
139	   with a low latency service.  The Classic queue is only necessary for
140	   traffic such as traditional (Reno/Cubic) TCP that needs about a round
141	   trip of buffering to fully utilize the link, and therefore has no
142	   incentive to mismark itself as low latency.  The QProt function is
143	   located at the ingress to the Low Latency queue.  Therefore, in the
144	   upstream QProt is located on the cable modem (CM), and in the
145	   downstream it is located on the cable CMTS (CM Termination System).
146	   If an arriving packet triggers queue protection, the QProt algorithm
147	   ejects the packet from the Low Latency queue and reclassifies it into
148	   the Classic queue.

150	   If QProt is used in settings other than DOCSIS links, it would be a
151	   simple matter to detect queue-building flows by using slightly
152	   different conditions, and/or to trigger a different action as a
153	   consequence, as appropriate for the scenario, e.g., dropping instead
154	   of reclassifying packets or perhaps accumulating a second per-flow
155	   score to decide whether to redirect a whole flow rather than just
156	   certain packets.  Such work is for future study and out of scope of
157	   the present document.

159	   The algorithm is based on a rigorous approach to quantifying how much
160	   each flow contributes to congestion, which is used in economics to
161	   allocate responsibility for the cost of one party's behaviour on
162	   others (the economic externality).  Another important feature of the
163	   approach is that the metric used for the queuing score is based on
164	   the same variable that determines the level of ECN signalling seen by
165	   the sender [RFC8311], [I-D.ietf-tsvwg-ecn-l4s-id].  This makes the
166	   internal queuing score visible externally as ECN markings.  This
167	   transparency is necessary to be able to objectively state (in
168	   Section 3) how a flow can keep on the 'good' side of the algorithm.

170	1.1.  Document Roadmap

172	   The core of the document is the walk-through of the DOCSIS QProt
173	   algorithm's pseudocode in Section 4.

175	   Prior to that, Section 2 summarizes the approach used in the
176	   algorithm, then Section 3 considers QProt from the perspective of the
177	   end-system, by defining the behaviour that a flow needs to comply
178	   with to avoid the QProt algorithm ejecting its packets from the low
179	   latency queue.

181	   Section 5 gives deeper insight into the principles and rationale
182	   behind the algorithm.  Then Section 6 explains the limitations of the
183	   approach, followed by the usual closing sections.

185	1.2.  Terminology

187	   The normative language for the DOCSIS QProt algorithm is in the
188	   DOCSIS specs [DOCSIS], [DOCSIS-CM-OSS], [DOCSIS-CCAP-OSS] not in this
189	   informational guide.  If there is any inconsistency, the DOCSIS specs
190	   take precedence.

192	   The following terms and abbreviations are used:

194	   CM:  Cable Modem

196	   CMTS:  CM Termination System

198	   Congestion-rate:  The transmission rate of bits or bytes contained
199	      within packets of a flow that have the CE codepoint set in the IP-
200	      ECN field [RFC3168] (including IP headers unless specified
201	      otherwise).  Congestion-bit-rate and congestion-volume were
202	      introduced in [RFC7713] and [RFC6789].

204	   DOCSIS:  Data Over Cable System Interface Specification.  "DOCSIS" is
205	      a registered trademark of Cable Television Laboratories, Inc.
206	      ("CableLabs").

208	   Non-queue-building:  A flow that tends not to build a queue

210	   Queue-building:  A flow that builds a queue.  If it is classified
211	      into the Low Latency queue, it is therefore a candidate for the
212	      queue protection algorithm to detect and sanction.

214	   ECN:  Explicit Congestion Notification

216	   QProt:  The Queue Protection function

218	1.3.  Copyright Material

220	   Parts of this document are reproduced from [DOCSIS] with kind
221	   permission of the copyright holder, Cable Television Laboratories,
222	   Inc. ("CableLabs").

224	2.  Approach - In Brief

226	   The algorithm is divided into mechanism and policy.  There is only a
227	   tiny amount of policy code, but policy might need to be changed in
228	   the future.  So, where hardware implementation is being considered,
229	   it would be advisable to implement the policy aspects in firmware or
230	   software:

232	   *  The mechanism aspects identify flows, maintain flow-state and
233	      accumulate per-flow queuing scores;

235	   *  The policy aspects can be divided into conditions and actions:

237	      -  The conditions are the logic that determines when action should
238	         be taken to avert the risk of queuing delay becoming excessive;

240	      -  The actions determine how this risk is averted, e.g., by
241	         redirecting packets from a flow into another queue, or to
242	         reclassify a whole flow that seems to be misclassified.

244	2.1.  Mechanism

246	   The algorithm maintains per-flow-state, where 'flow' usually means an
247	   end-to-end (layer-4) 5-tuple.  The flow-state consists of a queuing
248	   score that decays over time.  Indeed it is transformed into time
249	   units so that it represents the flow-state's own expiry time
250	   (explained in Section 5.3).  A higher queuing score pushes out the
251	   expiry time further.

253	   Non-queue-building flows tend to release their flow-state rapidly ---
254	   it usually expires reasonably early in the gap between the packets of
255	   a normal flow.  Then the memory can be recycled for packets from
256	   other flows that arrive in between.  So only queue-building flows
257	   hold flow state persistently.

259	   The simplicity and effectiveness of the algorithm is due to the
260	   definition of the queuing score.  The queueing score represents the
261	   share of blame for queuing that each flow bears.  The scoring
262	   algorithm uses the same internal variable, probNative, that the AQM
263	   for the low latency queue uses to ECN-mark packets (the other two
264	   forms of marking, Classic and coupled, are driven by Classic traffic
265	   and therefore not relevant to protection of the LL queue).  In this
266	   way, the queuing score accumulates the size of each arriving packet
267	   of a flow, but scaled by the value of probNative (in the range 0 to
268	   1) at the instant the packet arrives.  So a flow's score accumulates
269	   faster, the higher the degree of queuing and the faster that the
270	   flow's packets arrive when there is queuing.  Section 5.1 explains
271	   further why this score represents blame for queuing.

273	   The algorithm as described so far would accumulate a number that
274	   would rise at the so-called congestion-rate of the flow (see
275	   Terminology in Section 1.2), i.e., the rate at which the flow is
276	   contributing to congestion, or the rate at which the AQM is
277	   forwarding bytes of the flow that are ECN marked.  However, rather
278	   than growing continually, the queuing score is also reduced (or
279	   'aged') at a constant rate.  This is because it is unavoidable for
280	   capacity-seeking flows to induce a continuous low level of congestion
281	   as they track available capacity.  Section 5.2 explains why this
282	   allowance can be set to the same constant for any scalable flow,
283	   whatever its bit rate.

285	   For implementation efficiency, the queuing score is transformed into
286	   time units so that it represents the expiry time of the flow state
287	   (as already discussed above).  Then it does not need to be explicitly
288	   aged, because the natural passage of time implicitly 'ages' an expiry
289	   time.  The transformation into time units simply involves dividing
290	   the queuing score of each packet by the constant aging rate
291	   (explained further in Section 5.3).

293	2.2.  Policy

295	2.2.1.  Policy Conditions

297	   The algorithm uses the queuing score to determine whether to eject
298	   each packet only at the time it first arrives.  This limits the
299	   policies available.  For instance, when queueing delay exceeds a
300	   threshold, it is not possible to eject a packet from the flow with
301	   the highest queuing scoring, because that would involve searching the
302	   queue for such a packet (if indeed one was still in the queue).
303	   Nonetheless, it is still possible to develop a policy that protects
304	   the low latency of the queue by making the queuing score threshold
305	   stricter the greater the excess of queuing delay relative to the
306	   threshold (explained in Section 5.4).

308	2.2.2.  Policy Action

310	   In the DOCSIS QProt spec at the time of writing, when the policy
311	   conditions are met the action taken to protect the low latency queue
312	   is to reclassify a packet into the Classic queue (justified in
313	   Section 5.5).

315	3.  Necessary Flow Behaviour

317	   The QProt algorithm described here can be used for responsive and/or
318	   unresponsive flows.

320	   *  It is possible to objectively describe the least responsive way
321	      that a flow will need to respond to congestion signals in order to
322	      avoid triggering queue protection, no matter the link capacity and
323	      no matter how much other traffic there is.

325	   *  It is not possible to describe how fast or smooth an unresponsive
326	      flow should be to avoid queue protection, because this depends on
327	      how much other traffic there is and the capacity of the link,
328	      which an application is unable to know.  However, the more
329	      smoothly an unresponsive flow paces its packets and the lower its
330	      rate relative to typical broadband link capacities, the less
331	      likelihood that it will risk causing enough queueing to trigger
332	      queue protection.

334	   Responsive low latency flows can use an L4S ECN codepoint
335	   [I-D.ietf-tsvwg-ecn-l4s-id] to get classified into the low latency
336	   queue.

338	   A sender can arrange for flows that are smooth but do not respond to
339	   ECN marking to be classified into the low latency queue by using the
340	   Non-Queue-Building (NQB) Diffserv codepoint [I-D.ietf-tsvwg-nqb],
341	   which the DOCSIS specs support, or an operator can use various other
342	   local classifiers.

344	   As already explained in Section 2.1, the QProt algorithm is driven
345	   from the same variable that drives the ECN marking probability in the
346	   low latency queue (see the Immediate Active Queue Management Annex in
347	   [DOCSIS]).  The algorithm that calculates this internal variable is
348	   run on the arrival of every packet, whether it is ECN-capable or not,
349	   so that it can be used by the QProt algorithm.  But the variable is
350	   only used to ECN-mark packets that are ECN-capable.

352	   Not only does this dual use of the variable improve processing
353	   efficiency, but it also makes the basis of the QProt algorithm
354	   visible and transparent, at least for responsive ECN-capable flows.
355	   Then it is possible to state objectively that a flow can avoid
356	   triggering queue protection by keeping the bit rate of ECN marked
357	   packets (the congestion-rate) below AGING, which is a configured
358	   constant of the algorithm (default 2^19 B/s ~= 4 Mb/s).  Note that it
359	   is in a congestion controller's own interest to keep its average
360	   congestion-rate well below this level (e.g., ~1 Mb/s), to ensure that
361	   it does not trigger queue protection during transient dynamics.

363	   If the QProt algorithm is used in other settings, it would still need
364	   to be based on the visible level of congestion signalling, in a
365	   similar way to the DOCSIS approach.  Without transparency of the
366	   basis of the algorithm's decisions, end-systems would not be able to
367	   avoid triggering queue protection on an objective basis.

369	4.  Pseudocode Walk-Through
370	4.1.  Input Parameters, Constants and Variables

372	   The operator input parameters that set the parameters in the first
373	   two blocks of pseudocode below are defined for cable modems (CMs) in
374	   [DOCSIS-CM-OSS] and for CMTSs in [DOCSIS-CCAP-OSS].  Then, further
375	   constants are either derived from the input parameters or hard-coded.

377	   Defaults and units are shown in square brackets.  Defaults (or indeed
378	   any aspect of the algorithm) are subject to change, so the latest
379	   DOCSIS specs are the definitive references.  Also any operator might
380	   set certain parameters to non-default values.

382	   <CODE BEGINS>
383	   // Input Parameters
384	   MAX_RATE;          // Configured maximum sustained rate [b/s]
385	   QPROTECT_ON;       // Queue Protection is enabled [Default: TRUE]
386	   CRITICALqL_us;     // L queue threshold delay [us] Default: MAXTH_us
387	   CRITICALqLSCORE_us;// The threshold queuing score [Default: 4000us]
388	   LG_AGING;          // The aging rate of the q'ing score [Default: 19]
389	                      //  as log base 2 of the congestion-rate [lg(B/s)]

391	   // Input Parameters for the calcProbNative() algorithm:
392	   MAXTH_us;          // Max IAQM marking threshold [Default: 1000us]
393	   LG_RANGE;          // Log base 2 of the range of ramp [lg(ns)]
394	                      //  Default: 2^19 = 524288 ns (roughly 525 us)
395	   <CODE ENDS>
396	   <CODE BEGINS>
397	   // Constants, either derived from input parameters or hard-coded
398	   T_RES;                                    // Resolution of t_exp [ns]
399	                                             // Convert units (approx)
400	   AGING = pow(2, (LG_AGING-30) ) * T_RES;   // lg([B/s]) to [B/T_RES]
401	   CRITICALqL = CRITICALqL_us * 1000;        // [us] to [ns]
402	   CRITICALqLSCORE = CRITICALqLSCORE_us * 1000/T_RES; // [us] to [T_RES]
403	   // Threshold for the q'ing score condition
404	   CRITICALqLPRODUCT = CRITICALqL * CRITICALqLSCORE;
405	   qLSCORE_MAX = 5E9 / T_RES;           // Max queuing score = 5 s

407	   ATTEMPTS = 2; // Max attempts to pick a bucket (vendor-specific)
408	   BI_SIZE = 5;  // Bit-width of index number for non-default buckets
409	   NBUCKETS = pow(2, BI_SIZE);  // No. of non-default buckets
410	   MASK = NBUCKETS-1;     // convenient constant, with BI_SIZE LSBs set

412	                          // Queue Protection exit states
413	   EXIT_SUCCESS  = 0;     // Forward the packet
414	   EXIT_SANCTION = 1;     // Redirect the packet

416	   MAX_PROB      = 1; // For integer arithmetic, would use a large int
417	                      //  e.g., 2^31, to allow space for overflow
418	   MAXTH = MAXTH_us * 1000;   // Max marking threshold [ns]
419	   MAX_FRAME_SIZE = 2000;  // DOCSIS-wide constant [B]
420	   // Minimum marking threshold of 2 MTU for slow links [ns]
421	   FLOOR =  2 * 8 * MAX_FRAME_SIZE * 10^9 / MAX_RATE;
422	   RANGE = (1 << LG_RANGE);      // Range of ramp [ns]
423	   MINTH = max ( MAXTH - RANGE, FLOOR);
424	   MAXTH = MINTH + RANGE;           // Max marking threshold [ns]
425	   <CODE ENDS>

427	   Throughout the pseudocode, most variables are integers.  The only
428	   exceptions are floating point variables representing probabilities
429	   (MAX_PROB and probNative) and the AGING parameter.  The actual DOCSIS
430	   QProt algorithm is defined using integer arithmetic, but in the
431	   floating point arithmetic used in this document, (0 <= probNative <=
432	   1).  Also, the pseudocode omits overflow checking and it would need
433	   to be made robust to non-default input parameters.

435	   The resolution for expressing time, T_RES, needs to be chosen to
436	   ensure that expiry times for buckets can represent times that are a
437	   fraction (e.g., 1/10) of the expected packet interarrival time for
438	   the system.

440	   The following definitions explain the purpose of important variables
441	   and functions.

443	   <CODE BEGINS>
444	   // Public variables:
445	   qdelay;        // The current queuing delay of the LL queue [ns]
446	   probNative;    // Native marking probability of LL queue within [0,1]

448	   // External variables
449	   packet;            // The structure holding packet header fields
450	   packet.size;       // The size of the current packet [B]
451	   packet.uflow;      // The flow identifier of the current packet
452	                      //  (e.g., 5-tuple or 4-tuple if IPSec)

454	   // Irrelevant details of DOCSIS function to return qdelay are removed
455	   qdelayL(...)      // Returns current delay of the low latency Q [ns]
456	   <CODE ENDS>

458	   Pseudocode for how the algorithm categorizes packets by flow ID to
459	   populate the variable packet.uflow is not given in detail here.  The
460	   application's flow ID is usually defined by a common 5-tuple (or
461	   4-tuple) of:

463	   *  source and destination IP addresses of the innermost IP header
464	      found;

466	   *  the protocol (IPv4) or next header (IPv6) field in this IP header

468	   *  either of:

470	      -  source and destination port numbers, for TCP, UDP, UDP-Lite,
471	         SCTP, DCCP, etc.

473	      -  Security Parameters Index (SPI) for IPSec Encapsulating
474	         Security Payload (ESP) [RFC4303].

476	   The Microflow Classification section of the Queue Protection Annex of
477	   the DOCSIS spec. [DOCSIS] defines various strategies to find these
478	   headers by skipping extension headers or encapsulations.  If they
479	   cannot be found, the spec. defines various less-specific 3-tuples
480	   that would be used.  The DOCSIS spec. should be referred to for all
481	   these strategies, which will not be repeated here.

483	   The array of bucket structures defined below is used by all the Queue
484	   Protection functions:

486	   <CODE BEGINS>
487	   struct bucket { // The leaky bucket structure to hold per-flow state
488	      id;          // identifier (e.g., 5-tuple) of flow using bucket
489	      t_exp;       // expiry time in units of T_RES
490	                   // (t_exp - now) = flow's transformed q'ing score
491	   };
492	   struct bucket buckets[NBUCKETS+1];
493	   <CODE ENDS>

495	4.2.  Queue Protection Data Path

497	   All the functions of Queue Protection operate on the data path,
498	   driven by packet arrivals.

500	   The following functions that maintain per-flow queuing scores and
501	   manage per-flow state are considered primarily as mechanism:

503	      pick_bucket(uflow_id); // Returns bucket identifier

505	      fill_bucket(bucket_id, pkt_size, probNative); // Returns queuing
506	      score

508	      calcProbNative(qdelay) // Returns probability of ECN-marking

510	   The following function is primarily concerned with policy:

512	      qprotect(packet, ...); // Returns exit status to either forward or
513	      redirect the packet

515	   ('...' suppresses distracting detail.)

517	   Future modifications to policy aspects are more likely than to
518	   mechanisms.  Therefore, policy aspects would be less appropriate
519	   candidates for any hardware acceleration.

521	   The entry point to these functions is qprotect(), which is called
522	   from packet classification before each packet is enqueued into the
523	   appropriate queue, queue_id, as follows:

525	   <CODE BEGINS>
526	   classifier(packet) {
527	      // Determine which queue using ECN, DSCP and any local-use fields
528	      queue_id = classify(packet);
529	      //  LQ & CQ are macros for valid queue IDs returned by classify()
530	      if (queue_id == LQ) {
531	         // if packet classified to Low Latency Service Flow
532	         if (QPROTECT_ON) {
533	            if (qprotect(packet, ...) == EXIT_SANCTION) {
534	               // redirect packet to Classic Service Flow
535	               queue_id = CQ;
536	            }
537	         }
538	      return queue_id;
539	   }
540	   <CODE ENDS>

542	4.2.1.  The qprotect() function

544	   On each packet arrival, qprotect() measures the current queue delay
545	   and derives the native marking probability from it.  Then it uses
546	   pick_bucket to find the bucket already holding the flow's state, or
547	   to allocate a new bucket if the flow is new or its state has expired
548	   (the most likely case).  Then the queuing score is updated by the
549	   fill_bucket() function.  That completes the mechanism aspects.

551	   The comments against the subsequent policy conditions and actions
552	   should be self-explanatory at a superficial level.  The deeper
553	   rationale for these conditions is given in Section 5.4.

555	   <CODE BEGINS>
556	   // Per packet queue protection
557	   qprotect(packet, ...) {

559	      bckt_id;   // bucket index
560	      qLscore;   // queuing score of pkt's flow in units of T_RES

562	      qdelay = qL.qdelay(...);
563	      probNative = calcProbNative(qdelay);

565	      bckt_id = pick_bucket(packet.uflow);
566	      qLscore = fill_bucket(buckets[bckt_id], packet.size, probNative);

568	      // Determine whether to sanction packet
569	      if ( ( ( qdelay > CRITICALqL ) // Test if qdelay over threshold...
570	         // ...and if flow's q'ing score scaled by qdelay/CRITICALqL
571	         // ...exceeds CRITICALqLSCORE
572	         && ( qdelay * qLscore > CRITICALqLPRODUCT ) )
573	         // or qLSCORE_MAX reached
574	         || ( qLscore >= qLSCORE_MAX ) )

576	         return EXIT_SANCTION;

578	      else
579	         return EXIT_SUCCESS;
580	   }
581	   <CODE ENDS>

583	4.2.2.  The pick_bucket() function

585	   The pick_bucket() function is optimized for flow-state that will
586	   normally have expired from packet to packet of the same flow.  It is
587	   just one way of finding the bucket associated with the flow ID of
588	   each packet - it might be possible to develop more efficient
589	   alternatives.

591	   The algorithm is arranged so that the bucket holding any live (non-
592	   expired) flow-state associated with a packet will always be found
593	   before a new bucket is allocated.  The constant ATTEMPTS, defined
594	   earlier, determines how many hashes are used to find a bucket for
595	   each flow (actually, only one hash is generated; then, by default, 5
596	   bits of it at a time are used as the hash value, because by default
597	   there are 2^5 = 32 buckets).

599	   The algorithm stores the flow's own ID in its flow-state.  So, when a
600	   packet of a flow arrives, the algorithm tries up to ATTEMPTS times to
601	   hash to a bucket, looking for the flow's own ID.  If found, it uses
602	   that bucket, first resettings the expiry time to 'now' if it has
603	   expired.

605	   If it does not find the flow's ID, and the expiry time is still
606	   current, the algorithm can tell that another flow is using that
607	   bucket, and it continues to look for a bucket for the flow.  Even if
608	   it finds another flow's bucket where the expiry time has passed, it
609	   doesn't immediately use it.  It merely remembers it as the potential
610	   bucket to use.  But first it runs through all the ATTEMPTS hashes to
611	   look for a bucket assigned to the flow ID.  Then, if a live bucket is
612	   not already associated with the packet's flow, the algorithm should
613	   have already set aside an existing bucket with a score that has aged
614	   out.  Given this bucket is no longer necessary to hold state for its
615	   previous flow, it can be recycled for use by the present packet's
616	   flow.

618	   If all else fails, there is one additional bucket (called the dregs)
619	   that can be used.  If the dregs is still in live use by another flow,
620	   subsequent flows that cannot find a bucket of their own all share it,
621	   adding their score to the one in the dregs.  A flow might get away
622	   with using the dregs on its own, but when there are many mis-marked
623	   flows, multiple flows are more likely to collide in the dregs,
624	   including innocent flows.  The choice of number of buckets and number
625	   of hash attempts determines how likely it will be that this
626	   undesirable scenario will occur.

628	   <CODE BEGINS>
629	   // Pick the bucket associated with flow uflw
630	   pick_bucket(uflw) {

632	      now;                      // current time
633	      j;                        // loop counter
634	      h32;                      // holds hash of the packet's flow IDs
635	      h;                        // bucket index being checked
636	      hsav;                     // interim chosen bucket index

638	      h32   = hash32(uflw);     // 32-bit hash of flow ID
639	      hsav  = NBUCKETS;         // Default bucket
640	      now   = get_time_now();   // in units of T_RES

642	      // The for loop checks ATTEMPTS buckets for ownership by flow-ID
643	      // It also records the 1st bucket, if any, that could be recycled
644	      // because it's expired.
645	      // Must not recycle a bucket until all ownership checks completed
646	      for (j=0; j<ATTEMPTS; j++) {
647	         // Use least signif. BI_SIZE bits of hash for each attempt
648	         h = h32 & MASK;
649	         if (buckets[h].id == uflw) {    // Once uflw's bucket found...
650	            if (buckets[h].t_exp <= now) // ...if bucket has expired...
651	               buckets[h].t_exp = now;   // ...reset it
652	            return h;                    // Either way, use it
653	         }
654	         else if ( (hsav == NBUCKETS)  // If not seen expired bucket yet
655	                                       //  and this bucket has expired
656	              && (buckets[h].t_exp <= now) ) {
657	            hsav = h;                  // set it as the interim bucket
658	         }
659	         h32 >>= BI_SIZE;          // Bit-shift hash for next attempt
660	      }
661	      // If reached here, no tested bucket was owned by the flow-ID
662	      if (hsav != NBUCKETS) {
663	         // If here, found an expired bucket within the above for loop
664	         buckets[hsav].t_exp = now;              // Reset expired bucket
665	      } else {
666	         // If here, we're having to use the default bucket (the dregs)
667	         if (buckets[hsav].t_exp <= now) {   // If dregs has expired...
668	            buckets[hsav].t_exp = now;       // ...reset it
669	         }
670	      }
671	      buckets[hsav].id = uflw; // In either case, claim for recycling
672	      return hsav;
673	   }
674	   <CODE ENDS>

676	4.2.3.  The fill_bucket() function

678	   The fill_bucket() function both accumulates and ages the queuing
679	   score over time, as outlined in Section 2.1.  To make aging the score
680	   efficient, the increment of the queuing score is transformed into
681	   units of time by dividing by AGING, so that the result represents the
682	   new expiry time of the flow.

684	   Given that probNative is already used to select which packets to ECN-
685	   mark, it might be thought that the queuing score could just be
686	   incremented by the full size of each selected packet, instead of
687	   incrementing it by the product of every packet's size (pkt_sz) and
688	   probNative.  However, the unpublished experience of one of the
689	   authors with other congestion policers has found that the score then
690	   increments far too jumpily, particularly when probNative is low.

692	   A deeper explanation of the queuing score is given in Section 5.

694	   <CODE BEGINS>
695	   fill_bucket(bckt_id, pkt_sz, probNative) {
696	      now;                                       // current time
697	      now = get_time_now();                      // in units of T_RES
698	      // Add packet's queuing score
699	      // For integer arithmetic, a bit-shift can replace the division
700	      qLscore = min(buckets[bckt_id].t_exp - now
701	                    + probNative * pkt_sz / AGING, qLSCORE_MAX);
702	      buckets[bckt_id].t_exp = now + qLscore;
703	      return qLscore;
704	   }
705	   <CODE ENDS>

707	4.2.4.  The calcProbNative() function

709	   To derive this queuing score, the QProt algorithm uses the linear
710	   ramp function calcProbNative() to normalize instantaneous queuing
711	   delay into a probability in the range [0,1], which it assigns to
712	   probNative.

714	   <CODE BEGINS>
715	   calcProbNative(qdelay){
716	         if ( qdelay >= MAXTH ) {
717	            probNative = MAX_PROB;
718	         } else if ( qdelay > MINTH ) {
719	            probNative = MAX_PROB * (qdelay - MINTH)/RANGE;
720	            // In practice, the * and the / would use a bit-shift
721	         } else {
722	            probNative = 0;
723	         }
724	         return probNative;
725	   }
726	   <CODE ENDS>

728	5.  Rationale

730	5.1.  Rationale: Blame for Queuing, not for Rate in Itself

732	   Figure 1 shows the bit rates of two flows as stacked areas.  It poses
733	   the question of which flow is more to blame for queuing delay; the
734	   unresponsive constant bit rate flow (c) that is consuming about 80%
735	   of the capacity, or the flow sending regular short unresponsive
736	   bursts (b)?  The smoothness of c seems better for avoiding queuing,
737	   but its high rate does not.  However, if flow c was not there, or ran
738	   slightly more slowly, b would not cause any queuing.

740	   ^ bit rate (stacked areas)
741	   |  ,-.          ,-.          ,-.          ,-.          ,-.
742	   |--|b|----------|b|----------|b|----------|b|----------|b|---Capacity
743	   |__|_|__________|_|__________|_|__________|_|__________|_|_____
744	   |
745	   |                       c
746	   |
747	   |
748	   |
749	   +---------------------------------------------------------------->
750	                                                                 time

752	            Figure 1: Which is More to Blame for Queuing Delay?

754	   To explain queuing scores, in the following it will initially be
755	   assumed that the QProt algorithm is accumulating queuing scores, but
756	   not taking any action as a result.

758	   To quantify the responsibility that each flow bears for queuing
759	   delay, the QProt algorithm accumulates the product of the rate of
760	   each flow and the level of congestion, both measured at the instant
761	   each packet arrives.  The instantaneous flow rate is represented at
762	   each discrete event when a packet arrives by the packet's size, which
763	   accumulates faster the more packets arrive within each unit of time.
764	   The level of congestion is normalized to a dimensionless number
765	   between 0 and 1 (probNative).  This fractional congestion level is
766	   used in preference to a direct dependence on queuing delay for two
767	   reasons:

769	   *  to be able to ignore very low levels of queuing that contribute
770	      insignificantly to delay

772	   *  to be able to erect a steep barrier against excessive queuing
773	      delay

775	   The unit of the resulting queue score is "congested-bytes" per
776	   second, which distinguishes it from just bytes per second.

778	   Then, during the periods between bursts (b), neither flow accumulates
779	   any queuing score - the high rate of c is benign.  But, during each
780	   burst, if we say the rate of c and b are 80% and 45% of capacity,
781	   thus causing 25% overload, they each bear (80/125)% and (45/125)% of
782	   the responsibility for the queuing delay (64% and 36%).  The
783	   algorithm does not explicitly calculate these percentages.  They are
784	   just the outcome of the number of packets arriving from each flow
785	   during the burst.

787	   To summarize, the queuing score never sanctions rate solely on its
788	   own account.  It only sanctions rate inasmuch as it causes queuing.

790	   ^ bit rate (stacked areas)                               ,
791	   |               ,-.                       |\           ,-
792	   |------Capacity-|b|----------,-.----------|b|----------|b\-----
793	   |             __|_|_______   |b|        /``\| _...-._-': | ,.--
794	   |  ,-.     __/            \__|_|_     _/    |/          \|/
795	   |  |b| ___/                      \___/   __       r
796	   |  |_|/                v             \__/  \_______    _/\____/
797	   | _/                                               \__/
798	   |
799	   +---------------------------------------------------------------->
800	                                                                 time

802	        Figure 2: Responsibility for Queuing: More Complex Scenario

804	   Figure 2 gives a more complex illustration of the way the queuing
805	   score assigns responsibility for queuing (limited to the precision
806	   that ASCII art can illustrate).  The figure shows the bit rates of
807	   three flows represented as stacked areas labelled b, v and r.  The
808	   unresponsive bursts (b) are the same as in the previous example, but
809	   a variable rate video (v) replaces flow c.  It's rate varies as the
810	   complexity of the video scene varies.  Also on a slower timescale, in
811	   response to the level of congestion, the video adapts its quality.
812	   However, on a short time-scale it appears to be unresponsive to small
813	   amounts of queuing.  Also, part-way through, a low latency responsive
814	   flow (r) joins in, aiming to fill the balance of capacity left by the
815	   other two.

817	   The combination of the first burst and the low application-limited
818	   rate of the video causes neither flow to accumulate queuing score.
819	   In contrast, the second burst causes similar excessive overload
820	   (125%) to the example in Figure 1.  Then, the video happens to reduce
821	   its rate (probably due to a less complex scene) so the third burst
822	   causes only a little congestion.  Let us assume the resulting queue
823	   causes probNative to rise to just 1%, then the queuing score will
824	   only accumulate 1% of the size of each packet of flows v and b during
825	   this burst.

827	   The fourth burst happens to arrive just as the new responsive flow
828	   (r) has filled the available capacity, so it leads to very rapid
829	   growth of the queue.  After a round trip the responsive flow rapidly
830	   backs off, and the adaptive video also backs off more rapidly than it
831	   would normally, because of the very high congestion level.  The rapid
832	   response to congestion of flow r reduces the queuing score that all
833	   three flows accumulate, but they each still bear the cost in
834	   proportion to the product of the rates at which their packets arrive
835	   at the queue and the value of probNative when they do so.  Thus,
836	   during the fifth burst, they all accumulate less score than the
837	   fourth, because the queuing delay is not as excessive.

839	5.2.  Rationale for Aging the Queuing Score

841	   Even well-behaved flows will not always be able to respond fast
842	   enough to dynamic events.  Also well-behaved flows, e.g., DCTCP
843	   [RFC8257], TCP Prague [I-D.briscoe-iccrg-prague-congestion-control],
844	   BBRv2 [BBRv2] or the L4S variant of SCReAM [SCReAM] for real-time
845	   media [RFC8298], can maintain a very shallow queue by continual
846	   careful probing for more while also continually subtracting a little
847	   from their rate (or congestion window) in response to low levels of
848	   ECN signalling.  Therefore, the QProt algorithm needs to continually
849	   offer a degree of forgiveness to age out the queuing score as it
850	   accumulates.

852	   Scalable congestion controllers such as those above maintain their
853	   congestion window in inverse proportion to the congestion level,
854	   probNative.  That leads to the important property that on average a
855	   scalable flow holds the product of its congestion window and the
856	   congestion level constant, no matter the capacity of the link or how
857	   many other flows it competes with.  For instance, if the link
858	   capacity doubles, a scalable flow induces half the congestion
859	   probability.  Or if three scalable flows compete for the capacity,
860	   each flow will reduce to one third of the capacity they would use on
861	   their own and increase the congestion level by 3x.

863	   This suggests that the QProt algorithm will not sanction a well-
864	   behaved scalable flow if it ages out the queuing score at a
865	   sufficient constant rate.  The constant will need to be somewhat
866	   above the average of a well-behaved scalable flow to allow for normal
867	   dynamics.

869	   Relating QProt's aging constant to a scalable flow does not mean that
870	   a flow has to behave like a scalable flow.  It can be less
871	   aggressive, but not more.  For instance, a longer RTT flow can run at
872	   a lower congestion-rate than the aging rate, but it can also increase
873	   its aggressiveness to equal the rate of short RTT scalable flows
874	   [ScalingCC].  The constant aging of QProt also means that a long-
875	   running unresponsive flow will be prone to trigger QProt if it runs
876	   faster than a competing responsive scalable flow would.  And, of
877	   course, if a flow causes excessive queuing in the short-term, its
878	   queuing score will still rise faster than the constant aging process
879	   will decrease it.  Then QProt will still eject the flow's packets
880	   before they harm the low latency of the shared queue.

882	5.3.  Rationale for Transformed Queuing Score

884	   The QProt algorithm holds a flow's queuing score state in a structure
885	   called a bucket, because of its similarity to a classic leaky bucket
886	   (except the contents of the bucket does not represent bytes).

888	   probNative * pkt_sz   probNative * pkt_sz / AGING
889	             |                        |
890	          |  V  |                  |  V  |
891	          |  :  |        ___       |  :  |
892	          |_____|        ___       |_____|
893	          |     |        ___       |     |
894	          |__ __|                  |__ __|
895	             |                        |
896	             V                        V
897	        AGING * Dt                    Dt

899	                 Figure 3: Transformation of Queuing Score

901	   The accumulation and aging of the queuing score is shown on the left
902	   of Figure 3 in token bucket form.  Dt is the difference between the
903	   times when the scores of the current and previous packets were
904	   processed.

906	   A transformed equivalent of this token bucket is shown on the right
907	   of Figure 3, dividing both the input and output by the constant AGING
908	   rate.  The result is a bucket-depth that represents time and it
909	   drains at the rate that time passes.

911	   As a further optimization, the time the bucket was last updated is
912	   not stored with the flow-state.  Instead, when the bucket is
913	   initialized the queuing score is added to the system time 'now' and
914	   the resulting expiry time is written into the bucket.  Subsequently,
915	   if the bucket has not expired, the incremental queuing score is added
916	   to the time already held in the bucket.  Then the queuing score
917	   always represents the expiry time of the flow-state itself.  This
918	   means that the queuing score does not need to be aged explicitly
919	   because it ages itself implicitly.

921	5.4.  Rationale for Policy Conditions

923	   Pseudocode for the QProt policy conditions is given in Section 4.1
924	   within the second half of the qprotect() function.  When each packet
925	   arrives, after finding its flow state and updating the queuing score
926	   of the packet's flow, the algorithm checks whether the shared queue
927	   delay exceeds a constant threshold CRITICALqL (e.g., 2 ms), as
928	   repeated below for convenience:

930	   <CODE BEGINS>
931	      if (  ( qdelay > CRITICALqL )  // Test if qdelay over threshold...
932	         // ...and if flow's q'ing score scaled by qdelay/CRITICALqL
933	         // ...exceeds CRITICALqLSCORE
934	         && ( qdelay * qLscore > CRITICALqLPRODUCT ) )
935	         // Recall that CRITICALqLPRODUCT = CRITICALqL * CRITICALqLSCORE
936	   <CODE ENDS>

938	   If the queue delay threshold is exceeded, the flow's queuing score is
939	   temporarily scaled up by the ratio of the current queue delay to the
940	   threshold queuing delay, CRITICALqL (the reason for the scaling is
941	   given next).  If this scaled up score exceeds another constant
942	   threshold CRITICALqLSCORE, the packet is ejected.  The actual last
943	   line of code above multiplies both sides of the second condition by
944	   CRITICALqL to avoid a costly division.

946	   This approach allows each packet to be assessed once, as it arrives.
947	   Once queue delay exceeds the threshold, it has two implications:

949	   *  The current packet might be ejected even though there are packets
950	      already in the queue from flows with higher queuing scores.
951	      However, any flow that continues to contribute to the queue will
952	      have to send further packets, giving an opportunity to eject them
953	      as well, as they subsequently arrive.

955	   *  The next packets to arrive might not be ejected, because they
956	      might belong to flows with low queuing scores.  In this case,
957	      queue delay could continue to rise with no opportunity to eject a
958	      packet.  This is why the queuing score is scaled up by the current
959	      queue delay.  Then, the more the queue has grown without ejecting
960	      a packet, the more the algorithm 'raises the bar' to further
961	      packets.

963	   The above approach is preferred over the extra per-packet processing
964	   cost of searching the buckets for the flow with the highest queuing
965	   score and searching the queue for one of its packets to eject (if one
966	   is still in the queue).

968	   Note that by default CRITICALqL_us is set to the maximum threshold of
969	   the ramp marking algorithm, MAXTH_us.  However, there is some debate
970	   as to whether setting it to the minimum threshold instead would
971	   improve QProt performance.  This would roughly double the ratio of
972	   qdelay to CRITICALqL, which is compared against the CRITICALqLSCORE
973	   threshold.  So the threshold would have to be roughly doubled
974	   accordingly.

976	   Figure 4 explains this approach graphically.  On the horizontal axis
977	   it shows actual harm, meaning the queuing delay in the shared queue.
978	   On the vertical axis it shows the behaviour record of the flow
979	   associated with the currently arriving packet, represented in the
980	   algorithm by the flow's queuing score.  The shaded region represents
981	   the combination of actual harm and behaviour record that will lead to
982	   the packet being ejected.

984	   Behaviour Record:
985	   Queueing Score of
986	   Arriving Packet's Flow
987	   ^
988	   |   +          |/ / / / / / / / / / / / / / / / / / /
989	   |    +   N     | / / / / / / / / / / / / / / / / / / /
990	   |     +        |/ / / / /                   / / / / /
991	   |      +       | / / / /  E (Eject packet)   / / / / /
992	   |       +      |/ / / / /                   / / / / /
993	   |         +    | / / / / / / / / / / / / / / / / / / /
994	   |           +  |/ / / / / / / / / / / / / / / / / / /
995	   |             +| / / / / / / / / / / / / / / / / / / /
996	   |              |+ / / / / / / / / / / / / / / / / / /
997	   |    N         |   + / / / / / / / / / / / / / / / / /
998	   | (No actual   |       +/ / / / / / / / / / / / / / /
999	   |   harm)      |            +  / / / / / / / / / / / /
1000	   |              | P (Pass over)   +   ,/ / / / / / / /
1001	   |              |                           ^ + /./ /_/
1002	   +--------------+------------------------------------------>
1003	             CRITICALqL        Actual Harm: Shared Queue Delay

1005	          Figure 4: Graphical Explanation of the Policy Conditions

1007	   The regions labelled 'N' represent cases where the first condition is
1008	   not met - no actual harm - queue delay is below the critical
1009	   threshold, CRITICALqL.

1011	   The region labelled 'E' represents cases where there is actual harm
1012	   (queue delay exceeds CRITICALqL) and the queuing score associated
1013	   with the arriving packet is high enough to be able to eject it with
1014	   certainty.

1016	   The region labelled 'P' represents cases where there is actual harm,
1017	   but the queuing score of the arriving packet is insufficient to eject
1018	   it, so it has to be Passed over.  This adds to queuing delay, but the
1019	   alternative would be to sanction an innocent flow.  It can be seen
1020	   that, as actual harm increases, the judgement of innocence becomes
1021	   increasingly stringent; the behaviour record of the next packet's
1022	   flow does not have to be as bad to eject it.

1024	   Conditioning ejection on actual harm helps prevent VPN packets being
1025	   ejected unnecessarily.  VPNs consisting of multiple flows can tend to
1026	   accumulate queuing score faster than it is aged out, because the
1027	   aging rate is intended for a single flow.  However, whether or not
1028	   some traffic is in a VPN, the queue delay threshold (CRITICALqL) will
1029	   be no more likely to be exceeded.  So conditioning ejection on actual
1030	   harm helps reduce the chance that VPN traffic will be ejected by the
1031	   QProt function.

1033	5.5.  Rationale for Reclassification as the Policy Action

1035	   When the DOCSIS QProt algorithm deems that it is necessary to eject a
1036	   packet to protect the Low Latency queue, it redirects the packet to
1037	   the Classic queue.  In the Low Latency DOCSIS architecture (as in
1038	   Coupled DualQ AQMs generally), the Classic queue is expected to
1039	   frequently have a larger backlog of packets, caused by classic
1040	   congestion controllers interacting with a classic AQM (which has a
1041	   latency target of 10ms) as well as other bursty traffic.

1043	   Therefore, typically, an ejected packet will experience higher
1044	   queuing delay than it would otherwise, and it could be re-ordered
1045	   within its flow (assuming QProt does not eject all packets of an
1046	   anomalous flow).  The mild harm caused to the performance of the
1047	   ejected packet's flow is deliberate.  It gives senders a slight
1048	   incentive to identify their packets correctly.

1050	   If there were no such harm, there would be nothing to prevent all
1051	   flows from identifying themselves as suitable for classification into
1052	   the low latency queue, and just letting QProt sort the resulting
1053	   aggregate into queue-building and non-queue-building flows.  This
1054	   might seem like a useful alternative to requiring senders to
1055	   correctly identify their flows.  However, handling of mis-classified
1056	   flows is not without a cost.  The more packets that have to be
1057	   reclassified, the more often the delay of the low latency queue would
1058	   exceed the threshold.  Also more memory would be required to hold the
1059	   extra flow state.

1061	   When a packet is redirected into the Classic queue, an operator might
1062	   want to alter the identifier(s) that originally caused it to be
1063	   classified into the Low Latency queue, so that the packet will not be
1064	   classified into another low latency queue further downstream.
1065	   However, redirection of occasional packets can be due to unusually
1066	   high transient load just at the specific bottleneck, not necessarily
1067	   at any other bottleneck, and not necessarily due to bad flow
1068	   behaviour.  Therefore, Section 5.4.1.2 of [I-D.ietf-tsvwg-ecn-l4s-id]
1069	   precludes a network node from altering the end-to-end ECN field to
1070	   exclude traffic from L4S treatment.  Instead a local-use identifier
1071	   ought to be used (e.g., Diffserv Codepoint or VLAN tag), so that each
1072	   operator can apply its own policy, without prejudging what other
1073	   operators ought to do.

1075	   Although not supported in the DOCSIS specs, QProt could be extended
1076	   to recognize that large numbers of redirected packets belong to the
1077	   same flow.  This might be detected when the bucket expiry time t_exp
1078	   exceeds a threshold.  Depending on policy and implementation
1079	   capabilities, QProt could then install a classifier to redirect a
1080	   whole flow into the Classic queue, with an idle timeout to remove
1081	   stale classifiers.  In these 'persistent offender' cases, QProt might
1082	   also overwrite each redirected packet's DSCP or clear its ECN field
1083	   to Not-ECT, in order to protect other potential L4S queues
1084	   downstream.  The DOCSIS specs do not discuss sanctioning whole flows,
1085	   so further discussion is beyond the scope of the present document.

1087	6.  Limitations

1089	   The QProt algorithm groups packets with common layer-4 flow
1090	   identifiers.  It then uses this grouping to accumulate queuing scores
1091	   and to sanction packets.

1093	   This choice of identifier for grouping is pragmatic with no
1094	   scientific basis.  All the packets of a flow certainly pass between
1095	   the same two endpoints.  But some applications might initiate
1096	   multiple flows between the same end-points, e.g., for media, control,
1097	   data, etc.  Others might use common flow identifiers for all these
1098	   streams.  Also, a user might group multiple application flows within
1099	   the same encrypted VPN between the same layer-4 tunnel end-points.
1100	   And even if there were a one-to-one mapping between flows and
1101	   applications, there is no reason to believe that the rate at which
1102	   congestion can be caused ought to be allocated on a per application
1103	   flow basis.

1105	   The use of a queuing score that excludes those aspects of flow rate
1106	   that do not contribute to queuing (Section 5.1) goes some way to
1107	   mitigating this limitation, because the algorithm does not judge
1108	   responsibility for queuing delay primarily on the combined rate of a
1109	   set of flows grouped under one flow ID.

1111	7.  IANA Considerations (to be removed by RFC Editor)

1113	   This specification contains no IANA considerations.

1115	8.  Implementation Status

1117	    +================+================================================+
1118	    | Implementation | DOCSIS models for ns-3                         |
1119	    | name:          |                                                |
1120	    +================+================================================+
1121	    | Organization   | CableLabs                                      |
1122	    +----------------+------------------------------------------------+
1123	    | Web page       | https://apps.nsnam.org/app/docsis-ns3/         |
1124	    +----------------+------------------------------------------------+
1125	    | Description    | ns-3 simulation models developed and used in   |
1126	    |                | support of the Low Latency DOCSIS development, |
1127	    |                | including models of Dual Queue Coupled AQM,    |
1128	    |                | Queue Protection, and the DOCSIS MAC           |
1129	    +----------------+------------------------------------------------+
1130	    | Maturity       | Simulation models that can also be used in     |
1131	    |                | emulation mode in a testbed context            |
1132	    +----------------+------------------------------------------------+
1133	    | Coverage       | Complete implementation of Annex P of DOCSIS   |
1134	    |                | 3.1                                            |
1135	    +----------------+------------------------------------------------+
1136	    | Version        | DOCSIS 3.1, version I21;                       |
1137	    |                | https://www.cablelabs.com/specifications/CM-   |
1138	    |                | SP-MULPIv3.1?v=I21                             |
1139	    +----------------+------------------------------------------------+
1140	    | Licence        | GPLv2                                          |
1141	    +----------------+------------------------------------------------+
1142	    | Contact        | via web page                                   |
1143	    +----------------+------------------------------------------------+
1144	    | Last Impl'n    | Mar 2022                                       |
1145	    | update         |                                                |
1146	    +----------------+------------------------------------------------+
1147	    | Information    | 7 Mar 2022                                     |
1148	    | valid at       |                                                |
1149	    +----------------+------------------------------------------------+

1151	                                  Table 1

1153	   There are also a number of closed source implementations, including 2
1154	   cable modem implementations written by different chipset
1155	   manufacturers, and one CMTS implementation by a third manufacturer.
1156	   These, as well as the ns-3 implementation, have passed the full suite
1157	   of compliance tests developed by CableLabs.

1159	9.  Security Considerations

1161	   The whole of this document concerns traffic security.  It considers
1162	   the security question of how to identify and eject traffic that does
1163	   not comply with the non-queue-building behaviour required to use a
1164	   shared low latency queue, whether accidentally or maliciously.

1166	   Section 8.2 of the L4S architecture [I-D.ietf-tsvwg-l4s-arch]
1167	   introduces the problem of maintaining low latency by either self-
1168	   restraint or enforcement, and places DOCSIS queue protection in
1169	   context within a wider set of approaches to the problem.

1171	9.1.  Resource Exhaustion Attacks

1173	   The algorithm has been designed to fail gracefully in the face of
1174	   traffic crafted to overrun the resources used for the algorithm's own
1175	   processing and flow state.  This means that non-queue-building flows
1176	   will always be less likely to be sanctioned than queue-building
1177	   flows.  But an attack could be contrived to deplete resources in such
1178	   a way that the proportion of innocent (non-queue-building) flows that
1179	   are incorrectly sanctioned could increase.

1181	   Incorrect sanctioning is intended not to be catastrophic; it results
1182	   in more packets from well-behaved flows being redirected into the
1183	   Classic queue; thus introducing more reordering into innocent flows.

1185	9.1.1.  Exhausting Flow-State Storage

1187	   To exhaust the number of buckets, the most efficient attack is to
1188	   send enough long-running attack flows to increase the chance that an
1189	   arriving flow will not find an available bucket, and therefore have
1190	   to share the 'dregs' bucket.  For instance, if ATTEMPTS=2 and
1191	   NBUCKETS=32, it requires about 94 attack flows, each using different
1192	   port numbers, to increase the probability to 99% that an arriving
1193	   flow will have to share the dregs, where it will share a high degree
1194	   of redirection into the C queue with the remainder of the attack
1195	   flows.

1197	   For an attacker to keep buckets busy, it is more efficient to hold
1198	   onto them by cycling regularly through a set of port numbers (94 in
1199	   the above example), rather than to keep occupying and releasing
1200	   buckets with single packet flows across a much larger number of
1201	   ports.

1203	   During such an attack, the coupled marking probability will have
1204	   saturated at 100%. So, to hold a bucket, the rate of an attack flow
1205	   needs to be no less than the AGING rate of each bucket; 4Mb/s by
1206	   default.  However, for an attack flow to be sure to hold on to its
1207	   bucket, it would need to send somewhat faster.  Thus an attack with
1208	   100 flows would need a total force of more than 100 * 4Mb/s = 400Mb/
1209	   s.

1211	   This attack can be mitigated (but not prevented) by increasing the
1212	   number of buckets.  The required attack force scales linearly with
1213	   the number of buckets, NBUCKETS.  So, if NBUCKETS were doubled to 64,
1214	   twice as many 4Mb/s flows would be needed to maintain the same impact
1215	   on innocent flows.

1217	   Probably the most effective mitigation would be to implement
1218	   redirection of whole-flows once enough of the individual packets of a
1219	   certain offending flow had been redirected.  This would free up the
1220	   buckets used to maintain the per-packet queuing score of persistent
1221	   offenders.  Whole-flow redirection is outside the scope of the
1222	   current version of the QProt algorithm specified here, but it is
1223	   briefly discussed at the end of Section 5.5.

1225	   It might be considered that all the packets of persistently offending
1226	   flows ought to be discarded rather than redirected.  However, this is
1227	   not recommended, because attack flows might be able to pervert whole-
1228	   flow discard, turning it onto at least some innocent flows, thus
1229	   amplifying an attack that causes reordering into total deletion of
1230	   some innocent flows.

1232	9.1.2.  Exhausting Processing Resources

1234	   The processing time needed to apply the QProt algorithm to each L
1235	   packet is small and intended not to take all the time available
1236	   between each of a run of fairly small packets.  However, an attack
1237	   could use minimum size packets launched from multiple input
1238	   interfaces into a lower capacity output interface.  Whether the QProt
1239	   algorithm is vulnerable to processor exhaustion will depend on the
1240	   specific implementation.

1242	   Addition of a capability to redirect persistently offending flows
1243	   from L to C would be the most effective way to reduce the per-packet
1244	   processing cost of the QProt algorithm, when under attack.  As
1245	   already mentioned in Section 9.1.1, this would also be an effective
1246	   way to mitigate flow-state exhaustion attacks.  Further discussion of
1247	   whole-flow redirection is outside the scope of the present document,
1248	   but briefly discussed at the end of Section 5.5.

1250	10.  Comments Solicited

1252	   Evaluation and assessment of the algorithm by researchers is
1253	   solicited.  Comments and questions are also encouraged and welcome.
1254	   They can be addressed to the authors.

1256	11.  Acknowledgements

1258	   Thanks to Tom Henderson, Magnus Westerlund, David Black, Adrian
1259	   Farrel and Gorry Fairhurst for their reviews of this document.  The
1260	   design of the QProt algorithm and the settings of the parameters
1261	   benefited from discussion and critique from the participants of the
1262	   cable industry working group on Low Latency DOCSIS.  CableLabs funded
1263	   Bob Briscoe's initial work on this document.

1265	12.  References

1267	12.1.  Normative References

1269	   [DOCSIS]   CableLabs, "MAC and Upper Layer Protocols Interface
1270	              (MULPI) Specification, CM-SP-MULPIv3.1", Data-Over-Cable
1271	              Service Interface Specifications DOCSIS® 3.1 Version I17
1272	              or later, 21 January 2019, <https://specification-
1273	              search.cablelabs.com/CM-SP-MULPIv3.1>.

1275	   [DOCSIS-CCAP-OSS]
1276	              CableLabs, "CCAP Operations Support System Interface
1277	              Spec", Data-Over-Cable Service Interface Specifications
1278	              DOCSIS® 3.1 Version I14 or later, 21 January 2019,
1279	              <https://specification-search.cablelabs.com/CM-SP-CM-
1280	              OSSIv3.1>.

1282	   [DOCSIS-CM-OSS]
1283	              CableLabs, "Cable Modem Operations Support System
1284	              Interface Spec", Data-Over-Cable Service Interface
1285	              Specifications DOCSIS® 3.1 Version I14 or later, 21
1286	              January 2019, <https://specification-search.cablelabs.com/
1287	              CM-SP-CM-OSSIv3.1>.

1289	   [I-D.ietf-tsvwg-ecn-l4s-id]
1290	              Schepper, K. D. and B. Briscoe, "Explicit Congestion
1291	              Notification (ECN) Protocol for Very Low Queuing Delay
1292	              (L4S)", Work in Progress, Internet-Draft, draft-ietf-
1293	              tsvwg-ecn-l4s-id-25, 4 March 2022,
1294	              <https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-
1295	              ecn-l4s-id-25>.

1297	   [I-D.ietf-tsvwg-nqb]
1298	              White, G. and T. Fossati, "A Non-Queue-Building Per-Hop
1299	              Behavior (NQB PHB) for Differentiated Services", Work in
1300	              Progress, Internet-Draft, draft-ietf-tsvwg-nqb-10, 4 March
1301	              2022, <https://datatracker.ietf.org/doc/html/draft-ietf-
1302	              tsvwg-nqb-10>.

1304	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
1305	              of Explicit Congestion Notification (ECN) to IP",
1306	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
1307	              <https://www.rfc-editor.org/info/rfc3168>.

1309	   [RFC8311]  Black, D., "Relaxing Restrictions on Explicit Congestion
1310	              Notification (ECN) Experimentation", RFC 8311,
1311	              DOI 10.17487/RFC8311, January 2018,
1312	              <https://www.rfc-editor.org/info/rfc8311>.

1314	12.2.  Informative References

1316	   [BBRv2]    Cardwell, N., "TCP BBR v2 Alpha/Preview Release", github
1317	              repository; Linux congestion control module,
1318	              <https://github.com/google/bbr/blob/v2alpha/README.md>.

1320	   [I-D.briscoe-iccrg-prague-congestion-control]
1321	              Schepper, K. D., Tilmans, O., and B. Briscoe, "Prague
1322	              Congestion Control", Work in Progress, Internet-Draft,
1323	              draft-briscoe-iccrg-prague-congestion-control-00, 9 March
1324	              2021, <https://datatracker.ietf.org/doc/html/draft-
1325	              briscoe-iccrg-prague-congestion-control-00>.

1327	   [I-D.ietf-tsvwg-aqm-dualq-coupled]
1328	              Schepper, K. D., Briscoe, B., and G. White, "DualQ Coupled
1329	              AQMs for Low Latency, Low Loss and Scalable Throughput
1330	              (L4S)", Work in Progress, Internet-Draft, draft-ietf-
1331	              tsvwg-aqm-dualq-coupled-23, 4 May 2022,
1332	              <https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-
1333	              aqm-dualq-coupled-23>.

1335	   [I-D.ietf-tsvwg-l4s-arch]
1336	              Briscoe, B., Schepper, K. D., Bagnulo, M., and G. White,
1337	              "Low Latency, Low Loss, Scalable Throughput (L4S) Internet
1338	              Service: Architecture", Work in Progress, Internet-Draft,
1339	              draft-ietf-tsvwg-l4s-arch-17, 4 March 2022,
1340	              <https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-
1341	              l4s-arch-17>.

1343	   [LLD]      White, G., Sundaresan, K., and B. Briscoe, "Low Latency
1344	              DOCSIS: Technology Overview", CableLabs White Paper ,
1345	              February 2019, <https://cablela.bs/low-latency-docsis-
1346	              technology-overview-february-2019>.

1348	   [RFC4303]  Kent, S., "IP Encapsulating Security Payload (ESP)",
1349	              RFC 4303, DOI 10.17487/RFC4303, December 2005,
1350	              <https://www.rfc-editor.org/info/rfc4303>.

1352	   [RFC6789]  Briscoe, B., Ed., Woundy, R., Ed., and A. Cooper, Ed.,
1353	              "Congestion Exposure (ConEx) Concepts and Use Cases",
1354	              RFC 6789, DOI 10.17487/RFC6789, December 2012,
1355	              <https://www.rfc-editor.org/info/rfc6789>.

1357	   [RFC7713]  Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx)
1358	              Concepts, Abstract Mechanism, and Requirements", RFC 7713,
1359	              DOI 10.17487/RFC7713, December 2015,
1360	              <https://www.rfc-editor.org/info/rfc7713>.

1362	   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
1363	              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
1364	              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
1365	              October 2017, <https://www.rfc-editor.org/info/rfc8257>.

1367	   [RFC8298]  Johansson, I. and Z. Sarker, "Self-Clocked Rate Adaptation
1368	              for Multimedia", RFC 8298, DOI 10.17487/RFC8298, December
1369	              2017, <https://www.rfc-editor.org/info/rfc8298>.

1371	   [ScalingCC]
1372	              Briscoe, B. and K. De Schepper, "Resolving Tensions
1373	              between Congestion Control Scaling Requirements", Simula
1374	              Technical Report TR-CS-2016-001 arXiv:1904.07605, July
1375	              2017, <https://arxiv.org/abs/1904.07605>.

1377	   [SCReAM]   Johansson, I., "SCReAM", github repository; ,
1378	              <https://github.com/EricssonResearch/scream/blob/master/
1379	              README.md>.

1381	Authors' Addresses

1383	   Bob Briscoe (editor)
1384	   Independent
1385	   United Kingdom
1386	   Email: ietf@bobbriscoe.net
1387	   URI:   http://bobbriscoe.net/

1389	   Greg White
1390	   CableLabs
1391	   United States of America
1392	   Email: G.White@CableLabs.com