idnits 2.17.1 

draft-briscoe-docsis-q-protection-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 750 has weird spacing: '... pkt_sz   prob...'

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (July 8, 2019) is 1753 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Looks like a reference, but probably isn't: '0' on line 587

  -- Looks like a reference, but probably isn't: '1' on line 587

  == Missing Reference: 'B' is mentioned on line 381, but not defined

  == Outdated reference: A later version (-29) exists of
     draft-ietf-tsvwg-ecn-l4s-id-06

  == Outdated reference: A later version (-25) exists of
     draft-ietf-tsvwg-aqm-dualq-coupled-09


     Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Transport Area Working Group                             B. Briscoe, Ed.
3	Internet-Draft                                                  G. White
4	Intended status: Informational                                 CableLabs
5	Expires: January 9, 2020                                    July 8, 2019

7	                Queue Protection to Preserve Low Latency
8	                  draft-briscoe-docsis-q-protection-00

10	Abstract

12	   This informational document defines and explains the specification of
13	   the queue protection algorithm used in DOCSIS 3.1.  It is believed
14	   this algorithm will be useful in scenarios other than DOCSIS.  A
15	   shared low latency queue relies on the non-queue-building behaviour
16	   of every traffic flow using it.  However, some flows might not take
17	   such care, either accidentally or maliciously.  If a queue is about
18	   to exceed a threshold level of delay, the queue protection algorithm
19	   can rapidly detect the flow(s) most likely to be responsible.  It can
20	   then prevent selected packets of these flows (or whole flows) from
21	   harming the queuing delay of other traffic in the low latency queue.

23	Status of This Memo

25	   This Internet-Draft is submitted in full conformance with the
26	   provisions of BCP 78 and BCP 79.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF).  Note that other groups may also distribute
30	   working documents as Internet-Drafts.  The list of current Internet-
31	   Drafts is at https://datatracker.ietf.org/drafts/current/.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time.  It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   This Internet-Draft will expire on January 9, 2020.

40	Copyright Notice

42	   Copyright (c) 2019 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (https://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document.  Please review these documents
49	   carefully, as they describe your rights and restrictions with respect
50	   to this document.  Code Components extracted from this document must
51	   include Simplified BSD License text as described in Section 4.e of
52	   the Trust Legal Provisions and are provided without warranty as
53	   described in the Simplified BSD License.

55	Table of Contents

57	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
58	     1.1.  Document Roadmap  . . . . . . . . . . . . . . . . . . . .   3
59	     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   4
60	   2.  Approach - In Brief . . . . . . . . . . . . . . . . . . . . .   4
61	     2.1.  Mechanism . . . . . . . . . . . . . . . . . . . . . . . .   5
62	     2.2.  Policy  . . . . . . . . . . . . . . . . . . . . . . . . .   6
63	   3.  Necessary Flow Behaviour  . . . . . . . . . . . . . . . . . .   6
64	   4.  Pseudocode Walk-Through . . . . . . . . . . . . . . . . . . .   7
65	     4.1.  Input Parameters, Constants and Variables . . . . . . . .   8
66	     4.2.  Queue Protection Data Path  . . . . . . . . . . . . . . .   9
67	   5.  Rationale . . . . . . . . . . . . . . . . . . . . . . . . . .  14
68	     5.1.  Rationale: Blame for Queuing, not for Rate in Itself  . .  14
69	     5.2.  Rationale for Aging the Queuing Score . . . . . . . . . .  17
70	     5.3.  Rationale for Normalized Queuing Score  . . . . . . . . .  17
71	     5.4.  Rationale for Policy Conditions . . . . . . . . . . . . .  18
72	   6.  Limitations . . . . . . . . . . . . . . . . . . . . . . . . .  20
73	   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  21
74	   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  21
75	   9.  Comments Solicited  . . . . . . . . . . . . . . . . . . . . .  21
76	   10. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  21
77	   11. References  . . . . . . . . . . . . . . . . . . . . . . . . .  21
78	     11.1.  Normative References . . . . . . . . . . . . . . . . . .  21
79	     11.2.  Informative References . . . . . . . . . . . . . . . . .  22
80	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  24

82	1.  Introduction

84	   This informational document defines and explains the specification of
85	   the queue protection (QProt) algorithm used in DOCSIS 3.1
86	   [DOCSIS3.1].  It is believed this algorithm will be useful in
87	   scenarios other than DOCSIS.

89	   Low queuing delay depends on hosts sending their data smoothly either
90	   at low rate or responding to explicit congestion notifications (ECN).
91	   So low latency is something hosts create themselves, not something
92	   the network gives them.  Therefore, there is no incentive for flows
93	   to mis-mark their packets for the low latency queue, However, traffic
94	   from an application that does not behave in a non-queue-building way
95	   might erroneously be classified into a low latency queue, whether
96	   accidentally or maliciously.  QProt prevents such erroneous behavior
97	   from harming the queuing delay of other traffic in the low latency
98	   queue.

100	   In normal scenarios without misclassified traffic, QProt does not
101	   intervene at all in the classification or forwarding of packets.

103	   An overview of how low latency support has been added to DOCSIS is
104	   given in [I-D.white-tsvwg-lld].  In each direction of a DOCSIS link
105	   (upstream and downstream), there are two queues: one for Low Latency
106	   and one for Classic traffic, in an arrangement similar to the IETF's
107	   Coupled DualQ AQM [I-D.ietf-tsvwg-aqm-dualq-coupled].  The Classic
108	   queue is intended for traffic such as traditional (Reno/Cubic) TCP
109	   that needs about a round trip of buffering to fully utilize the link,
110	   and therefore has no incentive to mismark itself as low latency.  The
111	   QProt function is located at the ingress to the Low Latency queue.
112	   Therefore, in the upstream QProt is located on the cable modem (CM),
113	   and in the downstream it is located on the cable CMTS (CM Termination
114	   System).  If an arriving packet triggers queue protection, the DOCSIS
115	   algorithm reclassifies the packet from the Low Latency queue into the
116	   Classic queue.

118	   If QProt is used in settings other than DOCSIS, it would be a simple
119	   matter to detect queue-building flows by using slightly different
120	   conditions, and/or trigger a different action as a consequence, as
121	   appropriate for the scenario, e.g. dropping instead of reclassifying
122	   packets or perhaps accumulating a second per-flow score to decide
123	   whether to redirect a whole flow rather than just certain packets.

125	   The algorithm is based on a principled approach to quantifying how
126	   much each user contributes to congestion, which is used in economics
127	   to allocate responsibility for the cost of one party's behaviour on
128	   others (the economic externality).  Another important feature of the
129	   approach is that the metric used for the queuing score is based on
130	   the same variable that determines the level of ECN signalling seen by
131	   the sender [RFC8311], [I-D.ietf-tsvwg-ecn-l4s-id].  This transparency
132	   is necessary to be able to objectively state (in Section 3) how a
133	   flow can keep on the 'good' side of the algorithm.

135	1.1.  Document Roadmap

137	   The core of the document is the walk-through of the DOCSIS QProt
138	   algorithm's pseudocode in Section 4.

140	   Prior to that, two brief sections provide a "bluffer's guide to
141	   QProt" which should suffice for those who do not need the details or
142	   the insights.  Section 2 summarizes the approach used in the
143	   algorithm.  Then Section 3 considers QProt from the perspective of
144	   the end-system, by defining the behaviour that a flow needs to comply
145	   with to avoid the QProt algorithm ejecting its packets from the low
146	   latency queue.

148	   Section 5 gives deeper insight into the principles and rationale
149	   behind the algorithms.  Then Section 6 explains the limitations of
150	   the approach, followed by the usual closing sections.

152	1.2.  Terminology

154	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
155	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
156	   document are to be interpreted as described in [RFC2119].  In this
157	   document, these words will appear with that interpretation only when
158	   in ALL CAPS.  Lower case uses of these words are not to be
159	   interpreted as carrying RFC-2119 significance.

161	   The normative language for the DOCSIS QProt algorithm is in the
162	   DOCSIS 3.1 specifications [DOCSIS3.1], [DOCSIS3.1-CM-OSS],
163	   [DOCSIS3.1-CCAP-OSS] not in this informational guide.

165	   The following terms and abbreviations are used:

167	   CM:  Cable Modem

169	   CMTS:  CM Termination System

171	   Congestion-rate:  The rate at which a flow induces ECN-marked (or
172	      dropped) bytes, where an ECN-mark on a packet is defined as
173	      marking all the packet's bytes.  Congestion-bit-rate and
174	      congestion-volume were introduced in [RFC7713] and [RFC6789].

176	   Non-queue-building:  A flow that tends not to build a queue

178	   Queue-building:  A flow that builds a queue, and therefore is a
179	      candidate for the queue protection algorithm to detect and
180	      sanction

182	   ECN:  Explicit Congestion Notification

184	   QProt:  The Queue Protection function

186	2.  Approach - In Brief

188	   The algorithm is divided into mechanism and policy.

190	   o  The mechanism aspects identify flows, maintain flow-state and
191	      accumulate per-flow queuing scores;

193	   o  The policy aspects tend to be brief, but more likely to be
194	      modified in future.  They can be divided into conditions and
195	      actions:

197	      *  The conditions are the logic that determines when action should
198	         be taken to avert the risk of queuing delay becoming excessive;

200	      *  The actions determine how this risk is averted, e.g. by
201	         redirecting packets from a flow into another queue, or to
202	         reclassify a whole flow that seems to be misclassified.

204	2.1.  Mechanism

206	   The algorithm maintains per-flow-state, where 'flow' usually means an
207	   end-to-end (layer-4) 5-tuple.  The flow-state consists of a queuing
208	   score normalized to also represent the flow-state's own expiry time
209	   (explained in Section 5.3).  A higher queuing score pushes out the
210	   expiry time further.

212	   Non-queue-building flows tend to release their flow-state rapidly ---
213	   it usually expires reasonably early in the gap between the packets of
214	   a normal flow.  Then the memory can be recycled for packets from
215	   other flows that arrive in between.  So only queue-building flows
216	   hold flow state persistently.

218	   The simplicity and effectiveness of the algorithm is due to the
219	   definition of the queuing score.  It uses the internal variable from
220	   the AQM that determines the ECN marking probability, probNative, of
221	   the low latency queue.  In floating point arithmetic, (0 <=
222	   probNative <= 1).  The algorithm scales the size of each arriving
223	   packet of a flow by the value of probNative.

225	   The algorithm so far would accumulate a number that would rise at the
226	   so-called congestion-rate of the flow, i.e. the rate at which the
227	   flow is contributing to congestion, or the rate at which the AQM is
228	   forwarding bytes of the flow that are ECN marked.  However, rather
229	   than growing continually, the queuing score is also aged at a
230	   constant rate.

232	   In practice, the queuing score is normalized into time units (to
233	   represent the expiry time of the flow state, as above).  Then it does
234	   not need to be explicitly aged, because the natural passage of time
235	   implicitly 'ages' an expiry time.

237	2.2.  Policy

239	   The algorithm uses the queuing score to determine whether to eject
240	   each packet as it arrives, rather than allow it to further increase
241	   queuing delay.  This limits the policies available.  For instance,
242	   when queueing delay exceeds a threshold, it is not possible to eject
243	   a packet from the flow with the highest queuing scoring, because that
244	   would involve searching the queue for such a packet (if indeed there
245	   was still one in the queue).  Nonetheless, it is still possible to
246	   develop a policy that protects the low latency of the queue.

248	   Currently in DOCSIS, when the policy conditions are met, the action
249	   taken to protect the low latency queue is to reclassify a packet into
250	   the Classic queue.

252	3.  Necessary Flow Behaviour

254	   The QProt algorithm described here can be used for responsive and/or
255	   unresponsive flows.

257	   o  It is possible to objectively describe the least responsive way
258	      that a flow will need to respond to congestion signals in order to
259	      avoid triggering queue protection, no matter the link capacity and
260	      no matter how much other traffic there is.

262	   o  It is not possible to describe how fast or smooth an unresponsive
263	      flow should be to avoid queue protection, because this depends on
264	      how much other traffic there is and the capacity of the link,
265	      which an application is unable to know.  However, the smoother an
266	      unresponsive flow paces its packets and the lower its rate
267	      relative to typical broadband link capacities, the less likelihood
268	      that it will risk causing enough queueing to trigger queue
269	      protection.

271	   In DOCSIS, unresponsive flows are classified into the low latency
272	   queue by a Non-Queue-Building (NQB) Diffserv codepoint
273	   [I-D.white-tsvwg-nqb], or an operator can use various other
274	   additional local classifiers.

276	   Responsive low latency flows have to use L4S ECN
277	   [I-D.ietf-tsvwg-ecn-l4s-id] to get classified into the low latency
278	   queue.

280	   The QProt algorithm is driven from the same variable that drives the
281	   ECN marking probability in the low latency queue (Annex N of
282	   [DOCSIS3.1] ).  The algorithm that calculates this internal variable
283	   is run on the arrival of every packet, whether it is ECN-capable or
284	   not, so that it can be used by the QProt algorithm.  But the variable
285	   is only used to ECN-mark packets that are ECN-capable.

287	   Not only does this dual use of the variable improve processing
288	   efficiency, but it also makes the basis of the QProt algorithm
289	   visible and transparent, at least for responsive ECN-capable flows.
290	   Then it is possible to state objectively that a flow can avoid
291	   triggering queue protection by keeping the bit rate of ECN marked
292	   packets (the congestion-rate) below AGING, which is a configured
293	   constant of the algorithm (default 2^19 B/s ~= 4.2 Mb/s).  Note that
294	   a congestion controller is advised to keep the average congestion-
295	   rate well below this level (e.g. ~1 Mb/s), to ensure that queue
296	   protection is not triggered during transient dynamics.

298	   If the QProt algorithm is used in other settings, it would need to be
299	   based on the visible level of congestion signalling, in a similar way
300	   to DOCSIS.  Without transparency of the basis of the algorithm's
301	   decisions, end-systems would not be able to avoid triggering queue
302	   protection on an objective basis.

304	4.  Pseudocode Walk-Through

306	   The algorithm categorizes packets into flows, usually defined by a
307	   common 5-tuple (or 4-tuple) of:

309	   o  source and destination IP addresses of the innermost IP header
310	      found;

312	   o  protocol of the layer above this IP header

314	   o  either of:

316	      *  source and destination port numbers, for TCP, UDP, UDP-Lite,
317	         SCTP, DCCP, etc.

319	      *  Security Parameters Index (SPI) for IPSec Encapsulating
320	         Security Payload (ESP) [RFC4303].

322	   Annex P.3 of DOCSIS 3.1 [DOCSIS3.1] defines various strategies to
323	   find these headers by skipping extension headers or encapsulations.
324	   If they cannot be found the spec defines various less-specific
325	   3-tuples that would be used.  DOCSIS 3.1 should be referred to for
326	   all these strategies, which will not be repeated here.

328	4.1.  Input Parameters, Constants and Variables

330	   The operator input parameters that set the parameters in the first
331	   two blocks of pseudocode below are defined for cable modems (CMs) in
332	   [DOCSIS3.1-CM-OSS] and for CMTSs in [DOCSIS3.1-CCAP-OSS].  The
333	   constants below that are derived from them or hard-coded.

335	   // Input Parameters
336	   QPROTECT_ON         // Queue Protection is enabled if TRUE
337	   CRITICALqL_us       // Threshold delay of L queue [us]
338	   CRITICALqLSCORE_us  // The threshold queuing score [us]
339	   LG_AGING            // The aging rate of the q'ing score, as
340	                       //  log base 2 of the congestion-rate [lg(B/s)]

342	   // Input Parameters for the calcProbNative() algorithm:
343	   MAXTH_us            // Max marking threshold [us] for IAQM
344	   LG_RANGE            // Log base 2 of the range of ramp [lg(ns)]
345	                       //  Default: 2^19 = 524288 ns (roughly 525 us)

347	   // Constants, either derived from input parameters or hard-coded
348	   AGING = pow(2, (LG_AGING-30) );       // Convert lg([B/s]) to [B/ns]
349	   CRITICALqL = CRITICALqL_us * 1000             // Convert [us] to [ns]
350	   CRITICALqLSCORE = CRITICALqLSCORE_us * 1000   // Convert [us] to [ns]
351	   // Threshold for the q'ing score condition
352	   CRITICALqLPRODUCT = CRITICALqL * CRITICALqLSCORE

354	   ATTEMPTS = 2; // Max attempts to pick a bucket (vendor-specific)
355	   BI_SIZE = 5;  // Bit-width of index number for non-default buckets
356	   NBUCKETS = pow(2, BI_SIZE);  // No. of non-default buckets
357	   MASK = NBUCKETS-1;           // convenient constant, filled with ones

359	                          // Queue Protection exit states
360	   EXIT_SUCCESS  = 0;     // Forward the packet
361	   EXIT_SANCTION = 1;     // Redirect the packet

363	   MAX_PROB      = 1; // For integer arithmetic, would use a large int
364	                      //  e.g., 2^31, to allow space for overflow
365	   MAXTH = MAXTH_us * 1000;   // Max marking threshold [ns]
366	   // Minimum marking threshold of 2 MTU for slow links [ns]
367	   FLOOR =  2 * 8 * MAX FRAME SIZE * 10^9 / MAX RATE;
368	   RANGE = (1 << LG_RANGE);      // Range of ramp [ns]
369	   MINTH = max ( MAXTH - RANGE, FLOOR);
370	   MAXTH = MINTH + RANGE;           // Max marking threshold [ns]

372	   The following definitions explain the purpose of important variables
373	   and functions.

375	   // Public variables:
376	   qdelay         // The current queuing delay of the LL queue [ns]
377	   probNative     // The native probability of the LL queue within [0,1]

379	   // External variables
380	   packet             // The structure holding packet header fields
381	   packet.size        // The size of the current packet [B]
382	   packet.uflow       // The flow identifier of the current packet
383	                      //  (e.g. 5-tuple or 4-tuple if IPSec)

385	   // Irrelevant details of DOCSIS function to return qdelay are removed
386	   qdelayL(...)      // Returns current delay of the low latency Q [ns]

388	   The array of bucket structures defined below is used by all the Queue
389	   Protection functions:

391	   struct bucket { // The leaky bucket structure to hold per-flow state
392	      id;          // identifier (e.g. 5-tuple) of the flow using bucket
393	      t_exp;       // expiry time;
394	                   // (t_exp - now) = flow's normalized q'ing score [ns]
395	   };
396	   struct bucket buckets[NBUCKETS+1];

398	4.2.  Queue Protection Data Path

400	   All the functions of Queue Protection operate on the data path,
401	   driven by packet arrivals.

403	   The following functions that maintain per-flow queuing scores and
404	   manage per-flow state are considered primarily as mechanism:

406	      pick_bucket(uflow_id); // Returns bucket identifier

408	      fill_bucket(bucket_id, pkt_size, probNative); // Returns queuing
409	      score

411	      calcProbNative(qdelay) // Returns probability of ECN-marking

413	   The following function is primarily concerned with policy:

415	      qprotect(packet, ...); // Returns exit status to either forward or
416	      redirect the packet

418	   It is more likely that there might be future modifications to policy
419	   aspects.  Therefore, policy aspects would be less appropriate
420	   candidates for any hardware acceleration.

422	   The entry point to these functions is qprotect(), which would be
423	   called from packet classification as follows:

425	   classifier(packet) {
426	      // ...
427	      // Classify packet
428	         // if packet classified to Low Latency Service Flow
429	         if (QPROTECT_ON) {
430	            if (qprotect(packet, qL.byte_length) == EXIT_SANCTION) {
431	               // redirect packet to    Classic Service Flow
432	            }
433	         }
434	         // Forward packet to Low Latency Service Flow
435	      // Continue...
436	   }

438	   On each packet arrival, qprotect() measures the current queue delay
439	   and derives the native probability from it.  Then it uses pick_bucket
440	   to find the bucket already holding the flow's state, or to allocate a
441	   new bucket if the flow is new or its state has expired (the most
442	   likely case).  Then the queuing score is updated by the fill_bucket()
443	   function.  That completes the mechanism aspects.

445	   The comments against the subsequent policy conditions and actions
446	   should be self-explanatory at a superficial level.  The deeper
447	   rationale for these conditions is given in Section 5.4.

449	   // Per packet queue protection
450	   qprotect(packet, ...) {

452	      bckt_id;   // bucket index
453	      qLscore;   // queuing score of pkt's flow [ns]

455	      qdelay = qL.qdelay(...);
456	      probNative = calcProbNative(qdelay);

458	      bckt_id = pick_bucket(packet.uflow);
459	      // Not shown: if (bckt_id->t_exp risks overflow) EXIT_SANCTION
460	      qLscore = fill_bucket(buckets[bckt_id], packet.size, probNative);

462	      // Determine whether to sanction packet
463	      if (  ( qdelay > CRITICALqL )  // Test if qdelay over threshold...
464	         // ...and if flow's q'ing score scaled by qdelay/CRITICALqL
465	         // ...exceeds CRITICALqLSCORE
466	         && ( qdelay * qLscore > CRITICALqLPRODUCT ) )

468	         return EXIT_SANCTION;

470	      else
471	         return EXIT_SUCCESS;
472	   }

474	   The pick_bucket() function is optimized for flow-state that will
475	   normally have expired from packet to packet of the same flow. it is
476	   just one way of finding the bucket associated with the flow ID of
477	   each packet - it might be possible to develop more efficient
478	   alternatives.

480	   The algorithm is arranged so that the bucket holding any live (non-
481	   expired) flow-state associated with a packet will always be found
482	   before a new bucket is allocated.  The constant ATTEMPTS, defined
483	   earlier, determines how many hashes are used to find a bucket for
484	   each flow (actually, only one hash is generated; then, by default, 5
485	   bits of it at a time are used as the hash value, because by default
486	   there are 2^5 = 32 buckets).

488	   The algorithm stores the flow's own ID in its flow-state.  So, when
489	   the next packet of a flow arrives, if it finds its own ID, but the
490	   flow-state has expired, the algorithm just adds the packet's queuing
491	   score to 'now', as a new flow would, If it does not find the flow's
492	   ID, and the expiry time is still current, the algorithm can tell that
493	   another flow is using that bucket, and it continues to look for a
494	   bucket for the flow.  Even if it finds a bucket where the expiry time
495	   has passed, it doesn't immediately use it.  It merely remembers it as
496	   the potential bucket to use.  But first it runs through all the
497	   ATTEMPTS hashes to check for another bucket assigned to the flow, in
498	   case it is still live.

500	   If a live bucket is not already associated with the packet's flow,
501	   the algorithm should then have already set aside an existing bucket
502	   with a score that has aged out.  Given this bucket is no longer
503	   necessary to hold state for its previous flow, it can be recycled for
504	   use by the present packet's flow.

506	   If all else fails, there is one additional bucket (called the dregs)
507	   that can be used.  If the dregs is still in live use by another flow,
508	   subsequent flows that cannot find a bucket of their own all share it,
509	   adding their score to the one in the dregs.  A flow might get away
510	   with using the dregs on its own, but when there are many mis-marked
511	   flows, multiple flows are more likely to collide in the dregs,
512	   including innocent flows.  The choice of number of buckets and number
513	   of hash attempts determines how likely it will be that this
514	   undesirable scenario will occur.

516	   // Pick the bucket associated with flow uflw
517	   pick_bucket(uflw) {

519	      now;                      // current time [ns]
520	      j;                        // loop counter
521	      h32;                      // holds hash of the packet's flow IDs
522	      h;                        // bucket index being checked
523	      hsav;                     // interim chosen bucket index

525	      h32   = hash32(uflw);     // 32-bit hash of flow ID
526	      hsav  = NBUCKETS;         // Default bucket
527	      now   = get_time_now();

529	      // The for loop checks ATTEMPTS buckets for ownership by flow-ID
530	      // It also records the 1st bucket, if any, that could be recycled
531	      // because it's expired.
532	      // Must not recycle a bucket until all ownership checks completed
533	      for (j=0; j<ATTEMPTS; j++) {
534	         // Use least signif. BI_SIZE bits of hash for each attempt
535	         h = h32 & MASK;
536	         if (buckets[h].id == uflw) {    // Once uflw's bucket found...
537	            if (buckets[h].t_exp <= now) // ...if bucket has expired...
538	               buckets[h].t_exp = now;   // ...reset it
539	            return h;                    // ...use it
540	         }
541	         else if ( (hsav == NBUCKETS)  // If not seen expired bucket yet
542	                                       //  and this bucket has expired
543	              && (buckets[h].t_exp <= now) ) {
544	            hsav = h;                  // set it as the interim bucket
545	         }
546	         h32 >>= BI_SIZE;          // Bit-shift hash for next attempt
547	      }
548	      // If reached here, no tested bucket was owned by the flow-ID
549	      if (hsav != NBUCKETS) {
550	         // If here, found an expired bucket within the above for loop
551	         buckets[hsav].t_exp = now;              // Reset expired bucket
552	      } else {
553	         // If here, we're having to use the default bucket (the dregs)
554	         if (buckets[hsav].t_exp <= now) {   // If dregs has expired...
555	            buckets[hsav].t_exp = now;       // ...reset it
556	         }
557	      }
558	      buckets[hsav].id = uflw; // In either case, claim for recycling
559	      return hsav;
560	   }

562	   The fill_bucket() function both accumulates and ages the queuing
563	   score over time, as outlined in Section 2.1.  To make aging the score
564	   efficient, the increment of the queuing score is normalized to units
565	   of time by dividing by AGING, so that the result represents the new
566	   expiry time of the flow.

568	   It might be thought that, instead of multiplying the packet size
569	   (pkt_sz) by probNative, packets could be selected by the AQM with
570	   probability probNative, as they are for ECN-marking.  Then the full
571	   size of those selected packets would be used to increment the queuing
572	   score.  However, experience with other congestion policers has found
573	   that the score then increments far too jumpily, particularly when
574	   probNative is low.

576	   A deeper explanation of the queuing score is given in Section 5.

578	   fill_bucket(bckt_id, pkt_sz, probNative) {
579	      // Add packet's queuing score
580	      // For integer arithmetic, a bit-shift can replace the division
581	      buckets[bckt_id].t_exp += probNative * pkt_sz / AGING;
582	      return  (buckets[bckt_id].t_exp - now);
583	   }

585	   To derive this queuing score, the QProt algorithm uses the linear
586	   ramp function calcProbNative() to normalize instantaneous queuing
587	   delay into a probability in the range [0,1], which it assigns to
588	   probNative.

590	   calcProbNative(qdelay){
591	         if ( qdelay >= MAXTH ) {
592	            probNative = MAX_PROB;
593	         } else if ( qdelay > MINTH ) {
594	            probNative = MAX_PROB * (qdelay - MINTH)/RANGE;
595	            // In practice, the * and the / would use a bit-shift
596	         } else {
597	            probNative = 0;
598	         }
599	         return probNative;
600	   }

602	5.  Rationale

604	5.1.  Rationale: Blame for Queuing, not for Rate in Itself

606	   Figure 1 poses the question of which flow is more to blame for
607	   queuing delay; the unresponsive constant bit rate flow (c) that is
608	   consuming about 80% of the capacity, or the flow sending regular
609	   short unresponsive bursts (b)?  The smoothness of c seems better for
610	   avoiding queuing, but its high rate does not.  However, if flow c was
611	   not there, or ran slightly more slowly, b would not cause any
612	   queuing.

614	   ^ bit-rate
615	   |  ,-.          ,-.          ,-.          ,-.          ,-.
616	   |--|b|----------|b|----------|b|----------|b|----------|b|---Capacity
617	   |__|_|__________|_|__________|_|__________|_|__________|_|_____
618	   |
619	   |                       c
620	   |
621	   |
622	   |
623	   +---------------------------------------------------------------->
624	                                                                 time

626	            Figure 1: Which is More to Blame for Queuing Delay?

628	   To explain queuing scores, in the following it will initially be
629	   assumed that the QProt algorithm is accumulating queuing scores, but
630	   not taking any action as a result.

632	   To quantify the responsibility that each flow bears for queuing
633	   delay, the QProt algorithm accumulates the product of the level of
634	   congestion and the rate of each flow, both measured at the instant
635	   each packet arrives.  The level of congestion is normalized to a
636	   dimensionless number between 0 and 1 (nativeProb).  The instantaneous
637	   flow rate is represented at each discrete event when a packet arrives
638	   by the packet's size, which accumulates faster the more packets
639	   arrive within each unit of time.  The unit of the resulting queue
640	   score is "congested-bytes" per second, which distinguishes it from
641	   just bytes per second.

643	   Then, during the periods between bursts (b), neither flow accumulates
644	   any queuing score - the high rate of c is benign.  But, during each
645	   burst, if we say the rate of c and b are 80% and 45% of capacity,
646	   thus causing 125% overload, they each bear (80/125)% and (45/125)% of
647	   the responsibility for the queuing delay (64% and 36%).  The
648	   algorithm does not explicit calculate these percentages.  They are
649	   just the outcome of the number of packets arriving from each flow
650	   during the burst.

652	   To summarize, the queuing score never sanctions rate solely on its
653	   own account.  It only sanctions rate inasmuch as it causes queuing.

655	   ^ bit-rate                                ,
656	   |               ,-.                       |\           ,-
657	   |------Capacity-|b|----------,-.----------|b|----------|b\-----
658	   |             __|_|_______   |b|        /``\| _...-._-' \| ,.--
659	   |  ,-.     __/            \__|_|_     _/    |/           |/
660	   |  |b| ___/                      \___/   __       r
661	   |  |_|/                v             \__/  \_______    _/\____/
662	   | _/                                               \__/
663	   |
664	   +---------------------------------------------------------------->
665	                                                                 time

667	        Figure 2: Responsibility for Queuing: More Complex Scenario

669	   Figure 1 gives a more complex illustration of the way the queuing
670	   score assigns responsibility for queuing (limited to the precision
671	   that ASCII art can illustrate).  The unresponsive bursts (b) are the
672	   same as in the previous example, but a variable rate video (v)
673	   replaces flow c.  It's rate varies as the complexity of the video
674	   scene varies.  Also on a slower timescale, in response to the level
675	   of congestion, the video adapts its quality.  However, on a short
676	   time-scale it appears to be unresponsive to small amounts of queuing.
677	   Also, part-way through, a low latency responsive flow (r) joins in,
678	   aiming to fill the balance of capacity left by the other two.

680	   The combination of the first burst and the low application-limited
681	   rate of the video causes neither flow to accumulate queuing score.
682	   In contrast, the second burst causes similar excessive overload
683	   (125%) to the example in Figure 1.  Then, the video happens to reduce
684	   its rate (probably due to a less complex scene) so the third burst
685	   causes only a little congestion.  Let us assume the resulting queue
686	   causes probNative to rise to just 1%, then the queuing score will
687	   only accumulate 1% of the size of each packet of flows v and b during
688	   this burst.

690	   The fourth burst happens to arrive just as the new responsive flow
691	   (r) has filled the available capacity, so it leads to very rapid
692	   growth of the queue.  After a round trip the responsive flow rapidly
693	   backs off, and the adaptive video also backs off more rapidly than it
694	   would normally, because of the very high congestion level.  The rapid
695	   response to congestion of flow r reduces the queuing score that all
696	   three flows accumulate, but they each still bear the cost in
697	   proportion to the product of the rates at which their packets arrive
698	   at the queue and the value of probNative when they do so.  Thus,
699	   during the fifth burst, they all accumulate less score than the
700	   fourth, because the queuing delay is not as excessive.

702	5.2.  Rationale for Aging the Queuing Score

704	   Even well-behaved flows will not always be able to respond fast
705	   enough to dynamic events.  Also well-behaved flow(s), e.g.  DCTCP
706	   [RFC8257], TCP Prague [PragueLinux] or the L4S variant of SCReAM for
707	   real-time media [RFC8298], can maintain a very shallow queue by
708	   continual careful probing for more while also continually subtracting
709	   a little from their rate (or congestion window) in response to low
710	   levels of ECN signalling.  Therefore, the QProt algorithm needs to
711	   continually offer a degree of forgiveness to age out the queuing
712	   score as it accumulates.

714	   Scalable congestion controllers such as those above maintain their
715	   congestion window in inverse proportion to the congestion level,
716	   probNative, That leads to the important property that on average a
717	   scalable flow holds the product of its congestion window and the
718	   congestion level constant, no matter the capacity of the link or how
719	   many other flows it competes with.  For instance, if the link
720	   capacity doubles, a scalable flow induces half the congestion
721	   probability.  Or if three scalable flows compete for the capacity,
722	   each flow will reduce to one third of the capacity and increase the
723	   congestion level by 3x.

725	   This suggests that the QProt algorithm will not sanction a well-
726	   behaved scalable flow if it ages out the queuing score at a
727	   sufficient constant rate.  The constant will need to be somewhat
728	   about the average of a well-behaved scalable flow to allow for normal
729	   dynamics.

731	   Relating QProt's aging constant to a scalable flow does not mean that
732	   a flow has to behave like a scalable flow.  It can be less
733	   aggressive, but not more.  For instance, a longer RTT flow can run at
734	   a lower congestion-rate than the aging rate, but it can also increase
735	   its aggressiveness to equal the rate of short RTT scalable flows
736	   [ScalingCC].  The constant aging of QProt also means that a long-
737	   running unresponsive flow will be prone to trigger QProt if it runs
738	   faster than a competing responsive scalable flow would.  And, of
739	   course, if a flow causes excessive queuing in the short-term, its
740	   queuing score will still rise faster than the constant aging process
741	   will decrease it.  Then QProt will still eject the flow's packets
742	   before they harm the low latency of the shared queue.

744	5.3.  Rationale for Normalized Queuing Score

746	   The QProt algorithm holds a flow's queuing score state in a structure
747	   called a bucket, because of its similarity to a classic leaky bucket
748	   (except the contents of the bucket does not represent bytes).

750	   probNative * pkt_sz   probNative * pkt_sz / AGING
751	             |                        |
752	          |  V  |                  |  V  |
753	          |  :  |        ___       |  :  |
754	          |_____|        ___       |_____|
755	          |     |        ___       |     |
756	          |__ __|                  |__ __|
757	             |                        |
758	             V                        V
759	        AGING * Dt                    Dt

761	                 Figure 3: Normalization of Queuing Score

763	   The accumulation and aging of the queuing score is shown on the left
764	   of Figure 3 in token bucket form.  Dt is the difference between the
765	   times when the scores of the current and previous packets were
766	   processed.

768	   A normalized equivalent of this token bucket is shown on the right of
769	   Figure 3, dividing both the input and output by the constant AGING
770	   rate.  The result is a bucket-depth that represents time and it
771	   drains at the rate that time passes.

773	   As a further optimization, the time the bucket was last updated is
774	   not stored with the flow-state.  Instead, when the bucket is
775	   initialized the queuing score is added to the system time 'now' and
776	   the resulting expiry time is written into the bucket.  Subsequently,
777	   if the bucket has not expired, the incremental queuing score is added
778	   to the time already held in the bucket.  Then the queuing score
779	   always represents the expiry time of the flow-state itself.  This
780	   means that the queuing score does not need to be aged explicitly
781	   because it ages itself implicitly.

783	5.4.  Rationale for Policy Conditions

785	   Pseudocode for the QProt policy conditions is given in Section 4.1
786	   within the second half of the qprotect() function.  When each packet
787	   arrives, after finding its flow state and updating the queuing score
788	   of the packet's flow, the algorithm checks whether the shared queue
789	   delay exceeds a constant threshold CRITICALqL (e.g. 2 ms), as
790	   repeated below for convenience:

792	      if (  ( qdelay > CRITICALqL )  // Test if qdelay over threshold...
793	         // ...and if flow's q'ing score scaled by qdelay/CRITICALqL
794	         // ...exceeds CRITICALqLSCORE
795	         && ( qdelay * qLscore > CRITICALqLPRODUCT ) )
796	         // Recall that CRITICALqLPRODUCT = CRITICALqL * CRITICALqLSCORE

798	   If the queue delay threshold is exceeded, the flow's queuing score is
799	   temporarily scaled up by the current queue delay normalized as a
800	   ratio of the threshold queuing delay CRITICALqL.  If this scaled up
801	   score exceeds another constant threshold CRITICALqLSCORE, the packet
802	   is ejected.  The actual last line of code above multiplies both sides
803	   of the second condition by CRITICALqLSCORE to avoid a costly
804	   division.

806	   This approach allows each packet to be assessed once, as it arrives.
807	   Once queue delay exceeds the threshold, it has two implications:

809	   o  The current packet might be ejected even though there are packets
810	      already in the queue from flows with higher queuing scores.
811	      However, any flow that continues to contribute to the queue will
812	      have to send further packets, giving an opportunity to eject them
813	      as well, as they subsequently arrive.

815	   o  The next packets to arrive might not be ejected, because they
816	      might belong to flows with low queuing scores.  In this case,
817	      queue delay could continue to rise with no opportunity to eject a
818	      packet.  This is why the queuing score is scaled up by the current
819	      queue delay.  Then, the more the queue has grown without ejecting
820	      a packet, the more the algorithm 'raises the bar' to further
821	      packets.

823	   The above approach is preferred over searching for the flow with the
824	   highest queuing score and searching for one of its packets to eject
825	   from the queue (if one is still there).

827	   Figure 4 explains this approach graphically.  On the horizontal axis
828	   it shows actual harm, meaning the queuing delay in the shared queue.
829	   On the vertical axis it shows the behaviour record of the flow
830	   associated with the currently arriving packet, represented in the
831	   algorithm by the flow's queuing score.  The shaded region represents
832	   the combination of actual harm and behaviour record that will lead to
833	   the packet being ejected.

835	   Behaviour Record:
836	   Queueing Score of
837	   Arriving Packet's Flow
838	   ^
839	   |   +          |/ / / / / / / / / / / / / / / / / / /
840	   |    +   N     | / / / / / / / / / / / / / / / / / / /
841	   |     +        |/ / / / /                   / / / / /
842	   |      +       | / / / /  E (Eject packet)   / / / / /
843	   |       +      |/ / / / /                   / / / / /
844	   |         +    | / / / / / / / / / / / / / / / / / / /
845	   |           +  |/ / / / / / / / / / / / / / / / / / /
846	   |             +| / / / / / / / / / / / / / / / / / / /
847	   |              |+ / / / / / / / / / / / / / / / / / /
848	   |    N         |   + / / / / / / / / / / / / / / / / /
849	   | (No actual   |       +/ / / / / / / / / / / / / / /
850	   |   harm)      |            +  / / / / / / / / / / / /
851	   |              | P (Pass over)   +   ,/ / / / / / / /
852	   |              |                           ^ + /./ /_/
853	   +--------------+------------------------------------------>
854	             CRITICALqL        Actual Harm: Shared Queue Delay

856	         Figure 4: Graphical Explanation of the Policy Conditions

858	   The region labelled 'N' represents cases where the first condition is
859	   not met - No actual harm - queue delay is below the critical
860	   threshold, CRITICALqL.

862	   The region labelled 'E' represents cases where there is actual harm
863	   (queue delay exceeds CRITICALqL) and the queuing score associated
864	   with the arriving packet is high enough to be able to eject it with
865	   certainty.

867	   The region labelled 'P' represents cases where there is actual harm,
868	   but the queuing score of the arriving packet is insufficient to eject
869	   it, so it has to be Passed over.  This adds to queuing delay, but the
870	   alternative would be to sanction an innocent flow.  It can be seen
871	   that, as actual harm increases, the judgement of innocence becomes
872	   increasingly stringent; the behaviour record of the next packet's
873	   flow does not have to be as bad to eject it.

875	6.  Limitations

877	   The QProt algorithm groups packets with common layer-4 flow
878	   identifiers.  It then uses this grouping to accumulate queuing scores
879	   and to sanction packets.

881	   Some applications might initiate multiple flows between the same end-
882	   points, e.g. for media, control, data, etc.  Others might use common
883	   flow identifiers for all these streams.  Also, a user might group
884	   multiple application flows within the same encrypted VPN between the
885	   same layer-4 tunnel end-points.

887	   The use of a queuing score that excludes those aspects of flow rate
888	   that do not contribute to queuing Section 5.1 goes some way to
889	   mitigating this limitation.  However, ultimately this choice of flow
890	   identifiers is pragmatic and not particularly principled.

892	7.  IANA Considerations

894	   This specification contains no IANA considerations.

896	8.  Security Considerations

898	   The whole of this document considers the security concern of how to
899	   identify traffic that does not comply with the non-queue-building
900	   behaviour required to use a shared low latency queue, whether
901	   accidentally or maliciously.

903	   The algorithm has been designed to be fail gracefully in the face of
904	   traffic crafted to overrun the resources of the algorithm.  This
905	   means that non-queue-building flows will always be less likely to be
906	   sanctioned than queue-building flows.  But an attack could be
907	   contrived to deplete resources in such as way that the proportion of
908	   innocent (non-queue-building) flows that are incorrectly sanctioned
909	   could increase.

911	9.  Comments Solicited

913	   Comments and questions are encouraged and very welcome.  They can be
914	   addressed to the IETF Transport Area mailing list <tsv-
915	   area@ietf.org>, and/or to the authors.

917	10.  Acknowledgements

919	   Thanks to Tom Henderson for his review of this document.  The design
920	   of the QProt algorithm and the settings of the parameters benefited
921	   from discussion and critique from the participants of the cable
922	   industry working group on Low Latency DOCSIS.

924	11.  References

926	11.1.  Normative References

928	   [DOCSIS3.1]
929	              CableLabs, "MAC and Upper Layer Protocols Interface
930	              (MULPI) Specification, CM-SP-MULPIv3.1", Data-Over-Cable
931	              Service Interface Specifications DOCSIS(R) 3.1 Version i17
932	              or later, January 2019, <https://specification-
933	              search.cablelabs.com/CM-SP-MULPIv3.1>.

935	   [DOCSIS3.1-CCAP-OSS]
936	              CableLabs, "CCAP Operations Support System Interface
937	              Spec", Data-Over-Cable Service Interface Specifications
938	              DOCSIS(R) 3.1 Version i14 or later, January 2019,
939	              <https://specification-search.cablelabs.com/
940	              CM-SP-CM-OSSIv3.1>.

942	   [DOCSIS3.1-CM-OSS]
943	              CableLabs, "Cable Modem Operations Support System
944	              Interface Spec", Data-Over-Cable Service Interface
945	              Specifications DOCSIS(R) 3.1 Version i14 or later, January
946	              2019, <https://specification-search.cablelabs.com/
947	              CM-SP-CM-OSSIv3.1>.

949	   [I-D.ietf-tsvwg-ecn-l4s-id]
950	              Schepper, K. and B. Briscoe, "Identifying Modified
951	              Explicit Congestion Notification (ECN) Semantics for
952	              Ultra-Low Queuing Delay (L4S)", draft-ietf-tsvwg-ecn-l4s-
953	              id-06 (work in progress), March 2019.

955	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
956	              Requirement Levels", BCP 14, RFC 2119,
957	              DOI 10.17487/RFC2119, March 1997,
958	              <https://www.rfc-editor.org/info/rfc2119>.

960	   [RFC8311]  Black, D., "Relaxing Restrictions on Explicit Congestion
961	              Notification (ECN) Experimentation", RFC 8311,
962	              DOI 10.17487/RFC8311, January 2018,
963	              <https://www.rfc-editor.org/info/rfc8311>.

965	11.2.  Informative References

967	   [I-D.ietf-tsvwg-aqm-dualq-coupled]
968	              Schepper, K., Briscoe, B., and G. White, "DualQ Coupled
969	              AQMs for Low Latency, Low Loss and Scalable Throughput
970	              (L4S)", draft-ietf-tsvwg-aqm-dualq-coupled-09 (work in
971	              progress), July 2019.

973	   [I-D.white-tsvwg-lld]
974	              White, G., Sundaresan, K., and B. Briscoe, "Low Latency
975	              DOCSIS - Technology Overview", draft-white-tsvwg-lld-00
976	              (work in progress), March 2019.

978	   [I-D.white-tsvwg-nqb]
979	              White, G. and T. Fossati, "Identifying and Handling Non
980	              Queue Building Flows in a Bottleneck Link", draft-white-
981	              tsvwg-nqb-02 (work in progress), June 2019.

983	   [PragueLinux]
984	              Briscoe, B., De Schepper, K., Albisser, O., Misund, J.,
985	              Tilmans, O., Kuehlewind, M., and A. Ahmed, "Implementing
986	              the `TCP Prague' Requirements for Low Latency Low Loss
987	              Scalable Throughput (L4S)", Proc. Linux Netdev 0x13 ,
988	              March 2019, <https://www.netdevconf.org/0x13/
989	              session.html?talk-tcp-prague-l4s>.

991	   [RFC4303]  Kent, S., "IP Encapsulating Security Payload (ESP)",
992	              RFC 4303, DOI 10.17487/RFC4303, December 2005,
993	              <https://www.rfc-editor.org/info/rfc4303>.

995	   [RFC6789]  Briscoe, B., Ed., Woundy, R., Ed., and A. Cooper, Ed.,
996	              "Congestion Exposure (ConEx) Concepts and Use Cases",
997	              RFC 6789, DOI 10.17487/RFC6789, December 2012,
998	              <https://www.rfc-editor.org/info/rfc6789>.

1000	   [RFC7713]  Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx)
1001	              Concepts, Abstract Mechanism, and Requirements", RFC 7713,
1002	              DOI 10.17487/RFC7713, December 2015,
1003	              <https://www.rfc-editor.org/info/rfc7713>.

1005	   [RFC8257]  Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L.,
1006	              and G. Judd, "Data Center TCP (DCTCP): TCP Congestion
1007	              Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257,
1008	              October 2017, <https://www.rfc-editor.org/info/rfc8257>.

1010	   [RFC8298]  Johansson, I. and Z. Sarker, "Self-Clocked Rate Adaptation
1011	              for Multimedia", RFC 8298, DOI 10.17487/RFC8298, December
1012	              2017, <https://www.rfc-editor.org/info/rfc8298>.

1014	   [ScalingCC]
1015	              Briscoe, B. and K. De Schepper, "Resolving Tensions
1016	              between Congestion Control Scaling Requirements", Simula
1017	              Technical Report TR-CS-2016-001 arXiv:1904.07605, July
1018	              2017, <https://arxiv.org/abs/1904.07605>.

1020	Authors' Addresses

1022	   Bob Briscoe (editor)
1023	   CableLabs
1024	   UK

1026	   Email: ietf@bobbriscoe.net
1027	   URI:   http://bobbriscoe.net/

1029	   Greg White
1030	   CableLabs
1031	   US

1033	   Email: G.White@CableLabs.com