idnits 2.17.1 draft-briscoe-docsis-q-protection-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There are 4 instances of lines with non-ascii characters in the document. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 888 has weird spacing: '... pkt_sz prob...' -- The document date (13 May 2022) is 712 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '0' on line 711 -- Looks like a reference, but probably isn't: '1' on line 711 == Outdated reference: A later version (-29) exists of draft-ietf-tsvwg-ecn-l4s-id-25 == Outdated reference: A later version (-22) exists of draft-ietf-tsvwg-nqb-10 == Outdated reference: A later version (-03) exists of draft-briscoe-iccrg-prague-congestion-control-00 == Outdated reference: A later version (-25) exists of draft-ietf-tsvwg-aqm-dualq-coupled-23 == Outdated reference: A later version (-20) exists of draft-ietf-tsvwg-l4s-arch-17 Summary: 0 errors (**), 0 flaws (~~), 8 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group B. Briscoe, Ed. 3 Internet-Draft Independent 4 Intended status: Informational G. White 5 Expires: 14 November 2022 CableLabs 6 13 May 2022 8 The DOCSIS® Queue Protection Algorithm to Preserve Low Latency 9 draft-briscoe-docsis-q-protection-06 11 Abstract 13 This informational document explains the specification of the queue 14 protection algorithm used in DOCSIS technology since version 3.1. A 15 shared low latency queue relies on the non-queue-building behaviour 16 of every traffic flow using it. However, some flows might not take 17 such care, either accidentally or maliciously. If a queue is about 18 to exceed a threshold level of delay, the queue protection algorithm 19 can rapidly detect the flows most likely to be responsible. It can 20 then prevent harm to other traffic in the low latency queue by 21 ejecting selected packets (or all packets) of these flows. The 22 document is designed for four types of audience: a) congestion 23 control designers who need to understand how to keep on the 'good' 24 side of the algorithm; b) implementers of the algorithm who want to 25 understand it in more depth; c) designers of algorithms with similar 26 goals, perhaps for non-DOCSIS scenarios; and d) researchers 27 interested in evaluating the algorithm. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on 14 November 2022. 46 Copyright Notice 48 Copyright (c) 2022 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 53 license-info) in effect on the date of publication of this document. 54 Please review these documents carefully, as they describe your rights 55 and restrictions with respect to this document. Code Components 56 extracted from this document must include Revised BSD License text as 57 described in Section 4.e of the Trust Legal Provisions and are 58 provided without warranty as described in the Revised BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 63 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 4 64 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5 65 1.3. Copyright Material . . . . . . . . . . . . . . . . . . . 5 66 2. Approach - In Brief . . . . . . . . . . . . . . . . . . . . . 5 67 2.1. Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 6 68 2.2. Policy . . . . . . . . . . . . . . . . . . . . . . . . . 7 69 2.2.1. Policy Conditions . . . . . . . . . . . . . . . . . . 7 70 2.2.2. Policy Action . . . . . . . . . . . . . . . . . . . . 7 71 3. Necessary Flow Behaviour . . . . . . . . . . . . . . . . . . 7 72 4. Pseudocode Walk-Through . . . . . . . . . . . . . . . . . . . 8 73 4.1. Input Parameters, Constants and Variables . . . . . . . . 9 74 4.2. Queue Protection Data Path . . . . . . . . . . . . . . . 12 75 4.2.1. The qprotect() function . . . . . . . . . . . . . . . 13 76 4.2.2. The pick_bucket() function . . . . . . . . . . . . . 14 77 4.2.3. The fill_bucket() function . . . . . . . . . . . . . 17 78 4.2.4. The calcProbNative() function . . . . . . . . . . . . 17 79 5. Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . 18 80 5.1. Rationale: Blame for Queuing, not for Rate in Itself . . 18 81 5.2. Rationale for Aging the Queuing Score . . . . . . . . . . 20 82 5.3. Rationale for Transformed Queuing Score . . . . . . . . . 21 83 5.4. Rationale for Policy Conditions . . . . . . . . . . . . . 22 84 5.5. Rationale for Reclassification as the Policy Action . . . 25 85 6. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 26 86 7. IANA Considerations (to be removed by RFC Editor) . . . . . . 26 87 8. Implementation Status . . . . . . . . . . . . . . . . . . . . 26 88 9. Security Considerations . . . . . . . . . . . . . . . . . . . 27 89 9.1. Resource Exhaustion Attacks . . . . . . . . . . . . . . . 28 90 9.1.1. Exhausting Flow-State Storage . . . . . . . . . . . . 28 91 9.1.2. Exhausting Processing Resources . . . . . . . . . . . 29 92 10. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 29 93 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 29 94 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 30 95 12.1. Normative References . . . . . . . . . . . . . . . . . . 30 96 12.2. Informative References . . . . . . . . . . . . . . . . . 31 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 32 99 1. Introduction 101 This informational document explains the specification of the queue 102 protection (QProt) algorithm used in DOCSIS technology since version 103 3.1 [DOCSIS]. 105 Although the algorithm is defined in annex P of [DOCSIS], it relies 106 on cross-references to other parts of the set of specs. This 107 document pulls all the strands together into one self-contained 108 document. The core of the document is a similar pseudocode walk- 109 through to that in the DOCSIS spec, but it also includes additional 110 material: i) a brief overview; ii) a definition of how a data sender 111 needs to behave to avoid triggering queue protection; and iii) a 112 section giving the rationale for the design choices. 114 Low queuing delay depends on hosts sending their data smoothly, 115 either at low rate or responding to explicit congestion notifications 116 (ECN [RFC8311], [I-D.ietf-tsvwg-ecn-l4s-id]). So low queuing latency 117 is something hosts create themselves, not something the network gives 118 them. This tends to ensure that self-interest alone does not drive 119 flows to mis-mark their packets for the low latency queue. However, 120 traffic from an application that does not behave in a non-queue- 121 building way might erroneously be classified into a low latency 122 queue, whether accidentally or maliciously. QProt protects other 123 traffic in the low latency queue from the harm due to excess queuing 124 that would otherwise be caused by such anomalous behaviour. 126 In normal scenarios without misclassified traffic, QProt is not 127 expected to intervene at all in the classification or forwarding of 128 packets. 130 An overview of how low latency support has been added to DOCSIS 131 technology is given in [LLD]. In each direction of a DOCSIS link 132 (upstream and downstream), there are two queues: one for Low Latency 133 (LL) and one for Classic traffic, in an arrangement similar to the 134 IETF's Coupled DualQ AQM [I-D.ietf-tsvwg-aqm-dualq-coupled]. The two 135 queues enable a transition from 'Classic' to 'Scalable' congestion 136 control so that low latency can become the norm for any application, 137 including ones seeking both full throughput and low latency, not just 138 low-rate applications that have been more traditionally associated 139 with a low latency service. The Classic queue is only necessary for 140 traffic such as traditional (Reno/Cubic) TCP that needs about a round 141 trip of buffering to fully utilize the link, and therefore has no 142 incentive to mismark itself as low latency. The QProt function is 143 located at the ingress to the Low Latency queue. Therefore, in the 144 upstream QProt is located on the cable modem (CM), and in the 145 downstream it is located on the cable CMTS (CM Termination System). 146 If an arriving packet triggers queue protection, the QProt algorithm 147 ejects the packet from the Low Latency queue and reclassifies it into 148 the Classic queue. 150 If QProt is used in settings other than DOCSIS links, it would be a 151 simple matter to detect queue-building flows by using slightly 152 different conditions, and/or to trigger a different action as a 153 consequence, as appropriate for the scenario, e.g., dropping instead 154 of reclassifying packets or perhaps accumulating a second per-flow 155 score to decide whether to redirect a whole flow rather than just 156 certain packets. Such work is for future study and out of scope of 157 the present document. 159 The algorithm is based on a rigorous approach to quantifying how much 160 each flow contributes to congestion, which is used in economics to 161 allocate responsibility for the cost of one party's behaviour on 162 others (the economic externality). Another important feature of the 163 approach is that the metric used for the queuing score is based on 164 the same variable that determines the level of ECN signalling seen by 165 the sender [RFC8311], [I-D.ietf-tsvwg-ecn-l4s-id]. This makes the 166 internal queuing score visible externally as ECN markings. This 167 transparency is necessary to be able to objectively state (in 168 Section 3) how a flow can keep on the 'good' side of the algorithm. 170 1.1. Document Roadmap 172 The core of the document is the walk-through of the DOCSIS QProt 173 algorithm's pseudocode in Section 4. 175 Prior to that, Section 2 summarizes the approach used in the 176 algorithm, then Section 3 considers QProt from the perspective of the 177 end-system, by defining the behaviour that a flow needs to comply 178 with to avoid the QProt algorithm ejecting its packets from the low 179 latency queue. 181 Section 5 gives deeper insight into the principles and rationale 182 behind the algorithm. Then Section 6 explains the limitations of the 183 approach, followed by the usual closing sections. 185 1.2. Terminology 187 The normative language for the DOCSIS QProt algorithm is in the 188 DOCSIS specs [DOCSIS], [DOCSIS-CM-OSS], [DOCSIS-CCAP-OSS] not in this 189 informational guide. If there is any inconsistency, the DOCSIS specs 190 take precedence. 192 The following terms and abbreviations are used: 194 CM: Cable Modem 196 CMTS: CM Termination System 198 Congestion-rate: The transmission rate of bits or bytes contained 199 within packets of a flow that have the CE codepoint set in the IP- 200 ECN field [RFC3168] (including IP headers unless specified 201 otherwise). Congestion-bit-rate and congestion-volume were 202 introduced in [RFC7713] and [RFC6789]. 204 DOCSIS: Data Over Cable System Interface Specification. "DOCSIS" is 205 a registered trademark of Cable Television Laboratories, Inc. 206 ("CableLabs"). 208 Non-queue-building: A flow that tends not to build a queue 210 Queue-building: A flow that builds a queue. If it is classified 211 into the Low Latency queue, it is therefore a candidate for the 212 queue protection algorithm to detect and sanction. 214 ECN: Explicit Congestion Notification 216 QProt: The Queue Protection function 218 1.3. Copyright Material 220 Parts of this document are reproduced from [DOCSIS] with kind 221 permission of the copyright holder, Cable Television Laboratories, 222 Inc. ("CableLabs"). 224 2. Approach - In Brief 226 The algorithm is divided into mechanism and policy. There is only a 227 tiny amount of policy code, but policy might need to be changed in 228 the future. So, where hardware implementation is being considered, 229 it would be advisable to implement the policy aspects in firmware or 230 software: 232 * The mechanism aspects identify flows, maintain flow-state and 233 accumulate per-flow queuing scores; 235 * The policy aspects can be divided into conditions and actions: 237 - The conditions are the logic that determines when action should 238 be taken to avert the risk of queuing delay becoming excessive; 240 - The actions determine how this risk is averted, e.g., by 241 redirecting packets from a flow into another queue, or to 242 reclassify a whole flow that seems to be misclassified. 244 2.1. Mechanism 246 The algorithm maintains per-flow-state, where 'flow' usually means an 247 end-to-end (layer-4) 5-tuple. The flow-state consists of a queuing 248 score that decays over time. Indeed it is transformed into time 249 units so that it represents the flow-state's own expiry time 250 (explained in Section 5.3). A higher queuing score pushes out the 251 expiry time further. 253 Non-queue-building flows tend to release their flow-state rapidly --- 254 it usually expires reasonably early in the gap between the packets of 255 a normal flow. Then the memory can be recycled for packets from 256 other flows that arrive in between. So only queue-building flows 257 hold flow state persistently. 259 The simplicity and effectiveness of the algorithm is due to the 260 definition of the queuing score. The queueing score represents the 261 share of blame for queuing that each flow bears. The scoring 262 algorithm uses the same internal variable, probNative, that the AQM 263 for the low latency queue uses to ECN-mark packets (the other two 264 forms of marking, Classic and coupled, are driven by Classic traffic 265 and therefore not relevant to protection of the LL queue). In this 266 way, the queuing score accumulates the size of each arriving packet 267 of a flow, but scaled by the value of probNative (in the range 0 to 268 1) at the instant the packet arrives. So a flow's score accumulates 269 faster, the higher the degree of queuing and the faster that the 270 flow's packets arrive when there is queuing. Section 5.1 explains 271 further why this score represents blame for queuing. 273 The algorithm as described so far would accumulate a number that 274 would rise at the so-called congestion-rate of the flow (see 275 Terminology in Section 1.2), i.e., the rate at which the flow is 276 contributing to congestion, or the rate at which the AQM is 277 forwarding bytes of the flow that are ECN marked. However, rather 278 than growing continually, the queuing score is also reduced (or 279 'aged') at a constant rate. This is because it is unavoidable for 280 capacity-seeking flows to induce a continuous low level of congestion 281 as they track available capacity. Section 5.2 explains why this 282 allowance can be set to the same constant for any scalable flow, 283 whatever its bit rate. 285 For implementation efficiency, the queuing score is transformed into 286 time units so that it represents the expiry time of the flow state 287 (as already discussed above). Then it does not need to be explicitly 288 aged, because the natural passage of time implicitly 'ages' an expiry 289 time. The transformation into time units simply involves dividing 290 the queuing score of each packet by the constant aging rate 291 (explained further in Section 5.3). 293 2.2. Policy 295 2.2.1. Policy Conditions 297 The algorithm uses the queuing score to determine whether to eject 298 each packet only at the time it first arrives. This limits the 299 policies available. For instance, when queueing delay exceeds a 300 threshold, it is not possible to eject a packet from the flow with 301 the highest queuing scoring, because that would involve searching the 302 queue for such a packet (if indeed one was still in the queue). 303 Nonetheless, it is still possible to develop a policy that protects 304 the low latency of the queue by making the queuing score threshold 305 stricter the greater the excess of queuing delay relative to the 306 threshold (explained in Section 5.4). 308 2.2.2. Policy Action 310 In the DOCSIS QProt spec at the time of writing, when the policy 311 conditions are met the action taken to protect the low latency queue 312 is to reclassify a packet into the Classic queue (justified in 313 Section 5.5). 315 3. Necessary Flow Behaviour 317 The QProt algorithm described here can be used for responsive and/or 318 unresponsive flows. 320 * It is possible to objectively describe the least responsive way 321 that a flow will need to respond to congestion signals in order to 322 avoid triggering queue protection, no matter the link capacity and 323 no matter how much other traffic there is. 325 * It is not possible to describe how fast or smooth an unresponsive 326 flow should be to avoid queue protection, because this depends on 327 how much other traffic there is and the capacity of the link, 328 which an application is unable to know. However, the more 329 smoothly an unresponsive flow paces its packets and the lower its 330 rate relative to typical broadband link capacities, the less 331 likelihood that it will risk causing enough queueing to trigger 332 queue protection. 334 Responsive low latency flows can use an L4S ECN codepoint 335 [I-D.ietf-tsvwg-ecn-l4s-id] to get classified into the low latency 336 queue. 338 A sender can arrange for flows that are smooth but do not respond to 339 ECN marking to be classified into the low latency queue by using the 340 Non-Queue-Building (NQB) Diffserv codepoint [I-D.ietf-tsvwg-nqb], 341 which the DOCSIS specs support, or an operator can use various other 342 local classifiers. 344 As already explained in Section 2.1, the QProt algorithm is driven 345 from the same variable that drives the ECN marking probability in the 346 low latency queue (see the Immediate Active Queue Management Annex in 347 [DOCSIS]). The algorithm that calculates this internal variable is 348 run on the arrival of every packet, whether it is ECN-capable or not, 349 so that it can be used by the QProt algorithm. But the variable is 350 only used to ECN-mark packets that are ECN-capable. 352 Not only does this dual use of the variable improve processing 353 efficiency, but it also makes the basis of the QProt algorithm 354 visible and transparent, at least for responsive ECN-capable flows. 355 Then it is possible to state objectively that a flow can avoid 356 triggering queue protection by keeping the bit rate of ECN marked 357 packets (the congestion-rate) below AGING, which is a configured 358 constant of the algorithm (default 2^19 B/s ~= 4 Mb/s). Note that it 359 is in a congestion controller's own interest to keep its average 360 congestion-rate well below this level (e.g., ~1 Mb/s), to ensure that 361 it does not trigger queue protection during transient dynamics. 363 If the QProt algorithm is used in other settings, it would still need 364 to be based on the visible level of congestion signalling, in a 365 similar way to the DOCSIS approach. Without transparency of the 366 basis of the algorithm's decisions, end-systems would not be able to 367 avoid triggering queue protection on an objective basis. 369 4. Pseudocode Walk-Through 370 4.1. Input Parameters, Constants and Variables 372 The operator input parameters that set the parameters in the first 373 two blocks of pseudocode below are defined for cable modems (CMs) in 374 [DOCSIS-CM-OSS] and for CMTSs in [DOCSIS-CCAP-OSS]. Then, further 375 constants are either derived from the input parameters or hard-coded. 377 Defaults and units are shown in square brackets. Defaults (or indeed 378 any aspect of the algorithm) are subject to change, so the latest 379 DOCSIS specs are the definitive references. Also any operator might 380 set certain parameters to non-default values. 382 383 // Input Parameters 384 MAX_RATE; // Configured maximum sustained rate [b/s] 385 QPROTECT_ON; // Queue Protection is enabled [Default: TRUE] 386 CRITICALqL_us; // L queue threshold delay [us] Default: MAXTH_us 387 CRITICALqLSCORE_us;// The threshold queuing score [Default: 4000us] 388 LG_AGING; // The aging rate of the q'ing score [Default: 19] 389 // as log base 2 of the congestion-rate [lg(B/s)] 391 // Input Parameters for the calcProbNative() algorithm: 392 MAXTH_us; // Max IAQM marking threshold [Default: 1000us] 393 LG_RANGE; // Log base 2 of the range of ramp [lg(ns)] 394 // Default: 2^19 = 524288 ns (roughly 525 us) 395 396 397 // Constants, either derived from input parameters or hard-coded 398 T_RES; // Resolution of t_exp [ns] 399 // Convert units (approx) 400 AGING = pow(2, (LG_AGING-30) ) * T_RES; // lg([B/s]) to [B/T_RES] 401 CRITICALqL = CRITICALqL_us * 1000; // [us] to [ns] 402 CRITICALqLSCORE = CRITICALqLSCORE_us * 1000/T_RES; // [us] to [T_RES] 403 // Threshold for the q'ing score condition 404 CRITICALqLPRODUCT = CRITICALqL * CRITICALqLSCORE; 405 qLSCORE_MAX = 5E9 / T_RES; // Max queuing score = 5 s 407 ATTEMPTS = 2; // Max attempts to pick a bucket (vendor-specific) 408 BI_SIZE = 5; // Bit-width of index number for non-default buckets 409 NBUCKETS = pow(2, BI_SIZE); // No. of non-default buckets 410 MASK = NBUCKETS-1; // convenient constant, with BI_SIZE LSBs set 412 // Queue Protection exit states 413 EXIT_SUCCESS = 0; // Forward the packet 414 EXIT_SANCTION = 1; // Redirect the packet 416 MAX_PROB = 1; // For integer arithmetic, would use a large int 417 // e.g., 2^31, to allow space for overflow 418 MAXTH = MAXTH_us * 1000; // Max marking threshold [ns] 419 MAX_FRAME_SIZE = 2000; // DOCSIS-wide constant [B] 420 // Minimum marking threshold of 2 MTU for slow links [ns] 421 FLOOR = 2 * 8 * MAX_FRAME_SIZE * 10^9 / MAX_RATE; 422 RANGE = (1 << LG_RANGE); // Range of ramp [ns] 423 MINTH = max ( MAXTH - RANGE, FLOOR); 424 MAXTH = MINTH + RANGE; // Max marking threshold [ns] 425 427 Throughout the pseudocode, most variables are integers. The only 428 exceptions are floating point variables representing probabilities 429 (MAX_PROB and probNative) and the AGING parameter. The actual DOCSIS 430 QProt algorithm is defined using integer arithmetic, but in the 431 floating point arithmetic used in this document, (0 <= probNative <= 432 1). Also, the pseudocode omits overflow checking and it would need 433 to be made robust to non-default input parameters. 435 The resolution for expressing time, T_RES, needs to be chosen to 436 ensure that expiry times for buckets can represent times that are a 437 fraction (e.g., 1/10) of the expected packet interarrival time for 438 the system. 440 The following definitions explain the purpose of important variables 441 and functions. 443 444 // Public variables: 445 qdelay; // The current queuing delay of the LL queue [ns] 446 probNative; // Native marking probability of LL queue within [0,1] 448 // External variables 449 packet; // The structure holding packet header fields 450 packet.size; // The size of the current packet [B] 451 packet.uflow; // The flow identifier of the current packet 452 // (e.g., 5-tuple or 4-tuple if IPSec) 454 // Irrelevant details of DOCSIS function to return qdelay are removed 455 qdelayL(...) // Returns current delay of the low latency Q [ns] 456 458 Pseudocode for how the algorithm categorizes packets by flow ID to 459 populate the variable packet.uflow is not given in detail here. The 460 application's flow ID is usually defined by a common 5-tuple (or 461 4-tuple) of: 463 * source and destination IP addresses of the innermost IP header 464 found; 466 * the protocol (IPv4) or next header (IPv6) field in this IP header 468 * either of: 470 - source and destination port numbers, for TCP, UDP, UDP-Lite, 471 SCTP, DCCP, etc. 473 - Security Parameters Index (SPI) for IPSec Encapsulating 474 Security Payload (ESP) [RFC4303]. 476 The Microflow Classification section of the Queue Protection Annex of 477 the DOCSIS spec. [DOCSIS] defines various strategies to find these 478 headers by skipping extension headers or encapsulations. If they 479 cannot be found, the spec. defines various less-specific 3-tuples 480 that would be used. The DOCSIS spec. should be referred to for all 481 these strategies, which will not be repeated here. 483 The array of bucket structures defined below is used by all the Queue 484 Protection functions: 486 487 struct bucket { // The leaky bucket structure to hold per-flow state 488 id; // identifier (e.g., 5-tuple) of flow using bucket 489 t_exp; // expiry time in units of T_RES 490 // (t_exp - now) = flow's transformed q'ing score 491 }; 492 struct bucket buckets[NBUCKETS+1]; 493 495 4.2. Queue Protection Data Path 497 All the functions of Queue Protection operate on the data path, 498 driven by packet arrivals. 500 The following functions that maintain per-flow queuing scores and 501 manage per-flow state are considered primarily as mechanism: 503 pick_bucket(uflow_id); // Returns bucket identifier 505 fill_bucket(bucket_id, pkt_size, probNative); // Returns queuing 506 score 508 calcProbNative(qdelay) // Returns probability of ECN-marking 510 The following function is primarily concerned with policy: 512 qprotect(packet, ...); // Returns exit status to either forward or 513 redirect the packet 515 ('...' suppresses distracting detail.) 517 Future modifications to policy aspects are more likely than to 518 mechanisms. Therefore, policy aspects would be less appropriate 519 candidates for any hardware acceleration. 521 The entry point to these functions is qprotect(), which is called 522 from packet classification before each packet is enqueued into the 523 appropriate queue, queue_id, as follows: 525 526 classifier(packet) { 527 // Determine which queue using ECN, DSCP and any local-use fields 528 queue_id = classify(packet); 529 // LQ & CQ are macros for valid queue IDs returned by classify() 530 if (queue_id == LQ) { 531 // if packet classified to Low Latency Service Flow 532 if (QPROTECT_ON) { 533 if (qprotect(packet, ...) == EXIT_SANCTION) { 534 // redirect packet to Classic Service Flow 535 queue_id = CQ; 536 } 537 } 538 return queue_id; 539 } 540 542 4.2.1. The qprotect() function 544 On each packet arrival, qprotect() measures the current queue delay 545 and derives the native marking probability from it. Then it uses 546 pick_bucket to find the bucket already holding the flow's state, or 547 to allocate a new bucket if the flow is new or its state has expired 548 (the most likely case). Then the queuing score is updated by the 549 fill_bucket() function. That completes the mechanism aspects. 551 The comments against the subsequent policy conditions and actions 552 should be self-explanatory at a superficial level. The deeper 553 rationale for these conditions is given in Section 5.4. 555 556 // Per packet queue protection 557 qprotect(packet, ...) { 559 bckt_id; // bucket index 560 qLscore; // queuing score of pkt's flow in units of T_RES 562 qdelay = qL.qdelay(...); 563 probNative = calcProbNative(qdelay); 565 bckt_id = pick_bucket(packet.uflow); 566 qLscore = fill_bucket(buckets[bckt_id], packet.size, probNative); 568 // Determine whether to sanction packet 569 if ( ( ( qdelay > CRITICALqL ) // Test if qdelay over threshold... 570 // ...and if flow's q'ing score scaled by qdelay/CRITICALqL 571 // ...exceeds CRITICALqLSCORE 572 && ( qdelay * qLscore > CRITICALqLPRODUCT ) ) 573 // or qLSCORE_MAX reached 574 || ( qLscore >= qLSCORE_MAX ) ) 576 return EXIT_SANCTION; 578 else 579 return EXIT_SUCCESS; 580 } 581 583 4.2.2. The pick_bucket() function 585 The pick_bucket() function is optimized for flow-state that will 586 normally have expired from packet to packet of the same flow. It is 587 just one way of finding the bucket associated with the flow ID of 588 each packet - it might be possible to develop more efficient 589 alternatives. 591 The algorithm is arranged so that the bucket holding any live (non- 592 expired) flow-state associated with a packet will always be found 593 before a new bucket is allocated. The constant ATTEMPTS, defined 594 earlier, determines how many hashes are used to find a bucket for 595 each flow (actually, only one hash is generated; then, by default, 5 596 bits of it at a time are used as the hash value, because by default 597 there are 2^5 = 32 buckets). 599 The algorithm stores the flow's own ID in its flow-state. So, when a 600 packet of a flow arrives, the algorithm tries up to ATTEMPTS times to 601 hash to a bucket, looking for the flow's own ID. If found, it uses 602 that bucket, first resettings the expiry time to 'now' if it has 603 expired. 605 If it does not find the flow's ID, and the expiry time is still 606 current, the algorithm can tell that another flow is using that 607 bucket, and it continues to look for a bucket for the flow. Even if 608 it finds another flow's bucket where the expiry time has passed, it 609 doesn't immediately use it. It merely remembers it as the potential 610 bucket to use. But first it runs through all the ATTEMPTS hashes to 611 look for a bucket assigned to the flow ID. Then, if a live bucket is 612 not already associated with the packet's flow, the algorithm should 613 have already set aside an existing bucket with a score that has aged 614 out. Given this bucket is no longer necessary to hold state for its 615 previous flow, it can be recycled for use by the present packet's 616 flow. 618 If all else fails, there is one additional bucket (called the dregs) 619 that can be used. If the dregs is still in live use by another flow, 620 subsequent flows that cannot find a bucket of their own all share it, 621 adding their score to the one in the dregs. A flow might get away 622 with using the dregs on its own, but when there are many mis-marked 623 flows, multiple flows are more likely to collide in the dregs, 624 including innocent flows. The choice of number of buckets and number 625 of hash attempts determines how likely it will be that this 626 undesirable scenario will occur. 628 629 // Pick the bucket associated with flow uflw 630 pick_bucket(uflw) { 632 now; // current time 633 j; // loop counter 634 h32; // holds hash of the packet's flow IDs 635 h; // bucket index being checked 636 hsav; // interim chosen bucket index 638 h32 = hash32(uflw); // 32-bit hash of flow ID 639 hsav = NBUCKETS; // Default bucket 640 now = get_time_now(); // in units of T_RES 642 // The for loop checks ATTEMPTS buckets for ownership by flow-ID 643 // It also records the 1st bucket, if any, that could be recycled 644 // because it's expired. 645 // Must not recycle a bucket until all ownership checks completed 646 for (j=0; j>= BI_SIZE; // Bit-shift hash for next attempt 660 } 661 // If reached here, no tested bucket was owned by the flow-ID 662 if (hsav != NBUCKETS) { 663 // If here, found an expired bucket within the above for loop 664 buckets[hsav].t_exp = now; // Reset expired bucket 665 } else { 666 // If here, we're having to use the default bucket (the dregs) 667 if (buckets[hsav].t_exp <= now) { // If dregs has expired... 668 buckets[hsav].t_exp = now; // ...reset it 669 } 670 } 671 buckets[hsav].id = uflw; // In either case, claim for recycling 672 return hsav; 673 } 674 676 4.2.3. The fill_bucket() function 678 The fill_bucket() function both accumulates and ages the queuing 679 score over time, as outlined in Section 2.1. To make aging the score 680 efficient, the increment of the queuing score is transformed into 681 units of time by dividing by AGING, so that the result represents the 682 new expiry time of the flow. 684 Given that probNative is already used to select which packets to ECN- 685 mark, it might be thought that the queuing score could just be 686 incremented by the full size of each selected packet, instead of 687 incrementing it by the product of every packet's size (pkt_sz) and 688 probNative. However, the unpublished experience of one of the 689 authors with other congestion policers has found that the score then 690 increments far too jumpily, particularly when probNative is low. 692 A deeper explanation of the queuing score is given in Section 5. 694 695 fill_bucket(bckt_id, pkt_sz, probNative) { 696 now; // current time 697 now = get_time_now(); // in units of T_RES 698 // Add packet's queuing score 699 // For integer arithmetic, a bit-shift can replace the division 700 qLscore = min(buckets[bckt_id].t_exp - now 701 + probNative * pkt_sz / AGING, qLSCORE_MAX); 702 buckets[bckt_id].t_exp = now + qLscore; 703 return qLscore; 704 } 705 707 4.2.4. The calcProbNative() function 709 To derive this queuing score, the QProt algorithm uses the linear 710 ramp function calcProbNative() to normalize instantaneous queuing 711 delay into a probability in the range [0,1], which it assigns to 712 probNative. 714 715 calcProbNative(qdelay){ 716 if ( qdelay >= MAXTH ) { 717 probNative = MAX_PROB; 718 } else if ( qdelay > MINTH ) { 719 probNative = MAX_PROB * (qdelay - MINTH)/RANGE; 720 // In practice, the * and the / would use a bit-shift 721 } else { 722 probNative = 0; 723 } 724 return probNative; 725 } 726 728 5. Rationale 730 5.1. Rationale: Blame for Queuing, not for Rate in Itself 732 Figure 1 shows the bit rates of two flows as stacked areas. It poses 733 the question of which flow is more to blame for queuing delay; the 734 unresponsive constant bit rate flow (c) that is consuming about 80% 735 of the capacity, or the flow sending regular short unresponsive 736 bursts (b)? The smoothness of c seems better for avoiding queuing, 737 but its high rate does not. However, if flow c was not there, or ran 738 slightly more slowly, b would not cause any queuing. 740 ^ bit rate (stacked areas) 741 | ,-. ,-. ,-. ,-. ,-. 742 |--|b|----------|b|----------|b|----------|b|----------|b|---Capacity 743 |__|_|__________|_|__________|_|__________|_|__________|_|_____ 744 | 745 | c 746 | 747 | 748 | 749 +----------------------------------------------------------------> 750 time 752 Figure 1: Which is More to Blame for Queuing Delay? 754 To explain queuing scores, in the following it will initially be 755 assumed that the QProt algorithm is accumulating queuing scores, but 756 not taking any action as a result. 758 To quantify the responsibility that each flow bears for queuing 759 delay, the QProt algorithm accumulates the product of the rate of 760 each flow and the level of congestion, both measured at the instant 761 each packet arrives. The instantaneous flow rate is represented at 762 each discrete event when a packet arrives by the packet's size, which 763 accumulates faster the more packets arrive within each unit of time. 764 The level of congestion is normalized to a dimensionless number 765 between 0 and 1 (probNative). This fractional congestion level is 766 used in preference to a direct dependence on queuing delay for two 767 reasons: 769 * to be able to ignore very low levels of queuing that contribute 770 insignificantly to delay 772 * to be able to erect a steep barrier against excessive queuing 773 delay 775 The unit of the resulting queue score is "congested-bytes" per 776 second, which distinguishes it from just bytes per second. 778 Then, during the periods between bursts (b), neither flow accumulates 779 any queuing score - the high rate of c is benign. But, during each 780 burst, if we say the rate of c and b are 80% and 45% of capacity, 781 thus causing 25% overload, they each bear (80/125)% and (45/125)% of 782 the responsibility for the queuing delay (64% and 36%). The 783 algorithm does not explicitly calculate these percentages. They are 784 just the outcome of the number of packets arriving from each flow 785 during the burst. 787 To summarize, the queuing score never sanctions rate solely on its 788 own account. It only sanctions rate inasmuch as it causes queuing. 790 ^ bit rate (stacked areas) , 791 | ,-. |\ ,- 792 |------Capacity-|b|----------,-.----------|b|----------|b\----- 793 | __|_|_______ |b| /``\| _...-._-': | ,.-- 794 | ,-. __/ \__|_|_ _/ |/ \|/ 795 | |b| ___/ \___/ __ r 796 | |_|/ v \__/ \_______ _/\____/ 797 | _/ \__/ 798 | 799 +----------------------------------------------------------------> 800 time 802 Figure 2: Responsibility for Queuing: More Complex Scenario 804 Figure 2 gives a more complex illustration of the way the queuing 805 score assigns responsibility for queuing (limited to the precision 806 that ASCII art can illustrate). The figure shows the bit rates of 807 three flows represented as stacked areas labelled b, v and r. The 808 unresponsive bursts (b) are the same as in the previous example, but 809 a variable rate video (v) replaces flow c. It's rate varies as the 810 complexity of the video scene varies. Also on a slower timescale, in 811 response to the level of congestion, the video adapts its quality. 812 However, on a short time-scale it appears to be unresponsive to small 813 amounts of queuing. Also, part-way through, a low latency responsive 814 flow (r) joins in, aiming to fill the balance of capacity left by the 815 other two. 817 The combination of the first burst and the low application-limited 818 rate of the video causes neither flow to accumulate queuing score. 819 In contrast, the second burst causes similar excessive overload 820 (125%) to the example in Figure 1. Then, the video happens to reduce 821 its rate (probably due to a less complex scene) so the third burst 822 causes only a little congestion. Let us assume the resulting queue 823 causes probNative to rise to just 1%, then the queuing score will 824 only accumulate 1% of the size of each packet of flows v and b during 825 this burst. 827 The fourth burst happens to arrive just as the new responsive flow 828 (r) has filled the available capacity, so it leads to very rapid 829 growth of the queue. After a round trip the responsive flow rapidly 830 backs off, and the adaptive video also backs off more rapidly than it 831 would normally, because of the very high congestion level. The rapid 832 response to congestion of flow r reduces the queuing score that all 833 three flows accumulate, but they each still bear the cost in 834 proportion to the product of the rates at which their packets arrive 835 at the queue and the value of probNative when they do so. Thus, 836 during the fifth burst, they all accumulate less score than the 837 fourth, because the queuing delay is not as excessive. 839 5.2. Rationale for Aging the Queuing Score 841 Even well-behaved flows will not always be able to respond fast 842 enough to dynamic events. Also well-behaved flows, e.g., DCTCP 843 [RFC8257], TCP Prague [I-D.briscoe-iccrg-prague-congestion-control], 844 BBRv2 [BBRv2] or the L4S variant of SCReAM [SCReAM] for real-time 845 media [RFC8298], can maintain a very shallow queue by continual 846 careful probing for more while also continually subtracting a little 847 from their rate (or congestion window) in response to low levels of 848 ECN signalling. Therefore, the QProt algorithm needs to continually 849 offer a degree of forgiveness to age out the queuing score as it 850 accumulates. 852 Scalable congestion controllers such as those above maintain their 853 congestion window in inverse proportion to the congestion level, 854 probNative. That leads to the important property that on average a 855 scalable flow holds the product of its congestion window and the 856 congestion level constant, no matter the capacity of the link or how 857 many other flows it competes with. For instance, if the link 858 capacity doubles, a scalable flow induces half the congestion 859 probability. Or if three scalable flows compete for the capacity, 860 each flow will reduce to one third of the capacity they would use on 861 their own and increase the congestion level by 3x. 863 This suggests that the QProt algorithm will not sanction a well- 864 behaved scalable flow if it ages out the queuing score at a 865 sufficient constant rate. The constant will need to be somewhat 866 above the average of a well-behaved scalable flow to allow for normal 867 dynamics. 869 Relating QProt's aging constant to a scalable flow does not mean that 870 a flow has to behave like a scalable flow. It can be less 871 aggressive, but not more. For instance, a longer RTT flow can run at 872 a lower congestion-rate than the aging rate, but it can also increase 873 its aggressiveness to equal the rate of short RTT scalable flows 874 [ScalingCC]. The constant aging of QProt also means that a long- 875 running unresponsive flow will be prone to trigger QProt if it runs 876 faster than a competing responsive scalable flow would. And, of 877 course, if a flow causes excessive queuing in the short-term, its 878 queuing score will still rise faster than the constant aging process 879 will decrease it. Then QProt will still eject the flow's packets 880 before they harm the low latency of the shared queue. 882 5.3. Rationale for Transformed Queuing Score 884 The QProt algorithm holds a flow's queuing score state in a structure 885 called a bucket, because of its similarity to a classic leaky bucket 886 (except the contents of the bucket does not represent bytes). 888 probNative * pkt_sz probNative * pkt_sz / AGING 889 | | 890 | V | | V | 891 | : | ___ | : | 892 |_____| ___ |_____| 893 | | ___ | | 894 |__ __| |__ __| 895 | | 896 V V 897 AGING * Dt Dt 899 Figure 3: Transformation of Queuing Score 901 The accumulation and aging of the queuing score is shown on the left 902 of Figure 3 in token bucket form. Dt is the difference between the 903 times when the scores of the current and previous packets were 904 processed. 906 A transformed equivalent of this token bucket is shown on the right 907 of Figure 3, dividing both the input and output by the constant AGING 908 rate. The result is a bucket-depth that represents time and it 909 drains at the rate that time passes. 911 As a further optimization, the time the bucket was last updated is 912 not stored with the flow-state. Instead, when the bucket is 913 initialized the queuing score is added to the system time 'now' and 914 the resulting expiry time is written into the bucket. Subsequently, 915 if the bucket has not expired, the incremental queuing score is added 916 to the time already held in the bucket. Then the queuing score 917 always represents the expiry time of the flow-state itself. This 918 means that the queuing score does not need to be aged explicitly 919 because it ages itself implicitly. 921 5.4. Rationale for Policy Conditions 923 Pseudocode for the QProt policy conditions is given in Section 4.1 924 within the second half of the qprotect() function. When each packet 925 arrives, after finding its flow state and updating the queuing score 926 of the packet's flow, the algorithm checks whether the shared queue 927 delay exceeds a constant threshold CRITICALqL (e.g., 2 ms), as 928 repeated below for convenience: 930 931 if ( ( qdelay > CRITICALqL ) // Test if qdelay over threshold... 932 // ...and if flow's q'ing score scaled by qdelay/CRITICALqL 933 // ...exceeds CRITICALqLSCORE 934 && ( qdelay * qLscore > CRITICALqLPRODUCT ) ) 935 // Recall that CRITICALqLPRODUCT = CRITICALqL * CRITICALqLSCORE 936 938 If the queue delay threshold is exceeded, the flow's queuing score is 939 temporarily scaled up by the ratio of the current queue delay to the 940 threshold queuing delay, CRITICALqL (the reason for the scaling is 941 given next). If this scaled up score exceeds another constant 942 threshold CRITICALqLSCORE, the packet is ejected. The actual last 943 line of code above multiplies both sides of the second condition by 944 CRITICALqL to avoid a costly division. 946 This approach allows each packet to be assessed once, as it arrives. 947 Once queue delay exceeds the threshold, it has two implications: 949 * The current packet might be ejected even though there are packets 950 already in the queue from flows with higher queuing scores. 951 However, any flow that continues to contribute to the queue will 952 have to send further packets, giving an opportunity to eject them 953 as well, as they subsequently arrive. 955 * The next packets to arrive might not be ejected, because they 956 might belong to flows with low queuing scores. In this case, 957 queue delay could continue to rise with no opportunity to eject a 958 packet. This is why the queuing score is scaled up by the current 959 queue delay. Then, the more the queue has grown without ejecting 960 a packet, the more the algorithm 'raises the bar' to further 961 packets. 963 The above approach is preferred over the extra per-packet processing 964 cost of searching the buckets for the flow with the highest queuing 965 score and searching the queue for one of its packets to eject (if one 966 is still in the queue). 968 Note that by default CRITICALqL_us is set to the maximum threshold of 969 the ramp marking algorithm, MAXTH_us. However, there is some debate 970 as to whether setting it to the minimum threshold instead would 971 improve QProt performance. This would roughly double the ratio of 972 qdelay to CRITICALqL, which is compared against the CRITICALqLSCORE 973 threshold. So the threshold would have to be roughly doubled 974 accordingly. 976 Figure 4 explains this approach graphically. On the horizontal axis 977 it shows actual harm, meaning the queuing delay in the shared queue. 978 On the vertical axis it shows the behaviour record of the flow 979 associated with the currently arriving packet, represented in the 980 algorithm by the flow's queuing score. The shaded region represents 981 the combination of actual harm and behaviour record that will lead to 982 the packet being ejected. 984 Behaviour Record: 985 Queueing Score of 986 Arriving Packet's Flow 987 ^ 988 | + |/ / / / / / / / / / / / / / / / / / / 989 | + N | / / / / / / / / / / / / / / / / / / / 990 | + |/ / / / / / / / / / 991 | + | / / / / E (Eject packet) / / / / / 992 | + |/ / / / / / / / / / 993 | + | / / / / / / / / / / / / / / / / / / / 994 | + |/ / / / / / / / / / / / / / / / / / / 995 | +| / / / / / / / / / / / / / / / / / / / 996 | |+ / / / / / / / / / / / / / / / / / / 997 | N | + / / / / / / / / / / / / / / / / / 998 | (No actual | +/ / / / / / / / / / / / / / / 999 | harm) | + / / / / / / / / / / / / 1000 | | P (Pass over) + ,/ / / / / / / / 1001 | | ^ + /./ /_/ 1002 +--------------+------------------------------------------> 1003 CRITICALqL Actual Harm: Shared Queue Delay 1005 Figure 4: Graphical Explanation of the Policy Conditions 1007 The regions labelled 'N' represent cases where the first condition is 1008 not met - no actual harm - queue delay is below the critical 1009 threshold, CRITICALqL. 1011 The region labelled 'E' represents cases where there is actual harm 1012 (queue delay exceeds CRITICALqL) and the queuing score associated 1013 with the arriving packet is high enough to be able to eject it with 1014 certainty. 1016 The region labelled 'P' represents cases where there is actual harm, 1017 but the queuing score of the arriving packet is insufficient to eject 1018 it, so it has to be Passed over. This adds to queuing delay, but the 1019 alternative would be to sanction an innocent flow. It can be seen 1020 that, as actual harm increases, the judgement of innocence becomes 1021 increasingly stringent; the behaviour record of the next packet's 1022 flow does not have to be as bad to eject it. 1024 Conditioning ejection on actual harm helps prevent VPN packets being 1025 ejected unnecessarily. VPNs consisting of multiple flows can tend to 1026 accumulate queuing score faster than it is aged out, because the 1027 aging rate is intended for a single flow. However, whether or not 1028 some traffic is in a VPN, the queue delay threshold (CRITICALqL) will 1029 be no more likely to be exceeded. So conditioning ejection on actual 1030 harm helps reduce the chance that VPN traffic will be ejected by the 1031 QProt function. 1033 5.5. Rationale for Reclassification as the Policy Action 1035 When the DOCSIS QProt algorithm deems that it is necessary to eject a 1036 packet to protect the Low Latency queue, it redirects the packet to 1037 the Classic queue. In the Low Latency DOCSIS architecture (as in 1038 Coupled DualQ AQMs generally), the Classic queue is expected to 1039 frequently have a larger backlog of packets, caused by classic 1040 congestion controllers interacting with a classic AQM (which has a 1041 latency target of 10ms) as well as other bursty traffic. 1043 Therefore, typically, an ejected packet will experience higher 1044 queuing delay than it would otherwise, and it could be re-ordered 1045 within its flow (assuming QProt does not eject all packets of an 1046 anomalous flow). The mild harm caused to the performance of the 1047 ejected packet's flow is deliberate. It gives senders a slight 1048 incentive to identify their packets correctly. 1050 If there were no such harm, there would be nothing to prevent all 1051 flows from identifying themselves as suitable for classification into 1052 the low latency queue, and just letting QProt sort the resulting 1053 aggregate into queue-building and non-queue-building flows. This 1054 might seem like a useful alternative to requiring senders to 1055 correctly identify their flows. However, handling of mis-classified 1056 flows is not without a cost. The more packets that have to be 1057 reclassified, the more often the delay of the low latency queue would 1058 exceed the threshold. Also more memory would be required to hold the 1059 extra flow state. 1061 When a packet is redirected into the Classic queue, an operator might 1062 want to alter the identifier(s) that originally caused it to be 1063 classified into the Low Latency queue, so that the packet will not be 1064 classified into another low latency queue further downstream. 1065 However, redirection of occasional packets can be due to unusually 1066 high transient load just at the specific bottleneck, not necessarily 1067 at any other bottleneck, and not necessarily due to bad flow 1068 behaviour. Therefore, Section 5.4.1.2 of [I-D.ietf-tsvwg-ecn-l4s-id] 1069 precludes a network node from altering the end-to-end ECN field to 1070 exclude traffic from L4S treatment. Instead a local-use identifier 1071 ought to be used (e.g., Diffserv Codepoint or VLAN tag), so that each 1072 operator can apply its own policy, without prejudging what other 1073 operators ought to do. 1075 Although not supported in the DOCSIS specs, QProt could be extended 1076 to recognize that large numbers of redirected packets belong to the 1077 same flow. This might be detected when the bucket expiry time t_exp 1078 exceeds a threshold. Depending on policy and implementation 1079 capabilities, QProt could then install a classifier to redirect a 1080 whole flow into the Classic queue, with an idle timeout to remove 1081 stale classifiers. In these 'persistent offender' cases, QProt might 1082 also overwrite each redirected packet's DSCP or clear its ECN field 1083 to Not-ECT, in order to protect other potential L4S queues 1084 downstream. The DOCSIS specs do not discuss sanctioning whole flows, 1085 so further discussion is beyond the scope of the present document. 1087 6. Limitations 1089 The QProt algorithm groups packets with common layer-4 flow 1090 identifiers. It then uses this grouping to accumulate queuing scores 1091 and to sanction packets. 1093 This choice of identifier for grouping is pragmatic with no 1094 scientific basis. All the packets of a flow certainly pass between 1095 the same two endpoints. But some applications might initiate 1096 multiple flows between the same end-points, e.g., for media, control, 1097 data, etc. Others might use common flow identifiers for all these 1098 streams. Also, a user might group multiple application flows within 1099 the same encrypted VPN between the same layer-4 tunnel end-points. 1100 And even if there were a one-to-one mapping between flows and 1101 applications, there is no reason to believe that the rate at which 1102 congestion can be caused ought to be allocated on a per application 1103 flow basis. 1105 The use of a queuing score that excludes those aspects of flow rate 1106 that do not contribute to queuing (Section 5.1) goes some way to 1107 mitigating this limitation, because the algorithm does not judge 1108 responsibility for queuing delay primarily on the combined rate of a 1109 set of flows grouped under one flow ID. 1111 7. IANA Considerations (to be removed by RFC Editor) 1113 This specification contains no IANA considerations. 1115 8. Implementation Status 1117 +================+================================================+ 1118 | Implementation | DOCSIS models for ns-3 | 1119 | name: | | 1120 +================+================================================+ 1121 | Organization | CableLabs | 1122 +----------------+------------------------------------------------+ 1123 | Web page | https://apps.nsnam.org/app/docsis-ns3/ | 1124 +----------------+------------------------------------------------+ 1125 | Description | ns-3 simulation models developed and used in | 1126 | | support of the Low Latency DOCSIS development, | 1127 | | including models of Dual Queue Coupled AQM, | 1128 | | Queue Protection, and the DOCSIS MAC | 1129 +----------------+------------------------------------------------+ 1130 | Maturity | Simulation models that can also be used in | 1131 | | emulation mode in a testbed context | 1132 +----------------+------------------------------------------------+ 1133 | Coverage | Complete implementation of Annex P of DOCSIS | 1134 | | 3.1 | 1135 +----------------+------------------------------------------------+ 1136 | Version | DOCSIS 3.1, version I21; | 1137 | | https://www.cablelabs.com/specifications/CM- | 1138 | | SP-MULPIv3.1?v=I21 | 1139 +----------------+------------------------------------------------+ 1140 | Licence | GPLv2 | 1141 +----------------+------------------------------------------------+ 1142 | Contact | via web page | 1143 +----------------+------------------------------------------------+ 1144 | Last Impl'n | Mar 2022 | 1145 | update | | 1146 +----------------+------------------------------------------------+ 1147 | Information | 7 Mar 2022 | 1148 | valid at | | 1149 +----------------+------------------------------------------------+ 1151 Table 1 1153 There are also a number of closed source implementations, including 2 1154 cable modem implementations written by different chipset 1155 manufacturers, and one CMTS implementation by a third manufacturer. 1156 These, as well as the ns-3 implementation, have passed the full suite 1157 of compliance tests developed by CableLabs. 1159 9. Security Considerations 1161 The whole of this document concerns traffic security. It considers 1162 the security question of how to identify and eject traffic that does 1163 not comply with the non-queue-building behaviour required to use a 1164 shared low latency queue, whether accidentally or maliciously. 1166 Section 8.2 of the L4S architecture [I-D.ietf-tsvwg-l4s-arch] 1167 introduces the problem of maintaining low latency by either self- 1168 restraint or enforcement, and places DOCSIS queue protection in 1169 context within a wider set of approaches to the problem. 1171 9.1. Resource Exhaustion Attacks 1173 The algorithm has been designed to fail gracefully in the face of 1174 traffic crafted to overrun the resources used for the algorithm's own 1175 processing and flow state. This means that non-queue-building flows 1176 will always be less likely to be sanctioned than queue-building 1177 flows. But an attack could be contrived to deplete resources in such 1178 a way that the proportion of innocent (non-queue-building) flows that 1179 are incorrectly sanctioned could increase. 1181 Incorrect sanctioning is intended not to be catastrophic; it results 1182 in more packets from well-behaved flows being redirected into the 1183 Classic queue; thus introducing more reordering into innocent flows. 1185 9.1.1. Exhausting Flow-State Storage 1187 To exhaust the number of buckets, the most efficient attack is to 1188 send enough long-running attack flows to increase the chance that an 1189 arriving flow will not find an available bucket, and therefore have 1190 to share the 'dregs' bucket. For instance, if ATTEMPTS=2 and 1191 NBUCKETS=32, it requires about 94 attack flows, each using different 1192 port numbers, to increase the probability to 99% that an arriving 1193 flow will have to share the dregs, where it will share a high degree 1194 of redirection into the C queue with the remainder of the attack 1195 flows. 1197 For an attacker to keep buckets busy, it is more efficient to hold 1198 onto them by cycling regularly through a set of port numbers (94 in 1199 the above example), rather than to keep occupying and releasing 1200 buckets with single packet flows across a much larger number of 1201 ports. 1203 During such an attack, the coupled marking probability will have 1204 saturated at 100%. So, to hold a bucket, the rate of an attack flow 1205 needs to be no less than the AGING rate of each bucket; 4Mb/s by 1206 default. However, for an attack flow to be sure to hold on to its 1207 bucket, it would need to send somewhat faster. Thus an attack with 1208 100 flows would need a total force of more than 100 * 4Mb/s = 400Mb/ 1209 s. 1211 This attack can be mitigated (but not prevented) by increasing the 1212 number of buckets. The required attack force scales linearly with 1213 the number of buckets, NBUCKETS. So, if NBUCKETS were doubled to 64, 1214 twice as many 4Mb/s flows would be needed to maintain the same impact 1215 on innocent flows. 1217 Probably the most effective mitigation would be to implement 1218 redirection of whole-flows once enough of the individual packets of a 1219 certain offending flow had been redirected. This would free up the 1220 buckets used to maintain the per-packet queuing score of persistent 1221 offenders. Whole-flow redirection is outside the scope of the 1222 current version of the QProt algorithm specified here, but it is 1223 briefly discussed at the end of Section 5.5. 1225 It might be considered that all the packets of persistently offending 1226 flows ought to be discarded rather than redirected. However, this is 1227 not recommended, because attack flows might be able to pervert whole- 1228 flow discard, turning it onto at least some innocent flows, thus 1229 amplifying an attack that causes reordering into total deletion of 1230 some innocent flows. 1232 9.1.2. Exhausting Processing Resources 1234 The processing time needed to apply the QProt algorithm to each L 1235 packet is small and intended not to take all the time available 1236 between each of a run of fairly small packets. However, an attack 1237 could use minimum size packets launched from multiple input 1238 interfaces into a lower capacity output interface. Whether the QProt 1239 algorithm is vulnerable to processor exhaustion will depend on the 1240 specific implementation. 1242 Addition of a capability to redirect persistently offending flows 1243 from L to C would be the most effective way to reduce the per-packet 1244 processing cost of the QProt algorithm, when under attack. As 1245 already mentioned in Section 9.1.1, this would also be an effective 1246 way to mitigate flow-state exhaustion attacks. Further discussion of 1247 whole-flow redirection is outside the scope of the present document, 1248 but briefly discussed at the end of Section 5.5. 1250 10. Comments Solicited 1252 Evaluation and assessment of the algorithm by researchers is 1253 solicited. Comments and questions are also encouraged and welcome. 1254 They can be addressed to the authors. 1256 11. Acknowledgements 1258 Thanks to Tom Henderson, Magnus Westerlund, David Black, Adrian 1259 Farrel and Gorry Fairhurst for their reviews of this document. The 1260 design of the QProt algorithm and the settings of the parameters 1261 benefited from discussion and critique from the participants of the 1262 cable industry working group on Low Latency DOCSIS. CableLabs funded 1263 Bob Briscoe's initial work on this document. 1265 12. References 1267 12.1. Normative References 1269 [DOCSIS] CableLabs, "MAC and Upper Layer Protocols Interface 1270 (MULPI) Specification, CM-SP-MULPIv3.1", Data-Over-Cable 1271 Service Interface Specifications DOCSIS® 3.1 Version I17 1272 or later, 21 January 2019, . 1275 [DOCSIS-CCAP-OSS] 1276 CableLabs, "CCAP Operations Support System Interface 1277 Spec", Data-Over-Cable Service Interface Specifications 1278 DOCSIS® 3.1 Version I14 or later, 21 January 2019, 1279 . 1282 [DOCSIS-CM-OSS] 1283 CableLabs, "Cable Modem Operations Support System 1284 Interface Spec", Data-Over-Cable Service Interface 1285 Specifications DOCSIS® 3.1 Version I14 or later, 21 1286 January 2019, . 1289 [I-D.ietf-tsvwg-ecn-l4s-id] 1290 Schepper, K. D. and B. Briscoe, "Explicit Congestion 1291 Notification (ECN) Protocol for Very Low Queuing Delay 1292 (L4S)", Work in Progress, Internet-Draft, draft-ietf- 1293 tsvwg-ecn-l4s-id-25, 4 March 2022, 1294 . 1297 [I-D.ietf-tsvwg-nqb] 1298 White, G. and T. Fossati, "A Non-Queue-Building Per-Hop 1299 Behavior (NQB PHB) for Differentiated Services", Work in 1300 Progress, Internet-Draft, draft-ietf-tsvwg-nqb-10, 4 March 1301 2022, . 1304 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 1305 of Explicit Congestion Notification (ECN) to IP", 1306 RFC 3168, DOI 10.17487/RFC3168, September 2001, 1307 . 1309 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 1310 Notification (ECN) Experimentation", RFC 8311, 1311 DOI 10.17487/RFC8311, January 2018, 1312 . 1314 12.2. Informative References 1316 [BBRv2] Cardwell, N., "TCP BBR v2 Alpha/Preview Release", github 1317 repository; Linux congestion control module, 1318 . 1320 [I-D.briscoe-iccrg-prague-congestion-control] 1321 Schepper, K. D., Tilmans, O., and B. Briscoe, "Prague 1322 Congestion Control", Work in Progress, Internet-Draft, 1323 draft-briscoe-iccrg-prague-congestion-control-00, 9 March 1324 2021, . 1327 [I-D.ietf-tsvwg-aqm-dualq-coupled] 1328 Schepper, K. D., Briscoe, B., and G. White, "DualQ Coupled 1329 AQMs for Low Latency, Low Loss and Scalable Throughput 1330 (L4S)", Work in Progress, Internet-Draft, draft-ietf- 1331 tsvwg-aqm-dualq-coupled-23, 4 May 2022, 1332 . 1335 [I-D.ietf-tsvwg-l4s-arch] 1336 Briscoe, B., Schepper, K. D., Bagnulo, M., and G. White, 1337 "Low Latency, Low Loss, Scalable Throughput (L4S) Internet 1338 Service: Architecture", Work in Progress, Internet-Draft, 1339 draft-ietf-tsvwg-l4s-arch-17, 4 March 2022, 1340 . 1343 [LLD] White, G., Sundaresan, K., and B. Briscoe, "Low Latency 1344 DOCSIS: Technology Overview", CableLabs White Paper , 1345 February 2019, . 1348 [RFC4303] Kent, S., "IP Encapsulating Security Payload (ESP)", 1349 RFC 4303, DOI 10.17487/RFC4303, December 2005, 1350 . 1352 [RFC6789] Briscoe, B., Ed., Woundy, R., Ed., and A. Cooper, Ed., 1353 "Congestion Exposure (ConEx) Concepts and Use Cases", 1354 RFC 6789, DOI 10.17487/RFC6789, December 2012, 1355 . 1357 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 1358 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 1359 DOI 10.17487/RFC7713, December 2015, 1360 . 1362 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 1363 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 1364 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 1365 October 2017, . 1367 [RFC8298] Johansson, I. and Z. Sarker, "Self-Clocked Rate Adaptation 1368 for Multimedia", RFC 8298, DOI 10.17487/RFC8298, December 1369 2017, . 1371 [ScalingCC] 1372 Briscoe, B. and K. De Schepper, "Resolving Tensions 1373 between Congestion Control Scaling Requirements", Simula 1374 Technical Report TR-CS-2016-001 arXiv:1904.07605, July 1375 2017, . 1377 [SCReAM] Johansson, I., "SCReAM", github repository; , 1378 . 1381 Authors' Addresses 1383 Bob Briscoe (editor) 1384 Independent 1385 United Kingdom 1386 Email: ietf@bobbriscoe.net 1387 URI: http://bobbriscoe.net/ 1389 Greg White 1390 CableLabs 1391 United States of America 1392 Email: G.White@CableLabs.com