idnits 2.17.1 draft-briscoe-docsis-q-protection-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 750 has weird spacing: '... pkt_sz prob...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 8, 2019) is 1753 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '0' on line 587 -- Looks like a reference, but probably isn't: '1' on line 587 == Missing Reference: 'B' is mentioned on line 381, but not defined == Outdated reference: A later version (-29) exists of draft-ietf-tsvwg-ecn-l4s-id-06 == Outdated reference: A later version (-25) exists of draft-ietf-tsvwg-aqm-dualq-coupled-09 Summary: 0 errors (**), 0 flaws (~~), 6 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Transport Area Working Group B. Briscoe, Ed. 3 Internet-Draft G. White 4 Intended status: Informational CableLabs 5 Expires: January 9, 2020 July 8, 2019 7 Queue Protection to Preserve Low Latency 8 draft-briscoe-docsis-q-protection-00 10 Abstract 12 This informational document defines and explains the specification of 13 the queue protection algorithm used in DOCSIS 3.1. It is believed 14 this algorithm will be useful in scenarios other than DOCSIS. A 15 shared low latency queue relies on the non-queue-building behaviour 16 of every traffic flow using it. However, some flows might not take 17 such care, either accidentally or maliciously. If a queue is about 18 to exceed a threshold level of delay, the queue protection algorithm 19 can rapidly detect the flow(s) most likely to be responsible. It can 20 then prevent selected packets of these flows (or whole flows) from 21 harming the queuing delay of other traffic in the low latency queue. 23 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at https://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on January 9, 2020. 40 Copyright Notice 42 Copyright (c) 2019 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (https://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 58 1.1. Document Roadmap . . . . . . . . . . . . . . . . . . . . 3 59 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 60 2. Approach - In Brief . . . . . . . . . . . . . . . . . . . . . 4 61 2.1. Mechanism . . . . . . . . . . . . . . . . . . . . . . . . 5 62 2.2. Policy . . . . . . . . . . . . . . . . . . . . . . . . . 6 63 3. Necessary Flow Behaviour . . . . . . . . . . . . . . . . . . 6 64 4. Pseudocode Walk-Through . . . . . . . . . . . . . . . . . . . 7 65 4.1. Input Parameters, Constants and Variables . . . . . . . . 8 66 4.2. Queue Protection Data Path . . . . . . . . . . . . . . . 9 67 5. Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . 14 68 5.1. Rationale: Blame for Queuing, not for Rate in Itself . . 14 69 5.2. Rationale for Aging the Queuing Score . . . . . . . . . . 17 70 5.3. Rationale for Normalized Queuing Score . . . . . . . . . 17 71 5.4. Rationale for Policy Conditions . . . . . . . . . . . . . 18 72 6. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 20 73 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 21 74 8. Security Considerations . . . . . . . . . . . . . . . . . . . 21 75 9. Comments Solicited . . . . . . . . . . . . . . . . . . . . . 21 76 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 21 77 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 21 78 11.1. Normative References . . . . . . . . . . . . . . . . . . 21 79 11.2. Informative References . . . . . . . . . . . . . . . . . 22 80 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 24 82 1. Introduction 84 This informational document defines and explains the specification of 85 the queue protection (QProt) algorithm used in DOCSIS 3.1 86 [DOCSIS3.1]. It is believed this algorithm will be useful in 87 scenarios other than DOCSIS. 89 Low queuing delay depends on hosts sending their data smoothly either 90 at low rate or responding to explicit congestion notifications (ECN). 91 So low latency is something hosts create themselves, not something 92 the network gives them. Therefore, there is no incentive for flows 93 to mis-mark their packets for the low latency queue, However, traffic 94 from an application that does not behave in a non-queue-building way 95 might erroneously be classified into a low latency queue, whether 96 accidentally or maliciously. QProt prevents such erroneous behavior 97 from harming the queuing delay of other traffic in the low latency 98 queue. 100 In normal scenarios without misclassified traffic, QProt does not 101 intervene at all in the classification or forwarding of packets. 103 An overview of how low latency support has been added to DOCSIS is 104 given in [I-D.white-tsvwg-lld]. In each direction of a DOCSIS link 105 (upstream and downstream), there are two queues: one for Low Latency 106 and one for Classic traffic, in an arrangement similar to the IETF's 107 Coupled DualQ AQM [I-D.ietf-tsvwg-aqm-dualq-coupled]. The Classic 108 queue is intended for traffic such as traditional (Reno/Cubic) TCP 109 that needs about a round trip of buffering to fully utilize the link, 110 and therefore has no incentive to mismark itself as low latency. The 111 QProt function is located at the ingress to the Low Latency queue. 112 Therefore, in the upstream QProt is located on the cable modem (CM), 113 and in the downstream it is located on the cable CMTS (CM Termination 114 System). If an arriving packet triggers queue protection, the DOCSIS 115 algorithm reclassifies the packet from the Low Latency queue into the 116 Classic queue. 118 If QProt is used in settings other than DOCSIS, it would be a simple 119 matter to detect queue-building flows by using slightly different 120 conditions, and/or trigger a different action as a consequence, as 121 appropriate for the scenario, e.g. dropping instead of reclassifying 122 packets or perhaps accumulating a second per-flow score to decide 123 whether to redirect a whole flow rather than just certain packets. 125 The algorithm is based on a principled approach to quantifying how 126 much each user contributes to congestion, which is used in economics 127 to allocate responsibility for the cost of one party's behaviour on 128 others (the economic externality). Another important feature of the 129 approach is that the metric used for the queuing score is based on 130 the same variable that determines the level of ECN signalling seen by 131 the sender [RFC8311], [I-D.ietf-tsvwg-ecn-l4s-id]. This transparency 132 is necessary to be able to objectively state (in Section 3) how a 133 flow can keep on the 'good' side of the algorithm. 135 1.1. Document Roadmap 137 The core of the document is the walk-through of the DOCSIS QProt 138 algorithm's pseudocode in Section 4. 140 Prior to that, two brief sections provide a "bluffer's guide to 141 QProt" which should suffice for those who do not need the details or 142 the insights. Section 2 summarizes the approach used in the 143 algorithm. Then Section 3 considers QProt from the perspective of 144 the end-system, by defining the behaviour that a flow needs to comply 145 with to avoid the QProt algorithm ejecting its packets from the low 146 latency queue. 148 Section 5 gives deeper insight into the principles and rationale 149 behind the algorithms. Then Section 6 explains the limitations of 150 the approach, followed by the usual closing sections. 152 1.2. Terminology 154 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 155 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 156 document are to be interpreted as described in [RFC2119]. In this 157 document, these words will appear with that interpretation only when 158 in ALL CAPS. Lower case uses of these words are not to be 159 interpreted as carrying RFC-2119 significance. 161 The normative language for the DOCSIS QProt algorithm is in the 162 DOCSIS 3.1 specifications [DOCSIS3.1], [DOCSIS3.1-CM-OSS], 163 [DOCSIS3.1-CCAP-OSS] not in this informational guide. 165 The following terms and abbreviations are used: 167 CM: Cable Modem 169 CMTS: CM Termination System 171 Congestion-rate: The rate at which a flow induces ECN-marked (or 172 dropped) bytes, where an ECN-mark on a packet is defined as 173 marking all the packet's bytes. Congestion-bit-rate and 174 congestion-volume were introduced in [RFC7713] and [RFC6789]. 176 Non-queue-building: A flow that tends not to build a queue 178 Queue-building: A flow that builds a queue, and therefore is a 179 candidate for the queue protection algorithm to detect and 180 sanction 182 ECN: Explicit Congestion Notification 184 QProt: The Queue Protection function 186 2. Approach - In Brief 188 The algorithm is divided into mechanism and policy. 190 o The mechanism aspects identify flows, maintain flow-state and 191 accumulate per-flow queuing scores; 193 o The policy aspects tend to be brief, but more likely to be 194 modified in future. They can be divided into conditions and 195 actions: 197 * The conditions are the logic that determines when action should 198 be taken to avert the risk of queuing delay becoming excessive; 200 * The actions determine how this risk is averted, e.g. by 201 redirecting packets from a flow into another queue, or to 202 reclassify a whole flow that seems to be misclassified. 204 2.1. Mechanism 206 The algorithm maintains per-flow-state, where 'flow' usually means an 207 end-to-end (layer-4) 5-tuple. The flow-state consists of a queuing 208 score normalized to also represent the flow-state's own expiry time 209 (explained in Section 5.3). A higher queuing score pushes out the 210 expiry time further. 212 Non-queue-building flows tend to release their flow-state rapidly --- 213 it usually expires reasonably early in the gap between the packets of 214 a normal flow. Then the memory can be recycled for packets from 215 other flows that arrive in between. So only queue-building flows 216 hold flow state persistently. 218 The simplicity and effectiveness of the algorithm is due to the 219 definition of the queuing score. It uses the internal variable from 220 the AQM that determines the ECN marking probability, probNative, of 221 the low latency queue. In floating point arithmetic, (0 <= 222 probNative <= 1). The algorithm scales the size of each arriving 223 packet of a flow by the value of probNative. 225 The algorithm so far would accumulate a number that would rise at the 226 so-called congestion-rate of the flow, i.e. the rate at which the 227 flow is contributing to congestion, or the rate at which the AQM is 228 forwarding bytes of the flow that are ECN marked. However, rather 229 than growing continually, the queuing score is also aged at a 230 constant rate. 232 In practice, the queuing score is normalized into time units (to 233 represent the expiry time of the flow state, as above). Then it does 234 not need to be explicitly aged, because the natural passage of time 235 implicitly 'ages' an expiry time. 237 2.2. Policy 239 The algorithm uses the queuing score to determine whether to eject 240 each packet as it arrives, rather than allow it to further increase 241 queuing delay. This limits the policies available. For instance, 242 when queueing delay exceeds a threshold, it is not possible to eject 243 a packet from the flow with the highest queuing scoring, because that 244 would involve searching the queue for such a packet (if indeed there 245 was still one in the queue). Nonetheless, it is still possible to 246 develop a policy that protects the low latency of the queue. 248 Currently in DOCSIS, when the policy conditions are met, the action 249 taken to protect the low latency queue is to reclassify a packet into 250 the Classic queue. 252 3. Necessary Flow Behaviour 254 The QProt algorithm described here can be used for responsive and/or 255 unresponsive flows. 257 o It is possible to objectively describe the least responsive way 258 that a flow will need to respond to congestion signals in order to 259 avoid triggering queue protection, no matter the link capacity and 260 no matter how much other traffic there is. 262 o It is not possible to describe how fast or smooth an unresponsive 263 flow should be to avoid queue protection, because this depends on 264 how much other traffic there is and the capacity of the link, 265 which an application is unable to know. However, the smoother an 266 unresponsive flow paces its packets and the lower its rate 267 relative to typical broadband link capacities, the less likelihood 268 that it will risk causing enough queueing to trigger queue 269 protection. 271 In DOCSIS, unresponsive flows are classified into the low latency 272 queue by a Non-Queue-Building (NQB) Diffserv codepoint 273 [I-D.white-tsvwg-nqb], or an operator can use various other 274 additional local classifiers. 276 Responsive low latency flows have to use L4S ECN 277 [I-D.ietf-tsvwg-ecn-l4s-id] to get classified into the low latency 278 queue. 280 The QProt algorithm is driven from the same variable that drives the 281 ECN marking probability in the low latency queue (Annex N of 282 [DOCSIS3.1] ). The algorithm that calculates this internal variable 283 is run on the arrival of every packet, whether it is ECN-capable or 284 not, so that it can be used by the QProt algorithm. But the variable 285 is only used to ECN-mark packets that are ECN-capable. 287 Not only does this dual use of the variable improve processing 288 efficiency, but it also makes the basis of the QProt algorithm 289 visible and transparent, at least for responsive ECN-capable flows. 290 Then it is possible to state objectively that a flow can avoid 291 triggering queue protection by keeping the bit rate of ECN marked 292 packets (the congestion-rate) below AGING, which is a configured 293 constant of the algorithm (default 2^19 B/s ~= 4.2 Mb/s). Note that 294 a congestion controller is advised to keep the average congestion- 295 rate well below this level (e.g. ~1 Mb/s), to ensure that queue 296 protection is not triggered during transient dynamics. 298 If the QProt algorithm is used in other settings, it would need to be 299 based on the visible level of congestion signalling, in a similar way 300 to DOCSIS. Without transparency of the basis of the algorithm's 301 decisions, end-systems would not be able to avoid triggering queue 302 protection on an objective basis. 304 4. Pseudocode Walk-Through 306 The algorithm categorizes packets into flows, usually defined by a 307 common 5-tuple (or 4-tuple) of: 309 o source and destination IP addresses of the innermost IP header 310 found; 312 o protocol of the layer above this IP header 314 o either of: 316 * source and destination port numbers, for TCP, UDP, UDP-Lite, 317 SCTP, DCCP, etc. 319 * Security Parameters Index (SPI) for IPSec Encapsulating 320 Security Payload (ESP) [RFC4303]. 322 Annex P.3 of DOCSIS 3.1 [DOCSIS3.1] defines various strategies to 323 find these headers by skipping extension headers or encapsulations. 324 If they cannot be found the spec defines various less-specific 325 3-tuples that would be used. DOCSIS 3.1 should be referred to for 326 all these strategies, which will not be repeated here. 328 4.1. Input Parameters, Constants and Variables 330 The operator input parameters that set the parameters in the first 331 two blocks of pseudocode below are defined for cable modems (CMs) in 332 [DOCSIS3.1-CM-OSS] and for CMTSs in [DOCSIS3.1-CCAP-OSS]. The 333 constants below that are derived from them or hard-coded. 335 // Input Parameters 336 QPROTECT_ON // Queue Protection is enabled if TRUE 337 CRITICALqL_us // Threshold delay of L queue [us] 338 CRITICALqLSCORE_us // The threshold queuing score [us] 339 LG_AGING // The aging rate of the q'ing score, as 340 // log base 2 of the congestion-rate [lg(B/s)] 342 // Input Parameters for the calcProbNative() algorithm: 343 MAXTH_us // Max marking threshold [us] for IAQM 344 LG_RANGE // Log base 2 of the range of ramp [lg(ns)] 345 // Default: 2^19 = 524288 ns (roughly 525 us) 347 // Constants, either derived from input parameters or hard-coded 348 AGING = pow(2, (LG_AGING-30) ); // Convert lg([B/s]) to [B/ns] 349 CRITICALqL = CRITICALqL_us * 1000 // Convert [us] to [ns] 350 CRITICALqLSCORE = CRITICALqLSCORE_us * 1000 // Convert [us] to [ns] 351 // Threshold for the q'ing score condition 352 CRITICALqLPRODUCT = CRITICALqL * CRITICALqLSCORE 354 ATTEMPTS = 2; // Max attempts to pick a bucket (vendor-specific) 355 BI_SIZE = 5; // Bit-width of index number for non-default buckets 356 NBUCKETS = pow(2, BI_SIZE); // No. of non-default buckets 357 MASK = NBUCKETS-1; // convenient constant, filled with ones 359 // Queue Protection exit states 360 EXIT_SUCCESS = 0; // Forward the packet 361 EXIT_SANCTION = 1; // Redirect the packet 363 MAX_PROB = 1; // For integer arithmetic, would use a large int 364 // e.g., 2^31, to allow space for overflow 365 MAXTH = MAXTH_us * 1000; // Max marking threshold [ns] 366 // Minimum marking threshold of 2 MTU for slow links [ns] 367 FLOOR = 2 * 8 * MAX FRAME SIZE * 10^9 / MAX RATE; 368 RANGE = (1 << LG_RANGE); // Range of ramp [ns] 369 MINTH = max ( MAXTH - RANGE, FLOOR); 370 MAXTH = MINTH + RANGE; // Max marking threshold [ns] 372 The following definitions explain the purpose of important variables 373 and functions. 375 // Public variables: 376 qdelay // The current queuing delay of the LL queue [ns] 377 probNative // The native probability of the LL queue within [0,1] 379 // External variables 380 packet // The structure holding packet header fields 381 packet.size // The size of the current packet [B] 382 packet.uflow // The flow identifier of the current packet 383 // (e.g. 5-tuple or 4-tuple if IPSec) 385 // Irrelevant details of DOCSIS function to return qdelay are removed 386 qdelayL(...) // Returns current delay of the low latency Q [ns] 388 The array of bucket structures defined below is used by all the Queue 389 Protection functions: 391 struct bucket { // The leaky bucket structure to hold per-flow state 392 id; // identifier (e.g. 5-tuple) of the flow using bucket 393 t_exp; // expiry time; 394 // (t_exp - now) = flow's normalized q'ing score [ns] 395 }; 396 struct bucket buckets[NBUCKETS+1]; 398 4.2. Queue Protection Data Path 400 All the functions of Queue Protection operate on the data path, 401 driven by packet arrivals. 403 The following functions that maintain per-flow queuing scores and 404 manage per-flow state are considered primarily as mechanism: 406 pick_bucket(uflow_id); // Returns bucket identifier 408 fill_bucket(bucket_id, pkt_size, probNative); // Returns queuing 409 score 411 calcProbNative(qdelay) // Returns probability of ECN-marking 413 The following function is primarily concerned with policy: 415 qprotect(packet, ...); // Returns exit status to either forward or 416 redirect the packet 418 It is more likely that there might be future modifications to policy 419 aspects. Therefore, policy aspects would be less appropriate 420 candidates for any hardware acceleration. 422 The entry point to these functions is qprotect(), which would be 423 called from packet classification as follows: 425 classifier(packet) { 426 // ... 427 // Classify packet 428 // if packet classified to Low Latency Service Flow 429 if (QPROTECT_ON) { 430 if (qprotect(packet, qL.byte_length) == EXIT_SANCTION) { 431 // redirect packet to Classic Service Flow 432 } 433 } 434 // Forward packet to Low Latency Service Flow 435 // Continue... 436 } 438 On each packet arrival, qprotect() measures the current queue delay 439 and derives the native probability from it. Then it uses pick_bucket 440 to find the bucket already holding the flow's state, or to allocate a 441 new bucket if the flow is new or its state has expired (the most 442 likely case). Then the queuing score is updated by the fill_bucket() 443 function. That completes the mechanism aspects. 445 The comments against the subsequent policy conditions and actions 446 should be self-explanatory at a superficial level. The deeper 447 rationale for these conditions is given in Section 5.4. 449 // Per packet queue protection 450 qprotect(packet, ...) { 452 bckt_id; // bucket index 453 qLscore; // queuing score of pkt's flow [ns] 455 qdelay = qL.qdelay(...); 456 probNative = calcProbNative(qdelay); 458 bckt_id = pick_bucket(packet.uflow); 459 // Not shown: if (bckt_id->t_exp risks overflow) EXIT_SANCTION 460 qLscore = fill_bucket(buckets[bckt_id], packet.size, probNative); 462 // Determine whether to sanction packet 463 if ( ( qdelay > CRITICALqL ) // Test if qdelay over threshold... 464 // ...and if flow's q'ing score scaled by qdelay/CRITICALqL 465 // ...exceeds CRITICALqLSCORE 466 && ( qdelay * qLscore > CRITICALqLPRODUCT ) ) 468 return EXIT_SANCTION; 470 else 471 return EXIT_SUCCESS; 472 } 474 The pick_bucket() function is optimized for flow-state that will 475 normally have expired from packet to packet of the same flow. it is 476 just one way of finding the bucket associated with the flow ID of 477 each packet - it might be possible to develop more efficient 478 alternatives. 480 The algorithm is arranged so that the bucket holding any live (non- 481 expired) flow-state associated with a packet will always be found 482 before a new bucket is allocated. The constant ATTEMPTS, defined 483 earlier, determines how many hashes are used to find a bucket for 484 each flow (actually, only one hash is generated; then, by default, 5 485 bits of it at a time are used as the hash value, because by default 486 there are 2^5 = 32 buckets). 488 The algorithm stores the flow's own ID in its flow-state. So, when 489 the next packet of a flow arrives, if it finds its own ID, but the 490 flow-state has expired, the algorithm just adds the packet's queuing 491 score to 'now', as a new flow would, If it does not find the flow's 492 ID, and the expiry time is still current, the algorithm can tell that 493 another flow is using that bucket, and it continues to look for a 494 bucket for the flow. Even if it finds a bucket where the expiry time 495 has passed, it doesn't immediately use it. It merely remembers it as 496 the potential bucket to use. But first it runs through all the 497 ATTEMPTS hashes to check for another bucket assigned to the flow, in 498 case it is still live. 500 If a live bucket is not already associated with the packet's flow, 501 the algorithm should then have already set aside an existing bucket 502 with a score that has aged out. Given this bucket is no longer 503 necessary to hold state for its previous flow, it can be recycled for 504 use by the present packet's flow. 506 If all else fails, there is one additional bucket (called the dregs) 507 that can be used. If the dregs is still in live use by another flow, 508 subsequent flows that cannot find a bucket of their own all share it, 509 adding their score to the one in the dregs. A flow might get away 510 with using the dregs on its own, but when there are many mis-marked 511 flows, multiple flows are more likely to collide in the dregs, 512 including innocent flows. The choice of number of buckets and number 513 of hash attempts determines how likely it will be that this 514 undesirable scenario will occur. 516 // Pick the bucket associated with flow uflw 517 pick_bucket(uflw) { 519 now; // current time [ns] 520 j; // loop counter 521 h32; // holds hash of the packet's flow IDs 522 h; // bucket index being checked 523 hsav; // interim chosen bucket index 525 h32 = hash32(uflw); // 32-bit hash of flow ID 526 hsav = NBUCKETS; // Default bucket 527 now = get_time_now(); 529 // The for loop checks ATTEMPTS buckets for ownership by flow-ID 530 // It also records the 1st bucket, if any, that could be recycled 531 // because it's expired. 532 // Must not recycle a bucket until all ownership checks completed 533 for (j=0; j>= BI_SIZE; // Bit-shift hash for next attempt 547 } 548 // If reached here, no tested bucket was owned by the flow-ID 549 if (hsav != NBUCKETS) { 550 // If here, found an expired bucket within the above for loop 551 buckets[hsav].t_exp = now; // Reset expired bucket 552 } else { 553 // If here, we're having to use the default bucket (the dregs) 554 if (buckets[hsav].t_exp <= now) { // If dregs has expired... 555 buckets[hsav].t_exp = now; // ...reset it 556 } 557 } 558 buckets[hsav].id = uflw; // In either case, claim for recycling 559 return hsav; 560 } 562 The fill_bucket() function both accumulates and ages the queuing 563 score over time, as outlined in Section 2.1. To make aging the score 564 efficient, the increment of the queuing score is normalized to units 565 of time by dividing by AGING, so that the result represents the new 566 expiry time of the flow. 568 It might be thought that, instead of multiplying the packet size 569 (pkt_sz) by probNative, packets could be selected by the AQM with 570 probability probNative, as they are for ECN-marking. Then the full 571 size of those selected packets would be used to increment the queuing 572 score. However, experience with other congestion policers has found 573 that the score then increments far too jumpily, particularly when 574 probNative is low. 576 A deeper explanation of the queuing score is given in Section 5. 578 fill_bucket(bckt_id, pkt_sz, probNative) { 579 // Add packet's queuing score 580 // For integer arithmetic, a bit-shift can replace the division 581 buckets[bckt_id].t_exp += probNative * pkt_sz / AGING; 582 return (buckets[bckt_id].t_exp - now); 583 } 585 To derive this queuing score, the QProt algorithm uses the linear 586 ramp function calcProbNative() to normalize instantaneous queuing 587 delay into a probability in the range [0,1], which it assigns to 588 probNative. 590 calcProbNative(qdelay){ 591 if ( qdelay >= MAXTH ) { 592 probNative = MAX_PROB; 593 } else if ( qdelay > MINTH ) { 594 probNative = MAX_PROB * (qdelay - MINTH)/RANGE; 595 // In practice, the * and the / would use a bit-shift 596 } else { 597 probNative = 0; 598 } 599 return probNative; 600 } 602 5. Rationale 604 5.1. Rationale: Blame for Queuing, not for Rate in Itself 606 Figure 1 poses the question of which flow is more to blame for 607 queuing delay; the unresponsive constant bit rate flow (c) that is 608 consuming about 80% of the capacity, or the flow sending regular 609 short unresponsive bursts (b)? The smoothness of c seems better for 610 avoiding queuing, but its high rate does not. However, if flow c was 611 not there, or ran slightly more slowly, b would not cause any 612 queuing. 614 ^ bit-rate 615 | ,-. ,-. ,-. ,-. ,-. 616 |--|b|----------|b|----------|b|----------|b|----------|b|---Capacity 617 |__|_|__________|_|__________|_|__________|_|__________|_|_____ 618 | 619 | c 620 | 621 | 622 | 623 +----------------------------------------------------------------> 624 time 626 Figure 1: Which is More to Blame for Queuing Delay? 628 To explain queuing scores, in the following it will initially be 629 assumed that the QProt algorithm is accumulating queuing scores, but 630 not taking any action as a result. 632 To quantify the responsibility that each flow bears for queuing 633 delay, the QProt algorithm accumulates the product of the level of 634 congestion and the rate of each flow, both measured at the instant 635 each packet arrives. The level of congestion is normalized to a 636 dimensionless number between 0 and 1 (nativeProb). The instantaneous 637 flow rate is represented at each discrete event when a packet arrives 638 by the packet's size, which accumulates faster the more packets 639 arrive within each unit of time. The unit of the resulting queue 640 score is "congested-bytes" per second, which distinguishes it from 641 just bytes per second. 643 Then, during the periods between bursts (b), neither flow accumulates 644 any queuing score - the high rate of c is benign. But, during each 645 burst, if we say the rate of c and b are 80% and 45% of capacity, 646 thus causing 125% overload, they each bear (80/125)% and (45/125)% of 647 the responsibility for the queuing delay (64% and 36%). The 648 algorithm does not explicit calculate these percentages. They are 649 just the outcome of the number of packets arriving from each flow 650 during the burst. 652 To summarize, the queuing score never sanctions rate solely on its 653 own account. It only sanctions rate inasmuch as it causes queuing. 655 ^ bit-rate , 656 | ,-. |\ ,- 657 |------Capacity-|b|----------,-.----------|b|----------|b\----- 658 | __|_|_______ |b| /``\| _...-._-' \| ,.-- 659 | ,-. __/ \__|_|_ _/ |/ |/ 660 | |b| ___/ \___/ __ r 661 | |_|/ v \__/ \_______ _/\____/ 662 | _/ \__/ 663 | 664 +----------------------------------------------------------------> 665 time 667 Figure 2: Responsibility for Queuing: More Complex Scenario 669 Figure 1 gives a more complex illustration of the way the queuing 670 score assigns responsibility for queuing (limited to the precision 671 that ASCII art can illustrate). The unresponsive bursts (b) are the 672 same as in the previous example, but a variable rate video (v) 673 replaces flow c. It's rate varies as the complexity of the video 674 scene varies. Also on a slower timescale, in response to the level 675 of congestion, the video adapts its quality. However, on a short 676 time-scale it appears to be unresponsive to small amounts of queuing. 677 Also, part-way through, a low latency responsive flow (r) joins in, 678 aiming to fill the balance of capacity left by the other two. 680 The combination of the first burst and the low application-limited 681 rate of the video causes neither flow to accumulate queuing score. 682 In contrast, the second burst causes similar excessive overload 683 (125%) to the example in Figure 1. Then, the video happens to reduce 684 its rate (probably due to a less complex scene) so the third burst 685 causes only a little congestion. Let us assume the resulting queue 686 causes probNative to rise to just 1%, then the queuing score will 687 only accumulate 1% of the size of each packet of flows v and b during 688 this burst. 690 The fourth burst happens to arrive just as the new responsive flow 691 (r) has filled the available capacity, so it leads to very rapid 692 growth of the queue. After a round trip the responsive flow rapidly 693 backs off, and the adaptive video also backs off more rapidly than it 694 would normally, because of the very high congestion level. The rapid 695 response to congestion of flow r reduces the queuing score that all 696 three flows accumulate, but they each still bear the cost in 697 proportion to the product of the rates at which their packets arrive 698 at the queue and the value of probNative when they do so. Thus, 699 during the fifth burst, they all accumulate less score than the 700 fourth, because the queuing delay is not as excessive. 702 5.2. Rationale for Aging the Queuing Score 704 Even well-behaved flows will not always be able to respond fast 705 enough to dynamic events. Also well-behaved flow(s), e.g. DCTCP 706 [RFC8257], TCP Prague [PragueLinux] or the L4S variant of SCReAM for 707 real-time media [RFC8298], can maintain a very shallow queue by 708 continual careful probing for more while also continually subtracting 709 a little from their rate (or congestion window) in response to low 710 levels of ECN signalling. Therefore, the QProt algorithm needs to 711 continually offer a degree of forgiveness to age out the queuing 712 score as it accumulates. 714 Scalable congestion controllers such as those above maintain their 715 congestion window in inverse proportion to the congestion level, 716 probNative, That leads to the important property that on average a 717 scalable flow holds the product of its congestion window and the 718 congestion level constant, no matter the capacity of the link or how 719 many other flows it competes with. For instance, if the link 720 capacity doubles, a scalable flow induces half the congestion 721 probability. Or if three scalable flows compete for the capacity, 722 each flow will reduce to one third of the capacity and increase the 723 congestion level by 3x. 725 This suggests that the QProt algorithm will not sanction a well- 726 behaved scalable flow if it ages out the queuing score at a 727 sufficient constant rate. The constant will need to be somewhat 728 about the average of a well-behaved scalable flow to allow for normal 729 dynamics. 731 Relating QProt's aging constant to a scalable flow does not mean that 732 a flow has to behave like a scalable flow. It can be less 733 aggressive, but not more. For instance, a longer RTT flow can run at 734 a lower congestion-rate than the aging rate, but it can also increase 735 its aggressiveness to equal the rate of short RTT scalable flows 736 [ScalingCC]. The constant aging of QProt also means that a long- 737 running unresponsive flow will be prone to trigger QProt if it runs 738 faster than a competing responsive scalable flow would. And, of 739 course, if a flow causes excessive queuing in the short-term, its 740 queuing score will still rise faster than the constant aging process 741 will decrease it. Then QProt will still eject the flow's packets 742 before they harm the low latency of the shared queue. 744 5.3. Rationale for Normalized Queuing Score 746 The QProt algorithm holds a flow's queuing score state in a structure 747 called a bucket, because of its similarity to a classic leaky bucket 748 (except the contents of the bucket does not represent bytes). 750 probNative * pkt_sz probNative * pkt_sz / AGING 751 | | 752 | V | | V | 753 | : | ___ | : | 754 |_____| ___ |_____| 755 | | ___ | | 756 |__ __| |__ __| 757 | | 758 V V 759 AGING * Dt Dt 761 Figure 3: Normalization of Queuing Score 763 The accumulation and aging of the queuing score is shown on the left 764 of Figure 3 in token bucket form. Dt is the difference between the 765 times when the scores of the current and previous packets were 766 processed. 768 A normalized equivalent of this token bucket is shown on the right of 769 Figure 3, dividing both the input and output by the constant AGING 770 rate. The result is a bucket-depth that represents time and it 771 drains at the rate that time passes. 773 As a further optimization, the time the bucket was last updated is 774 not stored with the flow-state. Instead, when the bucket is 775 initialized the queuing score is added to the system time 'now' and 776 the resulting expiry time is written into the bucket. Subsequently, 777 if the bucket has not expired, the incremental queuing score is added 778 to the time already held in the bucket. Then the queuing score 779 always represents the expiry time of the flow-state itself. This 780 means that the queuing score does not need to be aged explicitly 781 because it ages itself implicitly. 783 5.4. Rationale for Policy Conditions 785 Pseudocode for the QProt policy conditions is given in Section 4.1 786 within the second half of the qprotect() function. When each packet 787 arrives, after finding its flow state and updating the queuing score 788 of the packet's flow, the algorithm checks whether the shared queue 789 delay exceeds a constant threshold CRITICALqL (e.g. 2 ms), as 790 repeated below for convenience: 792 if ( ( qdelay > CRITICALqL ) // Test if qdelay over threshold... 793 // ...and if flow's q'ing score scaled by qdelay/CRITICALqL 794 // ...exceeds CRITICALqLSCORE 795 && ( qdelay * qLscore > CRITICALqLPRODUCT ) ) 796 // Recall that CRITICALqLPRODUCT = CRITICALqL * CRITICALqLSCORE 798 If the queue delay threshold is exceeded, the flow's queuing score is 799 temporarily scaled up by the current queue delay normalized as a 800 ratio of the threshold queuing delay CRITICALqL. If this scaled up 801 score exceeds another constant threshold CRITICALqLSCORE, the packet 802 is ejected. The actual last line of code above multiplies both sides 803 of the second condition by CRITICALqLSCORE to avoid a costly 804 division. 806 This approach allows each packet to be assessed once, as it arrives. 807 Once queue delay exceeds the threshold, it has two implications: 809 o The current packet might be ejected even though there are packets 810 already in the queue from flows with higher queuing scores. 811 However, any flow that continues to contribute to the queue will 812 have to send further packets, giving an opportunity to eject them 813 as well, as they subsequently arrive. 815 o The next packets to arrive might not be ejected, because they 816 might belong to flows with low queuing scores. In this case, 817 queue delay could continue to rise with no opportunity to eject a 818 packet. This is why the queuing score is scaled up by the current 819 queue delay. Then, the more the queue has grown without ejecting 820 a packet, the more the algorithm 'raises the bar' to further 821 packets. 823 The above approach is preferred over searching for the flow with the 824 highest queuing score and searching for one of its packets to eject 825 from the queue (if one is still there). 827 Figure 4 explains this approach graphically. On the horizontal axis 828 it shows actual harm, meaning the queuing delay in the shared queue. 829 On the vertical axis it shows the behaviour record of the flow 830 associated with the currently arriving packet, represented in the 831 algorithm by the flow's queuing score. The shaded region represents 832 the combination of actual harm and behaviour record that will lead to 833 the packet being ejected. 835 Behaviour Record: 836 Queueing Score of 837 Arriving Packet's Flow 838 ^ 839 | + |/ / / / / / / / / / / / / / / / / / / 840 | + N | / / / / / / / / / / / / / / / / / / / 841 | + |/ / / / / / / / / / 842 | + | / / / / E (Eject packet) / / / / / 843 | + |/ / / / / / / / / / 844 | + | / / / / / / / / / / / / / / / / / / / 845 | + |/ / / / / / / / / / / / / / / / / / / 846 | +| / / / / / / / / / / / / / / / / / / / 847 | |+ / / / / / / / / / / / / / / / / / / 848 | N | + / / / / / / / / / / / / / / / / / 849 | (No actual | +/ / / / / / / / / / / / / / / 850 | harm) | + / / / / / / / / / / / / 851 | | P (Pass over) + ,/ / / / / / / / 852 | | ^ + /./ /_/ 853 +--------------+------------------------------------------> 854 CRITICALqL Actual Harm: Shared Queue Delay 856 Figure 4: Graphical Explanation of the Policy Conditions 858 The region labelled 'N' represents cases where the first condition is 859 not met - No actual harm - queue delay is below the critical 860 threshold, CRITICALqL. 862 The region labelled 'E' represents cases where there is actual harm 863 (queue delay exceeds CRITICALqL) and the queuing score associated 864 with the arriving packet is high enough to be able to eject it with 865 certainty. 867 The region labelled 'P' represents cases where there is actual harm, 868 but the queuing score of the arriving packet is insufficient to eject 869 it, so it has to be Passed over. This adds to queuing delay, but the 870 alternative would be to sanction an innocent flow. It can be seen 871 that, as actual harm increases, the judgement of innocence becomes 872 increasingly stringent; the behaviour record of the next packet's 873 flow does not have to be as bad to eject it. 875 6. Limitations 877 The QProt algorithm groups packets with common layer-4 flow 878 identifiers. It then uses this grouping to accumulate queuing scores 879 and to sanction packets. 881 Some applications might initiate multiple flows between the same end- 882 points, e.g. for media, control, data, etc. Others might use common 883 flow identifiers for all these streams. Also, a user might group 884 multiple application flows within the same encrypted VPN between the 885 same layer-4 tunnel end-points. 887 The use of a queuing score that excludes those aspects of flow rate 888 that do not contribute to queuing Section 5.1 goes some way to 889 mitigating this limitation. However, ultimately this choice of flow 890 identifiers is pragmatic and not particularly principled. 892 7. IANA Considerations 894 This specification contains no IANA considerations. 896 8. Security Considerations 898 The whole of this document considers the security concern of how to 899 identify traffic that does not comply with the non-queue-building 900 behaviour required to use a shared low latency queue, whether 901 accidentally or maliciously. 903 The algorithm has been designed to be fail gracefully in the face of 904 traffic crafted to overrun the resources of the algorithm. This 905 means that non-queue-building flows will always be less likely to be 906 sanctioned than queue-building flows. But an attack could be 907 contrived to deplete resources in such as way that the proportion of 908 innocent (non-queue-building) flows that are incorrectly sanctioned 909 could increase. 911 9. Comments Solicited 913 Comments and questions are encouraged and very welcome. They can be 914 addressed to the IETF Transport Area mailing list , and/or to the authors. 917 10. Acknowledgements 919 Thanks to Tom Henderson for his review of this document. The design 920 of the QProt algorithm and the settings of the parameters benefited 921 from discussion and critique from the participants of the cable 922 industry working group on Low Latency DOCSIS. 924 11. References 926 11.1. Normative References 928 [DOCSIS3.1] 929 CableLabs, "MAC and Upper Layer Protocols Interface 930 (MULPI) Specification, CM-SP-MULPIv3.1", Data-Over-Cable 931 Service Interface Specifications DOCSIS(R) 3.1 Version i17 932 or later, January 2019, . 935 [DOCSIS3.1-CCAP-OSS] 936 CableLabs, "CCAP Operations Support System Interface 937 Spec", Data-Over-Cable Service Interface Specifications 938 DOCSIS(R) 3.1 Version i14 or later, January 2019, 939 . 942 [DOCSIS3.1-CM-OSS] 943 CableLabs, "Cable Modem Operations Support System 944 Interface Spec", Data-Over-Cable Service Interface 945 Specifications DOCSIS(R) 3.1 Version i14 or later, January 946 2019, . 949 [I-D.ietf-tsvwg-ecn-l4s-id] 950 Schepper, K. and B. Briscoe, "Identifying Modified 951 Explicit Congestion Notification (ECN) Semantics for 952 Ultra-Low Queuing Delay (L4S)", draft-ietf-tsvwg-ecn-l4s- 953 id-06 (work in progress), March 2019. 955 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 956 Requirement Levels", BCP 14, RFC 2119, 957 DOI 10.17487/RFC2119, March 1997, 958 . 960 [RFC8311] Black, D., "Relaxing Restrictions on Explicit Congestion 961 Notification (ECN) Experimentation", RFC 8311, 962 DOI 10.17487/RFC8311, January 2018, 963 . 965 11.2. Informative References 967 [I-D.ietf-tsvwg-aqm-dualq-coupled] 968 Schepper, K., Briscoe, B., and G. White, "DualQ Coupled 969 AQMs for Low Latency, Low Loss and Scalable Throughput 970 (L4S)", draft-ietf-tsvwg-aqm-dualq-coupled-09 (work in 971 progress), July 2019. 973 [I-D.white-tsvwg-lld] 974 White, G., Sundaresan, K., and B. Briscoe, "Low Latency 975 DOCSIS - Technology Overview", draft-white-tsvwg-lld-00 976 (work in progress), March 2019. 978 [I-D.white-tsvwg-nqb] 979 White, G. and T. Fossati, "Identifying and Handling Non 980 Queue Building Flows in a Bottleneck Link", draft-white- 981 tsvwg-nqb-02 (work in progress), June 2019. 983 [PragueLinux] 984 Briscoe, B., De Schepper, K., Albisser, O., Misund, J., 985 Tilmans, O., Kuehlewind, M., and A. Ahmed, "Implementing 986 the `TCP Prague' Requirements for Low Latency Low Loss 987 Scalable Throughput (L4S)", Proc. Linux Netdev 0x13 , 988 March 2019, . 991 [RFC4303] Kent, S., "IP Encapsulating Security Payload (ESP)", 992 RFC 4303, DOI 10.17487/RFC4303, December 2005, 993 . 995 [RFC6789] Briscoe, B., Ed., Woundy, R., Ed., and A. Cooper, Ed., 996 "Congestion Exposure (ConEx) Concepts and Use Cases", 997 RFC 6789, DOI 10.17487/RFC6789, December 2012, 998 . 1000 [RFC7713] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) 1001 Concepts, Abstract Mechanism, and Requirements", RFC 7713, 1002 DOI 10.17487/RFC7713, December 2015, 1003 . 1005 [RFC8257] Bensley, S., Thaler, D., Balasubramanian, P., Eggert, L., 1006 and G. Judd, "Data Center TCP (DCTCP): TCP Congestion 1007 Control for Data Centers", RFC 8257, DOI 10.17487/RFC8257, 1008 October 2017, . 1010 [RFC8298] Johansson, I. and Z. Sarker, "Self-Clocked Rate Adaptation 1011 for Multimedia", RFC 8298, DOI 10.17487/RFC8298, December 1012 2017, . 1014 [ScalingCC] 1015 Briscoe, B. and K. De Schepper, "Resolving Tensions 1016 between Congestion Control Scaling Requirements", Simula 1017 Technical Report TR-CS-2016-001 arXiv:1904.07605, July 1018 2017, . 1020 Authors' Addresses 1022 Bob Briscoe (editor) 1023 CableLabs 1024 UK 1026 Email: ietf@bobbriscoe.net 1027 URI: http://bobbriscoe.net/ 1029 Greg White 1030 CableLabs 1031 US 1033 Email: G.White@CableLabs.com