idnits 2.17.1 draft-csfx-ippm-hipmetrics-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 20, 2021) is 918 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC7950' is defined on line 362, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group A. Clemm 3 Internet-Draft J. Strassner 4 Intended status: Standards Track Futurewei 5 Expires: April 23, 2022 J. Francois 6 Inria 7 October 20, 2021 9 High-Precision Service Metrics 10 draft-csfx-ippm-hipmetrics-00 12 Abstract 14 This document defines a set of metrics for high-precision networking 15 services. These metrics can be used to assess the service levels 16 that are being delivered for a networking flow. Specifically, they 17 can be used to determine the degree of compliance with which service 18 levels are being delivered relative to service level objectives that 19 were defined for the flow. The metrics can be used as part of flow 20 records and/or accounting records. They can also be used to 21 continuously monitor the quality with which high-precision networking 22 service are being delivered. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at https://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on April 23, 2022. 41 Copyright Notice 43 Copyright (c) 2021 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (https://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 59 2. Key Words . . . . . . . . . . . . . . . . . . . . . . . . . . 3 60 3. Definitions and Acronyms . . . . . . . . . . . . . . . . . . 3 61 4. Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 62 5. Discussion Items . . . . . . . . . . . . . . . . . . . . . . 7 63 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 64 7. Security Considerations . . . . . . . . . . . . . . . . . . . 7 65 8. Normative References . . . . . . . . . . . . . . . . . . . . 8 66 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9 68 1. Introduction 70 Many networking applications increasingly rely on high-precision 71 networking services that have clearly defined service level 72 objectives (SLOs), for example with regards to end-to-end latency. 73 Applications requiring such services include industrial networks, for 74 example cloud-based industrial controllers for precision machinery, 75 vehicular applications, for example tele-driving in which a vehicle 76 is remotely controlled by a human operators, or Augmented Reality / 77 Virtual Reality (AR/VR) applications involving rendering of point 78 clouds remotely. Many of those applications are not tolerant of 79 degrading service levels. A slight miss in SLOs does not merely 80 result in a slight deterioration of the Quality of Experience to end 81 users, but may render the application inoperable. At the same time, 82 many of those applications are mission critical, in which sudden 83 failures can jeopardize safety or have other adverse consequences. 84 However, clearly those applications represent significant business 85 opportunity demanding dependable technical solutions. 87 Because of this, efforts such as Deterministic Networking (DetNet) 88 [RFC8655] are attempting to create solutions in which clear bounds on 89 parameters such as end-to-end latency and jitter can be defined in 90 order to make service levels being delivered predictable and, 91 ideally, deterministic. However, one area that has not kept pace 92 concerns metrics that can account for service levels with which 93 services are delivered, specifically the degree of precision for 94 agreed-upon service level objectives. Such metrics, and the 95 instrumentation to support them, are important for a number of 96 purposes, including monitoring (to ensure that networking services 97 are performing according to their objectives) as well as accounting 98 (to maintain a record of service levels actually delivered, important 99 for monetization of such services as well as for triaging of 100 problems). 102 The current state-of-the-art of such metrics includes (for example) 103 interface metrics, useful to obtain data on traffic volume and 104 behavior that can be observed at an interface [RFC2863] [RFC8343] but 105 agnostic of actual end-to-end service levels and not specific to 106 distinct flows. Flow records [RFC7011] [RFC7012] maintain statistics 107 about flows, including flow volume and flow duration, but again 108 contain very little information about end-to-end service levels, let 109 alone whether the service levels delivered meet their targets, i.e. 110 their associated SLOs. 112 This specification introduces a new set of metrics aimed at capturing 113 end-to-end service levels for a flow, specifically the degree to 114 which flows comply with the SLOs that are in effect. 116 It should be noted that at this point, the set of metrics proposed 117 here is intended as a "starter set" that is intended to spark further 118 discussion. Other metrics are certainly conceivable; we expect that 119 the list of metrics will evolve over time as part of Working Group 120 discussions. 122 2. Key Words 124 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 125 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 126 "OPTIONAL" in this document are to be interpreted as described in BCP 127 14 [RFC2119] [RFC8174] when, and only when, they appear in all 128 capitals, as shown here. 130 3. Definitions and Acronyms 132 MTBF: Mean Time Between Failures 134 SL: Service Level 136 SLA: Service Level Agreement 138 SLO: Service Level Objective 140 4. Metrics 142 The following section proposes a set of accounting metrics focus on 143 end-to-end latency objectives. They indicate whether any violations 144 of end-to-end latency occurred at the packet level. These metrics 145 are intended to be applied on a per-flow basis and are intended to 146 assess the degree to which a flow's end-to-end service levels comply 147 with the SLO in effect for that flow. 149 While the focus in this document concerns end-to-end latency 150 objectives, analogous metrics could also be defined for other end-to- 151 end service level parameters, such as loss (which is distinct from 152 loss occurring at any one given interface) or delay variation. 154 o Violated Packets. This indicates the number of packets for which 155 a violation of a latency SLO occurred. 157 o Violated Time Units (e.g. violated seconds, violated 158 milliseconds). This indicates the number of time units during 159 which one or more violations of SLOs were observed, regardless of 160 how many violations took place during the same interval. This 161 measure is useful in scenarios where bursts of violations might 162 suddenly occur (e.g. due to temporary network congestion, during 163 route convergence etc.) and the count of violated packets by 164 itself might paint a misleading picture. 166 The following additional set of metrics may be useful in certain 167 scenarios as well. However, their precise definition may be subject 168 to policy and further discussion is needed: 170 o Significantly Violated Packets. This indicates the number of 171 packets for which a "significant" violation occurred, where 172 "significant" implies an SLO that was not merely a near-miss but 173 that missed the objective by a degree determined especially 174 significant. 176 o Significantly Violated Time Units (e.g. significantly violated 177 seconds, significantly violated milliseconds). This indicates the 178 number of time units during which any significant violation 179 occurred. 181 o Severely Violated Time Units (e.g. severely violated seconds, 182 severely violated milliseconds). "Severe" here refers to the 183 occurrence of multiple violations within the same time unit. The 184 definition of "severe" may be subject to policy; it may also take 185 into account the significance of the violations that occur. 187 Note that there is no definition of Severely Violated Packets. The 188 term "severe" is used in conjunction with the occurrence of multiple 189 violations related to multiple packets, not any one packet in 190 isolation. 192 From these first-order metrics, second-order metrics can be defined 193 that build on the first set of metrics. Some of these metrics are 194 modeled after Mean Time Between Failure, or MTBF metrics - a 195 "failure" in this context referring to a failure to deliver a packet 196 according to its SLO. 198 o Time since last violated time unit (i.e., since last violated ms, 199 since last violated second). (This parameter is particularly 200 useful for the monitoring of the current health.) 202 o Packets since last violated packet. (This parameter is 203 particularly useful for the monitoring of the current health.) 205 o Mean time between violated time units (i.e. between violated 206 milliseconds, between violated seconds). This refers to the 207 arithmetic mean of time between violations such as violated time 208 units. 210 o Mean packets between violations. This refers to the arithmetic 211 mean of the number of SLO-compliant packets between SLO 212 violations. (Another variation of "MTBF" in a service setting.) 214 The same set of metrics can also be applied to significant 215 violations, and to severe violations: 217 o Time since last significantly violated time unit (i.e., since last 218 significantly violated ms, since last significantly violated 219 second). 221 o Time since last severely violated time unit (i.e., since last 222 severely violated ms, since last severely violated second). 224 o Packets since last significatly violated packet. 226 o Mean time between significantly violated time units (i.e. between 227 significantly violated milliseconds, between significantly 228 violated seconds). 230 o Mean time between severely violated time units (i.e. between 231 severely violated milliseconds, between severely violated 232 seconds). 234 o Mean packets between significant violations. This refers to the 235 arithmetic mean of the number of SLO-compliant packets between 236 significant SLO violations. 238 The next set of metrics puts the violations in relationship to non- 239 violations. It is intended to provide an analogous measure to that 240 of availability, typically defined as the number of time units during 241 which a system (or service) is unavailable divided by the total 242 number of time units. In analogy, a time unit that is "violated" can 243 be viewed as one in which a service is not available with the 244 advertised precision: 246 o Precision availability (of milliseconds, of seconds): the ratio 247 between violated time units (seconds, milliseconds) and the total 248 time units for the duration of the service. 250 o Analogous metrics for precision availability re: severely violated 251 time units, re: significantly violated time units. 253 It should be noted that certain Service Level Agreements may be 254 statistical in nature, requiring the service levels of packets in a 255 flow to adhere to certain distributions. For example, an SLA might 256 state that any given SLO applies only to a certain percentage of 257 packets, allowing for a certain amount of violations to take place. 258 A "violated packet" in that case does not necessarily constitute an 259 SLO violation. However, it is still useful to maintain those 260 statistics, as the number of violated packets still matters when 261 looked at in proportion to the total number of packets. 263 Along that vein, an SLA might establish an SLO of, say, end-to-end 264 latency to not exceed 20ms for 99% of packets, to not exceed 25ms for 265 99.999% of packets, and to never exceed 30ms for anything beyond. In 266 that case, any individual packet missing the 20 ms latency target 267 cannot be considered an SLO violation in itself, but compliance with 268 the SLO may need to be assessed after the fact. 270 To support statistical SLAs more directly, it is feasible to support 271 additional metrics, such as metrics that represent histograms for 272 service level parameters with buckets corresponding to individual 273 service level objectives. For the example just given, a histogram 274 for a given flow could be maintained with three buckets: one 275 containing the count of packets within 20ms, a second with a count of 276 packets between 20 and 25ms (or simply all within 25ms), a third with 277 a count of packet between 25 and 30ms (or simply all packets within 278 30ms, and a fourth with a count of anything beyond (or simply a total 279 count). Of course, the number of buckets and the boundaries between 280 those buckets should correspond to the needs of the application 281 respectively SLA, i.e. to the specific guarantees and SLOs that were 282 provided. The definition of histogram metrics is for further study. 284 5. Discussion Items 286 The following is a list of items for which further discussion is 287 needed as to whether they should be included in the scope of this 288 specification: 290 o A YANG data model 292 o A set of IPFIX Information Elements 294 o Statistical metrics: e.g. histograms/buckets 296 o Policies regarding the definition of "significant" and "severe" 297 violations 299 o Additional second-order metrics, such as "longest disruption of 300 service time" (measuring consecutive time units with violations) 302 6. IANA Considerations 304 TBD 306 7. Security Considerations 308 Instrumentation for metrics that are used to assess compliance with 309 SLOs consitute an interesting target for an attacker. By interfering 310 with the maintaining of such metrics, services could be falsely 311 identified as being in compliance (when they are not), or vice-versa 312 flagged as being non-compliant (when indeed they are). While this 313 document does not specify how networks should be instrumented to 314 maintain the identified metrics, such instrumentation needs to be 315 properly secured to ensure accurate measurements and prohibit 316 tampering with metrics being kept. 318 Where metrics are being defined relative to an SLO, the configuration 319 of those SLOs needs to be properly secured. Likewise, where SLOs can 320 be adjusted, it needs to be clear which particular SLO any given 321 metrics instance refers to. The same service levels that constitute 322 SLO violations for one flow, and that should be maintained as part of 323 the "violated time units", "violated packets", and related metrics, 324 may be perfectly compliant for another flow. Where it is not 325 possible to properly tie together SLOs and violation metrics, it will 326 be preferrable to merely maintain statistics about sevice levels that 327 were delivered (for example, overall histograms of end-to-end 328 latency), without assessing which of these constitute violations. 330 By the same token, where the definition of what constitutes a 331 "severe" violation or a "significant" violation depends on policy or 332 context, the configuration of such policy or context needs to be 333 specially secured and the configuration of this policy be bound to 334 the metrics being maintained. This way it will be clear which policy 335 was in effect when those metrics were being assessed. An attacker 336 that is able to tamper with such policies will render the 337 corresponding metrics useless (in the best case) or misleading (in 338 the worst case). 340 8. Normative References 342 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 343 Requirement Levels", BCP 14, RFC 2119, 344 DOI 10.17487/RFC2119, March 1997, 345 . 347 [RFC2863] McCloghrie, K. and F. Kastenholz, "The Interfaces Group 348 MIB", RFC 2863, DOI 10.17487/RFC2863, June 2000, 349 . 351 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 352 "Specification of the IP Flow Information Export (IPFIX) 353 Protocol for the Exchange of Flow Information", STD 77, 354 RFC 7011, DOI 10.17487/RFC7011, September 2013, 355 . 357 [RFC7012] Claise, B., Ed. and B. Trammell, Ed., "Information Model 358 for IP Flow Information Export (IPFIX)", RFC 7012, 359 DOI 10.17487/RFC7012, September 2013, 360 . 362 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 363 RFC 7950, DOI 10.17487/RFC7950, August 2016, 364 . 366 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 367 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 368 May 2017, . 370 [RFC8343] Bjorklund, M., "A YANG Data Model for Interface 371 Management", RFC 8343, DOI 10.17487/RFC8343, March 2018, 372 . 374 [RFC8655] Finn, N., Thubert, P., Varga, B., and J. Farkas, 375 "Deterministic Networking Architecture", RFC 8655, 376 DOI 10.17487/RFC8655, October 2019, 377 . 379 Authors' Addresses 381 Alexander Clemm 382 Futurewei 383 2330 Central Expressway 384 Santa Clara CA 95050 385 USA 387 Email: ludwig@clemm.org 389 John Strassner 390 Futurewei 391 2330 Central Expressway 392 Santa Clara CA 95050 393 USA 395 Email: strazpdj@gmail.com 397 Jerome Francois 398 Inria 399 615 Rue du Jardin Botanique 400 Villers-les-Nancy 54600 401 France 403 Email: jerome.francois@inria.fr