idnits 2.17.1 

draft-csfx-ippm-hipmetrics-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document doesn't use any RFC 2119 keywords, yet seems to have RFC
     2119 boilerplate text.

  -- The document date (October 20, 2021) is 918 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'RFC7950' is defined on line 362, but no explicit
     reference was found in the text


     Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	Network Working Group                                           A. Clemm
3	Internet-Draft                                              J. Strassner
4	Intended status: Standards Track                               Futurewei
5	Expires: April 23, 2022                                      J. Francois
6	                                                                   Inria
7	                                                        October 20, 2021

9	                     High-Precision Service Metrics
10	                     draft-csfx-ippm-hipmetrics-00

12	Abstract

14	   This document defines a set of metrics for high-precision networking
15	   services.  These metrics can be used to assess the service levels
16	   that are being delivered for a networking flow.  Specifically, they
17	   can be used to determine the degree of compliance with which service
18	   levels are being delivered relative to service level objectives that
19	   were defined for the flow.  The metrics can be used as part of flow
20	   records and/or accounting records.  They can also be used to
21	   continuously monitor the quality with which high-precision networking
22	   service are being delivered.

24	Status of This Memo

26	   This Internet-Draft is submitted in full conformance with the
27	   provisions of BCP 78 and BCP 79.

29	   Internet-Drafts are working documents of the Internet Engineering
30	   Task Force (IETF).  Note that other groups may also distribute
31	   working documents as Internet-Drafts.  The list of current Internet-
32	   Drafts is at https://datatracker.ietf.org/drafts/current/.

34	   Internet-Drafts are draft documents valid for a maximum of six months
35	   and may be updated, replaced, or obsoleted by other documents at any
36	   time.  It is inappropriate to use Internet-Drafts as reference
37	   material or to cite them other than as "work in progress."

39	   This Internet-Draft will expire on April 23, 2022.

41	Copyright Notice

43	   Copyright (c) 2021 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents
48	   (https://trustee.ietf.org/license-info) in effect on the date of
49	   publication of this document.  Please review these documents
50	   carefully, as they describe your rights and restrictions with respect
51	   to this document.  Code Components extracted from this document must
52	   include Simplified BSD License text as described in Section 4.e of
53	   the Trust Legal Provisions and are provided without warranty as
54	   described in the Simplified BSD License.

56	Table of Contents

58	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
59	   2.  Key Words . . . . . . . . . . . . . . . . . . . . . . . . . .   3
60	   3.  Definitions and Acronyms  . . . . . . . . . . . . . . . . . .   3
61	   4.  Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . .   3
62	   5.  Discussion Items  . . . . . . . . . . . . . . . . . . . . . .   7
63	   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .   7
64	   7.  Security Considerations . . . . . . . . . . . . . . . . . . .   7
65	   8.  Normative References  . . . . . . . . . . . . . . . . . . . .   8
66	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .   9

68	1.  Introduction

70	   Many networking applications increasingly rely on high-precision
71	   networking services that have clearly defined service level
72	   objectives (SLOs), for example with regards to end-to-end latency.
73	   Applications requiring such services include industrial networks, for
74	   example cloud-based industrial controllers for precision machinery,
75	   vehicular applications, for example tele-driving in which a vehicle
76	   is remotely controlled by a human operators, or Augmented Reality /
77	   Virtual Reality (AR/VR) applications involving rendering of point
78	   clouds remotely.  Many of those applications are not tolerant of
79	   degrading service levels.  A slight miss in SLOs does not merely
80	   result in a slight deterioration of the Quality of Experience to end
81	   users, but may render the application inoperable.  At the same time,
82	   many of those applications are mission critical, in which sudden
83	   failures can jeopardize safety or have other adverse consequences.
84	   However, clearly those applications represent significant business
85	   opportunity demanding dependable technical solutions.

87	   Because of this, efforts such as Deterministic Networking (DetNet)
88	   [RFC8655] are attempting to create solutions in which clear bounds on
89	   parameters such as end-to-end latency and jitter can be defined in
90	   order to make service levels being delivered predictable and,
91	   ideally, deterministic.  However, one area that has not kept pace
92	   concerns metrics that can account for service levels with which
93	   services are delivered, specifically the degree of precision for
94	   agreed-upon service level objectives.  Such metrics, and the
95	   instrumentation to support them, are important for a number of
96	   purposes, including monitoring (to ensure that networking services
97	   are performing according to their objectives) as well as accounting
98	   (to maintain a record of service levels actually delivered, important
99	   for monetization of such services as well as for triaging of
100	   problems).

102	   The current state-of-the-art of such metrics includes (for example)
103	   interface metrics, useful to obtain data on traffic volume and
104	   behavior that can be observed at an interface [RFC2863] [RFC8343] but
105	   agnostic of actual end-to-end service levels and not specific to
106	   distinct flows.  Flow records [RFC7011] [RFC7012] maintain statistics
107	   about flows, including flow volume and flow duration, but again
108	   contain very little information about end-to-end service levels, let
109	   alone whether the service levels delivered meet their targets, i.e.
110	   their associated SLOs.

112	   This specification introduces a new set of metrics aimed at capturing
113	   end-to-end service levels for a flow, specifically the degree to
114	   which flows comply with the SLOs that are in effect.

116	   It should be noted that at this point, the set of metrics proposed
117	   here is intended as a "starter set" that is intended to spark further
118	   discussion.  Other metrics are certainly conceivable; we expect that
119	   the list of metrics will evolve over time as part of Working Group
120	   discussions.

122	2.  Key Words

124	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
125	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
126	   "OPTIONAL" in this document are to be interpreted as described in BCP
127	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
128	   capitals, as shown here.

130	3.  Definitions and Acronyms

132	      MTBF: Mean Time Between Failures

134	      SL: Service Level

136	      SLA: Service Level Agreement

138	      SLO: Service Level Objective

140	4.  Metrics

142	   The following section proposes a set of accounting metrics focus on
143	   end-to-end latency objectives.  They indicate whether any violations
144	   of end-to-end latency occurred at the packet level.  These metrics
145	   are intended to be applied on a per-flow basis and are intended to
146	   assess the degree to which a flow's end-to-end service levels comply
147	   with the SLO in effect for that flow.

149	   While the focus in this document concerns end-to-end latency
150	   objectives, analogous metrics could also be defined for other end-to-
151	   end service level parameters, such as loss (which is distinct from
152	   loss occurring at any one given interface) or delay variation.

154	   o  Violated Packets.  This indicates the number of packets for which
155	      a violation of a latency SLO occurred.

157	   o  Violated Time Units (e.g. violated seconds, violated
158	      milliseconds).  This indicates the number of time units during
159	      which one or more violations of SLOs were observed, regardless of
160	      how many violations took place during the same interval.  This
161	      measure is useful in scenarios where bursts of violations might
162	      suddenly occur (e.g. due to temporary network congestion, during
163	      route convergence etc.) and the count of violated packets by
164	      itself might paint a misleading picture.

166	   The following additional set of metrics may be useful in certain
167	   scenarios as well.  However, their precise definition may be subject
168	   to policy and further discussion is needed:

170	   o  Significantly Violated Packets.  This indicates the number of
171	      packets for which a "significant" violation occurred, where
172	      "significant" implies an SLO that was not merely a near-miss but
173	      that missed the objective by a degree determined especially
174	      significant.

176	   o  Significantly Violated Time Units (e.g. significantly violated
177	      seconds, significantly violated milliseconds).  This indicates the
178	      number of time units during which any significant violation
179	      occurred.

181	   o  Severely Violated Time Units (e.g. severely violated seconds,
182	      severely violated milliseconds).  "Severe" here refers to the
183	      occurrence of multiple violations within the same time unit.  The
184	      definition of "severe" may be subject to policy; it may also take
185	      into account the significance of the violations that occur.

187	   Note that there is no definition of Severely Violated Packets.  The
188	   term "severe" is used in conjunction with the occurrence of multiple
189	   violations related to multiple packets, not any one packet in
190	   isolation.

192	   From these first-order metrics, second-order metrics can be defined
193	   that build on the first set of metrics.  Some of these metrics are
194	   modeled after Mean Time Between Failure, or MTBF metrics - a
195	   "failure" in this context referring to a failure to deliver a packet
196	   according to its SLO.

198	   o  Time since last violated time unit (i.e., since last violated ms,
199	      since last violated second).  (This parameter is particularly
200	      useful for the monitoring of the current health.)

202	   o  Packets since last violated packet.  (This parameter is
203	      particularly useful for the monitoring of the current health.)

205	   o  Mean time between violated time units (i.e. between violated
206	      milliseconds, between violated seconds).  This refers to the
207	      arithmetic mean of time between violations such as violated time
208	      units.

210	   o  Mean packets between violations.  This refers to the arithmetic
211	      mean of the number of SLO-compliant packets between SLO
212	      violations.  (Another variation of "MTBF" in a service setting.)

214	   The same set of metrics can also be applied to significant
215	   violations, and to severe violations:

217	   o  Time since last significantly violated time unit (i.e., since last
218	      significantly violated ms, since last significantly violated
219	      second).

221	   o  Time since last severely violated time unit (i.e., since last
222	      severely violated ms, since last severely violated second).

224	   o  Packets since last significatly violated packet.

226	   o  Mean time between significantly violated time units (i.e. between
227	      significantly violated milliseconds, between significantly
228	      violated seconds).

230	   o  Mean time between severely violated time units (i.e. between
231	      severely violated milliseconds, between severely violated
232	      seconds).

234	   o  Mean packets between significant violations.  This refers to the
235	      arithmetic mean of the number of SLO-compliant packets between
236	      significant SLO violations.

238	   The next set of metrics puts the violations in relationship to non-
239	   violations.  It is intended to provide an analogous measure to that
240	   of availability, typically defined as the number of time units during
241	   which a system (or service) is unavailable divided by the total
242	   number of time units.  In analogy, a time unit that is "violated" can
243	   be viewed as one in which a service is not available with the
244	   advertised precision:

246	   o  Precision availability (of milliseconds, of seconds): the ratio
247	      between violated time units (seconds, milliseconds) and the total
248	      time units for the duration of the service.

250	   o  Analogous metrics for precision availability re: severely violated
251	      time units, re: significantly violated time units.

253	   It should be noted that certain Service Level Agreements may be
254	   statistical in nature, requiring the service levels of packets in a
255	   flow to adhere to certain distributions.  For example, an SLA might
256	   state that any given SLO applies only to a certain percentage of
257	   packets, allowing for a certain amount of violations to take place.
258	   A "violated packet" in that case does not necessarily constitute an
259	   SLO violation.  However, it is still useful to maintain those
260	   statistics, as the number of violated packets still matters when
261	   looked at in proportion to the total number of packets.

263	   Along that vein, an SLA might establish an SLO of, say, end-to-end
264	   latency to not exceed 20ms for 99% of packets, to not exceed 25ms for
265	   99.999% of packets, and to never exceed 30ms for anything beyond.  In
266	   that case, any individual packet missing the 20 ms latency target
267	   cannot be considered an SLO violation in itself, but compliance with
268	   the SLO may need to be assessed after the fact.

270	   To support statistical SLAs more directly, it is feasible to support
271	   additional metrics, such as metrics that represent histograms for
272	   service level parameters with buckets corresponding to individual
273	   service level objectives.  For the example just given, a histogram
274	   for a given flow could be maintained with three buckets: one
275	   containing the count of packets within 20ms, a second with a count of
276	   packets between 20 and 25ms (or simply all within 25ms), a third with
277	   a count of packet between 25 and 30ms (or simply all packets within
278	   30ms, and a fourth with a count of anything beyond (or simply a total
279	   count).  Of course, the number of buckets and the boundaries between
280	   those buckets should correspond to the needs of the application
281	   respectively SLA, i.e. to the specific guarantees and SLOs that were
282	   provided.  The definition of histogram metrics is for further study.

284	5.  Discussion Items

286	   The following is a list of items for which further discussion is
287	   needed as to whether they should be included in the scope of this
288	   specification:

290	   o  A YANG data model

292	   o  A set of IPFIX Information Elements

294	   o  Statistical metrics: e.g. histograms/buckets

296	   o  Policies regarding the definition of "significant" and "severe"
297	      violations

299	   o  Additional second-order metrics, such as "longest disruption of
300	      service time" (measuring consecutive time units with violations)

302	6.  IANA Considerations

304	   TBD

306	7.  Security Considerations

308	   Instrumentation for metrics that are used to assess compliance with
309	   SLOs consitute an interesting target for an attacker.  By interfering
310	   with the maintaining of such metrics, services could be falsely
311	   identified as being in compliance (when they are not), or vice-versa
312	   flagged as being non-compliant (when indeed they are).  While this
313	   document does not specify how networks should be instrumented to
314	   maintain the identified metrics, such instrumentation needs to be
315	   properly secured to ensure accurate measurements and prohibit
316	   tampering with metrics being kept.

318	   Where metrics are being defined relative to an SLO, the configuration
319	   of those SLOs needs to be properly secured.  Likewise, where SLOs can
320	   be adjusted, it needs to be clear which particular SLO any given
321	   metrics instance refers to.  The same service levels that constitute
322	   SLO violations for one flow, and that should be maintained as part of
323	   the "violated time units", "violated packets", and related metrics,
324	   may be perfectly compliant for another flow.  Where it is not
325	   possible to properly tie together SLOs and violation metrics, it will
326	   be preferrable to merely maintain statistics about sevice levels that
327	   were delivered (for example, overall histograms of end-to-end
328	   latency), without assessing which of these constitute violations.

330	   By the same token, where the definition of what constitutes a
331	   "severe" violation or a "significant" violation depends on policy or
332	   context, the configuration of such policy or context needs to be
333	   specially secured and the configuration of this policy be bound to
334	   the metrics being maintained.  This way it will be clear which policy
335	   was in effect when those metrics were being assessed.  An attacker
336	   that is able to tamper with such policies will render the
337	   corresponding metrics useless (in the best case) or misleading (in
338	   the worst case).

340	8.  Normative References

342	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
343	              Requirement Levels", BCP 14, RFC 2119,
344	              DOI 10.17487/RFC2119, March 1997,
345	              <https://www.rfc-editor.org/info/rfc2119>.

347	   [RFC2863]  McCloghrie, K. and F. Kastenholz, "The Interfaces Group
348	              MIB", RFC 2863, DOI 10.17487/RFC2863, June 2000,
349	              <https://www.rfc-editor.org/info/rfc2863>.

351	   [RFC7011]  Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
352	              "Specification of the IP Flow Information Export (IPFIX)
353	              Protocol for the Exchange of Flow Information", STD 77,
354	              RFC 7011, DOI 10.17487/RFC7011, September 2013,
355	              <https://www.rfc-editor.org/info/rfc7011>.

357	   [RFC7012]  Claise, B., Ed. and B. Trammell, Ed., "Information Model
358	              for IP Flow Information Export (IPFIX)", RFC 7012,
359	              DOI 10.17487/RFC7012, September 2013,
360	              <https://www.rfc-editor.org/info/rfc7012>.

362	   [RFC7950]  Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",
363	              RFC 7950, DOI 10.17487/RFC7950, August 2016,
364	              <https://www.rfc-editor.org/info/rfc7950>.

366	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
367	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
368	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

370	   [RFC8343]  Bjorklund, M., "A YANG Data Model for Interface
371	              Management", RFC 8343, DOI 10.17487/RFC8343, March 2018,
372	              <https://www.rfc-editor.org/info/rfc8343>.

374	   [RFC8655]  Finn, N., Thubert, P., Varga, B., and J. Farkas,
375	              "Deterministic Networking Architecture", RFC 8655,
376	              DOI 10.17487/RFC8655, October 2019,
377	              <https://www.rfc-editor.org/info/rfc8655>.

379	Authors' Addresses

381	   Alexander Clemm
382	   Futurewei
383	   2330 Central Expressway
384	   Santa Clara  CA 95050
385	   USA

387	   Email: ludwig@clemm.org

389	   John Strassner
390	   Futurewei
391	   2330 Central Expressway
392	   Santa Clara  CA 95050
393	   USA

395	   Email: strazpdj@gmail.com

397	   Jerome Francois
398	   Inria
399	   615 Rue du Jardin Botanique
400	   Villers-les-Nancy  54600
401	   France

403	   Email: jerome.francois@inria.fr