idnits 2.17.1 

draft-ietf-opsawg-service-assurance-architecture-06.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (28 June 2022) is 668 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-11) exists of
     draft-ietf-opsawg-service-assurance-yang-06


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	OPSAWG                                                         B. Claise
3	Internet-Draft                                               J. Quilbeuf
4	Intended status: Informational                                    Huawei
5	Expires: 30 December 2022                                       D. Lopez
6	                                                          Telefonica I+D
7	                                                                D. Voyer
8	                                                             Bell Canada
9	                                                             T. Arumugam
10	                                                     Cisco Systems, Inc.
11	                                                            28 June 2022

13	       Service Assurance for Intent-based Networking Architecture
14	          draft-ietf-opsawg-service-assurance-architecture-06

16	Abstract

18	   This document describes an architecture that aims at assuring that
19	   service instances are running as expected.  As services rely upon
20	   multiple sub-services provided by a variety of elements including the
21	   underlying network devices and functions, getting the assurance of a
22	   healthy service is only possible with a holistic view of all involved
23	   elements.  This architecture not only helps to correlate the service
24	   degradation with symptoms of a specific network component but also to
25	   list the services impacted by the failure or degradation of a
26	   specific network component.

28	Status of This Memo

30	   This Internet-Draft is submitted in full conformance with the
31	   provisions of BCP 78 and BCP 79.

33	   Internet-Drafts are working documents of the Internet Engineering
34	   Task Force (IETF).  Note that other groups may also distribute
35	   working documents as Internet-Drafts.  The list of current Internet-
36	   Drafts is at https://datatracker.ietf.org/drafts/current/.

38	   Internet-Drafts are draft documents valid for a maximum of six months
39	   and may be updated, replaced, or obsoleted by other documents at any
40	   time.  It is inappropriate to use Internet-Drafts as reference
41	   material or to cite them other than as "work in progress."

43	   This Internet-Draft will expire on 30 December 2022.

45	Copyright Notice

47	   Copyright (c) 2022 IETF Trust and the persons identified as the
48	   document authors.  All rights reserved.

50	   This document is subject to BCP 78 and the IETF Trust's Legal
51	   Provisions Relating to IETF Documents (https://trustee.ietf.org/
52	   license-info) in effect on the date of publication of this document.
53	   Please review these documents carefully, as they describe your rights
54	   and restrictions with respect to this document.  Code Components
55	   extracted from this document must include Revised BSD License text as
56	   described in Section 4.e of the Trust Legal Provisions and are
57	   provided without warranty as described in the Revised BSD License.

59	Table of Contents

61	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
62	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
63	   3.  A Functional Architecture . . . . . . . . . . . . . . . . . .   6
64	     3.1.  Inferring a Service Instance Configuration into an
65	           Assurance Graph . . . . . . . . . . . . . . . . . . . . .   9
66	       3.1.1.  Circular Dependencies . . . . . . . . . . . . . . . .  11
67	     3.2.  Intent and Assurance Graph  . . . . . . . . . . . . . . .  15
68	     3.3.  Subservices . . . . . . . . . . . . . . . . . . . . . . .  16
69	     3.4.  Building the Expression Graph from the Assurance Graph  .  16
70	     3.5.  Open Interfaces with YANG Modules . . . . . . . . . . . .  18
71	     3.6.  Handling Maintenance Windows  . . . . . . . . . . . . . .  18
72	     3.7.  Flexible Functional Architecture  . . . . . . . . . . . .  19
73	     3.8.  Timing  . . . . . . . . . . . . . . . . . . . . . . . . .  20
74	     3.9.  New Assurance Graph Generation  . . . . . . . . . . . . .  21
75	   4.  Security Considerations . . . . . . . . . . . . . . . . . . .  21
76	   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  22
77	   6.  Contributors  . . . . . . . . . . . . . . . . . . . . . . . .  22
78	   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  22
79	     7.1.  Normative References  . . . . . . . . . . . . . . . . . .  22
80	     7.2.  Informative References  . . . . . . . . . . . . . . . . .  22
81	   Appendix A.  Changes between revisions  . . . . . . . . . . . . .  24
82	   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  24
83	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  24

85	1.  Introduction

87	   Network service YANG modules [RFC8199] describe the configuration,
88	   state data, operations, and notifications of abstract representations
89	   of services implemented on one or multiple network elements.

91	   Service orchestrators use Network service YANG modules that will
92	   infer network-wide configuration and, therefore the invocation of the
93	   appropriate device modules (Section 3 of [RFC8969]).  Knowing that a
94	   configuration is applied doesn't imply that the service is up and
95	   running as expected.  For instance, the service might be degraded
96	   because of a failure in the network, the experience quality is
97	   distorted, or a service function may be reachable at the IP level but
98	   does not provide its intended function.  Thus, the network operator
99	   must monitor the service operational data at the same time as the
100	   configuration (Section 3.3 of [RFC8969].  To feed that task, the
101	   industry has been standardizing on telemetry to push network element
102	   performance information.

104	   A network administrator needs to monitor their network and services
105	   as a whole, independently of the management protocols.  With
106	   different protocols come different data models, and different ways to
107	   model the same type of information.  When network administrators deal
108	   with multiple management protocols, the network management entities
109	   have to perform the difficult and time-consuming job of mapping data
110	   models: e.g. the model used for configuration with the model used for
111	   monitoring when separate models or protocols are used.  This problem
112	   is compounded by a large, disparate set of data sources (MIB modules,
113	   YANG models [RFC7950], IPFIX information elements [RFC7011], syslog
114	   plain text [RFC5424], TACACS+ [RFC8907], RADIUS [RFC2865], etc.).  In
115	   order to avoid this data model mapping, the industry converged on
116	   model-driven telemetry to stream the service operational data,
117	   reusing the YANG models used for configuration.  Model-driven
118	   telemetry greatly facilitates the notion of closed-loop automation
119	   whereby events/status from the network drive remediation changes back
120	   into the network.

122	   However, it proves difficult for network operators to correlate the
123	   service degradation with the network root cause.  For example, "Why
124	   does my L3VPN fail to connect?" or "Why is this specific service not
125	   highly responsive?".  The reverse, i.e., which services are impacted
126	   when a network component fails or degrades, is also important for
127	   operators.  For example, "Which services are impacted when this
128	   specific optic dBM begins to degrade?", "Which applications are
129	   impacted by this ECMP imbalance?", or "Is that issue actually
130	   impacting any other customers?".  This task usually falls under the
131	   so-called "Service Impact Analysis" functional block.

133	   Intent-based approaches are often declarative, starting from a
134	   statement of "The service works as expected" and trying to enforce
135	   it.  Such approaches are mainly suited for greenfield deployments.

137	   In this document, we propose an architecture implementing Service
138	   Assurance for Intent-Based Networking (SAIN).  Aligned with
139	   Section 3.3 of [RFC7149], and instead of approaching intent from a
140	   declarative way, this architecture focuses on already defined
141	   services and tries to infer the meaning of "The service works as
142	   expected".  To do so, the architecture works from an assurance graph,
143	   deduced from the configuration pushed to the device for enabling the
144	   service instance.  If the SAIN orchestrator supports it, the service
145	   model (Section 2 of [RFC8309]) or the network model (Section 2.1 of

147	   [RFC8309]) can also be used to build the assurance graph.  In some
148	   cases, the assurance graph may also be explicitly completed to add an
149	   intent not exposed in the service model itself (e.g. the service must
150	   rely upon a backup physical path).  This assurance graph is
151	   decomposed into components, which are then assured independently.
152	   The root of the assurance graph represents the service to assure, and
153	   its children represent components identified as its direct
154	   dependencies; each component can have dependencies as well.  The SAIN
155	   orchestrator updates automatically the assurance graph when services
156	   are modified.

158	   When a service is degraded, the SAIN architecture will highlight
159	   where in the assurance service graph to look, as opposed to going hop
160	   by hop to troubleshoot the issue.  More precisely, the SAIN
161	   architecture will associate to each service a list of symptoms
162	   originating from specific components of the network.  These
163	   components are good candidates for explaining the source of a service
164	   degradation.  Not only can this architecture help to correlate
165	   service degradation with network root cause/symptoms, but it can
166	   deduce from the assurance graph the number and type of services
167	   impacted by a component degradation/failure.  This added value
168	   informs the operational team where to focus its attention for maximum
169	   return.  Indeed, the operational team should focus his priority on
170	   the degrading/failing components impacting the highest number
171	   customers, especially the ones with the SLA contracts involving
172	   penalties in case of failure.

174	   This architecture provides the building blocks to assure both
175	   physical and virtual entities and is flexible with respect to
176	   services and subservices, of (distributed) graphs, and of components
177	   (Section 3.7).

179	   The architecture presented in this document is completed by a set of
180	   YANG modules defined in a companion document
181	   [I-D.ietf-opsawg-service-assurance-yang].  These YANG modules
182	   properly define the interfaces between the various components of the
183	   architecture in order to foster interoperability.

185	2.  Terminology

187	   SAIN agent: A functional component that communicates with a device, a
188	   set of devices, or another agent to build an expression graph from a
189	   received assurance graph and perform the corresponding computation of
190	   the health status and symptoms.

192	   Assurance case: "An assurance case is a structured argument,
193	   supported by evidence, intended to justify that a system is
194	   acceptably assured relative to a concern (such as safety or security)
195	   in the intended operating environment" [Piovesan2017].

197	   Service instance: A specific instance of a service.

199	   Subservice: Part or functionality of the network system that can be
200	   independently assured as a single entity in assurance graph.

202	   Assurance graph: A Directed Acyclic Graph (DAG) representing the
203	   assurance case for one or several service instances.  The nodes (also
204	   known as vertices in the context of DAG) are the service instances
205	   themselves and the subservices, the edges indicate a dependency
206	   relations.

208	   SAIN collector: A functional component that fetches or receives the
209	   computer-consumable output of the SAIN agent(s) and process it
210	   locally (including displaying it in a user friendly form).

212	   DAG: Directed Acyclic Graph.

214	   ECMP: Equal Cost Multiple Paths

216	   Expression graph: A generic term for a DAG representing a computation
217	   in SAIN.  More specific terms are:

219	   *  Subservice expressions: Is an expression graph representing all
220	      the computations to execute for a subservice.

222	   *  Service expressions: Is an expression graph representing all the
223	      computations to execute for a service instance, i.e., including
224	      the computations for all dependent subservices.

226	   *  Global computation graph: Is an expression graph representing all
227	      the computations to execute for all services instances (i.e., all
228	      computations performed).

230	   Dependency: The directed relationship between subservice instances in
231	   the assurance graph.

233	   Metric: An information retrieved from the network running the assured
234	   service.

236	   Metric engine: A functional component, part of the SAIN agent, that
237	   maps metrics to a list of candidate metric implementations depending
238	   on the network element.

240	   Metric implementation: Actual way of retrieving a metric from a
241	   network element.

243	   Network service YANG module: describes the characteristics of a
244	   service as agreed upon with consumers of that service [RFC8199].

246	   Service orchestrator: Quoting RFC8199, "Network Service YANG Modules
247	   describe the characteristics of a service, as agreed upon with
248	   consumers of that service.  That is, a service module does not expose
249	   the detailed configuration parameters of all participating network
250	   elements and features but describes an abstract model that allows
251	   instances of the service to be decomposed into instance data
252	   according to the Network Element YANG Modules of the participating
253	   network elements.  The service-to-element decomposition is a separate
254	   process; the details depend on how the network operator chooses to
255	   realize the service.  For the purpose of this document, the term
256	   "orchestrator" is used to describe a system implementing such a
257	   process."

259	   SAIN orchestrator: A functional component that is in charge of
260	   fetching the configuration specific to each service instance and
261	   converting it into an assurance graph.

263	   Health status: Score and symptoms indicating whether a service
264	   instance or a subservice is "healthy".  A non-maximal score must
265	   always be explained by one or more symptoms.

267	   Health score: Integer ranging from 0 to 100 indicating the health of
268	   a subservice.  A score of 0 means that the subservice is broken, a
269	   score of 100 means that the subservice in question is operating as
270	   expected.

272	   Strongly connected component: subset of a directed graph such that
273	   there is a (directed) path from any node of the subset to any other
274	   node.  A DAG does not contain any strongly connected component.

276	   Symptom: Reason explaining why a service instance or a subservice is
277	   not completely healthy.

279	3.  A Functional Architecture

281	   The goal of SAIN is to assure that service instances are operating as
282	   expected (i.e. the observed service is matching the expected service)
283	   and if not, to pinpoint what is wrong.  More precisely, SAIN computes
284	   a score for each service instance and outputs symptoms explaining
285	   that score.  Symptoms explain the score.  The only valid situation
286	   where no symptoms are returned is when the score is maximal,
287	   indicating that no issues where detected for that service.  The score
288	   augmented with the symptoms is called the health status.

290	   The SAIN architecture is a generic architecture, applicable to
291	   multiple environments (e.g. wireline, wireless), but also different
292	   domains (e.g. 5G, NFV domain with a virtual infrastructure manager
293	   (VIM)), etc.  And as already noted, for physical or virtual devices,
294	   as well as virtual functions.  Thanks to the distributed graph design
295	   principle, graphs from different environments/orchestrator can be
296	   combined together.

298	   As an example of a service, let us consider a point-to-point L2VPN.
299	   [RFC8466] specifies the parameters for such a service.  Examples of
300	   symptoms might be symptoms reported by specific subservices
301	   "Interface has high error rate" or "Interface flapping", or "Device
302	   almost out of memory" as well as symptoms more specific to the
303	   service such as "Site disconnected from VPN".

305	   To compute the health status of such a service, the service
306	   definition is decomposed into an assurance graph formed by
307	   subservices linked through dependencies.  Each subservice is then
308	   turned into an expression graph that details how to fetch metrics
309	   from the devices and compute the health status of the subservice.
310	   The subservice expressions are combined according to the dependencies
311	   between the subservices in order to obtain the expression graph which
312	   computes the health status of the service.

314	   The overall SAIN architecture is presented in Figure 1.  Based on the
315	   service configuration provided by the service orchestrator, the SAIN
316	   orchestrator decomposes the assurance graph.  It then sends to the
317	   SAIN agents the assurance graph along some other configuration
318	   options.  The SAIN agents are responsible for building the expression
319	   graph and computing the health statuses in a distributed manner.  The
320	   collector is in charge of collecting and displaying the current
321	   inferred health status of the service instances and subservices.
322	   Finally, the automation loop is closed by having the SAIN collector
323	   providing feedback to the network/service orchestrator.

325	   In order to make agents, orchestrators and collectors from different
326	   vendors interoperable, their interface is defined as a YANG model in
327	   a companion document [I-D.ietf-opsawg-service-assurance-yang].  In
328	   Figure 1, the communications that are normalized by this YANG model
329	   are tagged with a "Y".  The use of this YANG model is further
330	   explained in Section 3.5.

332	          +-----------------+
333	          | Service         |
334	          | Orchestrator    |<--------------------+
335	          |                 |                     |
336	          +-----------------+                     |
337	             |            ^                       |
338	             |            | Network               |
339	             |            | Service               | Feedback
340	             |            | Instance              | Loop
341	             |            | Configuration         |
342	             |            |                       |
343	             |            V                       |
344	             |        +-----------------+       +-------------------+
345	             |        | SAIN            |       | SAIN              |
346	             |        | Orchestrator    |       | Collector         |
347	             |        +-----------------+       +-------------------+
348	             |            |                        ^
349	             |           Y| Configuration          | Health Status
350	             |            | (assurance graph)     Y| (Score + Symptoms)
351	             |            V                        | Streamed
352	             |     +-------------------+           | via Telemetry
353	             |     |+-------------------+          |
354	             |     ||+-------------------+         |
355	             |     +|| SAIN              |---------+
356	             |      +| agent             |
357	             |       +-------------------+
358	             |               ^ ^ ^
359	             |               | | |
360	             |               | | |  Metric Collection
361	             V               V V V
362	         +-------------------------------------------------------------+
363	         |           Network System                                    |
364	         |                                                             |
365	         +-------------------------------------------------------------+

367	                        Figure 1: SAIN Architecture

369	   In order to produce the score assigned to a service instance, the
370	   various involved components perform the following tasks:

372	   *  Analyze the configuration pushed to the network device(s) for
373	      configuring the service instance and decide: which information is
374	      needed from the device(s), such a piece of information being
375	      called a metric, which operations to apply to the metrics for
376	      computing the health status.

378	   *  Stream (via telemetry [RFC8641]) operational and config metric
379	      values when possible, else continuously poll.

381	   *  Continuously compute the health status of the service instances,
382	      based on the metric values.

384	3.1.  Inferring a Service Instance Configuration into an Assurance Graph

386	   In order to structure the assurance of a service instance, the SAIN
387	   orchestrator decomposes the service instance into so-called
388	   subservice instances.  Each subservice instance focuses on a specific
389	   feature or subpart of the service.

391	   The decomposition into subservices is an important function of the
392	   architecture, for the following reasons:

394	   *  The result of this decomposition provides a relational picture of
395	      a service instance, that can be represented as a graph (called
396	      assurance graph) to the operator.

398	   *  Subservices provide a scope for particular expertise and thereby
399	      enable contribution from external experts.  For instance, the
400	      subservice dealing with the optics health should be reviewed and
401	      extended by an expert in optical interfaces.

403	   *  Subservices that are common to several service instances are
404	      reused for reducing the amount of computation needed.

406	   The assurance graph of a service instance is a DAG representing the
407	   structure of the assurance case for the service instance.  The nodes
408	   of this graph are service instances or subservice instances.  Each
409	   edge of this graph indicates a dependency between the two nodes at
410	   its extremities: the service or subservice at the source of the edge
411	   depends on the service or subservice at the destination of the edge.

413	   Figure 2 depicts a simplistic example of the assurance graph for a
414	   tunnel service.  The node at the top is the service instance, the
415	   nodes below are its dependencies.  In the example, the tunnel service
416	   instance depends on the "peer1" and "peer2" tunnel interfaces, which
417	   in turn depend on the respective physical interfaces, which finally
418	   depend on the respective "peer1" and "peer2" devices.  The tunnel
419	   service instance also depends on the IP connectivity that depends on
420	   the IS-IS routing protocol.

422	                            +------------------+
423	                            | Tunnel           |
424	                            | Service Instance |
425	                            +------------------+
426	                                      |
427	                 +--------------------+-------------------+
428	                 |                    |                   |
429	                 v                    v                   v
430	            +-------------+    +--------------+    +-------------+
431	            | Peer1       |    | IP           |    | Peer2       |
432	            | Tunnel      |    | Connectivity |    | Tunnel      |
433	            | Interface   |    |              |    | Interface   |
434	            +-------------+    +--------------+    +-------------+
435	                   |                  |                  |
436	                   |    +-------------+--------------+   |
437	                   |    |             |              |   |
438	                   v    v             v              v   v
439	            +-------------+    +-------------+     +-------------+
440	            | Peer1       |    | IS-IS       |     | Peer2       |
441	            | Physical    |    | Routing     |     | Physical    |
442	            | Interface   |    | Protocol    |     | Interface   |
443	            +-------------+    +-------------+     +-------------+
444	                   |                                     |
445	                   v                                     v
446	            +-------------+                        +-------------+
447	            |             |                        |             |
448	            | Peer1       |                        | Peer2       |
449	            | Device      |                        | Device      |
450	            +-------------+                        +-------------+

452	                     Figure 2: Assurance Graph Example

454	   Depicting the assurance graph helps the operator to understand (and
455	   assert) the decomposition.  The assurance graph shall be maintained
456	   during normal operation with addition, modification and removal of
457	   service instances.  A change in the network configuration or topology
458	   shall automatically be reflected in the assurance graph.  As a first
459	   example, a change of routing protocol from IS-IS to OSPF would change
460	   the assurance graph accordingly.  As a second example, assuming that
461	   ECMP is in place for the source router for that specific tunnel; in
462	   that case, multiple interfaces must now be monitored, on top of the
463	   monitoring the ECMP health itself.

465	3.1.1.  Circular Dependencies

467	   The edges of the assurance graph represent dependencies.  An
468	   assurance graph is a DAG if and only if there are no circular
469	   dependencies among the subservices, and every assurance graph should
470	   avoid circular dependencies.  However, in some cases, circular
471	   dependencies might appear in the assurance graph.

473	   First, the assurance graph of a whole system is obtained by combining
474	   the assurance graph of every service running on that system.  Here
475	   combining means that two subservices having the same type and the
476	   same parameters are in fact the same subservice and thus a single
477	   node in the graph.  For instance, the subservice of type "device"
478	   with the only parameter (the device id) set to "PE1" will appear only
479	   once in the whole assurance graph even if several services rely on
480	   that device.  Now, if two engineers design assurance graphs for two
481	   different services, and engineer A decides that an interface depends
482	   on the link it is connected to, but engineer B decides that the link
483	   depends on the interface it is connected to, then when combining the
484	   two assurance graphs, we will have a circular dependency interface ->
485	   link -> interface.

487	   Another case possibly resulting in circular dependencies is when
488	   subservices are not properly identified.  Assume that we want to
489	   assure a kubernetes cluster.  If we represent the cluster by a
490	   subservice and the network service by another subservice, we will
491	   likely model that the network service depends on the cluster, because
492	   the network service is orchestrated by kubernetes, and that the
493	   cluster depends on the network service because it implements the
494	   communications.  A finer decomposition might distinguish between the
495	   resources for executing containers (a part of our cluster subservice)
496	   and the communication between the containers (which could be modelled
497	   in the same way as communication between routers).

499	   In any case, it is likely that circular dependencies will show up in
500	   the assurance graph.  A first step would be to detect circular
501	   dependencies as soon as possible in the SAIN architecture.  Such a
502	   detection could be carried out by the SAIN orchestrator.  Whenever a
503	   circular dependency is detected, the newly added service would not be
504	   monitored until more careful modelling or alignment between the
505	   different teams (engineer A and B) remove the circular dependency.

507	   As more elaborate solution we could consider a graph transformation:

509	   *  Decompose the graph into strongly connected components.

511	   *  For each strongly connected component:

513	      -  Remove all edges between nodes of the strongly connected
514	         component

516	      -  Add a new "top" node for the strongly connected component

518	      -  For each edge pointing to a node in the strongly connected
519	         component, change the destination to the "top" node

521	      -  Add a dependency from the top node to every node in the
522	         strongly connected component.

524	   Such an algorithm would include all symptoms detected by any
525	   subservice in one of the strongly component and make it available to
526	   any subservice that depends on it.  Figure 3 shows an example of such
527	   a transformation.  On the left-hand side, the nodes c, d, e and f
528	   form a strongly connected component.  The status of a should depend
529	   on the status of c, d, e, f, g, and h, but this is hard to compute
530	   because of the circular dependency.  On the right hand-side, a
531	   depends on all these nodes as well, but there the circular dependency
532	   has been removed.

534	         +---+    +---+          |                +---+    +---+
535	         | a |    | b |          |                | a |    | b |
536	         +---+    +---+          |                +---+    +---+
537	           |        |            |                  |        |
538	           v        v            |                  v        v
539	         +---+    +---+          |                +------------+
540	         | c |--->| d |          |                |    top     |
541	         +---+    +---+          |                +------------+
542	           ^        |            |               /   |      |   \
543	           |        |            |              /    |      |    \
544	           |        v            |             v     v      v     v
545	         +---+    +---+          |          +---+  +---+  +---+  +---+
546	         | f |<---| e |          |          | f |  | c |  | d |  | e |
547	         +---+    +---+          |          +---+  +---+  +---+  +---+
548	           |        |            |            |                    |
549	           v        v            |            v                    v
550	         +---+    +---+          |          +---+                +---+
551	         | g |    | h |          |          | g |                | h |
552	         +---+    +---+          |          +---+                +---+

554	            Before                                     After
555	         Transformation                           Transformation

557	                       Figure 3: Graph transformation

559	   We consider a concrete example to illustrate this transformation.
560	   Let's assume that Engineer A is building an assurance graph dealing
561	   with IS-IS and Engineer B is building an assurance graph dealing with
562	   OSPF.  The graph from Engineer A could contain the following:

564	                   +------------+
565	                   | IS-IS Link |
566	                   +------------+
567	                         |
568	                         v
569	                   +------------+
570	                   | Phys. Link |
571	                   +------------+
572	                     |       |
573	                     v       v
574	          +-------------+  +-------------+
575	          | Interface 1 |  | Interface 2 |
576	          +-------------+  +-------------+

578	           Figure 4: Fragment of assurance graph from Engineer A

580	   The graph from Engineer B could contain the following:

582	                   +------------+
583	                   | OSPF Link  |
584	                   +------------+
585	                     |   |   |
586	                     v   |   v
587	        +-------------+  |  +-------------+
588	        | Interface 1 |  |  | Interface 2 |
589	        +-------------+  |  +-------------+
590	                      |  |   |
591	                      v  v   v
592	                   +------------+
593	                   | Phys. Link |
594	                   +------------+

596	           Figure 5: Fragment of assurance graph from Engineer B

598	   Each Interface subservice and the Physical Link subservice are common
599	   to both fragments above.  Each of these subservice appears only once
600	   in the graph merging the two fragments.  Dependencies from both
601	   fragments are included in the merged graph, resulting in a circular
602	   dependency:

604	         +------------+      +------------+
605	         | IS-IS Link |      | OSPF Link  |---+
606	         +------------+      +------------+   |
607	               |               |     |        |
608	               |     +-------- +     |        |
609	               v     v               |        |
610	         +------------+              |        |
611	         | Phys. Link |<-------+     |        |
612	         +------------+        |     |        |
613	           |  ^     |          |     |        |
614	           |  |     +-------+  |     |        |
615	           v  |             v  |     v        |
616	         +-------------+  +-------------+     |
617	         | Interface 1 |  | Interface 2 |     |
618	         +-------------+  +-------------+     |
619	               ^                              |
620	               |                              |
621	               +------------------------------+

623	                   Figure 6: Merging graphs from A and B

625	   The solution presented above would result in graph looking as
626	   follows, where a new "empty" node is included.  Using that
627	   transformation, all dependencies are indirectly satisfied for the
628	   nodes outside the circular dependency, in the sense that both IS-IS
629	   and OSPF links have indirect dependencies to the two interfaces and
630	   the link.  However, the dependencies between the link and the
631	   interfaces are lost as they were causing the circular dependency.

633	               +------------+      +------------+
634	               | IS-IS Link |      | OSPF Link  |
635	               +------------+      +------------+
636	                          |          |
637	                          v          v
638	                         +------------+
639	                         |  empty     |
640	                         +------------+
641	                               |
642	                   +-----------+-------------+
643	                   |           |             |
644	                   v           v             v
645	         +-------------+ +------------+ +-------------+
646	         | Interface 1 | | Phys. Link | | Interface 2 |
647	         +-------------+ +------------+ +-------------+

649	       Figure 7: Removing circular dependencies after merging graphs
650	                                from A and B

652	3.2.  Intent and Assurance Graph

654	   The SAIN orchestrator analyzes the configuration of a service
655	   instance to:

657	   *  Try to capture the intent of the service instance, i.e., what is
658	      the service instance trying to achieve.  At least, this requires
659	      the SAIN orchestrator to know the YANG modules that are being
660	      configured on the devices to enable the service.  Note that if the
661	      service model or the network model is known to the SAIN
662	      orchestrator, the latter can exploit it.  In that case, the intent
663	      could be directly extracted and include more details, such as the
664	      notion of sites for a VPN, which is out of scope of the device
665	      configuration.

667	   *  Decompose the service instance into subservices representing the
668	      network features on which the service instance relies.

670	   The SAIN orchestrator must be able to analyze configuration pushed to
671	   various devices for configuring a service instance and produce the
672	   assurance graph for that service instance.

674	   To schematize what a SAIN orchestrator does, assume that the
675	   configuration for a service instance touches two devices and
676	   configure on each device a virtual tunnel interface.  Then:

678	   *  Capturing the intent would start by detecting that the service
679	      instance is actually a tunnel between the two devices, and stating
680	      that this tunnel must be functional.  This solution is minimally
681	      invasive as it does not require to modify nor know the service
682	      model.  If the service model or network model is known by the SAIN
683	      orchestrator, it can be used to further capture the intent and
684	      include more information such as SLO.  For instance, the latency
685	      and bandwidth requirements for the tunnel, if present in the
686	      service model

688	   *  Decomposing the service instance into subservices would result in
689	      the assurance graph depicted in Figure 2, for instance.

691	   To be applied, SAIN requires a mechanism mapping a service instance
692	   to the configuration actually required on the devices for that
693	   service instance to run.  While the Figure 1 makes a distinction
694	   between the SAIN orchestrator and a different component providing the
695	   service instance configuration, in practice those two components are
696	   mostly likely combined.  The internals of the orchestrator are
697	   currently out of scope of this document.

699	3.3.  Subservices

701	   A subservice corresponds to subpart or a feature of the network
702	   system that is needed for a service instance to function properly.
703	   In the context of SAIN, a subservice also defines its assurance, that
704	   is the method for assuring that a subservice behaves correctly.

706	   Subservices, just as with services, have high-level parameters that
707	   specify the type and specific instance to be assured.  For example,
708	   assuring a device requires a specific deviceId as parameter.  For
709	   example, assuring an interface requires a specific combination of
710	   deviceId and interfaceId.

712	   A subservice is also characterized by a list of metrics to fetch and
713	   a list of operations to apply to these metrics in order to infer a
714	   health status.

716	3.4.  Building the Expression Graph from the Assurance Graph

718	   From the assurance graph is derived a so-called global computation
719	   graph.  First, each subservice instance is transformed into a set of
720	   subservice expressions that take metrics and constants as input
721	   (i.e., sources of the DAG) and produce the status of the subservice,
722	   based on some heuristics.  For instance, the health of an interface
723	   is 0 (minimal score) with the symptom "interface admin-down" if the
724	   interface is disabled in the configuration.  Then for each service
725	   instance, the service expressions are constructed by combining the
726	   subservice expressions of its dependencies.  The way service
727	   expressions are combined depends on the dependency types (impacting
728	   or informational).  Finally, the global computation graph is built by
729	   combining the service expressions.  In other words, the global
730	   computation graph encodes all the operations needed to produce health
731	   statuses from the collected metrics.

733	   The two types of dependencies for combining subservices are:

735	      Informational Dependency: Type of dependency whose health score
736	      does not impact the health score of its parent subservice or
737	      service instance(s) in the assurance graph.  However, the symptoms
738	      should be taken into account in the parent service instance or
739	      subservice instance(s), for informational reasons.

741	      Impacting Dependency: Type of dependency whose score impacts the
742	      score of its parent subservice or service instance(s) in the
743	      assurance graph.  The symptoms are taken into account in the
744	      parent service instance or subservice instance(s), as the
745	      impacting reasons.

747	   The set of dependency type presented here is not exhaustive.  More
748	   specific dependency types can be defined by extending the YANG model.
749	   Adding these new dependency types requires defining the corresponding
750	   operation for combining statuses of subservices.

752	   Subservices shall be not be dependent on the protocol used to
753	   retrieve the metrics.  To justify this, let's consider the interface
754	   operational status.  Depending on the device capabilities, this
755	   status can be collected by an industry-accepted YANG module (IETF,
756	   Openconfig), by a vendor-specific YANG module, or even by a MIB
757	   module.  If the subservice was dependent on the mechanism to collect
758	   the operational status, then we would need multiple subservice
759	   definitions in order to support all different mechanisms.  This also
760	   implies that, while waiting for all the metrics to be available via
761	   standard YANG modules, SAIN agents might have to retrieve metric
762	   values via non-standard YANG models, via MIB modules, Command Line
763	   Interface (CLI), etc., effectively implementing a normalization layer
764	   between data models and information models.

766	   In order to keep subservices independent from metric collection
767	   method, or, expressed differently, to support multiple combinations
768	   of platforms, OSes, and even vendors, the architecture introduces the
769	   concept of "metric engine".  The metric engine maps each device-
770	   independent metric used in the subservices to a list of device-
771	   specific metric implementations that precisely define how to fetch
772	   values for that metric.  The mapping is parameterized by the
773	   characteristics (model, OS version, etc.) of the device from which
774	   the metrics are fetched.  This metric engine is included in the SAIN
775	   agent.

777	3.5.  Open Interfaces with YANG Modules

779	   The interfaces between the architecture components are open thanks to
780	   the YANG modules specified in
781	   [I-D.ietf-opsawg-service-assurance-yang]; they specify objects for
782	   assuring network services based on their decomposition into so-called
783	   subservices, according to the SAIN architecture.

785	   These modules are intended for the following use cases:

787	   *  Assurance graph configuration:

789	      -  Subservices: configure a set of subservices to assure, by
790	         specifying their types and parameters.

792	      -  Dependencies: configure the dependencies between the
793	         subservices, along with their types.

795	   *  Assurance telemetry: export the health status of the subservices,
796	      along with the observed symptoms.

798	   Some examples of YANG instances can be found in Appendix A of
799	   [I-D.ietf-opsawg-service-assurance-yang].

801	3.6.  Handling Maintenance Windows

803	   Whenever network components are under maintenance, the operator want
804	   to inhibit the emission of symptoms from those components.  A typical
805	   use case is device maintenance, during which the device is not
806	   supposed to be operational.  As such, symptoms related to the device
807	   health should be ignored, as well as symptoms related to the device-
808	   specific subservices, such as the interfaces, as their state changes
809	   is probably the consequence of the maintenance.

811	   To configure network components as "under maintenance" in the SAIN
812	   architecture, the ietf-service-assurance model proposed in
813	   [I-D.ietf-opsawg-service-assurance-yang] specifies an "under-
814	   maintenance" flag per service or subservice instance.  When this flag
815	   is set and only when this flag is set, the companion field
816	   "maintenance-contact" must be set to a string that identifies the
817	   person or process who requested the maintenance.  When a service or
818	   subservice is flagged as under maintenance, it may report a generic
819	   "Under Maintenance" symptom, for propagation towards subservices that
820	   depend on this specific subservice: any other symptom from this
821	   service, or by one of its impacting dependencies must not be
822	   reported.

824	   We illustrate this mechanism on three independent examples based on
825	   the assurance graph depicted in Figure 2:

827	   *  Device maintenance, for instance upgrading the device OS.  The
828	      operator sets the "under-maintenance" flag for the subservice
829	      "Peer1" device.  This inhibits the emission of symptoms from
830	      "Peer1 Physical Interface", "Peer1 Tunnel Interface" and "Tunnel
831	      Service Instance".  All other subservices are unaffected.

833	   *  Interface maintenance, for instance replacing a broken optic.  The
834	      operator sets the "under-maintenance" flag for the subservice
835	      "Peer1 Physical Interface".  This inhibits the emission of
836	      symptoms from "Peer 1 Tunnel Interface" and "Tunnel Service
837	      Instance".  All other subservices are unaffected.

839	   *  Routing protocol maintenance, for instance modifying parameters or
840	      redistribution.  The operator sets the "under-maintenance" flag
841	      for the subservice "IS-IS Routing Protocol".  This inhibits the
842	      emission of symptoms from "IP connectivity" and "Tunnel Service
843	      Instance".  All other subservices are unaffected.

845	3.7.  Flexible Functional Architecture

847	   The SAIN architecture is flexible in terms of components.  While the
848	   SAIN architecture in Figure 1 makes a distinction between two
849	   components, the SAIN configuration orchestrator and the SAIN
850	   orchestrator, in practice those two components are mostly likely
851	   combined.  Similarly, the SAIN agents are displayed in Figure 1 as
852	   being separate components.  Practically, the SAIN agents could be
853	   either independent components or directly integrated in monitored
854	   entities.  A practical example is an agent in a router.

856	   The SAIN architecture is also flexible in terms of services and
857	   subservices.  In the proposed architecture, the SAIN orchestrator is
858	   coupled to a Service orchestrator which defines the kinds of service
859	   that the architecture handles.  Most examples in this document deal
860	   with the notion of Network Service YANG modules, with well-known
861	   services such as L2VPN or tunnels.  However, the concept of services
862	   is general enough to cross into different domains.  One of them is
863	   the domain of service management on network elements, with also
864	   requires its own assurance.  Examples includes a DHCP server on a
865	   Linux server, a data plane, an IPFIX export, etc.  The notion of
866	   "service" is generic in this architecture and depends on the Service
867	   orchestrator and underlying network system.  In other terms, if a
868	   main service orchestrator coordinates several lower level
869	   controllers, a service for the controller can be a subservice from
870	   the point of view of the orchestrator.  Exactly like a DHCP server/
871	   data plane/IPFIX export can be considered as subservices for a
872	   device, exactly like a routing instance can be considered as a
873	   subservice for a L3VPN, exactly like a tunnel can considered as a
874	   subservice for an application in the cloud.  Exactly like a service
875	   function can be considered as a subservice for a service function
876	   chain [RFC7665].  The assurance graph is created to be flexible and
877	   open, regardless of the subservice types, locations, or domains.

879	   The SAIN architecture is also flexible in terms of distributed
880	   graphs.  As shown in Figure 1, the architecture comprises several
881	   agents.  Each agent is responsible for handling a subgraph of the
882	   assurance graph.  The collector is responsible for fetching the
883	   subgraphs from the different agents and gluing them together.  As an
884	   example, in the graph from Figure 2, the subservices relative to Peer
885	   1 might be handled by a different agent than the subservices relative
886	   to Peer 2 and the Connectivity and IS-IS subservices might be handled
887	   by yet another agent.  The agents will export their partial graph and
888	   the collector will stitch them together as dependencies of the
889	   service instance.

891	   And finally, the SAIN architecture is flexible in terms of what it
892	   monitors.  Most, if not all examples, in this document refer to
893	   physical components but this is not a constrain.  Indeed, the
894	   assurance of virtual components would follow the same principles and
895	   an assurance graph composed of virtualized components (or a mix of
896	   virtualized and physical ones) is well possible within this
897	   architecture.

899	3.8.  Timing

901	   The SAIN architecture requires time synchronization, with Network
902	   Time Protocol (NTP) [RFC5905] as a candidate, between all elements:
903	   monitored entities, SAIN agents, Service orchestrator, the SAIN
904	   collector, as well as the SAIN orchestrator.  This guarantees the
905	   correlations of all symptoms in the system, correlated with the right
906	   assurance graph version.

908	   The SAIN agent might have to remove some symptoms for specific
909	   subservice symptoms, because there are outdated and not relevant any
910	   longer, or simply because the SAIN agent needs to free up some space.
911	   Regardless of the reason, it's important for a SAIN collector
912	   (re-)connecting to a SAIN agent to understand the effect of this
913	   garbage collection.  Therefore, the SAIN agent contains a YANG object
914	   specifying the date and time at which the symptoms history starts for
915	   the subservice instances.

917	3.9.  New Assurance Graph Generation

919	   The assurance graph will change along the time, because services and
920	   subservices come and go (changing the dependencies between
921	   subservices), or simply because a subservice is now under
922	   maintenance.  Therefore an assurance graph version must be
923	   maintained, along with the date and time of its last generation.  The
924	   date and time of a particular subservice instance (again dependencies
925	   or under maintenance) might be kept.  From a client point of view, an
926	   assurance graph change is triggered by the value of the assurance-
927	   graph-version and assurance-graph-last-change YANG leaves.  At that
928	   point in time, the client (collector) follows the following process:

930	   *  Keep the previous assurance-graph-last-change value (let's call it
931	      time T)

933	   *  Run through all subservice instance and process the subservice
934	      instances for which the last-change is newer that the time T

936	   *  Keep the new assurance-graph-last-change as the new referenced
937	      date and time

939	4.  Security Considerations

941	   The SAIN architecture helps operators to reduce the mean time to
942	   detect and mean time to repair.  As such, it should not cause any
943	   security threats.  However, the SAIN agents must be secured: a
944	   compromised SAIN agent may be sending wrong root causes or symptoms
945	   to the management systems.

947	   Except for the configuration of telemetry, the agents do not need
948	   "write access" to the devices they monitor.  This configuration is
949	   applied with a YANG module, whose protection is covered by Secure
950	   Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF.

952	   The data collected by SAIN could potentially be compromising to the
953	   network or provide more insight into how the network is designed.
954	   Considering the data that SAIN requires (including CLI access in some
955	   cases), one should weigh data access concerns with the impact that
956	   reduced visibility will have on being able to rapidly identify root
957	   causes.

959	   If a closed loop system relies on this architecture then the well
960	   known issue of those system also applies, i.e., a lying device or
961	   compromised agent could trigger partial reconfiguration of the
962	   service or network.  The SAIN architecture neither augments or
963	   reduces this risk.

965	5.  IANA Considerations

967	   This document includes no request to IANA.

969	6.  Contributors

971	   *  Youssef El Fathi

973	   *  Eric Vyncke

975	7.  References

977	7.1.  Normative References

979	   [I-D.ietf-opsawg-service-assurance-yang]
980	              Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T.
981	              Arumugam, "YANG Modules for Service Assurance", Work in
982	              Progress, Internet-Draft, draft-ietf-opsawg-service-
983	              assurance-yang-06, 24 June 2022,
984	              <https://www.ietf.org/archive/id/draft-ietf-opsawg-
985	              service-assurance-yang-06.txt>.

987	7.2.  Informative References

989	   [Piovesan2017]
990	              Piovesan, A. and E. Griffor, "Reasoning About Safety and
991	              Security: The Logic of Assurance", 2017.

993	   [RFC2865]  Rigney, C., Willens, S., Rubens, A., and W. Simpson,
994	              "Remote Authentication Dial In User Service (RADIUS)",
995	              RFC 2865, DOI 10.17487/RFC2865, June 2000,
996	              <https://www.rfc-editor.org/info/rfc2865>.

998	   [RFC5424]  Gerhards, R., "The Syslog Protocol", RFC 5424,
999	              DOI 10.17487/RFC5424, March 2009,
1000	              <https://www.rfc-editor.org/info/rfc5424>.

1002	   [RFC5905]  Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch,
1003	              "Network Time Protocol Version 4: Protocol and Algorithms
1004	              Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010,
1005	              <https://www.rfc-editor.org/info/rfc5905>.

1007	   [RFC6242]  Wasserman, M., "Using the NETCONF Protocol over Secure
1008	              Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011,
1009	              <https://www.rfc-editor.org/info/rfc6242>.

1011	   [RFC7011]  Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
1012	              "Specification of the IP Flow Information Export (IPFIX)
1013	              Protocol for the Exchange of Flow Information", STD 77,
1014	              RFC 7011, DOI 10.17487/RFC7011, September 2013,
1015	              <https://www.rfc-editor.org/info/rfc7011>.

1017	   [RFC7149]  Boucadair, M. and C. Jacquenet, "Software-Defined
1018	              Networking: A Perspective from within a Service Provider
1019	              Environment", RFC 7149, DOI 10.17487/RFC7149, March 2014,
1020	              <https://www.rfc-editor.org/info/rfc7149>.

1022	   [RFC7665]  Halpern, J., Ed. and C. Pignataro, Ed., "Service Function
1023	              Chaining (SFC) Architecture", RFC 7665,
1024	              DOI 10.17487/RFC7665, October 2015,
1025	              <https://www.rfc-editor.org/info/rfc7665>.

1027	   [RFC7950]  Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",
1028	              RFC 7950, DOI 10.17487/RFC7950, August 2016,
1029	              <https://www.rfc-editor.org/info/rfc7950>.

1031	   [RFC8199]  Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module
1032	              Classification", RFC 8199, DOI 10.17487/RFC8199, July
1033	              2017, <https://www.rfc-editor.org/info/rfc8199>.

1035	   [RFC8309]  Wu, Q., Liu, W., and A. Farrel, "Service Models
1036	              Explained", RFC 8309, DOI 10.17487/RFC8309, January 2018,
1037	              <https://www.rfc-editor.org/info/rfc8309>.

1039	   [RFC8446]  Rescorla, E., "The Transport Layer Security (TLS) Protocol
1040	              Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018,
1041	              <https://www.rfc-editor.org/info/rfc8446>.

1043	   [RFC8466]  Wen, B., Fioccola, G., Ed., Xie, C., and L. Jalil, "A YANG
1044	              Data Model for Layer 2 Virtual Private Network (L2VPN)
1045	              Service Delivery", RFC 8466, DOI 10.17487/RFC8466, October
1046	              2018, <https://www.rfc-editor.org/info/rfc8466>.

1048	   [RFC8641]  Clemm, A. and E. Voit, "Subscription to YANG Notifications
1049	              for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641,
1050	              September 2019, <https://www.rfc-editor.org/info/rfc8641>.

1052	   [RFC8907]  Dahm, T., Ota, A., Medway Gash, D.C., Carrel, D., and L.
1053	              Grant, "The Terminal Access Controller Access-Control
1054	              System Plus (TACACS+) Protocol", RFC 8907,
1055	              DOI 10.17487/RFC8907, September 2020,
1056	              <https://www.rfc-editor.org/info/rfc8907>.

1058	   [RFC8969]  Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and
1059	              L. Geng, "A Framework for Automating Service and Network
1060	              Management with YANG", RFC 8969, DOI 10.17487/RFC8969,
1061	              January 2021, <https://www.rfc-editor.org/info/rfc8969>.

1063	Appendix A.  Changes between revisions

1065	   v03 - v04

1067	   *  Address comments from Mohamed Boucadair

1069	   v00 - v01

1071	   *  Cover the feedback received during the WG call for adoption

1073	Acknowledgements

1075	   The authors would like to thank Stephane Litkowski, Charles Eckel,
1076	   Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin,
1077	   Eric Vyncke, and Mohamed Boucadair for their reviews and feedback.

1079	Authors' Addresses

1081	   Benoit Claise
1082	   Huawei
1083	   Email: benoit.claise@huawei.com

1085	   Jean Quilbeuf
1086	   Huawei
1087	   Email: jean.quilbeuf@huawei.com

1089	   Diego R. Lopez
1090	   Telefonica I+D
1091	   Don Ramon de la Cruz, 82
1092	   Madrid  28006
1093	   Spain
1094	   Email: diego.r.lopez@telefonica.com

1096	   Dan Voyer
1097	   Bell Canada
1098	   Canada
1099	   Email: daniel.voyer@bell.ca
1100	   Thangam Arumugam
1101	   Cisco Systems, Inc.
1102	   Milpitas (California),
1103	   United States of America
1104	   Email: tarumuga@cisco.com