idnits 2.17.1 

draft-claise-opsawg-service-assurance-architecture-04.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (January 2, 2021) is 1209 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- No information found for draft-claise-opsawg-service-assurance-yang - is
     the name correct?

  -- Obsolete informational reference (is this intentional?): RFC 3164
     (Obsoleted by RFC 5424)


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	OPSAWG                                                         B. Claise
3	Internet-Draft                                       Cisco Systems, Inc.
4	Intended status: Informational                               J. Quilbeuf
5	Expires: July 6, 2021                                        Independent
6	                                                                D. Lopez
7	                                                          Telefonica I+D
8	                                                                D. Voyer
9	                                                             Bell Canada
10	                                                             T. Arumugam
11	                                                     Cisco Systems, Inc.
12	                                                         January 2, 2021

14	       Service Assurance for Intent-based Networking Architecture
15	         draft-claise-opsawg-service-assurance-architecture-04

17	Abstract

19	   This document describes an architecture for Service Assurance for
20	   Intent-based Networking (SAIN).  This architecture aims at assuring
21	   that service instances are correctly running.  As services rely on
22	   multiple sub-services by the underlying network devices, getting the
23	   assurance of a healthy service is only possible with a holistic view
24	   of network devices.  This architecture not only helps to correlate
25	   the service degradation with the network root cause but also the
26	   impacted services when a network component fails or degrades.

28	Status of This Memo

30	   This Internet-Draft is submitted in full conformance with the
31	   provisions of BCP 78 and BCP 79.

33	   Internet-Drafts are working documents of the Internet Engineering
34	   Task Force (IETF).  Note that other groups may also distribute
35	   working documents as Internet-Drafts.  The list of current Internet-
36	   Drafts is at https://datatracker.ietf.org/drafts/current/.

38	   Internet-Drafts are draft documents valid for a maximum of six months
39	   and may be updated, replaced, or obsoleted by other documents at any
40	   time.  It is inappropriate to use Internet-Drafts as reference
41	   material or to cite them other than as "work in progress."

43	   This Internet-Draft will expire on July 6, 2021.

45	Internet-DrafService Assurance for Intent-based Networking  January 2021

47	Copyright Notice

49	   Copyright (c) 2021 IETF Trust and the persons identified as the
50	   document authors.  All rights reserved.

52	   This document is subject to BCP 78 and the IETF Trust's Legal
53	   Provisions Relating to IETF Documents
54	   (https://trustee.ietf.org/license-info) in effect on the date of
55	   publication of this document.  Please review these documents
56	   carefully, as they describe your rights and restrictions with respect
57	   to this document.  Code Components extracted from this document must
58	   include Simplified BSD License text as described in Section 4.e of
59	   the Trust Legal Provisions and are provided without warranty as
60	   described in the Simplified BSD License.

62	Table of Contents

64	   1.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   2
65	   2.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   5
66	   3.  Architecture  . . . . . . . . . . . . . . . . . . . . . . . .   6
67	     3.1.  Decomposing a Service Instance Configuration into an
68	           Assurance Graph . . . . . . . . . . . . . . . . . . . . .   9
69	     3.2.  Intent and Assurance Graph  . . . . . . . . . . . . . . .  10
70	     3.3.  Subservices . . . . . . . . . . . . . . . . . . . . . . .  11
71	     3.4.  Building the Expression Graph from the Assurance Graph  .  11
72	     3.5.  Building the Expression from a Subservice . . . . . . . .  12
73	     3.6.  Open Interfaces with YANG Modules . . . . . . . . . . . .  12
74	     3.7.  Handling Maintenance Windows  . . . . . . . . . . . . . .  13
75	     3.8.  Flexible Architecture . . . . . . . . . . . . . . . . . .  14
76	     3.9.  Timing  . . . . . . . . . . . . . . . . . . . . . . . . .  15
77	     3.10. New Assurance Graph Generation  . . . . . . . . . . . . .  15
78	   4.  Security Considerations . . . . . . . . . . . . . . . . . . .  16
79	   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  16
80	   6.  Contributors  . . . . . . . . . . . . . . . . . . . . . . . .  16
81	   7.  Open Issues . . . . . . . . . . . . . . . . . . . . . . . . .  16
82	   8.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  16
83	     8.1.  Normative References  . . . . . . . . . . . . . . . . . .  16
84	     8.2.  Informative References  . . . . . . . . . . . . . . . . .  17
85	   Appendix A.  Changes between revisions  . . . . . . . . . . . . .  18
86	   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  18
87	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  19

89	1.  Terminology

91	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
92	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
93	   "OPTIONAL" in this document are to be interpreted as described in BCP

95	Internet-DrafService Assurance for Intent-based Networking  January 2021

97	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
98	   capitals, as shown here.

100	   SAIN Agent: Component that communicates with a device, a set of
101	   devices, or another agent to build an expression graph from a
102	   received assurance graph and perform the corresponding computation.

104	   Assurance Graph: DAG representing the assurance case for one or
105	   several service instances.  The nodes (also known as vertices in the
106	   context of DAG) are the service instances themselves and the
107	   subservices, the edges indicate a dependency relations.

109	   SAIN collector: Component that fetches or receives the computer-
110	   consumable output of the agent(s) and displays it in a user friendly
111	   form or process it locally.

113	   DAG: Directed Acyclic Graph.

115	   ECMP: Equal Cost Multiple Paths

117	   Expression Graph: Generic term for a DAG representing a computation
118	   in SAIN.  More specific terms are:

120	   o  Subservice Expressions: expression graph representing all the
121	      computations to execute for a subservice.

123	   o  Service Expressions: expression graph representing all the
124	      computations to execute for a service instance, i.e. including the
125	      computations for all dependent subservices.

127	   o  Global Computation Graph: expression graph representing all the
128	      computations to execute for all services instances (i.e. all
129	      computations performed).

131	   Dependency: The directed relationship between subservice instances in
132	   the assurance graph.

134	   Informational Dependency: Type of dependency whose score does not
135	   impact the score of its parent subservice or service instance(s) in
136	   the assurance graph.  However, the symptoms should be taken into
137	   account in the parent service instance or subservice instance(s), for
138	   informational reasons.

140	   Impacting Dependency: Type of dependency whose score impacts the
141	   score of its parent subservice or service instance(s) in the
142	   assurance graph.  The symptoms are taken into account in the parent
143	   service instance or subservice instance(s), as the impacting reasons.

145	Internet-DrafService Assurance for Intent-based Networking  January 2021

147	   Metric: Information retrieved from a network device.

149	   Metric Engine: Maps metrics to a list of candidate metric
150	   implementations depending on the target model.

152	   Metric Implementation: Actual way of retrieving a metric from a
153	   device.

155	   Network Service YANG Module: describes the characteristics of
156	   service, as agreed upon with consumers of that service [RFC8199].

158	   Service Instance: A specific instance of a service.

160	   Service configuration orchestrator: Quoting RFC8199, "Network Service
161	   YANG Modules describe the characteristics of a service, as agreed
162	   upon with consumers of that service.  That is, a service module does
163	   not expose the detailed configuration parameters of all participating
164	   network elements and features but describes an abstract model that
165	   allows instances of the service to be decomposed into instance data
166	   according to the Network Element YANG Modules of the participating
167	   network elements.  The service-to-element decomposition is a separate
168	   process; the details depend on how the network operator chooses to
169	   realize the service.  For the purpose of this document, the term
170	   "orchestrator" is used to describe a system implementing such a
171	   process."

173	   SAIN Orchestrator: Component of SAIN in charge of fetching the
174	   configuration specific to each service instance and converting it
175	   into an assurance graph.

177	   Health status: Score and symptoms indicating whether a service
178	   instance or a subservice is healthy.  A non-maximal score MUST always
179	   be explained by one or more symptoms.

181	   Health score: Integer ranging from 0 to 100 indicating the health of
182	   a subservice.  A score of 0 means that the subservice is broken, a
183	   score of 100 means that the subservice is perfectly operational.

185	   Subservice: Part of an assurance graph that assures a specific
186	   feature or subpart of the network system.

188	   Symptom: Reason explaining why a service instance or a subservice is
189	   not completely healthy.

191	Internet-DrafService Assurance for Intent-based Networking  January 2021

193	2.  Introduction

195	   Network Service YANG Modules [RFC8199] describe the configuration,
196	   state data, operations, and notifications of abstract representations
197	   of services implemented on one or multiple network elements.

199	   Quoting RFC8199: "Network Service YANG Modules describe the
200	   characteristics of a service, as agreed upon with consumers of that
201	   service.  That is, a service module does not expose the detailed
202	   configuration parameters of all participating network elements and
203	   features but describes an abstract model that allows instances of the
204	   service to be decomposed into instance data according to the Network
205	   Element YANG Modules of the participating network elements.  The
206	   service-to-element decomposition is a separate process; the details
207	   depend on how the network operator chooses to realize the service.
208	   For the purpose of this document, the term "orchestrator" is used to
209	   describe a system implementing such a process."

211	   In other words, service configuration orchestrators deploy Network
212	   Service YANG Modules through the configuration of Network Element
213	   YANG Modules.  Network configuration is based on those YANG data
214	   models, with protocol/encoding such as NETCONF/XML [RFC6241] ,
215	   RESTCONF/JSON [RFC8040], gNMI/gRPC/protobuf, etc.  Knowing that a
216	   configuration is applied doesn't imply that the service is running
217	   correctly (for example the service might be degraded because of a
218	   failure in the network), the network operator must monitor the
219	   service operational data at the same time as the configuration.  The
220	   industry has been standardizing on telemetry to push network element
221	   performance information.

223	   A network administrator needs to monitor her network and services as
224	   a whole, independently of the use cases or the management protocols.
225	   With different protocols come different data models, and different
226	   ways to model the same type of information.  When network
227	   administrators deal with multiple protocols, the network management
228	   must perform the difficult and time-consuming job of mapping data
229	   models: the model used for configuration with the model used for
230	   monitoring.  This problem is compounded by a large, disparate set of
231	   data sources (MIB modules, YANG models [RFC7950], IPFIX information
232	   elements [RFC7011], syslog plain text [RFC3164], TACACS+
233	   [I-D.ietf-opsawg-tacacs], RADIUS [RFC2865], etc.).  In order to avoid
234	   this data model mapping, the industry converged on model-driven
235	   telemetry to stream the service operational data, reusing the YANG
236	   models used for configuration.  Model-driven telemetry greatly
237	   facilitates the notion of closed-loop automation whereby events from
238	   the network drive remediation changes back into the network.

240	Internet-DrafService Assurance for Intent-based Networking  January 2021

242	   However, it proves difficult for network operators to correlate the
243	   service degradation with the network root cause.  For example, why
244	   does my L3VPN fail to connect?  Why is this specific service slow?
245	   The reverse, i.e. which services are impacted when a network
246	   component fails or degrades, is even more interesting for the
247	   operators.  For example, which service(s) is(are) impacted when this
248	   specific optic dBM begins to degrade?  Which application is impacted
249	   by this ECMP imbalance?  Is that issue actually impacting any other
250	   customers?

252	   Intent-based approaches are often declarative, starting from a
253	   statement of the "The service works correctly" and trying to enforce
254	   it.  Such approaches are mainly suited for greenfield deployments.

256	   Instead of approaching intent from a declarative way, this framework
257	   focuses on already defined services and tries to infer the meaning of
258	   "The service works correctly".  To do so, the framework works from an
259	   assurance graph, deduced from the service definition and from the
260	   network configuration.  This assurance graph is decomposed into
261	   components, which are then assured independently.  The root of the
262	   assurance graph represents the service to assure, and its children
263	   represent components identified as its direct dependencies; each
264	   component can have dependencies as well.  The SAIN architecture
265	   maintains the correct assurance graph when services are modified or
266	   when the network conditions change.

268	   When a service is degraded, the framework will highlight where in the
269	   assurance service graph to look, as opposed to going hop by hop to
270	   troubleshoot the issue.  Not only can this framework help to
271	   correlate service degradation with network root cause/symptoms, but
272	   it can deduce from the assurance graph the number and type of
273	   services impacted by a component degradation/failure.  This added
274	   value informs the operational team where to focus its attention for
275	   maximum return.

277	   This architecture provides the building blocks to assure both
278	   physical and virtual entities and is flexible with respect to
279	   services and subservices, of (distributed) graphs, and of components
280	   (Section 3.8).

282	3.  Architecture

284	   SAIN aims at assuring that service instances are correctly running.
285	   The goal of SAIN is to assure that service instances are operating
286	   correctly and if not, to pinpoint what is wrong.  More precisely,
287	   SAIN computes a score for each service instance and outputs symptoms
288	   explaining that score, especially why the score is not maximal.  The
289	   score augmented with the symptoms is called the health status.

291	Internet-DrafService Assurance for Intent-based Networking  January 2021

293	   The SAIN architecture is a generic architecture, applicable to
294	   multiple environments.  Obviously wireline but also wireless,
295	   including 5G, virtual infrastructure manager (VIM), and even virtual
296	   functions.  Thanks to the distributed graph design principle, graphs
297	   from different environments/orchestrator can be combined together.

299	   As an example of a service, let us consider a point-to-point L2VPN
300	   connection (i.e. pseudowire).  Such a service would take as
301	   parameters the two ends of the connection (device, interface or
302	   subinterface, and address of the other end) and configure both
303	   devices (and maybe more) so that a L2VPN connection is established
304	   between the two devices.  Examples of symptoms might be "Interface
305	   has high error rate" or "Interface flapping", or "Device almost out
306	   of memory".

308	   To compute the health status of such as service, the service is
309	   decomposed into an assurance graph formed by subservices linked
310	   through dependencies.  Each subservice is then turned into an
311	   expression graph that details how to fetch metrics from the devices
312	   and compute the health status of the subservice.  The subservice
313	   expressions are combined according to the dependencies between the
314	   subservices in order to obtain the expression graph which computes
315	   the health status of the service.

317	   The overall architecture of our solution is presented in Figure 1.
318	   Based on the service configuration, the SAIN orchestrator deduces the
319	   assurance graph.  It then sends to the SAIN agents the assurance
320	   graph along some other configuration options.  The SAIN agents are
321	   responsible for building the expression graph and computing the
322	   health statuses in a distributed manner.  The collector is in charge
323	   of collecting and displaying the current inferred health status of
324	   the service instances and subservices.  Finally, the automation loop
325	   is closed by having the SAIN Collector providing feedback to the
326	   network orchestrator.

328	Internet-DrafService Assurance for Intent-based Networking  January 2021

330	          +-----------------+
331	          | Service         |
332	          | Configuration   |<--------------------+
333	          | Orchestrator    |                     |
334	          +-----------------+                     |
335	             |            |                       |
336	             |            | Network               |
337	             |            | Service               | Feedback
338	             |            | Instance              | Loop
339	             |            | Configuration         |
340	             |            |                       |
341	             |            V                       |
342	             |        +-----------------+       +-------------------+
343	             |        | SAIN            |       | SAIN              |
344	             |        | Orchestrator    |       | Collector         |
345	             |        +-----------------+       +-------------------+
346	             |            |                        ^
347	             |            | Configuration          | Health Status
348	             |            | (assurance graph)      | (Score + Symptoms)
349	             |            V                        | Streamed
350	             |     +-------------------+           | via Telemetry
351	             |     |+-------------------+          |
352	             |     ||+-------------------+         |
353	             |     +|| SAIN              |---------+
354	             |      +| agent             |
355	             |       +-------------------+
356	             |               ^ ^ ^
357	             |               | | |
358	             |               | | |  Metric Collection
359	             V               V V V
360	         +-------------------------------------------------------------+
361	         | Monitored Entities                                          |
362	         |                                                             |
363	         +-------------------------------------------------------------+

365	                        Figure 1: SAIN Architecture

367	   In order to produce the score assigned to a service instance, the
368	   architecture performs the following tasks:

370	   o  Analyze the configuration pushed to the network device(s) for
371	      configuring the service instance and decide: which information is
372	      needed from the device(s), such a piece of information being
373	      called a metric, which operations to apply to the metrics for
374	      computing the health status.

376	Internet-DrafService Assurance for Intent-based Networking  January 2021

378	   o  Stream (via telemetry [RFC8641]) operational and config metric
379	      values when possible, else continuously poll.

381	   o  Continuously compute the health status of the service instances,
382	      based on the metric values.

384	3.1.  Decomposing a Service Instance Configuration into an Assurance
385	      Graph

387	   In order to structure the assurance of a service instance, the
388	   service instance is decomposed into so-called subservice instances.
389	   Each subservice instance focuses on a specific feature or subpart of
390	   the network system.

392	   The decomposition into subservices is an important function of this
393	   architecture, for the following reasons.

395	   o  TThe result of this decomposition provides a relational picture of
396	      a service instance, that can be represented as a graph (called
397	      assurance graph) to the operator.

399	   o  Subservices provide a scope for particular expertise and thereby
400	      enable contribution from external experts.  For instance, the
401	      subservice dealing with the optics health should be reviewed and
402	      extended by an expert in optical interfaces.

404	   o  Subservices that are common to several service instances are
405	      reused for reducing the amount of computation needed.

407	   The assurance graph of a service instance is a DAG representing the
408	   structure of the assurance case for the service instance.  The nodes
409	   of this graph are service instances or subservice instances.  Each
410	   edge of this graph indicates a dependency between the two nodes at
411	   its extremities: the service or subservice at the source of the edge
412	   depends on the service or subservice at the destination of the edge.

414	   Figure 2 depicts a simplistic example of the assurance graph for a
415	   tunnel service.  The node at the top is the service instance, the
416	   nodes below are its dependencies.  In the example, the tunnel service
417	   instance depends on the peer1 and peer2 tunnel interfaces, which in
418	   turn depend on the respective physical interfaces, which finally
419	   depend on the respective peer1 and peer2 devices.  The tunnel service
420	   instance also depends on the IP connectivity that depends on the IS-
421	   IS routing protocol.

423	Internet-DrafService Assurance for Intent-based Networking  January 2021

425	                             +------------------+
426	                             | Tunnel           |
427	                             | Service Instance |
428	                             +-----------------+
429	                                       |
430	                   +-------------------+-------------------+
431	                   |                   |                   |
432	            +-------------+     +-------------+     +--------------+
433	            | Peer1       |     | Peer2       |     | IP           |
434	            | Tunnel      |     | Tunnel      |     | Connectivity |
435	            | Interface   |     | Interface   |     |              |
436	            +-------------+     +-------------+     +--------------}
437	                   |                   |                  |
438	            +-------------+     +-------------+     +-------------+
439	            | Peer1       |     | Peer2       |     | IS-IS       |
440	            | Physical    |     | Physical    |     | Routing     |
441	            | Interface   |     | Interface   |     | Protocol    |
442	            +-------------+     +-------------+     +-------------+
443	                   |                   |
444	            +-------------+     +-------------+
445	            |             |     |             |
446	            | Peer1       |     | Peer2       |
447	            | Device      |     | Device      |
448	            +-------------+     +-------------+

450	                     Figure 2: Assurance Graph Example

452	   Depicting the assurance graph helps the operator to understand (and
453	   assert) the decomposition.  The assurance graph shall be maintained
454	   during normal operation with addition, modification and removal of
455	   service instances.  A change in the network configuration or topology
456	   shall be reflected in the assurance graph.  As a first example, a
457	   change of routing protocol from IS-IS to OSPF would change the
458	   assurance graph accordingly.  As a second example, assuming that ECMP
459	   is in place for the source router for that specific tunnel; in that
460	   case, multiple interfaces must now be monitored, on top of the
461	   monitoring the ECMP health itself.

463	3.2.  Intent and Assurance Graph

465	   The SAIN orchestrator analyzes the configuration of a service
466	   instance to:

468	   o  Try to capture the intent of the service instance, i.e. what is
469	      the service instance trying to achieve,

471	   o  Decompose the service instance into subservices representing the
472	      network features on which the service instance relies.

474	Internet-DrafService Assurance for Intent-based Networking  January 2021

476	   The SAIN orchestrator must be able to analyze configuration from
477	   various devices and produce the assurance graph.

479	   To schematize what a SAIN orchestrator does, assume that the
480	   configuration for a service instance touches 2 devices and configure
481	   on each device a virtual tunnel interface.  Then:

483	   o  Capturing the intent would start by detecting that the service
484	      instance is actually a tunnel between the two devices, and stating
485	      that this tunnel must be functional.  This is the current state of
486	      SAIN, however it does not completely capture the intent which
487	      might additionally include, for instance, on the latency and
488	      bandwidth requirements of this tunnel.

490	   o  Decomposing the service instance into subservices would result in
491	      the assurance graph depicted in Figure 2, for instance.

493	   In order for SAIN to be applied, the configuration necessary for each
494	   service instance should be identifiable and thus should come from a
495	   "service-aware" source.  While the Figure 1 makes a distinction
496	   between the SAIN orchestrator and a different component providing the
497	   service instance configuration, in practice those two components are
498	   mostly likely combined.  The internals of the orchestrator are
499	   currently out of scope of this document.

501	3.3.  Subservices

503	   A subservice corresponds to subpart or a feature of the network
504	   system that is needed for a service instance to function properly.
505	   In the context of SAIN, subservice is actually a shortcut for
506	   subservice assurance, that is the method for assuring that a
507	   subservice behaves correctly.

509	   Subservices, just as with services, have high-level parameters that
510	   specify the type and specific instance to be assured.  For example,
511	   assuring a device requires the specific deviceId as parameter.  For
512	   example, assuring an interface requires the specific combination of
513	   deviceId and interfaceId.

515	   A subservice is also characterized by a list of metrics to fetch and
516	   a list of computations to apply to these metrics in order to infer a
517	   health status.

519	3.4.  Building the Expression Graph from the Assurance Graph

521	   From the assurance graph is derived a so-called global computation
522	   graph.  First, each subservice instance is transformed into a set of
523	   subservice expressions that take metrics and constants as input (i.e.

525	Internet-DrafService Assurance for Intent-based Networking  January 2021

527	   sources of the DAG) and produce the status of the subservice, based
528	   on some heuristics.  Then for each service instance, the service
529	   expressions are constructed by combining the subservice expressions
530	   of its dependencies.  The way service expressions are combined
531	   depends on the dependency types (impacting or informational).
532	   Finally, the global computation graph is built by combining the
533	   service expressions.  In other words, the global computation graph
534	   encodes all the operations needed to produce health statuses from the
535	   collected metrics.

537	   Subservices shall be device independent.  To justify this, let's
538	   consider the interface operational status.  Depending on the device
539	   capabilities, this status can be collected by an industry-accepted
540	   YANG module (IETF, Openconfig), by a vendor-specific YANG module, or
541	   even by a MIB module.  If the subservice was dependent on the
542	   mechanism to collect the operational status, then we would need
543	   multiple subservice definitions in order to support all different
544	   mechanisms.  This also implies that, while waiting for all the
545	   metrics to be available via standard YANG modules, SAIN agents might
546	   have to retrieve metric values via non-standard YANG models, via MIB
547	   modules, Command Line Interface (CLI), etc., effectively implementing
548	   a normalization layer between data models and information models.

550	   In order to keep subservices independent from metric collection
551	   method, or, expressed differently, to support multiple combinations
552	   of platforms, OSes, and even vendors, the framework introduces the
553	   concept of "metric engine".  The metric engine maps each device-
554	   independent metric used in the subservices to a list of device-
555	   specific metric implementations that precisely define how to fetch
556	   values for that metric.  The mapping is parameterized by the
557	   characteristics (model, OS version, etc.) of the device from which
558	   the metrics are fetched.

560	3.5.  Building the Expression from a Subservice

562	   Additionally, to the list of metrics, each subservice defines a list
563	   of expressions to apply on the metrics in order to compute the health
564	   status of the subservice.  The definition or the standardization of
565	   those expressions (also known as heuristic) is currently out of scope
566	   of this standardization.

568	3.6.  Open Interfaces with YANG Modules

570	   The interfaces between the architecture components are open thanks to
571	   the YANG modules specified in YANG Modules for Service Assurance
572	   [I-D.claise-opsawg-service-assurance-yang]; they specify objects for
573	   assuring network services based on their decomposition into so-called
574	   subservices, according to the SAIN architecture.

576	Internet-DrafService Assurance for Intent-based Networking  January 2021

578	   This module is intended for the following use cases:

580	   o  Assurance graph configuration:

582	      *  Subservices: configure a set of subservices to assure, by
583	         specifying their types and parameters.

585	      *  Dependencies: configure the dependencies between the
586	         subservices, along with their types.

588	   o  Assurance telemetry: export the health status of the subservices,
589	      along with the observed symptoms.

591	3.7.  Handling Maintenance Windows

593	   Whenever network components are under maintenance, the operator want
594	   to inhibit the emission of symptoms from those components.  A typical
595	   use case is device maintenance, during which the device is not
596	   supposed to be operational.  As such, symptoms related to the device
597	   health should be ignored, as well as symptoms related to the device-
598	   specific subservices, such as the interfaces, as their state changes
599	   is probably the consequence of the maintenance.

601	   To configure network components as "under maintenance" in the SAIN
602	   architecture, the ietf-service-assurance model proposed in
603	   [I-D.claise-opsawg-service-assurance-yang] specifies an "under-
604	   maintenance" flag per service or subservice instance.  When this flag
605	   is set and only when this flag is set, the companion field
606	   "maintenance-contact" must be set to a string that identifies the
607	   person or process who requested the maintenance.  Any symptom
608	   produced by a service or subservice under maintenance, or by one of
609	   its dependencies MUST NOT be be reported.  A service or subservice
610	   under maintenance MAY propagate a symptom "Under Maintenance" towards
611	   services or subservices that depend on it.

613	   We illustrate this mechanism on three independent examples based on
614	   the assurance graph depicted in Figure 2:

616	   o  Device maintenance, for instance upgrading the device OS.  The
617	      operator sets the "under-maintenance" flag for the subservice
618	      "Peer1" device.  This inhibits the emission of symptoms from
619	      "Peer1 Physical Interface", "Peer1 Tunnel Interface" and "Tunnel
620	      Service Instance".  All other subservices are unaffected.

622	   o  Interface maintenance, for instance replacing a broken optic.  The
623	      operator sets the "under-maintenance" flag for the subservice
624	      "Peer1 Physical Interface".  This inhibits the emission of

626	Internet-DrafService Assurance for Intent-based Networking  January 2021

628	      symptoms from "Peer 1 Tunnel Interface" and "Tunnel Service
629	      Instance".  All other subservices are unaffected.

631	   o  Routing protocol maintenance, for instance modifying parameters or
632	      redistribution.  The operator sets the "under-maintenance" flag
633	      for the subservice "IS-IS Routing Protocol".  This inhibits the
634	      emission of symptoms from "IP connectivity" and "Tunnel Service
635	      Instance".  All other subservices are unaffected.

637	3.8.  Flexible Architecture

639	   The SAIN architecture is flexible in terms of components.  While the
640	   SAIN architecture in Figure 1 makes a distinction between two
641	   components, the SAIN configuration orchestrator and the SAIN
642	   orchestrator, in practice those two components are mostly likely
643	   combined.  Similarly, the SAIN agents are displayed in Figure 1 as
644	   being separate components.  Practically, the SAIN agents could be
645	   either independent components or directly integrated in monitored
646	   entities.  A practical example is an agent in a router.

648	   The SAIN architecture is also flexible in terms of services and
649	   subservices.  Most examples in this document deal with the notion of
650	   Network Service YANG modules, with well known service such as L2VPN
651	   or tunnels.  However, the concepts of services is general enough to
652	   cross into different domains.  One of them is the domain of service
653	   management on network elements, with also requires its own assurance.
654	   Examples includes a DHCP server on a linux server, a data plane, an
655	   IPFIX export, etc.  The notion of "service" is generic in this
656	   architecture.  Indeed, a configured service can itself be a service
657	   for someone else.  Exactly like an DHCP server/ data plane/IPFIX
658	   export can be considered as services for a device, exactly like an
659	   routing instance can be considered as a service for a L3VPN, exactly
660	   like a tunnel can considered as a service for an application in the
661	   cloud.  The assurance graph is created to be flexible and open,
662	   regardless of the subservice types, locations, or domains.

664	   The SAIN architecture is also flexible in terms of distributed
665	   graphs.  As shown in Figure 1, our architecture comprises several
666	   agents.  Each agent is responsible for handling a subgraph of the
667	   assurance graph.  The collector is responsible for fetching the
668	   subgraphs from the different agents and gluing them together.  As an
669	   example, in the graph from Figure 2, the subservices relative to Peer
670	   1 might be handled by a different agent than the subservices relative
671	   to Peer 2 and the Connectivity and IS-IS subservices might be handled
672	   by yet another agent.  The agents will export their partial graph and
673	   the collector will stitch them together as dependencies of the
674	   service instance.

676	Internet-DrafService Assurance for Intent-based Networking  January 2021

678	   And finally, the SAIN architecture is flexible in terms of what it
679	   monitors.  Most, if not all examples, in this document refer to
680	   physical components but this is not a constrain.  Indeed, the
681	   assurance of virtual components would follow the same principles and
682	   an assurance graph composed of virtualized components (or a mix of
683	   virtualized and physical ones) is well possible within this
684	   architecture.

686	3.9.  Timing

688	   The SAIN architecture requires the Network Time Protocol (NTP)
689	   [RFC5905] between all elements: monitored entities, SAIN agents,
690	   Service Configuration Orchesttrator, the SAIN Collector, as well as
691	   the SAIN Orchestrator.  This garantees the correlations of all
692	   symptoms in the system, correlated with the right assurance graph
693	   version.

695	   The SAIN agent might have to remove some symptoms for specific
696	   subservice symptoms, because there are outdated and not relevant any
697	   longer, or simply because the SAIN agent needs to free up some space.
698	   Regardless of the reason, it's important for a SAIN collector
699	   (re-)connecting to a SAIN agent to understand the effect of this
700	   garbage collection.  Therefore, the SAIN agent contains a YANG object
701	   specifying the date and time at which the symptoms history starts for
702	   the subservice instances.

704	3.10.  New Assurance Graph Generation

706	   The assurance graph will change along the time, because services and
707	   subservices come and go (changing the dependencies between
708	   subservices), or simply because a subservice is now under
709	   maintenance.  Therefore an assurance graph version must be
710	   maintained, along with the date and time of its last generation.  The
711	   date and time of a particular subservice instance (again dependencies
712	   or under maintenane) might be kept.  From a client point of view, an
713	   assurance graph change is triggered by the value of the assurance-
714	   graph-version and assurance-graph-last-change YANG leaves.  At that
715	   point in time, the client (collector) follows the following process:

717	   o  Keep the previous assurance-graph-last-change value (let's call it
718	      time T)

720	   o  Run through all subservice instance and process the subservice
721	      instances for which the last-change is newer that the time T

723	   o  Keep the new assurance-graph-last-change as the new referenced
724	      date and time

726	Internet-DrafService Assurance for Intent-based Networking  January 2021

728	4.  Security Considerations

730	   The SAIN architecture helps operators to reduce the mean time to
731	   detect and mean time to repair.  As such, it should not cause any
732	   security threats.  However, the SAIN agents must be secure: a
733	   compromised SAIN agents could be sending wrong root causes or
734	   symptoms to the management systems.

736	   Except for the configuration of telemetry, the agents do not need
737	   "write access" to the devices they monitor.  This configuration is
738	   applied with a YANG module, whose protection is covered by Secure
739	   Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF.

741	   The data collected by SAIN could potentially be compromising to the
742	   network or provide more insight into how the network is designed.
743	   Considering the data that SAIN requires (including CLI access in some
744	   cases), one should weigh data access concerns with the impact that
745	   reduced visibility will have on being able to rapidly identify root
746	   causes.

748	   If a closed loop system relies on this architecture then the well
749	   known issue of those system also applies, i.e., a lying device or
750	   compromised agent could trigger partial reconfiguration of the
751	   service or network.  The SAIN architecture neither augments or
752	   reduces this risk.

754	5.  IANA Considerations

756	   This document includes no request to IANA.

758	6.  Contributors

760	   o  Youssef El Fathi

762	   o  Eric Vyncke

764	7.  Open Issues

766	      Refer to the Intent-based Networking NMRG documents

768	8.  References

770	8.1.  Normative References

772	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
773	              Requirement Levels", BCP 14, RFC 2119,
774	              DOI 10.17487/RFC2119, March 1997,
775	              <https://www.rfc-editor.org/info/rfc2119>.

777	Internet-DrafService Assurance for Intent-based Networking  January 2021

779	   [RFC5905]  Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch,
780	              "Network Time Protocol Version 4: Protocol and Algorithms
781	              Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010,
782	              <https://www.rfc-editor.org/info/rfc5905>.

784	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
785	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
786	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

788	8.2.  Informative References

790	   [I-D.claise-opsawg-service-assurance-yang]
791	              Claise, B. and J. Quilbeuf, "Service Assurance for Intent-
792	              based Networking Architecture", February 2020.

794	   [I-D.ietf-opsawg-tacacs]
795	              Dahm, T., Ota, A., dcmgash@cisco.com, d., Carrel, D., and
796	              L. Grant, "The TACACS+ Protocol", draft-ietf-opsawg-
797	              tacacs-18 (work in progress), March 2020.

799	   [RFC2865]  Rigney, C., Willens, S., Rubens, A., and W. Simpson,
800	              "Remote Authentication Dial In User Service (RADIUS)",
801	              RFC 2865, DOI 10.17487/RFC2865, June 2000,
802	              <https://www.rfc-editor.org/info/rfc2865>.

804	   [RFC3164]  Lonvick, C., "The BSD Syslog Protocol", RFC 3164,
805	              DOI 10.17487/RFC3164, August 2001,
806	              <https://www.rfc-editor.org/info/rfc3164>.

808	   [RFC6241]  Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed.,
809	              and A. Bierman, Ed., "Network Configuration Protocol
810	              (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011,
811	              <https://www.rfc-editor.org/info/rfc6241>.

813	   [RFC6242]  Wasserman, M., "Using the NETCONF Protocol over Secure
814	              Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011,
815	              <https://www.rfc-editor.org/info/rfc6242>.

817	   [RFC7011]  Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
818	              "Specification of the IP Flow Information Export (IPFIX)
819	              Protocol for the Exchange of Flow Information", STD 77,
820	              RFC 7011, DOI 10.17487/RFC7011, September 2013,
821	              <https://www.rfc-editor.org/info/rfc7011>.

823	   [RFC7950]  Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",
824	              RFC 7950, DOI 10.17487/RFC7950, August 2016,
825	              <https://www.rfc-editor.org/info/rfc7950>.

827	Internet-DrafService Assurance for Intent-based Networking  January 2021

829	   [RFC8040]  Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF
830	              Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017,
831	              <https://www.rfc-editor.org/info/rfc8040>.

833	   [RFC8199]  Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module
834	              Classification", RFC 8199, DOI 10.17487/RFC8199, July
835	              2017, <https://www.rfc-editor.org/info/rfc8199>.

837	   [RFC8446]  Rescorla, E., "The Transport Layer Security (TLS) Protocol
838	              Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018,
839	              <https://www.rfc-editor.org/info/rfc8446>.

841	   [RFC8641]  Clemm, A. and E. Voit, "Subscription to YANG Notifications
842	              for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641,
843	              September 2019, <https://www.rfc-editor.org/info/rfc8641>.

845	Appendix A.  Changes between revisions

847	   v02 - v03

849	   o  Timing Concepts

851	   o  New Assurance Graph Generation

853	   v01 - v02

855	   o  Handling maintenance windows

857	   o  Flexible architecture better explained

859	   o  Improved the terminology

861	   o  Notion of mapping information model to data model, while waiting
862	      for YANG to be everywhere

864	   o  Started a security considerations section

866	   v00 - v01

868	   o  Terminology clarifications

870	   o  Figure 1 improved

872	Acknowledgements

874	   The authors would like to thank Stephane Litkowski, Charles Eckel,
875	   Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin,
876	   and Eric Vyncke for their reviews and feedback.

878	Internet-DrafService Assurance for Intent-based Networking  January 2021

880	Authors' Addresses

882	   Benoit Claise
883	   Cisco Systems, Inc.
884	   De Kleetlaan 6a b1
885	   1831 Diegem
886	   Belgium

888	   Email: bclaise@cisco.com

890	   Jean Quilbeuf
891	   Independent

893	   Email: jean@quilbeuf.net

895	   Diego R. Lopez
896	   Telefonica I+D
897	   Don Ramon de la Cruz, 82
898	   Madrid  28006
899	   Spain

901	   Email: diego.r.lopez@telefonica.com

903	   Dan Voyer
904	   Bell Canada
905	   Canada

907	   Email: daniel.voyer@bell.ca

909	   Thangam Arumugam
910	   Cisco Systems, Inc.
911	   Milpitas (California)
912	   United States

914	   Email: tarumuga@cisco.com