idnits 2.17.1 

draft-claise-opsawg-service-assurance-architecture-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (March 9, 2020) is 1508 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Outdated reference: A later version (-18) exists of
     draft-ietf-opsawg-tacacs-17

  -- Obsolete informational reference (is this intentional?): RFC 3164
     (Obsoleted by RFC 5424)


     Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	OPSAWG                                                         B. Claise
3	Internet-Draft                                               J. Quilbeuf
4	Intended status: Informational                       Cisco Systems, Inc.
5	Expires: September 10, 2020                                  Y. El Fathi
6	                                                Orange Business Services
7	                                                                D. Lopez
8	                                                          Telefonica I+D
9	                                                                D. Voyer
10	                                                             Bell Canada
11	                                                           March 9, 2020

13	       Service Assurance for Intent-based Networking Architecture
14	         draft-claise-opsawg-service-assurance-architecture-02

16	Abstract

18	   This document describes an architecture for Service Assurance for
19	   Intent-based Networking (SAIN).  This architecture aims at assuring
20	   that service instances are correctly running.  As services rely on
21	   multiple sub-services by the underlying network devices, getting the
22	   assurance of a healthy service is only possible with a holistic view
23	   of network devices.  This architecture not only helps to correlate
24	   the service degradation with the network root cause but also the
25	   impacted services when a network component fails or degrades.

27	Status of This Memo

29	   This Internet-Draft is submitted in full conformance with the
30	   provisions of BCP 78 and BCP 79.

32	   Internet-Drafts are working documents of the Internet Engineering
33	   Task Force (IETF).  Note that other groups may also distribute
34	   working documents as Internet-Drafts.  The list of current Internet-
35	   Drafts is at https://datatracker.ietf.org/drafts/current/.

37	   Internet-Drafts are draft documents valid for a maximum of six months
38	   and may be updated, replaced, or obsoleted by other documents at any
39	   time.  It is inappropriate to use Internet-Drafts as reference
40	   material or to cite them other than as "work in progress."

42	   This Internet-Draft will expire on September 10, 2020.

44	Copyright Notice

46	   Copyright (c) 2020 IETF Trust and the persons identified as the
47	   document authors.  All rights reserved.

49	   This document is subject to BCP 78 and the IETF Trust's Legal
50	   Provisions Relating to IETF Documents
51	   (https://trustee.ietf.org/license-info) in effect on the date of
52	   publication of this document.  Please review these documents
53	   carefully, as they describe your rights and restrictions with respect
54	   to this document.  Code Components extracted from this document must
55	   include Simplified BSD License text as described in Section 4.e of
56	   the Trust Legal Provisions and are provided without warranty as
57	   described in the Simplified BSD License.

59	Table of Contents

61	   1.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   2
62	   2.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   4
63	   3.  Architecture  . . . . . . . . . . . . . . . . . . . . . . . .   6
64	     3.1.  Decomposing a Service Instance Configuration into an
65	           Assurance Graph . . . . . . . . . . . . . . . . . . . . .   9
66	     3.2.  Intent and Assurance Graph  . . . . . . . . . . . . . . .  10
67	     3.3.  Subservices . . . . . . . . . . . . . . . . . . . . . . .  11
68	     3.4.  Building the Expression Graph from the Assurance Graph  .  11
69	     3.5.  Building the Expression from a Subservice . . . . . . . .  12
70	     3.6.  Open Interfaces with YANG Modules . . . . . . . . . . . .  12
71	     3.7.  Handling Maintenance Windows  . . . . . . . . . . . . . .  13
72	     3.8.  Flexible Architecture . . . . . . . . . . . . . . . . . .  14
73	   4.  Security Considerations . . . . . . . . . . . . . . . . . . .  15
74	   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  15
75	   6.  Open Issues . . . . . . . . . . . . . . . . . . . . . . . . .  15
76	   7.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  15
77	     7.1.  Normative References  . . . . . . . . . . . . . . . . . .  15
78	     7.2.  Informative References  . . . . . . . . . . . . . . . . .  16
79	   Appendix A.  Changes between revisions  . . . . . . . . . . . . .  17
80	   Acknowledgements  . . . . . . . . . . . . . . . . . . . . . . . .  17
81	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  17

83	1.  Terminology

85	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
86	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
87	   "OPTIONAL" in this document are to be interpreted as described in BCP
88	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
89	   capitals, as shown here.

91	   SAIN Agent: Component that communicates with a device, a set of
92	   devices, or another agent to build an expression graph from a
93	   received assurance graph and perform the corresponding computation.

95	   Assurance Graph: DAG representing the assurance case for one or
96	   several service instances.  The nodes (also known as vertices in the
97	   context of DAG) are the service instances themselves and the
98	   subservices, the edges indicate a dependency relations.

100	   SAIN collector: Component that fetches or receives the computer-
101	   consumable output of the agent(s) and displays it in a user friendly
102	   form or process it locally.

104	   DAG: Directed Acyclic Graph.

106	   ECMP: Equal Cost Multiple Paths

108	   Expression Graph: Generic term for a DAG representing a computation
109	   in SAIN.  More specific terms are:

111	   o  Subservice Expressions: expression graph representing all the
112	      computations to execute for a subservice.

114	   o  Service Expressions: expression graph representing all the
115	      computations to execute for a service instance, i.e. including the
116	      computations for all dependent subservices.

118	   o  Global Computation Graph: expression graph representing all the
119	      computations to execute for all services instances (i.e. all
120	      computations performed).

122	   Dependency: The directed relationship between subservice instances in
123	   the assurance graph.

125	   Informational Dependency: Type of dependency whose score does not
126	   impact the score of its parent subservice or service instance(s) in
127	   the assurance graph.  However, the symptoms should be taken into
128	   account in the parent service instance or subservice instance(s), for
129	   informational reasons.

131	   Impacting Dependency: Type of dependency whose score impacts the
132	   score of its parent subservice or service instance(s) in the
133	   assurance graph.  The symptoms are taken into account in the parent
134	   service instance or subservice instance(s), as the impacting reasons.

136	   Metric: Information retrieved from a network device.

138	   Metric Engine: Maps metrics to a list of candidate metric
139	   implementations depending on the target model.

141	   Metric Implementation: Actual way of retrieving a metric from a
142	   device.

144	   Network Service YANG Module: describes the characteristics of
145	   service, as agreed upon with consumers of that service [RFC8199].

147	   Service Instance: A specific instance of a service.

149	   Service configuration orchestrator: Quoting RFC8199, "Network Service
150	   YANG Modules describe the characteristics of a service, as agreed
151	   upon with consumers of that service.  That is, a service module does
152	   not expose the detailed configuration parameters of all participating
153	   network elements and features but describes an abstract model that
154	   allows instances of the service to be decomposed into instance data
155	   according to the Network Element YANG Modules of the participating
156	   network elements.  The service-to-element decomposition is a separate
157	   process; the details depend on how the network operator chooses to
158	   realize the service.  For the purpose of this document, the term
159	   "orchestrator" is used to describe a system implementing such a
160	   process."

162	   SAIN Orchestrator: Component of SAIN in charge of fetching the
163	   configuration specific to each service instance and converting it
164	   into an assurance graph.

166	   Health status: Score and symptoms indicating whether a service
167	   instance or a subservice is healthy.  A non-maximal score MUST always
168	   be explained by one or more symptoms.

170	   Health score: Integer ranging from 0 to 100 indicating the health of
171	   a subservice.  A score of 0 means that the subservice is broken, a
172	   score of 100 means that the subservice is perfectly operational.

174	   Subservice: Part of an assurance graph that assures a specific
175	   feature or subpart of the network system.

177	   Symptom: Reason explaining why a service instance or a subservice is
178	   not completely healthy.

180	2.  Introduction

182	   Network Service YANG Modules [RFC8199] describe the configuration,
183	   state data, operations, and notifications of abstract representations
184	   of services implemented on one or multiple network elements.

186	   Quoting RFC8199: "Network Service YANG Modules describe the
187	   characteristics of a service, as agreed upon with consumers of that
188	   service.  That is, a service module does not expose the detailed
189	   configuration parameters of all participating network elements and
190	   features but describes an abstract model that allows instances of the
191	   service to be decomposed into instance data according to the Network
192	   Element YANG Modules of the participating network elements.  The
193	   service-to-element decomposition is a separate process; the details
194	   depend on how the network operator chooses to realize the service.
195	   For the purpose of this document, the term "orchestrator" is used to
196	   describe a system implementing such a process."

198	   In other words, service configuration orchestrators deploy Network
199	   Service YANG Modules through the configuration of Network Element
200	   YANG Modules.  Network configuration is based on those YANG data
201	   models, with protocol/encoding such as NETCONF/XML [RFC6241] ,
202	   RESTCONF/JSON [RFC8040], gNMI/gRPC/protobuf, etc.  Knowing that a
203	   configuration is applied doesn't imply that the service is running
204	   correctly (for example the service might be degraded because of a
205	   failure in the network), the network operator must monitor the
206	   service operational data at the same time as the configuration.  The
207	   industry has been standardizing on telemetry to push network element
208	   performance information.

210	   A network administrator needs to monitor her network and services as
211	   a whole, independently of the use cases or the management protocols.
212	   With different protocols come different data models, and different
213	   ways to model the same type of information.  When network
214	   administrators deal with multiple protocols, the network management
215	   must perform the difficult and time-consuming job of mapping data
216	   models: the model used for configuration with the model used for
217	   monitoring.  This problem is compounded by a large, disparate set of
218	   data sources (MIB modules, YANG models [RFC7950], IPFIX information
219	   elements [RFC7011], syslog plain text [RFC3164], TACACS+
220	   [I-D.ietf-opsawg-tacacs], RADIUS [RFC2865], etc.).  In order to avoid
221	   this data model mapping, the industry converged on model-driven
222	   telemetry to stream the service operational data, reusing the YANG
223	   models used for configuration.  Model-driven telemetry greatly
224	   facilitates the notion of closed-loop automation whereby events from
225	   the network drive remediation changes back into the network.

227	   However, it proves difficult for network operators to correlate the
228	   service degradation with the network root cause.  For example, why
229	   does my L3VPN fail to connect?  Why is this specific service slow?
230	   The reverse, i.e. which services are impacted when a network
231	   component fails or degrades, is even more interesting for the
232	   operators.  For example, which service(s) is(are) impacted when this
233	   specific optic dBM begins to degrade?  Which application is impacted
234	   by this ECMP imbalance?  Is that issue actually impacting any other
235	   customers?

237	   Intent-based approaches are often declarative, starting from a
238	   statement of the "The service works correctly" and trying to enforce
239	   it.  Such approaches are mainly suited for greenfield deployments.

241	   Instead of approaching intent from a declarative way, this framework
242	   focuses on already defined services and tries to infer the meaning of
243	   "The service works correctly".  To do so, the framework works from an
244	   assurance graph, deduced from the service definition and from the
245	   network configuration.  This assurance graph is decomposed into
246	   components, which are then assured independently.  The root of the
247	   assurance graph represents the service to assure, and its children
248	   represent components identified as its direct dependencies; each
249	   component can have dependencies as well.  The SAIN architecture
250	   maintains the correct assurance graph when services are modified or
251	   when the network conditions change.

253	   When a service is degraded, the framework will highlight where in the
254	   assurance service graph to look, as opposed to going hop by hop to
255	   troubleshoot the issue.  Not only can this framework help to
256	   correlate service degradation with network root cause/symptoms, but
257	   it can deduce from the assurance graph the number and type of
258	   services impacted by a component degradation/failure.  This added
259	   value informs the operational team where to focus its attention for
260	   maximum return.

262	   This architecture provides the building blocks to assure both
263	   physical and virtual entities and is flexible of services and
264	   subservices, of (distributed) graphs, and of components
265	   (Section 3.8).

267	3.  Architecture

269	   SAIN aims at assuring that service instances are correctly running.
270	   The goal of SAIN is to assure that service instances are operating
271	   correctly and if not, to pinpoint what is wrong.  More precisely,
272	   SAIN computes a score for each service instance and outputs symptoms
273	   explaining that score, especially why the score is not maximal.  The
274	   score augmented with the symptoms is called the health status.

276	   As an example of a service, let us consider a point-to-point L2VPN
277	   connection (i.e. pseudowire).  Such a service would take as
278	   parameters the two ends of the connection (device, interface or
279	   subinterface, and address of the other end) and configure both
280	   devices (and maybe more) so that a L2VPN connection is established
281	   between the two devices.  Examples of symptoms might be "Interface
282	   has high error rate" or "Interface flapping", or "Device almost out
283	   of memory".

285	   To compute the health status of such as service, the service is
286	   decomposed into an assurance graph formed by subservices linked
287	   through dependencies.  Each subservice is then turned into an
288	   expression graph that details how to fetch metrics from the devices
289	   and compute the health status of the subservice.  The subservice
290	   expressions are combined according to the dependencies between the
291	   subservices in order to obtain the expression graph which computes
292	   the health status of the service.

294	   The overall architecture of our solution is presented in Figure 1.
295	   Based on the service configuration, the SAIN orchestrator deduces the
296	   assurance graph.  It then sends to the SAIN agents the assurance
297	   graph along some other configuration options.  The SAIN agents are
298	   responsible for building the expression graph and computing the
299	   health statuses in a distributed manner.  The collector is in charge
300	   of collecting and displaying the current inferred health status of
301	   the service instances and subservices.  Finally, the automation loop
302	   is closed by having the SAIN Collector providing feedback to the
303	   network orchestrator.

305	          +-----------------+
306	          | Service         |
307	          | Configuration   |<--------------------+
308	          | Orchestrator    |                     |
309	          +-----------------+                     |
310	             |            |                       |
311	             |            | Network               |
312	             |            | Service               | Feedback
313	             |            | Instance              | Loop
314	             |            | Configuration         |
315	             |            |                       |
316	             |            V                       |
317	             |        +-----------------+       +-------------------+
318	             |        | SAIN            |       | SAIN              |
319	             |        | Orchestrator    |       | Collector         |
320	             |        +-----------------+       +-------------------+
321	             |            |                        ^
322	             |            | Configuration          | Health Status
323	             |            | (assurance graph)      | (Score + Symptoms)
324	             |            V                        | Streamed
325	             |     +-------------------+           | via Telemetry
326	             |     |+-------------------+          |
327	             |     ||+-------------------+         |
328	             |     +|| SAIN              |---------+
329	             |      +| agent             |
330	             |       +-------------------+
331	             |               ^ ^ ^
332	             |               | | |
333	             |               | | |  Metric Collection
334	             V               V V V
335	         +-------------------------------------------------------------+
336	         | Monitored Entities                                          |
337	         |                                                             |
338	         +-------------------------------------------------------------+

340	                        Figure 1: SAIN Architecture

342	   In order to produce the score assigned to a service instance, the
343	   architecture performs the following tasks:

345	   o  Analyze the configuration pushed to the network device(s) for
346	      configuring the service instance and decide: which information is
347	      needed from the device(s), such a piece of information being
348	      called a metric, which operations to apply to the metrics for
349	      computing the health status.

351	   o  Stream (via telemetry [RFC8641]) operational and config metric
352	      values when possible, else continuously poll.

354	   o  Continuously compute the health status of the service instances,
355	      based on the metric values.

357	3.1.  Decomposing a Service Instance Configuration into an Assurance
358	      Graph

360	   In order to structure the assurance of a service instance, the
361	   service instance is decomposed into so-called subservice instances.
362	   Each subservice instance focuses on a specific feature or subpart of
363	   the network system.

365	   The decomposition into subservices is an important function of this
366	   architecture, for the following reasons.

368	   o  The result of this decomposition is the assurance case of a
369	      service instance, that can be represented as a graph (called
370	      assurance graph) to the operator.

372	   o  Subservices provide a scope for particular expertise and thereby
373	      enable contribution from external experts.  For instance, the
374	      subservice dealing with the optics health should be reviewed and
375	      extended by an expert in optical interfaces.

377	   o  Subservices that are common to several service instances are
378	      reused for reducing the amount of computation needed.

380	   The assurance graph of a service instance is a DAG representing the
381	   structure of the assurance case for the service instance.  The nodes
382	   of this graph are service instances or subservice instances.  Each
383	   edge of this graph indicates a dependency between the two nodes at
384	   its extremities: the service or subservice at the source of the edge
385	   depends on the service or subservice at the destination of the edge.

387	   Figure 2 depicts a simplistic example of the assurance graph for a
388	   tunnel service.  The node at the top is the service instance, the
389	   nodes below are its dependencies.  In the example, the tunnel service
390	   instance depends on the peer1 and peer2 tunnel interfaces, which in
391	   turn depend on the respective physical interfaces, which finally
392	   depend on the respective peer1 and peer2 devices.  The tunnel service
393	   instance also depends on the IP connectivity that depends on the IS-
394	   IS routing protocol.

396	                             +------------------+
397	                             | Tunnel           |
398	                             | Service Instance |
399	                             +-----------------+
400	                                       |
401	                   +-------------------+-------------------+
402	                   |                   |                   |
403	            +-------------+     +-------------+     +--------------+
404	            | Peer1       |     | Peer2       |     | IP           |
405	            | Tunnel      |     | Tunnel      |     | Connectivity |
406	            | Interface   |     | Interface   |     |              |
407	            +-------------+     +-------------+     +--------------}
408	                   |                   |                  |
409	            +-------------+     +-------------+     +-------------+
410	            | Peer1       |     | Peer2       |     | IS-IS       |
411	            | Physical    |     | Physical    |     | Routing     |
412	            | Interface   |     | Interface   |     | Protocol    |
413	            +-------------+     +-------------+     +-------------+
414	                   |                   |
415	            +-------------+     +-------------+
416	            |             |     |             |
417	            | Peer1       |     | Peer2       |
418	            | Device      |     | Device      |
419	            +-------------+     +-------------+

421	                     Figure 2: Assurance Graph Example

423	   Depicting the assurance graph helps the operator to understand (and
424	   assert) the decomposition.  The assurance graph shall be maintained
425	   during normal operation with addition, modification and removal of
426	   service instances.  A change in the network configuration or topology
427	   shall be reflected in the assurance graph.  As a first example, a
428	   change of routing protocol from IS-IS to OSPF would change the
429	   assurance graph accordingly.  As a second example, assuming that ECMP
430	   is in place for the source router for that specific tunnel; in that
431	   case, multiple interfaces must now be monitored, on top of the
432	   monitoring the ECMP health itself.

434	3.2.  Intent and Assurance Graph

436	   The SAIN orchestrator analyzes the configuration of a service
437	   instance to:

439	   o  Try to capture the intent of the service instance, i.e. what is
440	      the service instance trying to achieve,

442	   o  Decompose the service instance into subservices representing the
443	      network features on which the service instance relies.

445	   The SAIN orchestrator must be able to analyze configuration from
446	   various devices and produce the assurance graph.

448	   To schematize what a SAIN orchestrator does, assume that the
449	   configuration for a service instance touches 2 devices and configure
450	   on each device a virtual tunnel interface.  Then:

452	   o  Capturing the intent would start by detecting that the service
453	      instance is actually a tunnel between the two devices, and stating
454	      that this tunnel must be functional.  This is the current state of
455	      SAIN, however it does not completely capture the intent which
456	      might additionally include, for instance, on the latency and
457	      bandwidth requirements of this tunnel.

459	   o  Decomposing the service instance into subservices would result in
460	      the assurance graph depicted in Figure 2, for instance.

462	   In order for SAIN to be applied, the configuration necessary for each
463	   service instance should be identifiable and thus should come from a
464	   "service-aware" source.  While the Figure 1 makes a distinction
465	   between the SAIN orchestrator and a different component providing the
466	   service instance configuration, in practice those two components are
467	   mostly likely combined.  The internals of the orchestrator are
468	   currently out of scope of this document.

470	3.3.  Subservices

472	   A subservice corresponds to subpart or a feature of the network
473	   system that is needed for a service instance to function properly.
474	   In the context of SAIN, subservice is actually a shortcut for
475	   subservice assurance, that is the method for assuring that a
476	   subservice behaves correctly.

478	   Subservices, exactly such as services, have high-level parameters
479	   that specify the type and specific instance to be assured.  For
480	   example, assuring a device requires the specific deviceId as
481	   parameter.  For example, assuring an interface requires the specific
482	   combination of deviceId and interfaceId.

484	   A subservice is also characterized by a list of metrics to fetch and
485	   a list of computations to apply to these metrics in order to infer a
486	   health status.

488	3.4.  Building the Expression Graph from the Assurance Graph

490	   From the assurance graph is derived a so-called global computation
491	   graph.  First, each subservice instance is transformed into a set of
492	   subservice expressions that take metrics and constants as input (i.e.

494	   sources of the DAG) and produce the status of the subservice, based
495	   on some heuristics.  Then for each service instance, the service
496	   expressions are constructed by combining the subservice expressions
497	   of its dependencies.  The way service expressions are combined
498	   depends on the dependency types (impacting or informational).
499	   Finally, the global computation graph is built by combining the
500	   service expressions.  In other words, the global computation graph
501	   encodes all the operations needed to produce health statuses from the
502	   collected metrics.

504	   Subservices shall be device independent.  To justify this, let's
505	   consider the interface operational status.  Depending on the device
506	   capabilities, this status can be collected by an industry-accepted
507	   YANG module (IETF, Openconfig), by a vendor-specific YANG module, or
508	   even by a MIB module.  If the subservice was dependent on the
509	   mechanism to collect the operational status, then we would need
510	   multiple subservice definitions in order to support all different
511	   mechanisms.  This also implies that, while waiting for all the
512	   metrics to be available via standard YANG modules, SAIN agents might
513	   have to retrieve metric values via non-standard YANG models, via MIB
514	   modules, Command Line Interface (CLI), etc., effectively implementing
515	   a normalization layer between data models and information models.

517	   In order to keep subservices independent from metric collection
518	   method, or, expressed differently, to support multiple combinations
519	   of platforms, OSes, and even vendors, the framework introduces the
520	   concept of "metric engine".  The metric engine maps each device-
521	   independent metric used in the subservices to a list of device-
522	   specific metric implementations that precisely define how to fetch
523	   values for that metric.  The mapping is parameterized by the
524	   characteristics (model, OS version, etc.) of the device from which
525	   the metrics are fetched.

527	3.5.  Building the Expression from a Subservice

529	   Additionally, to the list of metrics, each subservice defines a list
530	   of expressions to apply on the metrics in order to compute the health
531	   status of the subservice.  The definition or the standardization of
532	   those expressions (also known as heuristic) is currently out of scope
533	   of this standardization.

535	3.6.  Open Interfaces with YANG Modules

537	   The interfaces between the architecture components are open thanks to
538	   the YANG modules specified in YANG Modules for Service Assurance
539	   [I-D.claise-opsawg-service-assurance-yang]; they specify objects for
540	   assuring network services based on their decomposition into so-called
541	   subservices, according to the SAIN architecture.

543	   This module is intended for the following use cases:

545	   o  Assurance graph configuration:

547	      *  Subservices: configure a set of subservices to assure, by
548	         specifying their types and parameters.

550	      *  Dependencies: configure the dependencies between the
551	         subservices, along with their types.

553	   o  Assurance telemetry: export the health status of the subservices,
554	      along with the observed symptoms.

556	3.7.  Handling Maintenance Windows

558	   Whenever network components are under maintenance, the operator want
559	   to inhibit the emission of symptoms from those components.  A typical
560	   use case is device maintenance, during which the device is not
561	   supposed to be operational.  As such, symptoms related to the device
562	   health should be ignored, as well as symptoms related to the device-
563	   specific subservices, such as the interfaces, as their state changes
564	   is probably the consequence of the maintenance.

566	   To configure network components as "under maintenance" in the SAIN
567	   architecture, the ietf-service-assurance model proposed in
568	   [I-D.claise-opsawg-service-assurance-yang] specifies an "under-
569	   maintenance" flag per service or subservice instance.  When this flag
570	   is set and only when this flag is set, the companion field
571	   "maintenance-contact" must be set to a string that identifies the
572	   person or process who requested the maintenance.  Any symptom
573	   produced by a service or subservice under maintenance, or by one of
574	   its dependencies MUST NOT be be reported.  A service or subservice
575	   under maintenance MAY propagate a symptom "Under Maintenance" towards
576	   services or subservices that depend on it.

578	   We illustrate this mechanism on three independent examples based on
579	   the assurance graph depicted in Figure 2:

581	   o  Device maintenance, for instance upgrading the device OS.  The
582	      operator sets the "under-maintenance" flag for the subservice
583	      "Peer1" device.  This inhibits the emission of symptoms from
584	      "Peer1 Physical Interface", "Peer1 Tunnel Interface" and "Tunnel
585	      Service Instance".  All other subservices are unaffected.

587	   o  Interface maintenance, for instance replacing a broken optic.  The
588	      operator sets the "under-maintenance" flag for the subservice
589	      "Peer1 Physical Interface".  This inhibits the emission of
590	      symptoms from "Peer 1 Tunnel Interface" and "Tunnel Service
591	      Instance".  All other subservices are unaffected.

593	   o  Routing protocol maintenance, for instance modifying parameters or
594	      redistribution.  The operator sets the "under-maintenance" flag
595	      for the subservice "IS-IS Routing Protocol".  This inhibits the
596	      emission of symptoms from "IP connectivity" and "Tunnel Service
597	      Instance".  All other subservices are unaffected.

599	3.8.  Flexible Architecture

601	   The SAIN architecture is flexible in terms of components.  While the
602	   SAIN architecture in Figure 1 makes a distinction between two
603	   components, the SAIN configuration orchestrator and the SAIN
604	   orchestrator, in practice those two components are mostly likely
605	   combined.  Similarly, the SAIN agents are displayed in Figure 1 as
606	   being separate components.  Practically, the SAIN agents could be
607	   either independent components or directly integrated in monitored
608	   entities.  A practical example is an agent in a router.

610	   The SAIN architecture is also flexible in terms of services and
611	   subservices.  Most examples in this document deal with the notion of
612	   Network Service YANG modules, with well known service such as L2VPN
613	   or tunnels.  However, the concepts of services is general enough to
614	   cross into different domains.  One of them is the domain of service
615	   management on network elements, with also requires its own assurance.
616	   Examples includes a DHCP server on a linux server, a data plane, an
617	   IPFIX export, etc.  The notion of "service" is generic in this
618	   architecture.  Indeed, a configured service can itself be a service
619	   for someone else.  Exactly like an DHCP server/ data plane/IPFIX
620	   export can be considered as services for a device, exactly like an
621	   routing instance can be considered as a service for a L3VPN, exactly
622	   like a tunnel can considered as a service for an application in the
623	   cloud.  The assurance graph is created to be flexible and open,
624	   regardless of the subservice types, locations, or domains.

626	   The SAIN architecture is also flexible in terms of distributed
627	   graphs.  As shown in Figure 1, our architecture comprises several
628	   agents.  Each agent is responsible for handling a subgraph of the
629	   assurance graph.  The collector is responsible for fetching the
630	   subgraphs from the different agents and gluing them together.  As an
631	   example, in the graph from Figure 2, the subservices relative to Peer
632	   1 might be handled by a different agent than the subservices relative
633	   to Peer 2 and the Connectivity and IS-IS subservices might be handled
634	   by yet another agent.  The agents will export their partial graph and
635	   the collector will stitch them together as dependencies of the
636	   service instance.

638	   And finally, the SAIN architecture is flexible in terms of what it
639	   monitors.  Most, if not all examples, in this document refer to
640	   physical components but this is not a constrain.  Indeed, the
641	   assurance of virtual components would follow the same principles and
642	   an assurance graph composed of virtualized components (or a mix of
643	   virtualized and physical ones) is well possible within this
644	   architecture.

646	4.  Security Considerations

648	   The SAIN architecture helps operators to reduce the mean time to
649	   detect and mean time to repair.  As such, it should not cause any
650	   security threats.  However, the SAIN agents must be secure: a
651	   compromised SAIN agents could be sending wrong root causes or
652	   symptoms to the management systems.

654	   Except for the configuration of telemetry, the agents do not need
655	   "write access" to the devices they monitor.  This configuration is
656	   applied with a YANG module, whose protection is covered by Secure
657	   Shell (SSH) [RFC6242] for NETCONF or TLS [RFC8446] for RESTCONF.

659	   If a closed loop system relies on this architecture then the well
660	   known issue of t hose system also applies, i.e., a lying device or
661	   compromised agent could trigger partial reconfiguration of the
662	   service or network.  The SAIN architecture neither augments or
663	   reduces this risk.

665	5.  IANA Considerations

667	   This document includes no request to IANA.

669	6.  Open Issues

671	      -Security Considerations to be completed

673	7.  References

675	7.1.  Normative References

677	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
678	              Requirement Levels", BCP 14, RFC 2119,
679	              DOI 10.17487/RFC2119, March 1997,
680	              <https://www.rfc-editor.org/info/rfc2119>.

682	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
683	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
684	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

686	7.2.  Informative References

688	   [I-D.claise-opsawg-service-assurance-yang]
689	              Claise, B. and J. Quilbeuf, "Service Assurance for Intent-
690	              based Networking Architecture", February 2020.

692	   [I-D.ietf-opsawg-tacacs]
693	              Dahm, T., Ota, A., dcmgash@cisco.com, d., Carrel, D., and
694	              L. Grant, "The TACACS+ Protocol", draft-ietf-opsawg-
695	              tacacs-17 (work in progress), November 2019.

697	   [RFC2865]  Rigney, C., Willens, S., Rubens, A., and W. Simpson,
698	              "Remote Authentication Dial In User Service (RADIUS)",
699	              RFC 2865, DOI 10.17487/RFC2865, June 2000,
700	              <https://www.rfc-editor.org/info/rfc2865>.

702	   [RFC3164]  Lonvick, C., "The BSD Syslog Protocol", RFC 3164,
703	              DOI 10.17487/RFC3164, August 2001,
704	              <https://www.rfc-editor.org/info/rfc3164>.

706	   [RFC6241]  Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed.,
707	              and A. Bierman, Ed., "Network Configuration Protocol
708	              (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011,
709	              <https://www.rfc-editor.org/info/rfc6241>.

711	   [RFC6242]  Wasserman, M., "Using the NETCONF Protocol over Secure
712	              Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011,
713	              <https://www.rfc-editor.org/info/rfc6242>.

715	   [RFC7011]  Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
716	              "Specification of the IP Flow Information Export (IPFIX)
717	              Protocol for the Exchange of Flow Information", STD 77,
718	              RFC 7011, DOI 10.17487/RFC7011, September 2013,
719	              <https://www.rfc-editor.org/info/rfc7011>.

721	   [RFC7950]  Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",
722	              RFC 7950, DOI 10.17487/RFC7950, August 2016,
723	              <https://www.rfc-editor.org/info/rfc7950>.

725	   [RFC8040]  Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF
726	              Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017,
727	              <https://www.rfc-editor.org/info/rfc8040>.

729	   [RFC8199]  Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module
730	              Classification", RFC 8199, DOI 10.17487/RFC8199, July
731	              2017, <https://www.rfc-editor.org/info/rfc8199>.

733	   [RFC8446]  Rescorla, E., "The Transport Layer Security (TLS) Protocol
734	              Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018,
735	              <https://www.rfc-editor.org/info/rfc8446>.

737	   [RFC8641]  Clemm, A. and E. Voit, "Subscription to YANG Notifications
738	              for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641,
739	              September 2019, <https://www.rfc-editor.org/info/rfc8641>.

741	Appendix A.  Changes between revisions

743	   v00 - v01

745	   o  Terminology clarifications

747	   o  Figure 1 improved

749	Acknowledgements

751	   The authors would like to thank Stephane Litkowski, Charles Eckel,
752	   Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, and Stefan
753	   Vallin for their reviews and feedback.

755	Authors' Addresses

757	   Benoit Claise
758	   Cisco Systems, Inc.
759	   De Kleetlaan 6a b1
760	   1831 Diegem
761	   Belgium

763	   Email: bclaise@cisco.com

765	   Jean Quilbeuf
766	   Cisco Systems, Inc.
767	   1, rue Camille Desmoulins
768	   92782 Issy Les Moulineaux
769	   France

771	   Email: jquilbeu@cisco.com
772	   Youssef El Fathi
773	   Orange Business Services
774	   61 rue des archives
775	   75003 Paris
776	   France

778	   Email: io@elfathi.net

780	   Diego R. Lopez
781	   Telefonica I+D
782	   Don Ramon de la Cruz, 82
783	   Madrid  28006
784	   Spain

786	   Email: diego.r.lopez@telefonica.com

788	   Dan Voyer
789	   Bell Canada
790	   Canada

792	   Email: daniel.voyer@bell.ca