idnits 2.17.1 draft-claise-opsawg-service-assurance-architecture-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (November 3, 2019) is 1629 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-18) exists of draft-ietf-opsawg-tacacs-15 -- Obsolete informational reference (is this intentional?): RFC 2138 (Obsoleted by RFC 2865) -- Obsolete informational reference (is this intentional?): RFC 3164 (Obsoleted by RFC 5424) Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG B. Claise 3 Internet-Draft J. Quilbeuf 4 Intended status: Informational Cisco Systems, Inc. 5 Expires: May 6, 2020 November 3, 2019 7 Service Assurance for Intent-based Networking Architecture 8 draft-claise-opsawg-service-assurance-architecture-00 10 Abstract 12 This document describes the architecture for Service Assurance for 13 Intent-based Networking (SAIN). This architecture aims at assuring 14 that service instances are correctly running. As services rely on 15 multiple sub-services by the underlying network devices, getting the 16 assurance of a healthy service is only possible with a holistic view 17 of network devices. This architecture not only helps to correlate 18 the service degradation with the network root cause but also the 19 impacted services impacted when a network component fails or 20 degrades. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at https://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on May 6, 2020. 39 Copyright Notice 41 Copyright (c) 2019 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (https://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 57 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4 58 3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 5 59 3.1. Decomposing a Service Instance Configuration into an 60 Assurance Tree . . . . . . . . . . . . . . . . . . . . . 7 61 3.2. Intent and Assurance Tree . . . . . . . . . . . . . . . . 9 62 3.3. Subservices . . . . . . . . . . . . . . . . . . . . . . . 9 63 3.4. Building the Expression Tree from the Assurance Tree . . 10 64 3.5. Building the Expression from a Subservice . . . . . . . . 10 65 3.6. Open Interfaces with YANG Modules . . . . . . . . . . . . 10 66 4. Security Considerations . . . . . . . . . . . . . . . . . . . 11 67 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 68 6. Open Issues . . . . . . . . . . . . . . . . . . . . . . . . . 11 69 7. References . . . . . . . . . . . . . . . . . . . . . . . . . 11 70 7.1. Normative References . . . . . . . . . . . . . . . . . . 11 71 7.2. Informative References . . . . . . . . . . . . . . . . . 11 72 Appendix A. Changes between revisions . . . . . . . . . . . . . 12 73 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 12 74 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 12 76 1. Terminology 78 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 79 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 80 "OPTIONAL" in this document are to be interpreted as described in BCP 81 14 [RFC2119] [RFC8174] when, and only when, they appear in all 82 capitals, as shown here. 84 Agent (SAIN Agent): Component that communicates with a device, a set 85 of devices, or another agent to build an expression tree from a 86 received assurance tree and perform the corresponding computation. 88 Assurance Tree: DAG representing the assurance case for one or 89 several service instances. The nodes are the service instances 90 themselves and the subservices, the edges indicate a dependency 91 relations. 93 Collector (SAIN collector): Component that fetches the computer- 94 consumable output of the agent(s) and displays it in a user friendly 95 form or process it locally. 97 DAG: Directed Acyclic Graph. 99 ECMP: Equal Cost Multiple Paths 101 Expression Tree: Generic term for a DAG representing a computation in 102 SAIN. More specific terms are: 104 o Subservice Expressions: expression tree representing all the 105 computations to execute for a subservice. 107 o Service Expressions: expression tree representing all the 108 computations to execute for a service instance, i.e. including the 109 computations for all dependent subservices. 111 o Global Computation Forest: expression tree representing all the 112 computations to execute for all services instances in an instance 113 of SAIN (i.e. all computations performed within an instance of 114 SAIN). 116 Impacting Dependency: Type of dependency in the assurance tree. The 117 status of the dependency is completely taken into account by the 118 dependent service instance or subservice. 120 Informational Dependency: Type of dependency in the assurance tree. 121 Only the symptoms (i.e. for informational reasons) are taken into 122 account in the dependent service instance or subservice. In 123 particular, the score is not taken into account. 125 Metric: Information retrieved from a network device. 127 Metric Engine: Maps metrics to a list of candidate metric 128 implementations depending on the target model. 130 Metric Implementation: Actual way of retrieving a metric from a 131 device. 133 Network Service YANG Module: describes the characteristics of 134 service, as agreed upon with consumers of that service [RFC8199]. 136 Service Instance: A specific instance of a service. 138 Orchestrator (SAIN Orchestrator): Component of SAIN in charge of 139 fetching the configuration specific to each service instance and 140 converting it into an assurance tree. 142 Health status: Score and symptoms indicating whether a service 143 instance or a subservice is heathy. A non-maximal score MUST always 144 be explained by one or more symptoms. 146 Subservice: Part of an assurance tree that assures a specific feature 147 or subpart of the network system. 149 Symptom: Reason explaining why a service instance or a subservice is 150 not completely healthy. 152 2. Introduction 154 Network Service YANG Modules [RFC8199] describe the configuration, 155 state data, operations, and notifications of abstract representations 156 of services implemented on one or multiple network elements. 158 Quoting RFC8199: "Network Service YANG Modules describe the 159 characteristics of a service, as agreed upon with consumers of that 160 service. That is, a service module does not expose the detailed 161 configuration parameters of all participating network elements and 162 features but describes an abstract model that allows instances of the 163 service to be decomposed into instance data according to the Network 164 Element YANG Modules of the participating network elements. The 165 service-to-element decomposition is a separate process; the details 166 depend on how the network operator chooses to realize the service. 167 For the purpose of this document, the term "orchestrator" is used to 168 describe a system implementing such a process."" 170 In other words, orchestrators deploy Network Service YANG Modules 171 through the configuration of Network Element YANG Modules. Network 172 configuration is based on those YANG data models, with protocol/ 173 encoding such as NETCONF/XML [RFC6241] , RESTCONF/JSON [RFC8040], 174 gNMI/gRPC/protobuf, etc. Knowing that a configuration is applied 175 doesn't imply that the service is running correctly (for example the 176 service might be degraded because of a failure in the network), the 177 network operator must monitor the service operational data at the 178 same time as the configuration. The industry has been standardizing 179 on telemetry to push network element performance information. 181 A network administrator needs to monitor her network and services as 182 a whole, independently of the use cases or the management protocols. 183 With different protocols come different data models, and different 184 ways to model the same type of information. When network 185 administrators deal with multiple protocols, the network management 186 must perform the difficult and time-consuming job of mapping data 187 models: the model used for configuration with the model used for 188 monitoring. This problem is compounded by a large, disparate set of 189 data sources (MIB modules, YANG models [RFC7950], IPFIX information 190 elements [RFC7011], syslog plain text [RFC3164], TACACS+ 191 [I-D.ietf-opsawg-tacacs], RADIUS [RFC2138], etc.). In order to avoid 192 this data model mapping, the industry converged on model-driven 193 telemetry to stream the service operational data, reusing the YANG 194 models used for configuration. Model-driven telemetry greatly 195 facilitates the notion of closed-loop automation whereby events from 196 the network drive remediation changes back into the network. 198 However, it proves difficult for network operators to correlate the 199 service degradation with the network root cause. For example, why 200 does my L3VPN fail to connect? Why is this specific service slow? 201 The reverse, i.e. which services are impacted when a network 202 component fails or degrades, is even more interesting for the 203 operators. For example, which service(s) is(are) impacted when this 204 specific optic dBM begins to degrade? Which application is impacted 205 by this ECMP imbalance? Is that issue actually impacting any other 206 customers? 208 Intent-based approaches are often declarative, starting from a 209 statement of the "The service works correctly" and trying to enforce 210 it. Such approaches are mainly suited for greenfield deployments. 212 Instead of approaching intent from a declarative way, this framework 213 focuses on already defined services and tries to infer the meaning of 214 "The service works correctly". To do so, the framework works from an 215 assurance tree, deduced from the service definition and from the 216 network configuration. This assurance tree is decomposed into 217 components, which are then assured independently. The root of the 218 assurance tree represents the service to assure, and its children 219 represent components identified as its direct dependencies; each 220 component can have dependencies as well. 222 When a service is degraded, the framework will highlight where in the 223 assurance service tree to look, as opposed to going hop by hop to 224 troubleshoot the issue. Not only can can this framework help to 225 correlate service degradation with network root cause/symptoms, but 226 it can deduce from the assurance tree the number and type of services 227 impacted by a component degradation/failure. This added value 228 informs the operational team where to focus its attention for maximum 229 return. 231 3. Architecture 233 SAIN aims at assuring that service instances are correctly running. 234 The goal of SAIN is to assure that service instances are operating 235 correctly and if not, to pinpoint what is wrong. More precisely, 236 SAIN computes a score for each service instance and outputs symptoms 237 explaining that score, especially why the score is not maximal. The 238 score augmented with the symptoms is called the health status 240 As an example of a service, let us consider a point-to-point L2VPN 241 connection (i.e. pseudowire). Such a service would take as 242 parameters the two ends of the connection (device, interface or 243 subinterface, and address of the other end) and configure both 244 devices (and maybe more) so that a L2VPN connection is established 245 between the two devices. Examples of symptoms might be "Interface 246 has high error rate" or "Interface flapping", or "Device almost out 247 of memory". 249 The overall architecture of our solution is presented in Figure 1. 250 The assurance tree along some other configuration options is sent to 251 the SAIN agents who are responsible for building the expression tree 252 and computing the statuses in a distributed manner. The collector is 253 in charge of collecting and displaying the current status of the 254 assured service instances. 256 Network +-----------------+ +-------------------+ 257 Service --------> | (SAIN) | | (SAIN) | 258 Instance | Orchestrator | | Collector | 259 Configuration +-----------------+ +-------------------+ 260 | ^ 261 | Configuration | health Status 262 | (assurance tree) | (score + symptoms) 263 V | streamed 264 +-------------------+ | via telemetry 265 |+-------------------+ | 266 ||+-------------------+ | 267 +|| (SAIN) |---------+ 268 +| agent | 269 +-------------------+ 270 ^ ^ ^ 271 | | | 272 | | | Metric Collection 273 V V V 274 +-------------------------------------------------------------+ 275 | Network | 276 | | 277 +-------------------------------------------------------------+ 279 Figure 1: SAIN Architecture 281 In order to produce the score assigned to a service instance, the 282 architecture performs the following tasks: 284 o Analyze the configuration pushed to the network device(s) for 285 configuring the service instance and decide: which information is 286 needed from the device(s), such a piece of information being 287 called a metric, which operations to apply to the metrics for 288 computing the health status. 290 o Stream (via telemetry [RFC8641]) operational and config metric 291 values when possible, else continuously fetch. 293 o Continuously compute the health status of the service instances, 294 based on the metric values. 296 As said above, the goal of SAIN is to produce a health status for 297 each service instance to assure, by collecting some metrics and 298 applying operations on them. To meet that goal, the service is 299 decomposed into an assurance tree formed by subservices linked 300 through dependencies. Each subservice is then turned into 301 expressions that are combined according to the dependencies between 302 the subservices in order to obtain the expression tree which details 303 how to fetch the metrics and how to compute the health status for 304 each service instances. The expression tree is then implemented by 305 the SAIN agents. The architecture also exports the health status of 306 each subservice. 308 3.1. Decomposing a Service Instance Configuration into an Assurance 309 Tree 311 In order to structure the assurance of a service instance, the 312 service instance is decomposed into so-called subservices. Each 313 subservice focuses on a specific feature or subpart of the network 314 system. 316 The decomposition into subservices is at the heart of this 317 architecture, for the following reasons. 319 o The result of this decomposition is the assurance case of a 320 service instance, that can be represented is as a graph (called 321 assurance tree) to the operator. 323 o Subservices provide a scope for particular expertise and thereby 324 enable contribution from external experts. For instance, the 325 subservice dealing with the optics health should be reviewed and 326 extended by an expert in optical interfaces. 328 o Subservices that are common to several service instances are 329 reused for reducing the amount of computation needed. 331 The assurance tree of a service instance is a DAG representing the 332 structure of the assurance case for the service instance. The nodes 333 of this graph are service instances or subservice instances. Each 334 edge of this graph indicates a dependency between the two nodes at 335 its extremities: the service or subservice at the source of the edge 336 depends on the service or subservice at the destination of the edge. 338 Figure 2 depicts a simplistic example of the assurance tree for a 339 tunnel service. The node at the top is the service instance, the 340 nodes below are its dependencies. In the example, the tunnel service 341 instance depends on the peer1 and peer2 tunnel interfaces, which in 342 turn depend on the respective physical interfaces, which finally 343 depend on the respective peer1 and peer2 devices. The tunnel service 344 instance also depends on the IP connectivity that depends on the IS- 345 IS routing protocol. 347 +------------------+ 348 | Tunnel | 349 | Service Instance | 350 +-----------------+ 351 | 352 +-------------------+-------------------+ 353 | | | 354 +-------------+ +-------------+ +--------------+ 355 | Peer1 | | Peer2 | | IP | 356 | Tunnel | | Tunnel | | Connectivity | 357 | Interface | | Interface | | | 358 +-------------+ +-------------+ +--------------} 359 | | | 360 +-------------+ +-------------+ +-------------+ 361 | Peer1 | | Peer2 | | IS-IS | 362 | Physical | | Physical | | Routing | 363 | Interface | | Interface | | Protocol | 364 +-------------+ +-------------+ +-------------+ 365 | | 366 +-------------+ +-------------+ 367 | | | | 368 | Peer1 | | Peer2 | 369 | Device | | Device | 370 +-------------+ +-------------+ 372 Figure 2: Assurance Tree Example 374 Depicting the assurance tree helps the operator to understand (and 375 assert) the decomposition. The assurance tree shall be maintained 376 during normal operation with addition, modification and removal of 377 service instances. A change in the network configuration or topology 378 shall be reflected in the assurance tree. As a first example, a 379 change of routing protocol from IS-IS to OSPF would change the 380 assurance tree accordingly. As a second example, assuming that ECMP 381 is in place for the source router for that specific tunnel; in that 382 case, multiple interfaces must now be monitored, on top of the 383 monitoring the ECMP health itself. 385 3.2. Intent and Assurance Tree 387 The SAIN orchestrator analyzes the configuration of a service 388 instance to: 390 o Try to capture the intent of the service instance, i.e. what is 391 the service instance trying to achieve, 393 o Decompose the service instance into subservices representing the 394 network features on which the service instance relies. 396 The SAIN orchestrator must be able to analyze configuration from 397 various devices and produce the assurance tree. 399 To schematize what a SAIN orchestrator does, assume that the 400 configuration for a service instance touches 2 devices and configure 401 on each device a virtual tunnel interface. Then: 403 o Capturing the intent would start by detecting that the service 404 instance is actually a tunnel between the two devices, and stating 405 that this tunnel must be functional. This is the current state of 406 SAIN, however it does not completely capture the intent which 407 might additionally include, for instance, on the latency and 408 bandwidth requirements of this tunnel. 410 o Decompose the service instance into subservices is what the 411 assurance tree depicted in Figure 2 does. 413 In order for SAIN to be applied, the configuration necessary for each 414 service instance should be identifiable and thus should come from a 415 "service-aware" source. While the figure 1 makes a distinction 416 between the SAIN orchestrator and a different component providing the 417 service instance configuration, in practice those two components are 418 mostly likely combined. The internals of the orchestrator are 419 currently out of scope of this standardization. 421 3.3. Subservices 423 A subservice corresponds to subpart or a feature of the network 424 system that is needed for a service instance to function properly. 425 In the context of SAIN, subservice is actually a shortcut for 426 subservice assurance, that is the method for assuring that a 427 subservice behaves correctly. 429 A subservice is characterized by a list of metrics to fetch and a 430 list of computations to apply to these metrics in order to produce a 431 health status. Subservices, as services, have high-level parameters 432 which defines which object should be assured. 434 3.4. Building the Expression Tree from the Assurance Tree 436 From the assurance tree is derived a so-called expression tree, which 437 is actually a DAG whose sources are constants or metrics and other 438 nodes are operators. The expression tree encodes all the operations 439 needed to produce heath statuses from the collected metrics. 441 Subservices shall be device independent. To justify this, let's 442 consider the interface operational status. Dependending on the 443 device capabilities, this status can be collected by an industry- 444 accepted YANG module (IETF, Openconfig), by a vendor-specific YANG 445 module, or even by a MIB module. If the subservice was dependent on 446 the mechanism to collect the operational status, then we would need 447 multiple subservice definitions in order to support all different 448 mechanisms. 450 In order to keep subservices independent from metric collection 451 method, or, expressed differently, to support multiple combinations 452 of platforms, OSes, and even vendors, the framework introduces the 453 concept of "metric engine". The metric engine maps each device- 454 independent metric used in the subservices to a list of device- 455 specific metric implementations that precisely define how to fetch 456 values for that metric. The mapping is parameterized by the 457 characteristics (model, OS version, etc.) of the device from which 458 the metrics are fetched. 460 3.5. Building the Expression from a Subservice 462 Additionally, to the list of metrics, each subservice defines a list 463 of expressions to apply on the metrics in order to compute the health 464 status of the subservice. The definition or the standardization of 465 those expressions (also known as heuristic) is currently out of scope 466 of this standardization. 468 3.6. Open Interfaces with YANG Modules 470 The interfaces between the architecture components are open thanks to 471 YANG module(I-D.claise-opsawg-service-assurance-yang) defines objects 472 for assuring network services based on their decomposition into so- 473 called subservices, according to the SAIN architecture. 475 This module is intended for the following use cases: 477 o Assurance tree configuration: 479 * Subservices: configure a set of subservices to assure, by 480 specifying their types and parameters. 482 * Dependencies: configure the dependencies between the 483 subservices, along with their type. 485 o Assurance telemetry: export the health status of the subservices, 486 along with the observed symptoms. 488 4. Security Considerations 490 TO BE COMPLETED 492 5. IANA Considerations 494 This document includes no request to IANA. 496 6. Open Issues 498 -Security Considerations to be completed 500 7. References 502 7.1. Normative References 504 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 505 Requirement Levels", BCP 14, RFC 2119, 506 DOI 10.17487/RFC2119, March 1997, 507 . 509 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 510 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 511 May 2017, . 513 7.2. Informative References 515 [I-D.ietf-opsawg-tacacs] 516 Dahm, T., Ota, A., dcmgash@cisco.com, d., Carrel, D., and 517 L. Grant, "The TACACS+ Protocol", draft-ietf-opsawg- 518 tacacs-15 (work in progress), September 2019. 520 [RFC2138] Rigney, C., Rubens, A., Simpson, W., and S. Willens, 521 "Remote Authentication Dial In User Service (RADIUS)", 522 RFC 2138, DOI 10.17487/RFC2138, April 1997, 523 . 525 [RFC3164] Lonvick, C., "The BSD Syslog Protocol", RFC 3164, 526 DOI 10.17487/RFC3164, August 2001, 527 . 529 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 530 and A. Bierman, Ed., "Network Configuration Protocol 531 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 532 . 534 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 535 "Specification of the IP Flow Information Export (IPFIX) 536 Protocol for the Exchange of Flow Information", STD 77, 537 RFC 7011, DOI 10.17487/RFC7011, September 2013, 538 . 540 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 541 RFC 7950, DOI 10.17487/RFC7950, August 2016, 542 . 544 [RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF 545 Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, 546 . 548 [RFC8199] Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module 549 Classification", RFC 8199, DOI 10.17487/RFC8199, July 550 2017, . 552 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 553 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 554 September 2019, . 556 Appendix A. Changes between revisions 558 v00 - v01 560 o Placeholder for next version. 562 Acknowledgements 564 The authors would like to thank ... 566 Authors' Addresses 567 Benoit Claise 568 Cisco Systems, Inc. 569 De Kleetlaan 6a b1 570 1831 Diegem 571 Belgium 573 Email: bclaise@cisco.com 575 Jean Quilbeuf 576 Cisco Systems, Inc. 577 1, rue Camille Desmoulins 578 92782 Issy Les Moulineaux 579 France 581 Email: jquilbeu@cisco.com