idnits 2.17.1 draft-song-opsawg-ntf-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (December 14, 2018) is 1960 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC1157' is defined on line 756, but no explicit reference was found in the text == Outdated reference: A later version (-07) exists of draft-ietf-grow-bmp-adj-rib-out-02 == Outdated reference: A later version (-13) exists of draft-ietf-grow-bmp-local-rib-02 == Outdated reference: A later version (-05) exists of draft-ietf-netconf-udp-pub-channel-04 == Outdated reference: A later version (-25) exists of draft-ietf-netconf-yang-push-20 == Outdated reference: A later version (-10) exists of draft-zhou-netconf-multi-stream-originators-03 -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 8321 (Obsoleted by RFC 9341) Summary: 0 errors (**), 0 flaws (~~), 8 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG H. Song, Ed. 3 Internet-Draft T. Zhou 4 Intended status: Informational ZB. Li 5 Expires: June 17, 2019 Huawei 6 ZQ. Li 7 China Mobile 8 P. Martinez-Julia 9 NICT 10 L. Ciavaglia 11 Nokia 12 A. Wang 13 China Telecom 14 December 14, 2018 16 Network Telemetry Framework 17 draft-song-opsawg-ntf-02 19 Abstract 21 This document provides an architectural framework for network 22 telemetry to address the current and future network operation 23 challenges and requirements. The defining characteristics of network 24 telemetry show a clear distinction from the conventional network 25 Operations, Administration, and Management (OAM). Network telemetry 26 promises better scalability, accuracy, coverage, and performance and 27 allows automated control loops to suit both today's and tomorrow's 28 network operation requirements. This document clarifies the 29 terminologies and classifies the modules and components of a network 30 telemetry system. The framework and taxonomy help to set a common 31 ground for the collection of related work and provide guidance for 32 future technique and standard developments. 34 Status of This Memo 36 This Internet-Draft is submitted in full conformance with the 37 provisions of BCP 78 and BCP 79. 39 Internet-Drafts are working documents of the Internet Engineering 40 Task Force (IETF). Note that other groups may also distribute 41 working documents as Internet-Drafts. The list of current Internet- 42 Drafts is at https://datatracker.ietf.org/drafts/current/. 44 Internet-Drafts are draft documents valid for a maximum of six months 45 and may be updated, replaced, or obsoleted by other documents at any 46 time. It is inappropriate to use Internet-Drafts as reference 47 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on June 17, 2019. 50 Copyright Notice 52 Copyright (c) 2018 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 69 2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 3 70 2.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 5 72 2.3. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 6 73 2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 8 74 3. The Necessity of a Network Telemetry Framework . . . . . . . 9 75 4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 10 76 4.1. Existing Works Mapped in the Framework . . . . . . . . . 14 77 5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 15 78 6. Security Considerations . . . . . . . . . . . . . . . . . . . 16 79 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 80 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 16 81 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 16 82 9.1. Normative References . . . . . . . . . . . . . . . . . . 16 83 9.2. Informative References . . . . . . . . . . . . . . . . . 16 84 Appendix A. A Survey on Existing Network Telemetry Techniques . 19 85 A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 19 86 A.1.1. Requirements and Challenges . . . . . . . . . . . . . 19 87 A.1.2. Push Extensions for NETCONF . . . . . . . . . . . . . 20 88 A.1.3. gRPC Network Management Interface . . . . . . . . . . 20 89 A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 21 90 A.2.1. Requirements and Challenges . . . . . . . . . . . . . 21 91 A.2.2. BGP Monitoring Protocol . . . . . . . . . . . . . . . 21 92 A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 22 93 A.3.1. Requirements and Challenges . . . . . . . . . . . . . 22 94 A.3.2. Technique Taxonomy . . . . . . . . . . . . . . . . . 23 95 A.3.3. The IPFPM technology . . . . . . . . . . . . . . . . 23 96 A.3.4. Dynamic Network Probe . . . . . . . . . . . . . . . . 25 97 A.3.5. IP Flow Information Export (IPFIX) protocol . . . . . 25 98 A.3.6. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 25 99 A.4. External Data and Event Telemetry . . . . . . . . . . . . 26 100 A.4.1. Requirements and Challenges . . . . . . . . . . . . . 26 101 A.4.2. Sources of External Events . . . . . . . . . . . . . 27 102 A.4.3. Connectors and Interfaces . . . . . . . . . . . . . . 28 103 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 28 105 1. Introduction 107 Network visibility is essential for network operation. Network 108 telemetry has been widely considered as an ideal mean to gain 109 sufficient network visibility with better scalability, accuracy, 110 coverage, and performance than conventional OAM technologies. 111 However, confusion and misunderstandings about the network telemetry 112 remain (e.g., the scope and coverage of the term). We need an 113 unambiguous concept and a clear architectural framework for network 114 telemetry so we can better align the related technology and standard 115 work. 117 First, we show some key characteristics of network telemetry which 118 set a clear distinction from the conventional network OAM. We then 119 provide an architectural framework for network telemetry to meet the 120 current and future network operation requirements. Following the 121 framework, we classify the components of a network telemetry system 122 so we can esily map the exising and emerging techniques and protocols 123 into the framework. At last, we outline a roadmap for the evolution 124 of the network telemetry system. 126 The purpose of the framework and taxonomy is to set a common ground 127 for the collection of related work and provide guidance for future 128 technique and standard developments. 130 1.1. Requirements Language 132 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 133 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 134 "OPTIONAL" in this document are to be interpreted as described in BCP 135 14 [RFC2119][RFC8174] when, and only when, they appear in all 136 capitals, as shown here. 138 2. Motivation 140 Thanks to the advance of the computing and storage technologies, 141 today's big data analytics and machine learning-based Artifical 142 Intelligence (AI) give network operators an unprecedented opportunity 143 to gain network insights and move towards network autonomy. Software 144 tools can use the network data to detect and react on network faults, 145 anomalies, and policy violations, as well as predicting future 146 events. In turn, the network policy updates for planning, intrusion 147 prevention, optimization, and self-healing may be applied. 149 It is conceivable that an intent-driven autonomous network is the 150 logical next step for network evolution following Software Defined 151 Network (SDN), aiming to reduce (or even eliminate) human labor, make 152 the most efficient use of network resources, and provide better 153 services more aligned with customer requirements. Although it takes 154 time to reach the ultimate goal, the journey has started 155 nevertheless. 157 However, the system bottleneck is shifting from data consumption to 158 data supply. Both the number of network nodes and the traffic 159 bandwidth keep increasing at a fast pace; The network configuration 160 and policy change at a much smaller time frame than ever before; More 161 subtle events and fine-grained data through all network planes need 162 to be captured and exported in real time. In a nutshell, it is 163 challenging to get enough high-quality data out of network 164 efficiently, timely, and flexibly. Therefore, we need to examine the 165 existing network technologies and protocols, and identify any 166 potential gaps based on the real network and device architectures. 168 In the remaining of this section, first we discuss several key use 169 cases for today's and future network operations. Next, we show why 170 the current network OAM techniques and protocols are insufficient for 171 these use cases. The discussion underlines the need for new methods, 172 techniques, and protocols which we may assign under an umbrella term 173 - network telemetry. 175 2.1. Use Cases 177 These use cases are essential for network operations. While the list 178 is by no means exhaustive, it is enough to highlight the requirements 179 of data velocity, variety, and volume. 181 Policy and Intent Compliance: Network policies are the rules that 182 constraint the services for network access, provide service 183 differentiation, or enforce specific treatment on the traffic. 184 For example, a service function chain is a policy that requires 185 the selected flows to pass through a set of ordered network 186 functions. An intents is a high-level abstract policy which 187 requires a complex translation and mapping process before being 188 applied on networks. While a policy is enforced, the compliance 189 needs to be verified and monitored continuously. 191 SLA Compliance: A Service-Level Agreement (SLA) defines the level of 192 service a user expects from a network operator, which include the 193 metrics for the service measurement and remedy/penalty procedures 194 when the service level misses the agreement. Users need to check 195 if they get the service as promised and network operators need to 196 evaluate how they can deliver the services that can meet the SLA. 198 Root Cause Analysis: Any network failure can be the cause or effect 199 of a sequence of chained events. Troubleshooting and recovery 200 require quick identification of the root cause of any observable 201 issues. However, the root casue is not always straightforward to 202 identify, especially when the failure is sporadic and the related 203 and unrelated events are overwhelming. While machine learning 204 technologies can be used for root cause analysis, it up to the 205 network to sense and provide all the relevant data. 207 Network Optimization: This covers all short-term and long-term 208 network optimization techniques, including load balancing, Traffic 209 Engineering (TE), and network planning. Network operators are 210 motivated to optimize their network utilization and differentiate 211 services for better ROI or lower CAPEX. The first step is to know 212 the real-time network conditions before applying policies for 213 traffic manipulation. In some cases, micro-bursts need to be 214 detected in a very short time-frame so that fine-grained traffic 215 control can be applied to avoid network congestion. The long-term 216 network capacity planning and topology augmentation also rely on 217 the accumulated data of the network operations. 219 Event Tracking and Prediction: The visibility of user traffic path 220 and performance is critical for healthy network operation. 221 Numerous related network events are of interest to network 222 operators. For example, Network operators always want to learn 223 where and why packets are dropped for an application flow. They 224 also want to be warned of issues in advance so proactive actions 225 can be taken to avoid catastrophic consequences. 227 2.2. Challenges 229 For a long time, network operators have relied upon SNMP [RFC3416], 230 Command-Line Interface (CLI), or Syslog to monitor the network. Some 231 other OAM techniques as described in [RFC7276] are also used to 232 faciliate network troubleshooting. These conventional techniques are 233 not sufficient to support the above use cases for the following 234 reasons: 236 o Most use cases need to continuously monitor the network and 237 dynamically refine the data collection in real-time and 238 interactively. The poll-based low-frequency data collection is 239 ill-suited for these applications. Subscription-based streaming 240 data directly pushed from the data source (e.g., the forwarding 241 chip) is preferred to provide enough data quantity and precision 242 at scale. 244 o Comprehensive data is needed from packet processing engine to 245 traffic manager, from line cards to main control board, from user 246 flows to control protocol packets, from device configurations to 247 operations, and from physical layer to application layer. 248 Conventional OAM only covers a narrow range of data (e.g., SNMP 249 only handles data from the Management Information Base (MIB)). 250 Traditional network devices cannot provide all the necessary 251 probes. An open and programmable network device is therefore 252 needed. 254 o Many application scenarios need to correlate data from multiple 255 sources (i.e., from distributed network devices, different 256 components of a network device, or different network planes). A 257 piecemeal solution is often lacking the capability to consolidate 258 the data from multiple sources. The composition of a complete 259 solution, as partly proposed by Autonomic Resource Control 260 Architecture(ARCA) [I-D.pedro-nmrg-anticipated-adaptation], will 261 be empowered and guided by a comprehensive framework. 263 o Some of the conventional OAM techniques (e.g., CLI and Syslog) are 264 lack of formal data model. The unstructured data hinder the tool 265 automation and application extensibility. Standardized data 266 models are essential to support the programmable networks. 268 o Although some conventional OAM techniques support data push (e.g., 269 SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow), the pushed data 270 are limited to only predefined management plane warnings (e.g., 271 SNMP Trap) or sampled user packets (e.g., sFlow). We require the 272 data with arbitrary source, granularity, and precision which are 273 beyond the capability of the existing techniques. 275 o The conventional passive measurement techniques can either consume 276 too much network resources and render too much redundant data, or 277 lead to inaccurate results; the conventional active measurement 278 techniques can interfere with the user traffic and their results 279 are indirect. We need techniques that can collect direct and on- 280 demand data from user traffic. 282 2.3. Glossary 284 Before further discussion, we list some key terminology and acronyms 285 used in this documents. We make an intended distinction between 286 network telemetry and network OAM. 288 AI: Artificial Intelligence. Use machine-learning based 289 technologies to automate network operation. 291 BMP: BGP Monitoring Protocol 293 DNP: Dynamic Network Probe 295 DPI: Deep Packet Inspection 297 gNMI: gRPC Network Management Interface 299 gRPC: gRPC Remote Procedure Call 301 IDN: Intent-Driven Network 303 IPFIX: IP Flow Information Export Protocol 305 IPFPM: IP Flow Performance Measurement 307 IOAM: In-situ OAM 309 NETCONF: Network Configuration Protocol 311 Network Telemetry: A general term for a new brood of network 312 visibility techniques and protocols, with the characteristics 313 defined in this document. Network telemetry enables smooth 314 evolution toward intent-driven autonomous networks. 316 NMS: Network Management System 318 OAM: Operations, Administration, and Maintenance. A group of 319 network management functions that provide network fault 320 indication, fault localization, performance information, and data 321 and diagnosis functions. Most conventional network monitoring 322 techniques and protocols belong to network OAM. 324 SNMP: Simple Network Management Protocol 326 YANG: A data modeling language for NETCONF 328 YANG FSM: A YANG model to define device side finite state machine 330 YANG PUSH: A method to subscribe pushed data from remote YANG 331 datastore 333 2.4. Network Telemetry 335 Network telemetry has emerged as a mainstream technical term to refer 336 to the newer data collection and consumption techniques, 337 distinguishing itself form the convention techniques for network OAM. 338 The representative techniques and protocols include IPFIX [RFC7011] 339 and gPRC [I-D.kumar-rtgwg-grpc-protocol]. It is expected that 340 network telemetry can provide the necessary network visibility for 341 autonomous networks, address the shortcomings of conventional OAM 342 techniques, and allow for the emergence of new techniques bearing 343 certain characteristics. 345 One difference between the network telemetry and the network OAM is 346 that the network telemetry assumes machines as data consumer, while 347 the conventional network OAM assumes human operators. Hence, the 348 network telemetry can directly trigger the automated network 349 operation, but the conventional OAM tools only help human operators 350 to monitor and diagnose the networks and guide manual network 351 operations. The difference leads to very different techniques. 353 Although the network telemetry techniques are just emerging and 354 subject to continuous evolution, several defining characteristics of 355 network telemetry have been well accepted: 357 o Push and Streaming: Instead of polling data from network devices, 358 the telemetry collector subscribes to the streaming data pushed 359 from data sources in network devices. 361 o Volume and Velocity: The telemetry data is intended to be consumed 362 by machine rather than by a human. Therefore, the data volume is 363 huge and the processing is often in realtime. 365 o Normalization and Unification: Telemetry aims to address the 366 overall network automation needs. The piecemeal solutions offered 367 by the conventional OAM approach are no longer suitable. Efforts 368 need to be made to normalize the data representation and unify the 369 protocols. 371 o Model-based: The telemetry data is modeled in advance which allows 372 applications to configure and consume data with ease. 374 o Data Fusion: The data for a single application can come from 375 multiple data sources (e.g., cross-domain, cross-device, and 376 cross-layer) and needs to be correlated to take effect. 378 o Dynamic and Interactive: Since the network telemetry means to be 379 used in a closed control loop for network automation, it needs to 380 run continuously and adapt to the dynamic and interactive queries 381 from the network operation controller. 383 Note that a technique does not need to have all the above 384 characterisitics to be qualified as telemetry. An ideal network 385 telemetry solution may also have the following features or 386 properities: 388 o In-Network Customization: The data can be customized in network at 389 run-time to cater to the specific need of applications. This 390 needs the support of a programmable data plane which allows probes 391 to be deployed at flexible locations. 393 o Direct Data Plane Export: The data originated from data plane can 394 be directly exported to the data consumer for efficiency, 395 especially when the data bandwidth is large and the real-time 396 processing is required. 398 o In-band Data Collection: In addition to the passive and active 399 data collection approaches, the new hybrid approach allows to 400 directly collect data for any target flow on its entire forwarding 401 path. 403 o Non-intrusive: The telemetry system should avoid the pitfall of 404 the "observer effect". That is, it should not change the network 405 behavior and affect the forwarding performance. 407 3. The Necessity of a Network Telemetry Framework 409 Big data analytics and machine-learning based AI technologies are 410 applied for network operation automation, relying on abundant data 411 from networks. The single-sourced and static data acquisition cannot 412 meet the data requirements. It is desirable to have a framework that 413 integrates multiple telemetry approaches from different layers. This 414 allows flexible combinations for different applications. The 415 framework would benefit application development for the following 416 reasons: 418 o The future autonomous networks will require a holistic view on 419 network visibility. All the use cases and applications need to be 420 supported uniformly and coherently under a single intelligent 421 agent. Therefore, the protocols and mechanisms should be 422 consolidated into a minimum yet comprehensive set. A telemetry 423 framework can help to normalize the technique developments. 425 o Network visibility presents multiple viewpoints. For example, the 426 device viewpoint takes the network infrastructure as the 427 monitoring object from which the network topology and device 428 status can be acquired; the traffic viewpoint takes the flows or 429 packets as the monitoring object from which the traffic quality 430 and path can be acquired. An application may need to switch its 431 viewpoint during operation. It may also need to correlate a 432 service and impact on network experience to acquire the 433 comprehensive information. 435 o Applications require network telemetry to be elastic in order to 436 efficiently use the network resource and reduce the performance 437 impact. Routine network monitoring covers the entire network with 438 low data sampling rate. When issues arise or trends emerge, the 439 telemetry data source can be modified and the data rate can be 440 boosted. 442 o Efficient data fusion is critical for applications to reduce the 443 overall quantity of data and improve the accuracy of analysis. 445 So far, some telemetry related work has been done within IETF. 446 However, the work is fragmented and scattered in different working 447 groups. The lack of coherence makes it difficult to assemble a 448 comprehensive network telemetry system and causes repetitive and 449 redundant work. 451 A formal network telemetry framework is needed for constructing a 452 working system. The framework should cover the concepts and 453 components from the standardization perspective. This document 454 clarifies the layers on which the telemetry is exerted and decomposes 455 the telemetry system into a set of distinct components that the 456 existing and future work can easily map to. 458 4. Network Telemetry Framework 460 Telemetry can be applied on the forwarding plane, the control plane, 461 and the management plane in a network, as well as other sources out 462 of the network, as shown in Figure 1. Therefore, we categorize the 463 network telemetry into four distinct modules. 465 +------------------------------+ 466 | | 467 | Network Operation |<-------+ 468 | Applications | | 469 | | | 470 +------------------------------+ | 471 ^ ^ ^ | 472 | | | | 473 V | V V 474 +-----------|---+--------------+ +-----------+ 475 | | | | | | 476 | Control Pl|ane| | | External | 477 | Telemetry | <---> | | Data and | 478 | | | | | Event | 479 | ^ V | Management | | Telemetry | 480 +------|--------+ Plane | | | 481 | V | Telemetry | +-----------+ 482 | Forwarding | | 483 | Plane <---> | 484 | Telemetry | | 485 | | | 486 +---------------+--------------+ 488 Figure 1: Layer Category of the Network Telemetry Framework 490 The rationale of this partition lies in the different telemtry data 491 objects which result in different data source and export locations. 492 Such differences have profound implications on in-network data 493 programming and processing capability, data encoding and transport 494 protocol, and data bandwidth and latency. 496 We summarize the major differences of the four modules in the 497 followng table: 499 +---------+--------------+--------------+--------------+----------+ 500 | Module | Control | Management | Forwarding | External | 501 | | Plane | Plane | Plane | Data | 502 +---------+--------------+--------------+--------------+----------+ 503 |Object | control | config. & | flow & packet| terminal,| 504 | | protocol & | operation | QoS, traffic | social & | 505 | | signailing, | state, MIB | stat., buffer| environ- | 506 | | RIB, ACL | | & queue stat.| mental | 507 +---------+--------------+--------------+--------------+----------+ 508 |Export | main control | main control | fwding chip | various | 509 |Location | CPU, | CPU | or linecard | | 510 | | linecard CPU | | CPU; main | | 511 | | or fwding | | control CPU | | 512 | | chip | | unlikely | | 513 +---------+--------------+--------------+--------------+----------+ 514 |Model | YANG, | MIB, syslog, | template, | TBD | 515 | | custom | YANG, | YANG, | | 516 | | | custom | custom | | 517 +---------+--------------+--------------+--------------+----------+ 518 |Encoding | GPB, JSON, | GPB, JSON, | plain | TBD | 519 | | XML, plain | XML | | | 520 +---------+--------------+--------------+--------------+----------+ 521 |Protocol | gRPC,NETCONF,| gPRC,NETCONF,| IPFIX, mirror| TBD | 522 | | IPFIX,mirror | | | | 523 +---------+--------------+--------------+--------------+----------+ 524 |Transport| HTTP, TCP, | HTTP, TCP | UDP | TCP, UDP | 525 | | UDP | | | | 526 +---------+--------------+--------------+--------------+----------+ 528 Figure 2: Layer Category of the Network Telemetry Framework 530 Note that the interaction with the network operation applications can 531 be indirect. For example, in the management plane telemetry, the 532 management plane may need to acquire data from the data plane. Some 533 of the operational states can only be derived from the data plane 534 such as the interface status and statistics. For another example, 535 the control plane telemetry may need to access the FIB in data plane. 536 On the other hand, an application may involve more than one plane 537 simultaneously. For example, an SLA compliance application may 538 require both the data plane telemetry and the control plane 539 telemetry. 541 At each plane, the telemetry can be further partitioned into five 542 distinct components: 544 Data Query, Analysis, and Storage: This component works at the 545 application layer. On the one hand, it is responisble for issuing 546 data queries. The queries can be for modeled data through 547 configuration or custom data through programming. The queries can 548 be one shot or subscriptions for events or streaming data. On the 549 other hand, it receives, stores, and processes the returned data 550 from network devices. Data analysis can be interactive to 551 initiate further data queries. 553 Data Configuration and Subscription: This component deploys data 554 queries on devices. It determines the protocol and channel for 555 applications to acquire desired data. This component is also 556 responsible for configuring the desired data that might not be 557 directly available form data sources. The subscription data can 558 be described by models, templates, or programs. 560 Data Encoding and Export: This component determines how telemetry 561 data are delivered to the data analysis and storage component. 562 The data encoding and the transport protocol may vary due to the 563 data exporting location. 565 Data Generation and Processing: The requested data needs to be 566 captured, processed, and formatted in network devices from raw 567 data sources. This may involve in-network computing and 568 processing on either the fast path or the slow path in network 569 devices. 571 Data Object and Source: This component determines the monitoring 572 object and original data source. The data source usually just 573 provides raw data which needs further processing. A data source 574 can be considered a probe. A probe can be statically installed or 575 dynamically installed. 577 +----------------------------------------+ 578 | | 579 | Data Query, Analysis, & Storage | 580 | | 581 +----------------------------------------+ 582 | ^ 583 | | 584 V | 585 +---------------------+------------------+ 586 | Data Configuration | | 587 | & Subscription | Data Encoding | 588 | (model, template, | & Export | 589 | & program) | | 590 +---------------------+------------------| 591 | | 592 | Data Generation | 593 | & Processing | 594 | | 595 +----------------------------------------| 596 | | 597 | Data Object and Source | 598 | | 599 +----------------------------------------+ 601 Figure 3: Components in the Network Telemetry Framework 603 Since most existing standard-related work belongs to the first four 604 components, in the remainder of the document, we focus on these 605 components only. 607 4.1. Existing Works Mapped in the Framework 609 The following table provides a non-exhaustive list of existing works 610 (mainly published in IETF and with the emphasis on the latest new 611 technologies) and shows their positions in the framework. The 612 details about the mentioned work can be found in Appendix A. 614 +--------------+---------------+----------------+---------------+ 615 | | Management | Control | Forwardidng | 616 | | Plane | Plane | Plane | 617 +--------------+---------------+----------------+---------------+ 618 | data Config. | gRPC, NETCONF,| NETCONF/YANG | NETCONF/YANG, | 619 | & subscrib. | YANG PUSH | | YANG FSM | 620 +--------------+---------------+----------------+---------------+ 621 | data gen. & | DNP, | DNP, | In-situ OAM, | 622 | processing | YANG | YANG | PBT, IPFPM, | 623 | | | | DNP | 624 +--------------+---------------+----------------+---------------+ 625 | data | gRPC, NETCONF | BMP, NETCONF | IPFIX | 626 | export | YANG PUSH | | | 627 +--------------+---------------+----------------+---------------+ 629 Figure 4: Existing Work 631 5. Evolution of Network Telemetry 633 As the network is evolving towards the automated operation, network 634 telemetry also undergoes several levels of evolution. 636 Level 0 - Static Telemetry: The telemetry data is determined at 637 design time. The network operator can only configure how to use 638 it with limited flexibility. 640 Level 1 - Dynamic Telemetry: The telemetry data can be dynamically 641 programmed or configured at runtime, allowing a tradeoff among 642 resource, performance, flexibility, and coverage. DNP is an 643 effort towards this direction. 645 Level 2 - Interactive Telemetry: The network operator can 646 continuously customize the telemetry data in real time to reflect 647 the network operation's visibility requirements. At this level, 648 some tasks can be automated, although ultimately human operators 649 will still need to sit in the middle to make decisions. 651 Level 3 - Closed-loop Telemetry: Human operators are completely 652 excluded from the control loop. The intelligent network operation 653 engine automatically issues the telemetry data request, analyzes 654 the data, and updates the network operations in closed control 655 loops. 657 While most of the existing technologies belong to level 0 and level 658 1, with the help of a clearly defined network telemetry framework, we 659 can assemble the technologies to support level 2 and make solid steps 660 towards level 3. 662 6. Security Considerations 664 TBD 666 7. IANA Considerations 668 This document includes no request to IANA. 670 8. Acknowledgments 672 We would like to thank Adrian Farrel, Randy Presuhn, Vi-->ctor Liu, 673 James Guichard, Uri Blumenthal, Giuseppe Fioccola, Daniel King, Yunan 674 Gu, and many others who have provided helpful comments and 675 suggestions to improve this document. 677 9. References 679 9.1. Normative References 681 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 682 Requirement Levels", BCP 14, RFC 2119, 683 DOI 10.17487/RFC2119, March 1997, 684 . 686 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 687 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 688 May 2017, . 690 9.2. Informative References 692 [I-D.brockners-inband-oam-requirements] 693 Brockners, F., Bhandari, S., Dara, S., Pignataro, C., 694 Gredler, H., Leddy, J., Youell, S., Mozes, D., Mizrahi, 695 T., Lapukhov, P., and r. remy@barefootnetworks.com, 696 "Requirements for In-situ OAM", draft-brockners-inband- 697 oam-requirements-03 (work in progress), March 2017. 699 [I-D.fioccola-ippm-multipoint-alt-mark] 700 Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, 701 "Multipoint Alternate Marking method for passive and 702 hybrid performance monitoring", draft-fioccola-ippm- 703 multipoint-alt-mark-04 (work in progress), June 2018. 705 [I-D.ietf-grow-bmp-adj-rib-out] 706 Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S. 707 Zhuang, "Support for Adj-RIB-Out in BGP Monitoring 708 Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-02 (work 709 in progress), September 2018. 711 [I-D.ietf-grow-bmp-local-rib] 712 Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, 713 "Support for Local RIB in BGP Monitoring Protocol (BMP)", 714 draft-ietf-grow-bmp-local-rib-02 (work in progress), 715 September 2018. 717 [I-D.ietf-netconf-udp-pub-channel] 718 Zheng, G., Zhou, T., and A. Clemm, "UDP based Publication 719 Channel for Streaming Telemetry", draft-ietf-netconf-udp- 720 pub-channel-04 (work in progress), October 2018. 722 [I-D.ietf-netconf-yang-push] 723 Clemm, A., Voit, E., Prieto, A., Tripathy, A., Nilsen- 724 Nygaard, E., Bierman, A., and B. Lengyel, "Subscription to 725 YANG Datastores", draft-ietf-netconf-yang-push-20 (work in 726 progress), October 2018. 728 [I-D.kumar-rtgwg-grpc-protocol] 729 Kumar, A., Kolhe, J., Ghemawat, S., and L. Ryan, "gRPC 730 Protocol", draft-kumar-rtgwg-grpc-protocol-00 (work in 731 progress), July 2016. 733 [I-D.openconfig-rtgwg-gnmi-spec] 734 Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, 735 C., and C. Morrow, "gRPC Network Management Interface 736 (gNMI)", draft-openconfig-rtgwg-gnmi-spec-01 (work in 737 progress), March 2018. 739 [I-D.pedro-nmrg-anticipated-adaptation] 740 Martinez-Julia, P., "Exploiting External Event Detectors 741 to Anticipate Resource Requirements for the Elastic 742 Adaptation of SDN/NFV Systems", draft-pedro-nmrg- 743 anticipated-adaptation-02 (work in progress), June 2018. 745 [I-D.song-opsawg-dnp4iq] 746 Song, H. and J. Gong, "Requirements for Interactive Query 747 with Dynamic Network Probes", draft-song-opsawg-dnp4iq-01 748 (work in progress), June 2017. 750 [I-D.zhou-netconf-multi-stream-originators] 751 Zhou, T., Zheng, G., Voit, E., Clemm, A., and A. Bierman, 752 "Subscription to Multiple Stream Originators", draft-zhou- 753 netconf-multi-stream-originators-03 (work in progress), 754 October 2018. 756 [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, 757 "Simple Network Management Protocol (SNMP)", RFC 1157, 758 DOI 10.17487/RFC1157, May 1990, 759 . 761 [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, 762 DOI 10.17487/RFC2981, October 2000, 763 . 765 [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations 766 for the Simple Network Management Protocol (SNMP)", 767 STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, 768 . 770 [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management 771 Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, 772 September 2004, . 774 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 775 Zekauskas, "A One-way Active Measurement Protocol 776 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 777 . 779 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 780 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 781 RFC 5357, DOI 10.17487/RFC5357, October 2008, 782 . 784 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 785 and A. Bierman, Ed., "Network Configuration Protocol 786 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 787 . 789 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 790 "Specification of the IP Flow Information Export (IPFIX) 791 Protocol for the Exchange of Flow Information", STD 77, 792 RFC 7011, DOI 10.17487/RFC7011, September 2013, 793 . 795 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 796 Weingarten, "An Overview of Operations, Administration, 797 and Maintenance (OAM) Tools", RFC 7276, 798 DOI 10.17487/RFC7276, June 2014, 799 . 801 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 802 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 803 DOI 10.17487/RFC7540, May 2015, 804 . 806 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 807 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 808 May 2016, . 810 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 811 Monitoring Protocol (BMP)", RFC 7854, 812 DOI 10.17487/RFC7854, June 2016, 813 . 815 [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, 816 L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, 817 "Alternate-Marking Method for Passive and Hybrid 818 Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, 819 January 2018, . 821 Appendix A. A Survey on Existing Network Telemetry Techniques 823 We provide an overview of the challenges and existing solutions for 824 each network telemetry module. 826 A.1. Management Plane Telemetry 828 A.1.1. Requirements and Challenges 830 The management plane of the network element interacts with the 831 Network Management System (NMS), and provides information such as 832 performance data, network logging data, network warning and defects 833 data, and network statistics and state data. Some legacy protocols 834 are widely used for the management plane, such as SNMP and Syslog. 835 However, these protocols are insufficient to meet the requirements of 836 the automatic network operation applications. 838 New management plane telemetry protocols should consider the 839 following requirements: 841 Convenient Data Subscription: An application should have the freedom 842 to choose the data export means such as the data types and the 843 export frequency. 845 Structured Data: For automatic network operation, machines will 846 replace human for network data comprehension. The schema 847 languages such as YANG can efficiently describe structured data 848 and normalize data encoding and transformation. 850 High Speed Data Transport: In order to retain the information, a 851 server needs to send a large amount of data at high frequency. 852 Compact encoding formats are needed to compress the data and 853 improve the data transport efficiency. The push mode, by 854 replacing the poll mode, can also reduce the interactions between 855 clients and servers, which help to improve the server's 856 efficiency. 858 A.1.2. Push Extensions for NETCONF 860 NETCONF [RFC6241] is one popular network management protocol, which 861 is also recommended by IETF. Although it can be used for data 862 collection, NETCONF is good at configurations. YANG Push 863 [I-D.ietf-netconf-yang-push] extends NETCONF and enables subscriber 864 applications to request a continuous, customized stream of updates 865 from a YANG datastore. Providing such visibility into changes made 866 upon YANG configuration and operational objects enables new 867 capabilities based on the remote mirroring of configuration and 868 operational state. Moreover, distributed data collection mechanism 869 [I-D.zhou-netconf-multi-stream-originators] via UDP based publication 870 channel [I-D.ietf-netconf-udp-pub-channel] provides enhanced 871 efficiency for the NETCONF based telemetry. 873 A.1.3. gRPC Network Management Interface 875 gRPC Network Management Interface (gNMI) 876 [I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol 877 based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote 878 Procedure Call) framework. With a single gRPC service definition, 879 both configuration and telemetry can be covered. gRPC is an HTTP/2 880 [RFC7540] based open source micro service communication framework. 881 It provides a number of capabilities which are well-suited for 882 network telemetry, including: 884 o Full-duplex streaming transport model combined with a binary 885 encoding mechanism provided further improved telemetry efficiency. 887 o gRPC provides higher-level features consistency across platforms 888 that common HTTP/2 libraries typically do not. This 889 characteristic is especially valuable for the fact that telemetry 890 data collectors normally reside on a large variety of platforms. 892 o The built-in load-balancing and failover mechanism. 894 A.2. Control Plane Telemetry 896 A.2.1. Requirements and Challenges 898 The control plane telemetry refers to the health condition monitoring 899 of different network protocols, which covers Layer 2 to Layer 7. 900 Keeping track of the running status of these protocols is beneficial 901 for detecting, localizing, and even predicting various network 902 issues, as well as network optimization, in real-time and in fine 903 granularity. 905 One of the most challenging problems for the control plane telemetry 906 is how to correlate the E2E Key Performance Indicators (KPI) to a 907 specific layer's KPIs. For example, an IPTV user may describe his 908 User Experience (UE) by the video fluency and definition. Then in 909 case of an unusually poor UE KPI or a service disconnection, it is 910 non-trivial work to delimit and localize the issue to the responsible 911 protocol layer (e.g., the Transport Layer or the Network Layer), the 912 responsible protocol (e.g., ISIS or BGP at the Network Layer), and 913 finally the responsible device(s) with specific reasons. 915 Traditional OAM-based approaches for control plane KPI measurement 916 include PING (L3), Tracert (L3), Y.1731 (L2) and so on. One common 917 issue behind these methods is that they only measure the KPIs instead 918 of reflecting the actual running status of these protocols, making 919 them less effective or efficient for control plane troubleshooting 920 and network optimization. An example of the control plane telemetry 921 is the BGP monitoring protocol (BMP), it is currently used to 922 monitoring the BGP routes and enables rich applications, such as BGP 923 peer analysis, AS analysis, prefix analysis, security analysis, and 924 so on. However, the monitoring of other layers, protocols and the 925 cross-layer, cross-protocol KPI correlations are still in their 926 infancy (e.g., the IGP monitoring is missing), which require 927 substantial further research. 929 A.2.2. BGP Monitoring Protocol 931 BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP 932 sessions and intended to provide a convenient interface for obtaining 933 route views. 935 The BGP routing information is collected from the monitored device(s) 936 to the BMP monitoring station by setting up the BMP TCP session. The 937 BGP peers are monitored by the BMP Peer Up and Peer Down 938 Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], 939 Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib 940 [I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route 941 Monitoring Message and the BMP Route Mirroring Message, in the form 942 of both initial table dump and real-time route update. In addition, 943 BGP statistics are reported through the BMP Stats Report Message, 944 which could be either timer triggered or event-driven. More BMP 945 extensions can be explored to enrich the applications of BGP 946 monitoring. 948 A.3. Data Plane Telemetry 950 A.3.1. Requirements and Challenges 952 An effective data plane telemetry system relies on the data that the 953 network device can expose. The data's quality, quantity, and 954 timeliness must meet some stringent requirements. This raises some 955 challenges to the network data plane devices where the first hand 956 data originate. 958 o A data plane device's main function is user traffic processing and 959 forwarding. While supporting network visibility is important, the 960 telemetry is just an auxiliary function, and it should not impede 961 normal traffic processing and forwarding (i.e., the performance is 962 not lowered and the behavior is not altered due to the telemetry 963 functions). 965 o The network operation applications requires end-to-end visibility 966 from various sources, which results in a huge volume of data. 967 However, the sheer data quantity should not stress the network 968 bandwidth, regardless of the data delivery approach (i.e., through 969 in-band or out-of-band channels). 971 o The data plane devices must provide timely data with the minimum 972 possible delay. Long processing, transport, storage, and analysis 973 delay can impact the effectiveness of the control loop and even 974 render the data useless. 976 o The data should be structured and labeled, and easy for 977 applications to parse and consume. At the same time, the data 978 types needed by applications can vary significantly. The data 979 plane devices need to provide enough flexibility and 980 programmability to support the precise data provision for 981 applications. 983 o The data plane telemetry should support incremental deployment and 984 work even though some devices are unaware of the system. This 985 challenge is highly relevant to the standards and legacy networks. 987 The industry has agreed that the data plane programmability is 988 essential to support network telemetry. Newer data plane chips are 989 all equipped with advanced telemetry features and provide flexibility 990 to support customized telemetry functions. 992 A.3.2. Technique Taxonomy 994 There can be multiple possible dimensions to classify the data plane 995 telemetry techniques. 997 Active and Passive: The active and passive methods (as well as the 998 hybrid types) are well documented in [RFC7799]. The passive 999 methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic 1000 mirror. These methods usually have low data coverage. The 1001 bandwidth cost is very high in order to improve the data coverage. 1002 On the other hand, the active methods include Ping, Traceroute, 1003 OWAMP [RFC4656], and TWAMP [RFC5357]. These methods are intrusive 1004 and only provide indirect network measurement results. The hybrid 1005 methods, including in-situ OAM 1006 [I-D.brockners-inband-oam-requirements], IPFPM [RFC8321], and 1007 Multipoint Alternate Marking 1008 [I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced 1009 and more flexible approach. However, these methods are also more 1010 complex to implement. 1012 In-Band and Out-of-Band: The telemetry data, before being exported 1013 to some collector, can be carried in user packets. Such methods 1014 are considered in-band (e.g., in-situ OAM 1015 [I-D.brockners-inband-oam-requirements]). If the telemetry data 1016 is directly exported to some collector without modifying the user 1017 packets, Such methods are considered out-of-band (e.g., postcard- 1018 based INT). It is possible to have hybrid methods. For example, 1019 only the telemetry instruction or partial data is carried by user 1020 packets (e.g., IPFPM [RFC8321]). 1022 E2E and In-Network: Some E2E methods start from and end at the 1023 network end hosts (e.g., Ping). The other methods work in 1024 networks and are transparent to end hosts. However, if needed, 1025 the in-network methods can be easily extended into end hosts. 1027 Flow, Path, and Node: Depending on the telemetry objective, the 1028 methods can be flow-based (e.g., in-situ OAM 1029 [I-D.brockners-inband-oam-requirements]), path-based (e.g., 1030 Traceroute), and node-based (e.g., IPFIX [RFC7011]). 1032 A.3.3. The IPFPM technology 1034 The Alternate Marking method is efficient to perform packet loss, 1035 delay, and jitter measurements both in an IP and Overlay Networks, as 1036 presented in IPFPM [RFC8321] and 1037 [I-D.fioccola-ippm-multipoint-alt-mark]. 1039 This technique can be applied to point-to-point and multipoint-to- 1040 multipoint flows. Alternate Marking creates batches of packets by 1041 alternating the value of 1 bit (or a label) of the packet header. 1042 These batches of packets are unambiguously recognized over the 1043 network and the comparison of packet counters for each batch allows 1044 the packet loss calculation. The same idea can be applied to delay 1045 measurement by selecting ad hoc packets with a marking bit dedicated 1046 for delay measurements. 1048 Alternate Marking method needs two counters each marking period for 1049 each flow under monitor. For instance, by considering n measurement 1050 points and m monitored flows, the order of magnitude of the packet 1051 counters for each time interval is n*m*2 (1 per color). 1053 Since networks offer rich sets of network performance measurement 1054 data (e.g packet counters), traditional approaches run into 1055 limitations. One reason is the fact that the bottleneck is the 1056 generation and export of the data and the amount of data that can be 1057 reasonably collected from the network. In addition, management tasks 1058 related to determining and configuring which data to generate lead to 1059 significant deployment challenges. 1061 Multipoint Alternate Marking approach, described in 1062 [I-D.fioccola-ippm-multipoint-alt-mark], aims to resolve this issue 1063 and makes the performance monitoring more flexible in case a detailed 1064 analysis is not needed. 1066 An application orchestrates network performance measurements tasks 1067 across the network to allow an optimized monitoring and it can 1068 calibrate how deep can be obtained monitoring data from the network 1069 by configuring measurement points roughly or meticulously. 1071 Using Alternate Marking, it is possible to monitor a Multipoint 1072 Network without examining in depth by using the Network Clustering 1073 (subnetworks that are portions of the entire network that preserve 1074 the same property of the entire network, called clusters). So in 1075 case there is packet loss or the delay is too high the filtering 1076 criteria could be specified more in order to perform a detailed 1077 analysis by using a different combination of clusters up to a per- 1078 flow measurement as described in IPFPM [RFC8321]. 1080 In summary, an application can configure end-to-end network 1081 monitoring. If the network does not experiment issues, this 1082 approximate monitoring is good enough and is very cheap in terms of 1083 network resources. However, in case of problems, the application 1084 becomes aware of the issues from this approximate monitoring and, in 1085 order to localize the portion of the network that has issues, 1086 configures the measurement points more exhaustively. So a new 1087 detailed monitoring is performed. After the detection and resolution 1088 of the problem the initial approximate monitoring can be used again. 1090 A.3.4. Dynamic Network Probe 1092 Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] 1093 provides a programmable means to customize the data that an 1094 application collects from the data plane. A direct benefit of DNP is 1095 the reduction of the exported data. A full DNP solution covers 1096 several components including data source, data subscription, and data 1097 generation. The data subscription needs to define the custom data 1098 which can be composed and derived from the raw data sources. The 1099 data generation takes advantage of the moderate in-network computing 1100 to produce the desired data. 1102 While DNP can introduce unforeseeable flexibility to the data plane 1103 telemetry, it also faces some challenges. It requires a flexible 1104 data plane that can be dynamically reprogrammed at run-time. The 1105 programming API is yet to be defined. 1107 A.3.5. IP Flow Information Export (IPFIX) protocol 1109 Traffic on a network can be seen as a set of flows passing through 1110 network elements. IP Flow Information Export (IPFIX) [RFC7011] 1111 provides a means of transmitting traffic flow information for 1112 administrative or other purposes. A typical IPFIX enabled system 1113 includes a pool of Metering Processes collects data packets at one or 1114 more Observation Points, optionally filters them and aggregates 1115 information about these packets. An Exporter then gathers each of 1116 the Observation Points together into an Observation Domain and sends 1117 this information via the IPFIX protocol to a Collector. 1119 A.3.6. In-Situ OAM 1121 Traditional passive and active monitoring and measurement techniques 1122 are either inaccurate or resource-consuming. It is preferable to 1123 directly acquire data associated with a flow's packets when the 1124 packets pass through a network. In-situ OAM (iOAM) 1125 [I-D.brockners-inband-oam-requirements], a data generation technique, 1126 embeds a new instruction header to user packets and the instruction 1127 directs the network nodes to add the requested data to the packets. 1128 Thus, at the path end, the packet's experience gained on the entire 1129 forwarding path can be collected. Such firsthand data is invaluable 1130 to many network OAM applications. 1132 However, iOAM also faces some challenges. The issues on performance 1133 impact, security, scalability and overhead limits, encapsulation 1134 difficulties in some protocols, and cross-domain deployment need to 1135 be addressed. 1137 A.4. External Data and Event Telemetry 1139 Events that occur outside the boundaries of the network system are 1140 another important source of telemetry information. Correlating both 1141 internal telemetry data and external events with the requirements of 1142 network systems, as presented in Exploiting External Event Detectors 1143 to Anticipate Resource Requirements for the Elastic Adaptation of 1144 SDN/NFV Systems [I-D.pedro-nmrg-anticipated-adaptation], provides a 1145 strategic and functional advantage to management operations. 1147 A.4.1. Requirements and Challenges 1149 As with other sources of telemetry information, the data and events 1150 must meet strict requirements, especially in terms of timeliness, 1151 which is essential to properly incorporate external event information 1152 to management cycles. Thus, the specific challenges are described as 1153 follows: 1155 o The role of external event detector can be played by multiple 1156 elements, including hardware (e.g. physical sensors, such as 1157 seismometers) and software (e.g. Big Data sources that analyze 1158 streams of information, such as Twitter messages). Thus, the 1159 transmitted data must support different shapes but, at the same 1160 time, follow a common but extensible ontology. 1162 o Since the main function of the external event detectors is to 1163 perform the notifications, their timeliness is assumed. However, 1164 once messages have been dispatched, they must be quickly collected 1165 and inserted into the control plane with variable priority, which 1166 will be high for important sources and/or important events and low 1167 for secondary ones. 1169 o The ontology used by external detectors must be easily adopted by 1170 current and future devices and applications. Therefore, it must 1171 be easily mapped to current information models, such as in terms 1172 of YANG. 1174 Organizing together both internal and external telemetry information 1175 will be key for the general exploitation of the management 1176 possibilities of current and future network systems, as reflected in 1177 the incorporation of cognitive capabilities to new hardware and 1178 software (virtual) elements. 1180 A.4.2. Sources of External Events 1182 To ensure that the information provided by external event detectors 1183 and used by the network management solutions is meaningful for the 1184 management purposes, the network telemetry framework must ensure that 1185 such detectors (sources) are easily connected to the management 1186 solutions (sinks). This requires the specification of a simple 1187 taxonomy of detectors and match it to the connectors and/or 1188 interfaces required to connect them. 1190 Once detectors are classified in such taxonomy, their definitions are 1191 enlarged with the qualities and other aspects used to handle them and 1192 represented in the ontology and information model (e.g. YANG). 1193 Therefore, differentiating several types of detectors as potential 1194 sources of external events is essential for the integrity of the 1195 management framework. We thus differentiate the following source 1196 types of external events: 1198 o Smart objects and sensors. With the consolidation of the Internet 1199 of Things~(IoT) any network system will have many smart objects 1200 attached to its physical surroundings and logical operation 1201 environments. Most of these objects will be essentially based on 1202 sensors of many kinds (e.g. temperature, humidity, presence) and 1203 the information they provide can be very useful for the management 1204 of the network, even when they are not specifically deployed for 1205 such purpose. Elements of this source type will usually provide a 1206 specific protocol for interaction, especially one of those 1207 protocols related to IoT, such as the Constrained Application 1208 Protocol (CoAP). It will be used by the telemetry framework to 1209 interact with the relevant objects. 1211 o Online news reporters. Several online news services have the 1212 ability to provide enormous quantity of information about 1213 different events occurring in the world. Some of those events can 1214 impact on the network system managed by a specific framework and, 1215 therefore, it will be interested on getting such information. For 1216 instance, diverse security reports, such as the Common 1217 Vulnerabilities and Exposures (CVE), can be issued by the 1218 corresponding authority and used by the management solution to 1219 update the managed system if needed. Instead of a specific 1220 protocol and data format, the sources of this kind of information 1221 usually follow a relaxed but structured format. This format will 1222 be part of both the ontology and information model of the 1223 telemetry framework. 1225 o Global event analyzers. The advance of Big Data analyzers 1226 provides a huge amount of information and, more interestingly, the 1227 identification of events detected by analyzing many data streams 1228 from different origins. In contrast with the other types of 1229 sources, which are focused in specific events, the detectors of 1230 this source type will detect very generic events. For example, a 1231 sports event takes place and some unexpected movement makes it 1232 highly interesting and many people connects to sites that are 1233 covering such event. The systems supporting the services that 1234 cover the event can be affected by such situation so their 1235 management solutions should be aware of it. In contrast with the 1236 other source types, a new information model, format, and reporting 1237 protocol is required to integrate the detectors of this type with 1238 the management solution. 1240 Additional types of detector types can be added to the system but 1241 they will be generally the result of composing the properties offered 1242 by these main classes. In any case, future revisions of the network 1243 telemetry framework will include the required types that cover new 1244 circumstances and that cannot be obtained by composition. 1246 A.4.3. Connectors and Interfaces 1248 For allowing external event detectors to be properly integrated with 1249 other management solutions, both elements must expose interfaces and 1250 protocols that are subject to their particular objective. Since 1251 external event detectors will be focused on providing their 1252 information to their main consumers, which generally will not be 1253 limited to the network management solutions, the framework must 1254 include the definition of the required connectors for ensuring the 1255 interconnection between detectors (sources) and their consumers 1256 within the management systems (sinks) are effective. 1258 In some situations, the interconnection between the external event 1259 detectors and the management system is via the management plane. For 1260 those situations there will be a special connector that provides the 1261 typical interfaces found in most other elements connected to the 1262 management plane. For instance, the interfaces will accomplish w ith 1263 a specific information model (YANG) and specific telemetry protocol, 1264 such as NETCONF, SNMP, or gRPC. 1266 Authors' Addresses 1268 Haoyu Song (editor) 1269 Huawei 1270 2330 Central Expressway 1271 Santa Clara 1272 USA 1274 Email: haoyu.song@huawei.com 1275 Tianran Zhou 1276 Huawei 1277 156 Beiqing Road 1278 Beijing, 100095 1279 P.R. China 1281 Email: zhoutianran@huawei.com 1283 Zhenbin Li 1284 Huawei 1285 156 Beiqing Road 1286 Beijing, 100095 1287 P.R. China 1289 Email: lizhenbin@huawei.com 1291 Zhenqiang Li 1292 China Mobile 1293 No. 32 Xuanwumenxi Ave., Xicheng District 1294 Beijing, 100032 1295 P.R. China 1297 Email: lizhenqiang@chinamobile.com 1299 Pedro Martinez-Julia 1300 NICT 1301 4-2-1, Nukui-Kitamachi 1302 Koganei, Tokyo 184-8795 1303 Japan 1305 Email: pedro@nict.go.jp 1307 Laurent Ciavaglia 1308 Nokia 1309 Villarceaux 91460 1310 France 1312 Email: laurent.ciavaglia@nokia.com 1313 Aijun Wang 1314 China Telecom 1315 Beiqijia Town, Changping District 1316 Beijing, 102209 1317 P.R. China 1319 Email: wangaj.bri@chinatelecom.cn