idnits 2.17.1 draft-ietf-opsawg-ntf-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (8 November 2021) is 871 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-17) exists of draft-ietf-ippm-ioam-data-16 == Outdated reference: A later version (-08) exists of draft-ietf-netconf-distributed-notif-02 == Outdated reference: A later version (-12) exists of draft-ietf-netconf-udp-notif-04 == Outdated reference: A later version (-09) exists of draft-irtf-nmrg-ibn-concepts-definitions-05 == Outdated reference: A later version (-16) exists of draft-song-ippm-postcard-based-telemetry-10 == Outdated reference: A later version (-21) exists of draft-song-opsawg-ifit-framework-16 -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 8321 (Obsoleted by RFC 9341) Summary: 0 errors (**), 0 flaws (~~), 7 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG H. Song 3 Internet-Draft Futurewei 4 Intended status: Informational F. Qin 5 Expires: 12 May 2022 China Mobile 6 P. Martinez-Julia 7 NICT 8 L. Ciavaglia 9 Rakuten Mobile 10 A. Wang 11 China Telecom 12 8 November 2021 14 Network Telemetry Framework 15 draft-ietf-opsawg-ntf-10 17 Abstract 19 Network telemetry is a technology for gaining network insight and 20 facilitating efficient and automated network management. It 21 encompasses various techniques for remote data generation, 22 collection, correlation, and consumption. This document describes an 23 architectural framework for network telemetry, motivated by 24 challenges that are encountered as part of the operation of networks 25 and by the requirements that ensue. This document clarifies the 26 terminologies and classifies the modules and components of a network 27 telemetry system from different perspectives. The framework and 28 taxonomy help to set a common ground for the collection of related 29 work and provide guidance for related technique and standard 30 developments. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at https://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on 12 May 2022. 49 Copyright Notice 51 Copyright (c) 2021 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 56 license-info) in effect on the date of publication of this document. 57 Please review these documents carefully, as they describe your rights 58 and restrictions with respect to this document. Code Components 59 extracted from this document must include Simplified BSD License text 60 as described in Section 4.e of the Trust Legal Provisions and are 61 provided without warranty as described in the Simplified BSD License. 63 Table of Contents 65 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 66 2. Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . 4 67 3. Background . . . . . . . . . . . . . . . . . . . . . . . . . 6 68 3.1. Telemetry Data Coverage . . . . . . . . . . . . . . . . . 7 69 3.2. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 7 70 3.3. Challenges . . . . . . . . . . . . . . . . . . . . . . . 9 71 3.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 10 72 3.5. The Necessity of a Network Telemetry Framework . . . . . 13 73 4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 14 74 4.1. Top Level Modules . . . . . . . . . . . . . . . . . . . . 14 75 4.1.1. Management Plane Telemetry . . . . . . . . . . . . . 18 76 4.1.2. Control Plane Telemetry . . . . . . . . . . . . . . . 18 77 4.1.3. Forwarding Plane Telemetry . . . . . . . . . . . . . 19 78 4.1.4. External Data Telemetry . . . . . . . . . . . . . . . 21 79 4.2. Second Level Function Components . . . . . . . . . . . . 22 80 4.3. Data Acquisition Mechanism and Type Abstraction . . . . . 24 81 4.4. Mapping Existing Mechanisms into the Framework . . . . . 26 82 5. Evolution of Network Telemetry Applications . . . . . . . . . 27 83 6. Security Considerations . . . . . . . . . . . . . . . . . . . 27 84 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 29 85 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 29 86 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 30 87 10. Informative References . . . . . . . . . . . . . . . . . . . 30 88 Appendix A. A Survey on Existing Network Telemetry Techniques . 35 89 A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 35 90 A.1.1. Push Extensions for NETCONF . . . . . . . . . . . . . 36 91 A.1.2. gRPC Network Management Interface . . . . . . . . . . 36 92 A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 36 93 A.2.1. BGP Monitoring Protocol . . . . . . . . . . . . . . . 36 94 A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 37 95 A.3.1. The Alternate Marking (AM) technology . . . . . . . . 37 96 A.3.2. Dynamic Network Probe . . . . . . . . . . . . . . . . 38 97 A.3.3. IP Flow Information Export (IPFIX) Protocol . . . . . 39 98 A.3.4. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 39 99 A.3.5. Postcard Based Telemetry . . . . . . . . . . . . . . 39 100 A.3.6. Existing OAM for Specific Data Planes . . . . . . . . 39 101 A.4. External Data and Event Telemetry . . . . . . . . . . . . 40 102 A.4.1. Sources of External Events . . . . . . . . . . . . . 40 103 A.4.2. Connectors and Interfaces . . . . . . . . . . . . . . 41 104 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 41 106 1. Introduction 108 Network visibility is the ability of management tools to see the 109 state and behavior of a network, which is essential for successful 110 network operation. Network Telemetry revolves around network data 111 that can help provide insights about the current state of the 112 network, including network devices, forwarding, control, and 113 management planes, and that can be generated and obtained through a 114 variety of techniques, including but not limited to network 115 instrumentation and measurements, and that can be processed for 116 purposes ranging from service assurance to network security using a 117 wide variety of techniques including machine learning, data analysis, 118 and correlation. In this document, Network Telemetry refer to both 119 the data itself (i.e., "Network Telemetry Data"), and the techniques 120 and processes used to generate, export, collect, and consume that 121 data for use by potentially automated management applications. 122 Network telemetry extends beyond the historical network Operations, 123 Administration, and Management (OAM) techniques and expects to 124 support better flexibility, scalability, accuracy, coverage, and 125 performance. 127 However, the term "network telemetry" lacks an unambiguous 128 definition. The scope and coverage of it cause confusion and 129 misunderstandings. It is beneficial to clarify the concept and 130 provide a clear architectural framework for network telemetry, so we 131 can articulate the technical field, and better align the related 132 techniques and standard works. 134 To fulfill such an undertaking, we first discuss some key 135 characteristics of network telemetry which set a clear distinction 136 from the conventional network OAM and show that some conventional OAM 137 technologies can be considered a subset of the network telemetry 138 technologies. We then provide an architectural framework for network 139 telemetry which includes four modules, each concerned with a 140 different category of telemetry data and corresponding procedures. 141 All the modules are internally structured in the same way, including 142 components that allow to configure data sources in regard to what 143 data to generate and how to make that available to client 144 applications, components that instrument the underlying data sources, 145 and components that perform the actual rendering, encoding, and 146 exporting of the generated data. We show how the network telemetry 147 framework can benefit the current and future network operations. 148 Based on the distinction of modules and function components, we can 149 map the existing and emerging techniques and protocols into the 150 framework. The framework can also simplify the tasks for designing, 151 maintaining, and understanding a network telemetry system. At last, 152 we outline the evolution stages of the network telemetry system and 153 discuss the potential security concerns. 155 The purpose of the framework and taxonomy is to set a common ground 156 for the collection of related work and provide guidance for future 157 technique and standard developments. To the best of our knowledge, 158 this document is the first such effort for network telemetry in 159 industry standards organizations. 161 2. Glossary 163 Before further discussion, we list some key terminology and acronyms 164 used in this document. We make an intended differentiation between 165 the terms of network telemetry and OAM. However, it should be 166 understood that there is not a hard-line distinction between the two 167 concepts. Rather, network telemetry is considered as an extension of 168 OAM. It covers all the existing OAM protocols but puts more emphasis 169 on the newer and emerging techniques and protocols concerning all 170 aspects of network data from acquisition to consumption. 172 AI: Artificial Intelligence. In network domain, AI refers to the 173 machine-learning based technologies for automated network 174 operation and other tasks. 176 AM: Alternate Marking, a flow performance measurement method, 177 specified in [RFC8321]. 179 BMP: BGP Monitoring Protocol, specified in [RFC7854]. 181 DPI: Deep Packet Inspection, referring to the techniques that 182 examines packet beyond packet L3/L4 headers. 184 gNMI: gRPC Network Management Interface, a network management 185 protocol from OpenConfig Operator Working Group, mainly 186 contributed by Google. See [gnmi] for details. 188 GPB: Google Protocol Buffer, an extensible mechanism for serializing 189 structured data. 191 gRPC: gRPC Remote Procedure Call, an open source high performance 192 RPC framework that gNMI is based on. See [grpc] for details. 194 IPFIX: IP Flow Information Export Protocol, specified in [RFC7011]. 196 IOAM: In-situ OAM, a dataplane on-path telemetry technique. 198 JSON: An open standard file format and data interchange format that 199 uses human-readable text to store and transmit data objects, 200 specified in [RFC8259]. 202 MIB: Management Information Base, a database used for managing the 203 entities in a network. 205 NETCONF: Network Configuration Protocol, specified in [RFC6241]. 207 NetFlow: A Cisco protocol for flow record collecting, described in 208 [RFC3594]. 210 Network Telemetry: The process and instrumentation for acquiring and 211 utilizing network data remotely for network monitoring and 212 operation. A general term for a large set of network visibility 213 techniques and protocols, concerning aspects like data generation, 214 collection, correlation, and consumption. Network telemetry 215 addresses the current network operation issues and enables smooth 216 evolution toward future intent-driven autonomous networks. 218 NMS: Network Management System, referring to applications that allow 219 network administrators to manage a network. 221 OAM: Operations, Administration, and Maintenance. A group of 222 network management functions that provide network fault 223 indication, fault localization, performance information, and data 224 and diagnosis functions. Most conventional network monitoring 225 techniques and protocols belong to network OAM. 227 PBT: Postcard-Based Telemetry, a dataplane on-path telemetry 228 technique. 230 RESTCONF: An HTTP-based protocol that provides a programmatic 231 interface for accessing data defined in YANG, using the datastore 232 concepts defined in NETCONF, as specified in [RFC8040]. 234 SMIv2 Structure of Management Information Version 2, defining MIB 235 objects, specified in [RFC2578]. 237 SNMP: Simple Network Management Protocol. Version 1 and 2 are 238 specified in [RFC1157] and [RFC3416], respectively. 240 XML; Extensible Markup Language is a markup language for data 241 encoding that is both human-readable and machine-readable, 242 specified by W3C [xml]. 244 YANG: YANG is a data modeling language for the definition of data 245 sent over network management protocols such as the NETCONF and 246 RESTCONF. YANG is defined in [RFC6020] and [RFC7950]. 248 YANG ECA A YANG model for Event-Condition-Action policies, defined 249 in [I-D.wwx-netmod-event-yang]. 251 YANG-Push: A mechanism that allows subscriber applications to 252 request a stream of updates from a YANG datastore on a network 253 device. Details are specified in [RFC8641] and [RFC8639]. 255 3. Background 257 The term "big data" is used to describe the extremely large volume of 258 data sets that can be analyzed computationally to reveal patterns, 259 trends, and associations. Networks are undoubtedly a source of big 260 data because of their scale and the volume of network traffic they 261 forward. When a network's endpoints do not represent individual 262 users (e.g. in industrial, datacenter, and infrastructure contexts), 263 network operations can often benefit from large-scale data collection 264 without breaching user privacy. 266 Today one can access advanced big data analytics capability through a 267 plethora of commercial and open source platforms (e.g., Apache 268 Hadoop), tools (e.g., Apache Spark), and techniques (e.g., machine 269 learning). Thanks to the advance of computing and storage 270 technologies, network big data analytics gives network operators an 271 opportunity to gain network insights and move towards network 272 autonomy. Some operators start to explore the application of 273 Artificial Intelligence (AI) to make sense of network data. Software 274 tools can use the network data to detect and react on network faults, 275 anomalies, and policy violations, as well as predicting future 276 events. In turn, the network policy updates for planning, intrusion 277 prevention, optimization, and self-healing may be applied. 279 It is conceivable that an autonomic network [RFC7575] is the logical 280 next step for network evolution following Software Defined Network 281 (SDN), aiming to reduce (or even eliminate) human labor, make more 282 efficient use of network resources, and provide better services more 283 aligned with customer requirements. The related technique of 284 Intent-based Networking (IBN) 285 [I-D.irtf-nmrg-ibn-concepts-definitions] requires network visibility 286 and telemetry data in order to ensure that the network is behaving as 287 intended. 289 However, while the data processing capability is improved and 290 applications are hungry for more data, the networks lag behind in 291 extracting and translating network data into useful and actionable 292 information in efficient ways. The system bottleneck is shifting 293 from data consumption to data supply. Both the number of network 294 nodes and the traffic bandwidth keep increasing at a fast pace. The 295 network configuration and policy change at smaller time slots than 296 before. More subtle events and fine-grained data through all network 297 planes need to be captured and exported in real time. In a nutshell, 298 it is a challenge to get enough high-quality data out of the network 299 in a manner that is efficient, timely, and flexible. Therefore, we 300 need to survey the existing technologies and protocols and identify 301 any potential gaps. 303 In the remainder of this section, first we clarify the scope of 304 network data (i.e., telemetry data) concerned in the context. Then, 305 we discuss several key use cases for today's and future network 306 operations. Next, we show why the current network OAM techniques and 307 protocols are insufficient for these use cases. The discussion 308 underlines the need of new methods, techniques, and protocols, as 309 well as the extensions of existing ones, which we assign under the 310 umbrella term - Network Telemetry. 312 3.1. Telemetry Data Coverage 314 Any information that can be extracted from networks (including data 315 plane, control plane, and management plane) and used to gain 316 visibility or as basis for actions is considered telemetry data. It 317 includes statistics, event records and logs, snapshots of state, 318 configuration data, etc. It also covers the outputs of any active 319 and passive measurements [RFC7799]. In some cases, raw data is 320 processed in network before being sent to a data consumer. Such 321 processed data is also considered telemetry data. The value of 322 telemetry data varies. Less but higher quality data are often better 323 than lots of low quality data. A classification of telemetry data is 324 provided in Section 4. 326 3.2. Use Cases 328 The following set of use cases is essential for network operations. 329 While the list is by no means exhaustive, it is enough to highlight 330 the requirements for data velocity, variety, volume, and veracity in 331 networks. 333 * Security: Network intrusion detection and prevention systems need 334 to monitor network traffic and activities and act upon anomalies. 335 Given increasingly sophisticated attack vector coupled with 336 increasingly severe consequences of security breaches, new tools 337 and techniques need to be developed, relying on wider and deeper 338 visibility into networks. The ultimate goal is to achieve the 339 ideal security with no, or only minimal, human intervention. 341 * Policy and Intent Compliance: Network policies are the rules that 342 constrain the services for network access, provide service 343 differentiation, or enforce specific treatment on the traffic. 344 For example, a service function chain is a policy that requires 345 the selected flows to pass through a set of ordered network 346 functions. Intent, as defined in 347 [I-D.irtf-nmrg-ibn-concepts-definitions], is a set of operational 348 goal that a network should meet and outcomes that a network is 349 supposed to deliver, defined in a declarative manner without 350 specifying how to achieve or implement them. An intent requires a 351 complex translation and mapping process before being applied on 352 networks. While a policy or intent is enforced, the compliance 353 needs to be verified and monitored continuously by relying on 354 visibility that is provided through network telemetry data. Any 355 violation must be notified immediately, potentially resulting in 356 updates to how the policy or intent is applied in the network to 357 ensure that it remains in force, or otherwise alerting the network 358 administrator to the policy or intent violation. 360 * SLA Compliance: A Service-Level Agreement (SLA) defines the level 361 of service a user expects from a network operator, which include 362 the metrics for the service measurement and remedy/penalty 363 procedures when the service level misses the agreement. Users 364 need to check if they get the service as promised and network 365 operators need to evaluate how they can deliver the services that 366 can meet the SLA based on realtime network telemetry data, 367 including data from network measurements. 369 * Root Cause Analysis: Any network failure can be the effect of a 370 sequence of chained events. Troubleshooting and recovery require 371 quick identification of the root cause of any observable issues. 372 However, the root cause is not always straightforward to identify, 373 especially when the failure is sporadic and the number of event 374 messages, both related and unrelated to the same cause, is 375 overwhelming. While machine learning technologies can be used for 376 root cause analysis, it up to the network to sense and provide the 377 relevant diagnostic data which are either actively fed into, or 378 passively retrieved by, machine learning applications. 380 * Network Optimization: This covers all short-term and long-term 381 network optimization techniques, including load balancing, Traffic 382 Engineering (TE), and network planning. Network operators are 383 motivated to optimize their network utilization and differentiate 384 services for better Return On Investment (ROI) or lower Capital 385 Expenditures (CAPEX). The first step is to know the real-time 386 network conditions before applying policies for traffic 387 manipulation. In some cases, micro-bursts need to be detected in 388 a very short time-frame so that fine-grained traffic control can 389 be applied to avoid network congestion. Long-term planning of 390 network capacity and topology requires analysis of real-world 391 network telemetry data that is obtained over long periods of time. 393 * Event Tracking and Prediction: The visibility into traffic path 394 and performance is critical for services and applications that 395 rely on healthy network operation. Numerous related network 396 events are of interest to network operators. For example, Network 397 operators want to learn where and why packets are dropped for an 398 application flow. They also want to be warned of issues in 399 advance so proactive actions can be taken to avoid catastrophic 400 consequences. 402 3.3. Challenges 404 For a long time, network operators have relied upon SNMP [RFC3416], 405 Command-Line Interface (CLI), or Syslog to monitor the network. Some 406 other OAM techniques as described in [RFC7276] are also used to 407 facilitate network troubleshooting. These conventional techniques 408 are not sufficient to support the above use cases for the following 409 reasons: 411 * Most use cases need to continuously monitor the network and 412 dynamically refine the data collection in real-time. The poll- 413 based low-frequency data collection is ill-suited for these 414 applications. Subscription-based streaming data directly pushed 415 from the data source (e.g., the forwarding chip) is preferred to 416 provide enough data quantity and precision at scale. 418 * Comprehensive data is needed from packet processing engine to 419 traffic manager, from line cards to main control board, from user 420 flows to control protocol packets, from device configurations to 421 operations, and from physical layer to application layer. 422 Conventional OAM only covers a narrow range of data (e.g., SNMP 423 only handles data from the Management Information Base (MIB)). 424 Traditional network devices cannot provide all the necessary 425 probes. More open and programmable network devices are therefore 426 needed. 428 * Many application scenarios need to correlate network-wide data 429 from multiple sources (i.e., from distributed network devices, 430 different components of a network device, or different network 431 planes). A piecemeal solution is often lacking the capability to 432 consolidate the data from multiple sources. The composition of a 433 complete solution, as partly proposed by Autonomic Resource 434 Control Architecture(ARCA) 435 [I-D.pedro-nmrg-anticipated-adaptation], will be empowered and 436 guided by a comprehensive framework. 438 * Some conventional OAM techniques (e.g., CLI and Syslog) lack a 439 formal data model. The unstructured data hinder the tool 440 automation and application extensibility. Standardized data 441 models are essential to support the programmable networks. 443 * Although some conventional OAM techniques support data push (e.g., 444 SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow), the pushed data 445 are limited to only predefined management plane warnings (e.g., 446 SNMP Trap) or sampled user packets (e.g., sFlow). Network 447 operators require the data with arbitrary source, granularity, and 448 precision which are beyond the capability of the existing 449 techniques. 451 * The conventional passive measurement techniques can either consume 452 excessive network resources and render excessive redundant data, 453 or lead to inaccurate results; on the other hand, the conventional 454 active measurement techniques can interfere with the user traffic 455 and their results are indirect. Techniques that can collect 456 direct and on-demand data from user traffic are more favorable. 458 These challenges were addressed by newer standards and techniques 459 (e.g., IPFIX/Netflow, PSAMP, IOAM, and YANG-Push) and more are 460 emerging. These standards and techniques need to be recognized and 461 accommodated in a new framework. 463 3.4. Network Telemetry 465 Network telemetry has emerged as a mainstream technical term to refer 466 to the network data collection and consumption techniques. Several 467 network telemetry techniques and protocols (e.g., IPFIX [RFC7011] and 468 gRPC [grpc]) have been widely deployed. Network telemetry allows 469 separate entities to acquire data from network devices so that data 470 can be visualized and analyzed to support network monitoring and 471 operation. Network telemetry covers the conventional network OAM and 472 has a wider scope. It is expected that network telemetry can provide 473 the necessary network insight for autonomous networks and address the 474 shortcomings of conventional OAM techniques. 476 Network telemetry usually assumes machines as data consumers rather 477 than human operators. Hence, the network telemetry can directly 478 trigger the automated network operation, while in contrast some 479 conventional OAM tools are designed and used to help human operators 480 to monitor and diagnose the networks and guide manual network 481 operations. Such a proposition leads to very different techniques. 483 Although new network telemetry techniques are emerging and subject to 484 continuous evolution, several characteristics of network telemetry 485 have been well accepted. Note that network telemetry is intended to 486 be an umbrella term covering a wide spectrum of techniques, so the 487 following characteristics are not expected to be held by every 488 specific technique. 490 * Push and Streaming: Instead of polling data from network devices, 491 telemetry collectors subscribe to streaming data pushed from data 492 sources in network devices. 494 * Volume and Velocity: The telemetry data is intended to be consumed 495 by machines rather than by human being. Therefore, the data 496 volume can be huge and the processing is optimized for the needs 497 of automation in realtime. 499 * Normalization and Unification: Telemetry aims to address the 500 overall network automation needs. Efforts are made to normalize 501 the data representation and unify the protocols, so to simplify 502 data analysis and provide integrated analysis across heterogeneous 503 devices and data sources across a network. 505 * Model-based: The telemetry data is modeled in advance which allows 506 applications to configure and consume data with ease. 508 * Data Fusion: The data for a single application can come from 509 multiple data sources (e.g., cross-domain, cross-device, and 510 cross-layer) and needs to be correlated to take effect. 512 * Dynamic and Interactive: Since the network telemetry means to be 513 used in a closed control loop for network automation, it needs to 514 run continuously and adapt to the dynamic and interactive queries 515 from the network operation controller. 517 In addition, an ideal network telemetry solution may also have the 518 following features or properties: 520 * In-Network Customization: The data that is generated can be 521 customized in network at run-time to cater to the specific need of 522 applications. This needs the support of a programmable data plane 523 which allows probes with custom functions to be deployed at 524 flexible locations. 526 * In-Network Data Aggregation and Correlation: Network devices and 527 aggregation points can work out which events and what data needs 528 to be stored, reported, or discarded thus reducing the load on the 529 central collection and processing points while still ensuring that 530 the right information is ready to be processed in a timely way. 532 * In-Network Processing: Sometimes it is not necessary or feasible 533 to gather all information to a central point to be processed and 534 acted upon. It is possible for the data processing to be done in 535 network, allowing reactive actions to be taken locally. 537 * Direct Data Plane Export: The data originated from the data plane 538 forwarding chips can be directly exported to the data consumer for 539 efficiency, especially when the data bandwidth is large and the 540 real-time processing is required. 542 * In-band Data Collection: In addition to the passive and active 543 data collection approaches, the new hybrid approach allows to 544 directly collect data for any target flow on its entire forwarding 545 path [I-D.song-opsawg-ifit-framework]. 547 It is worth noting that a network telemetry system should not be 548 intrusive to normal network operations by avoiding the pitfall of the 549 "observer effect". That is, it should not change the network 550 behavior and affect the forwarding performance. Moreover, high- 551 volume telemetry traffic may cause network congestion unless proper 552 isolation or traffic engineering techniques are in place, or 553 congestion control mechanisms ensure that telemetry traffic backs off 554 if it exceeds the network capacity. [RFC8084] and [RFC8085] are 555 relevant Best Current Practices (BCP) in this space. 557 Although in many cases a system for network telemetry involves a 558 remote data collecting and consuming entity, it is important to 559 understand that there are no inherent assumptions about how a system 560 should be architected. While a network architecture with centralized 561 controller (e.g., SDN) seems a natural fit for network telemetry, 562 network telemetry can work in distributed fashions as well. For 563 example, telemetry data producers and consumers can have a peer-to- 564 peer relationship, in which a network node can be the direct consumer 565 of telemetry data from other nodes. 567 3.5. The Necessity of a Network Telemetry Framework 569 Network data analytics and machine-learning technologies are applied 570 for network operation automation, relying on abundant and coherent 571 data from networks. Data acquisition that is limited to a single 572 source and static in nature will in many cases not be sufficient to 573 meet an application's telemetry data needs. As a result, multiple 574 data sources, involving a variety of techniques and standards, will 575 need to be integrated. It is desirable to have a framework that 576 classifies and organizes different telemetry data source and types, 577 defines different components of a network telemetry system and their 578 interactions, and helps coordinate and integrate multiple telemetry 579 approaches across layers. This allows flexible combinations of data 580 for different applications, while normalizing and simplifying 581 interfaces. In detail, such a framework would benefit application 582 development for the following reasons: 584 * Future networks, autonomous or otherwise, depend on holistic and 585 comprehensive network visibility. All the use cases and 586 applications are better to be supported uniformly and coherently 587 under a single intelligent agent using an integrated, converged 588 mechanism and common telemetry data representations wherever 589 feasible. Therefore, the protocols and mechanisms should be 590 consolidated into a minimum yet comprehensive set. A telemetry 591 framework can help to normalize the technique developments. 593 * Network visibility presents multiple viewpoints. For example, the 594 device viewpoint takes the network infrastructure as the 595 monitoring object from which the network topology and device 596 status can be acquired; the traffic viewpoint takes the flows or 597 packets as the monitoring object from which the traffic quality 598 and path can be acquired. An application may need to switch its 599 viewpoint during operation. It may also need to correlate a 600 service and its impact on user experience to acquire the 601 comprehensive information. 603 * Applications require network telemetry to be elastic in order to 604 make efficient use of network resources and reduce the impact of 605 processing related to network telemetry on network performance. 606 For example, routine network monitoring should cover the entire 607 network with a low data sampling rate. Only when issues arise or 608 critical trends emerge should telemetry data source be modified 609 and telemetry data rates boosted as needed. 611 * Efficient data fusion is critical for applications to reduce the 612 overall quantity of data and improve the accuracy of analysis. 614 A telemetry framework collects together all the telemetry-related 615 works from different sources and working groups within IETF. This 616 makes it possible to assemble a comprehensive network telemetry 617 system and to avoid repetitious or redundant work. The framework 618 should cover the concepts and components from the standardization 619 perspective. This document describes the modules which make up a 620 network telemetry framework and decomposes the telemetry system into 621 a set of distinct components that existing and future work can easily 622 map to. 624 4. Network Telemetry Framework 626 The top level network telemetry framework partitions the network 627 telemetry into four modules based on the telemetry data object source 628 and represents their relationship. At the next level, the framework 629 decomposes each module into separate components. Each of the modules 630 follows the same underlying structure, with one component dedicated 631 to the configuration of data subscriptions and data sources, a second 632 component dedicated to encoding and exporting data, and a third 633 component instrumenting the generation of telemetry related to the 634 underlying resources. Throughout the framework, the same set of 635 abstract data acquiring mechanisms and data types (Section 4.3) are 636 applied. The two-level architecture with the uniform data 637 abstraction helps accurately pinpoint a protocol or technique to its 638 position in a network telemetry system or disaggregate a network 639 telemetry system into manageable parts. 641 4.1. Top Level Modules 643 Telemetry can be applied on the forwarding plane, the control plane, 644 and the management plane in a network, as well as other sources out 645 of the network, as shown in Figure 1. Therefore, we categorize the 646 network telemetry into four distinct modules with each having its own 647 interface to Network Operation Applications. 649 +------------------------------+ 650 | | 651 | Network Operation |<-------+ 652 | Applications | | 653 | | | 654 +------------------------------+ | 655 ^ ^ ^ | 656 | | | | 657 V V | V 658 +--------------+-----------|---+ +-----------+ 659 | | Control | | | | 660 | | Plane | | | External | 661 | <---> | | | Data and | 662 | | Telemetry | | | Event | 663 | Management | ^ V | | Telemetry | 664 | Plane +-------|-------+ | | 665 | Telemetry | V | +-----------+ 666 | | Forwarding | 667 | | Plane | 668 | <---> | 669 | | Telemetry | 670 | | | 671 +--------------+---------------+ 673 Figure 1: Modules in Layer Category of NTF 675 The rationale of this partition lies in the different telemetry data 676 objects which result in different data source and export locations. 677 Such differences have profound implications on in-network data 678 programming and processing capability, data encoding and transport 679 protocol, and required data bandwidth and latency. Data can be sent 680 directly, or proxied via the control and management planes. There 681 are advantages/disadvantages to both approaches. 683 We summarize the major differences of the four modules in the 684 following table. They are compared from six angles: 686 * Data Object 688 * Data Export Location 690 * Data Model 692 * Data Encoding 694 * Telemetry Protocol 696 * Transport Method 697 Data Object is the target and source of each module. Because the 698 data source varies, the location where data is mostly conveniently 699 exported also varies. For example, forwarding plane data mainly 700 originates as data exported from the forwarding ASICs, while control 701 plane data mainly originates from the protocol daemons running on the 702 control CPU(s). For convenience and efficiency, it is preferred to 703 export the data off the device from locations near the source. 704 Because the locations that can export data have different 705 capabilities, different choices of data model, encoding, and 706 transport method are made to balance the performance and cost. For 707 example, the forwarding chip has high throughput but limited capacity 708 for processing complex data and maintaining states, while the main 709 control CPU is capable of complex data and state processing, but has 710 limited bandwidth for high throughput data. As a result, the 711 suitable telemetry protocol for each module can be different. Some 712 representative techniques are shown in the corresponding table blocks 713 to highlight the technical diversity of these modules. Note that the 714 selected techniques just reflect the de facto state of the art and 715 are by no means exhaustive (e.g., IPFIX can also be implemented over 716 TCP and SCTP but that is not recommended for forwarding plane). The 717 key point is that one cannot expect to use a universal protocol to 718 cover all the network telemetry requirements. 720 +-----------+-------------+-------------+--------------+----------+ 721 | Module |Management |Control |Forwarding |External | 722 | |Plane |Plane |Plane |Data | 723 +-----------+-------------+-------------+--------------+----------+ 724 |Object |config. & |control |flow & packet |terminal, | 725 | |operation |protocol & |QoS, traffic |social & | 726 | |state |signaling, |stat., buffer |environ- | 727 | | |RIB |& queue stat.,|mental | 728 | | | |ACL, FIB | | 729 +-----------+-------------+-------------+--------------+----------+ 730 |Export |main control |main control |fwding chip |various | 731 |Location |CPU |CPU, |or linecard | | 732 | | |linecard CPU |CPU; main | | 733 | | |or forwarding|control CPU | | 734 | | |chip |unlikely | | 735 +-----------+-------------+-------------+--------------+----------+ 736 |Data |YANG, MIB, |YANG, |template, |YANG, | 737 |Model |syslog |custom |YANG, |custom | 738 | | | |custom | | 739 +-----------+-------------+-------------+--------------+----------+ 740 |Data |GPB, JSON, |GPB, JSON, |plain |GPB, JSON | 741 |Encoding |XML |XML, plain | |XML, plain| 742 +-----------+-------------+-------------+--------------+----------+ 743 |Application|gRPC,NETCONF,|gRPC,NETCONF,|IPFIX, mirror,|gRPC | 744 |Protocol |RESTCONF |IPFIX, mirror|gRPC, NETFLOW | | 745 +-----------+-------------+-------------+--------------+----------+ 746 |Data |HTTP, TCP |HTTP, TCP, |UDP |HTTP,TCP | 747 |Transport | |UDP | |UDP | 748 +-----------+-------------+-------------+--------------+----------+ 750 Figure 2: Comparison of the Data Object Modules 752 Note that the interaction with the applications that consume network 753 telemetry data can be indirect. Some in-device data transfer is 754 possible. For example, in the management plane telemetry, the 755 management plane will need to acquire data from the data plane. Some 756 operational states can only be derived from data plane data sources 757 such as the interface status and statistics. As another example, 758 obtaining control plane telemetry data may require the ability to 759 access the Forwarding Information Base (FIB) of the data plane. 761 On the other hand, an application may involve more than one plane and 762 interact with multiple planes simultaneously. For example, an SLA 763 compliance application may require both the data plane telemetry and 764 the control plane telemetry. 766 The requirements and challenges for each module are summarized as 767 follows (note that the requirements may pertain across all telemetry 768 modules; however, we emphasize those that are most pronounced for a 769 particular plane). 771 4.1.1. Management Plane Telemetry 773 The management plane of network elements interacts with the Network 774 Management System (NMS), and provides information such as performance 775 data, network logging data, network warning and defects data, and 776 network statistics and state data. The management plane includes 777 many protocols, including some that are considered "legacy", such as 778 SNMP and syslog. Regardless the protocol, management plane telemetry 779 must address the following requirements: 781 * Convenient Data Subscription: An application should have the 782 freedom to choose which data is exported (see section 4.3) and the 783 means and frequency of how that data is exported (e.g., on-change 784 or periodic subscription). 786 * Structured Data: For automatic network operation, machines will 787 replace human for network data comprehension. Data modeling 788 languages, such as YANG, can efficiently describe structured data 789 and normalize data encoding and transformation. 791 * High Speed Data Transport: In order to keep up with the velocity 792 of information, a server needs to be able to send large amounts of 793 data at high frequency. Compact encoding formats or data 794 compression schemes are needed to reduce the quantity of data and 795 improve the data transport efficiency. The subscription mode, by 796 replacing the query mode, reduces the interactions between clients 797 and servers and helps to improve the server's efficiency. 799 * Network Congestion Avoidance: The application must protect the 800 network from congestion by congestion control mechanisms or at 801 least circuit breakers. [RFC8084] and [RFC8085] provide some 802 solutions in this space. 804 4.1.2. Control Plane Telemetry 806 The control plane telemetry refers to the health condition monitoring 807 of different network control protocols at all layers of the protocol 808 stack. Keeping track of the operational status of these protocols is 809 beneficial for detecting, localizing, and even predicting various 810 network issues, as well as network optimization, in real-time and 811 with fine granularity. Some particular challenges and issues faced 812 by the control plane telemetry are as follows: 814 * One challenging problem for the control plane telemetry is how to 815 correlate the End-to-End (E2E) Key Performance Indicators (KPI) to 816 a specific layer's KPIs. For example, an IPTV user may describe 817 his User Experience (UE) by the video fluency and definition. 818 Then in case of an unusually poor UE KPI or a service 819 disconnection, it is non-trivial to delimit and pinpoint the issue 820 in the responsible protocol layer (e.g., the Transport Layer or 821 the Network Layer), the responsible protocol (e.g., ISIS or BGP at 822 the Network Layer), and finally the responsible device(s) with 823 specific reasons. 825 * Traditional OAM-based approaches for control plane KPI measurement 826 include Ping (L3), Traceroute (L3), Y.1731 (L2), and so on. One 827 common issue behind these methods is that they only measure the 828 KPIs instead of reflecting the actual running status of these 829 protocols, making them less effective or efficient for control 830 plane troubleshooting and network optimization. 832 * An example of the control plane telemetry is the BGP monitoring 833 protocol (BMP), it is currently used for monitoring the BGP routes 834 and enables rich applications, such as BGP peer analysis, AS 835 analysis, prefix analysis, and security analysis. However, the 836 monitoring of other layers, protocols and the cross-layer, cross- 837 protocol KPI correlations are still in their infancy (e.g., IGP 838 monitoring is not as extensive as BMP), which require further 839 research. 841 * The requirement and solutions for network congestion avoidance are 842 also applicable to the control plane telemetry. 844 4.1.3. Forwarding Plane Telemetry 846 An effective forwarding plane telemetry system relies on the data 847 that the network device can expose. The quality, quantity, and 848 timeliness of data must meet some stringent requirements. This 849 raises some challenges to the network data plane devices where the 850 first-hand data originates. 852 * A data plane device's main function is user traffic processing and 853 forwarding. While supporting network visibility is important, the 854 telemetry is just an auxiliary function, and it should strive to 855 not impede normal traffic processing and forwarding (i.e., the 856 forwarding behavior should not be altered and the trade-off 857 between forwarding performance and telemetry should be well- 858 balanced). 860 * Network operation applications require end-to-end visibility 861 across various sources, which can result in a huge volume of data. 862 However, the sheer quantity of data must not exhaust the network 863 bandwidth, regardless of the data delivery approach (i.e., whether 864 through in-band or out-of-band channels). 866 * The data plane devices must provide timely data with the minimum 867 possible delay. Long processing, transport, storage, and analysis 868 delay can impact the effectiveness of the control loop and even 869 render the data useless. 871 * The data should be structured and labeled, and easy for 872 applications to parse and consume. At the same time, the data 873 types needed by applications can vary significantly. The data 874 plane devices need to provide enough flexibility and 875 programmability to support the precise data provision for 876 applications. 878 * The data plane telemetry should support incremental deployment and 879 work even though some devices are unaware of the system. 881 * The requirement and solutions for network congestion avoidance are 882 also applicable to the forwarding plane telemetry. 884 Although not specific to the forwarding plane, these challenges are 885 more difficult to the forwarding plane because of the limited 886 resource and flexibility. Data plane programmability is essential to 887 support network telemetry. Newer data plane forwarding chips are 888 equipped with advanced telemetry features and provide flexibility to 889 support customized telemetry functions. 891 Technique Taxonomy: concerning about how one instruments the 892 telemetry, there can be multiple possible dimensions to classify the 893 forwarding plane telemetry techniques. 895 * Active, Passive, and Hybrid: This dimension concerns about the 896 end-to-end measurement. Active and passive methods (as well as 897 the hybrid types) are well documented in [RFC7799]. Passive 898 methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic 899 mirroring. These methods usually have low data coverage. The 900 bandwidth cost is very high in order to improve the data coverage. 901 On the other hand, active methods include Ping, OWAMP [RFC4656], 902 TWAMP [RFC5357], STAMP [RFC8762], and Cisco's SLA Protocol 903 [RFC6812]. These methods are intrusive and only provide indirect 904 network measurements. Hybrid methods, including in-situ OAM 905 [I-D.ietf-ippm-ioam-data], Alternate-Marking (AM) [RFC8321], and 906 Multipoint Alternate Marking [I-D.ietf-ippm-multipoint-alt-mark], 907 provide a well-balanced and more flexible approach. However, 908 these methods are also more complex to implement. 910 * In-Band and Out-of-Band: Telemetry data carried in user packets 911 before being exported to a data collector is considered in-band 912 (e.g., in-situ OAM [I-D.ietf-ippm-ioam-data]). Telemetry data 913 that is directly exported to a data collector without modifying 914 user packets is considered out-of-band (e.g., the postcard-based 915 approach described in Appendix A.3.5). It is also possible to 916 have hybrid methods, where only the telemetry instruction or 917 partial data is carried by user packets (e.g., AM [RFC8321]). 919 * End-to-End and In-Network: End-to-End methods start from, and end 920 at, the network end hosts (e.g., Ping). In-Network methods work 921 in networks and are transparent to end hosts. However, if needed, 922 In-Network methods can be easily extended into end hosts. 924 * Data Subject: Depending on the telemetry objective, the methods 925 can be flow-based (e.g., in-situ OAM [I-D.ietf-ippm-ioam-data]), 926 path-based (e.g., Traceroute), and node-based (e.g., IPFIX 927 [RFC7011]). The various data objects can be packet, flow record, 928 measurement, states, and signal. 930 4.1.4. External Data Telemetry 932 Events that occur outside the boundaries of the network system are 933 another important source of network telemetry. Correlating both 934 internal telemetry data and external events with the requirements of 935 network systems, as presented in 936 [I-D.pedro-nmrg-anticipated-adaptation], provides a strategic and 937 functional advantage to management operations. 939 As with other sources of telemetry information, the data and events 940 must meet strict requirements, especially in terms of timeliness, 941 which is essential to properly incorporate external event information 942 into network management applications. The specific challenges are 943 described as follows: 945 * The role of the external event detector can be played by multiple 946 elements, including hardware (e.g., physical sensors, such as 947 seismometers) and software (e.g., Big Data sources that analyze 948 streams of information, such as Twitter messages). Thus, the 949 transmitted data must support different shapes but, at the same 950 time, follow a common but extensible schema. 952 * Since the main function of the external event detectors is to 953 perform the notifications, their timeliness is assumed. However, 954 once messages have been dispatched, they must be quickly collected 955 and inserted into the control plane with variable priority, which 956 is higher for important sources and events and lower for secondary 957 ones. 959 * The schema used by external detectors must be easily adopted by 960 current and future devices and applications. Therefore, it must 961 be easily mapped to current data models, such as in terms of YANG. 963 * As the communication with external entities outside the boundary 964 of a provider network may be realized over the Internet, the risk 965 of congestion is even more relevant in this context and proper 966 counter-measures must be taken. Solutions such as network 967 transport circuit breakers are needed as well. 969 Organizing both internal and external telemetry information together 970 will be key for the general exploitation of the management 971 possibilities of current and future network systems, as reflected in 972 the incorporation of cognitive capabilities to new hardware and 973 software (virtual) elements. 975 4.2. Second Level Function Components 977 The telemetry module at each plane can be further partitioned into 978 five distinct conceptual components: 980 * Data Query, Analysis, and Storage: This component works at the 981 application layer. It is normally a part of the network 982 management system at the receiver side. On the one hand, it is 983 responsible for issuing data requirements. The data of interest 984 can be modeled data through configuration or custom data through 985 programming. The data requirements can be queries for one-shot 986 data or subscriptions for events or streaming data. On the other 987 hand, it receives, stores, and processes the returned data from 988 network devices. Data analysis can be interactive to initiate 989 further data queries. This component can reside in either network 990 devices or remote controllers. It can be centralized and 991 distributed, and involve one or more instances. 993 * Data Configuration and Subscription: This component manages data 994 queries on devices. It determines the protocol and channel for 995 applications to acquire desired data. This component is also 996 responsible for configuring the desired data that might not be 997 directly available form data sources. The subscription data can 998 be described by models, templates, or programs. 1000 * Data Encoding and Export: This component determines how telemetry 1001 data is delivered to the data analysis and storage component with 1002 access control. The data encoding and the transport protocol may 1003 vary due to the data export location. 1005 * Data Generation and Processing: The requested data needs to be 1006 captured, filtered, processed, and formatted in network devices 1007 from raw data sources. This may involve in-network computing and 1008 processing on either the fast path or the slow path in network 1009 devices. 1011 * Data Object and Source: This component determines the monitoring 1012 objects and original data sources provisioned in the device. A 1013 data source usually just provides raw data which needs further 1014 processing. Each data source can be considered a probe. Some 1015 data sources can be dynamically installed, while others will be 1016 more static. 1018 +----------------------------------------+ 1019 +----------------------------------------+ | 1020 | | | 1021 | Data Query, Analysis, & Storage | | 1022 | | + 1023 +-------+++ -----------------------------+ 1024 ||| ^^^ 1025 ||| ||| 1026 ||V ||| 1027 +--+V--------------------+++------------+ 1028 +-----V---------------------+------------+ | 1029 +---------------------+-------+----------+ | | 1030 | Data Configuration | | | | 1031 | & Subscription | Data Encoding | | | 1032 | (model, template, | & Export | | | 1033 | & program) | | | | 1034 +---------------------+------------------| | | 1035 | | | | 1036 | Data Generation | | | 1037 | & Processing | | | 1038 | | | | 1039 +----------------------------------------| | | 1040 | | | | 1041 | Data Object and Source | |-+ 1042 | |-+ 1043 +----------------------------------------+ 1045 Figure 3: Components in the Network Telemetry Framework 1047 4.3. Data Acquisition Mechanism and Type Abstraction 1049 Broadly speaking, network data can be acquired through subscription 1050 (push) and query (poll). A subscription is a contract between 1051 publisher and subscriber. After initial setup, the subscribed data 1052 is automatically delivered to registered subscribers until the 1053 subscription expires. There are two variations of subscription. The 1054 subscriptions can be either pre-defined, or the subscribers are 1055 allowed to configure and tailor the published data to their specific 1056 needs. 1058 In contrast, queries are used when a client expects immediate and 1059 one-off feedback from network devices. The queried data may be 1060 directly extracted from some specific data source, or synthesized and 1061 processed from raw data. Queries work well for interactive network 1062 telemetry applications. 1064 In general, data can be pulled (i.e., queried) whenever needed, but 1065 in many cases, pushing the data (i.e., subscription) is more 1066 efficient, and can reduce the latency of a client detecting a change. 1067 From the data consumer point of view, there are four types of data 1068 from network devices that a telemetry data consumer can subscribe or 1069 query: 1071 * Simple Data: The data that are steadily available from some 1072 datastore or static probes in network devices. 1074 * Derived Data: The data need to be synthesized or processed in 1075 network from raw data from one or more network devices. The data 1076 processing function can be statically or dynamically loaded into 1077 network devices. 1079 * Event-triggered Data: The data are conditionally acquired based on 1080 the occurrence of some events. An example of event-triggered data 1081 could be an interface changing operational state between up and 1082 down. Such data can be actively pushed through subscription or 1083 passively polled through query. There are many ways to model 1084 events, including using Finite State Machine (FSM) or Event 1085 Condition Action (ECA) [I-D.wwx-netmod-event-yang]. 1087 * Streaming Data: The data are continuously generated. It can be 1088 time series or the dump of databases. For example, an interface 1089 packet counter is exported every second. The streaming data 1090 reflect realtime network states and metrics and require large 1091 bandwidth and processing power. The streaming data are always 1092 actively pushed to the subscribers. 1094 The above data types are not mutually exclusive. Rather, they are 1095 often composite. Derived data is composed of simple data; Event- 1096 triggered data can be simple or derived; streaming data can be based 1097 on some recurring event. The relationships of these data types are 1098 illustrated in Figure 4. 1100 +----------------------+ +-----------------+ 1101 | Event-triggered Data |<----+ Streaming Data | 1102 +-------+---+----------+ +-----+---+-------+ 1103 | | | | 1104 | | | | 1105 | | +--------------+ | | 1106 | +-->| Derived Data |<--+ | 1107 | +------+------ + | 1108 | | | 1109 | V | 1110 | +--------------+ | 1111 +------>| Simple Data |<------+ 1112 +--------------+ 1114 Figure 4: Data Type Relationship 1116 Subscription usually deals with event-triggered data and streaming 1117 data, and query usually deals with simple data and derived data. But 1118 the other ways are also possible. Advanced network telemetry 1119 techniques are designed mainly for event-triggered or streaming data 1120 subscription, and derived data query. 1122 4.4. Mapping Existing Mechanisms into the Framework 1124 The following table shows how the existing mechanisms (mainly 1125 published in IETF and with the emphasis on the latest new 1126 technologies) are positioned in the framework. Given the vast body 1127 of existing work, we cannot provide an exhaustive list, so the 1128 mechanisms in the tables should be considered as just examples. 1129 Also, some comprehensive protocols and techniques may cover multiple 1130 aspects or modules of the framework, so a name in a block only 1131 emphasizes one particular characteristic of it. More details about 1132 some listed mechanisms can be found in Appendix A. 1134 +-------------+-----------------+---------------+--------------+ 1135 | | Management | Control | Forwarding | 1136 | | Plane | Plane | Plane | 1137 +-------------+-----------------+---------------+--------------+ 1138 | data config.| gNMI, NETCONF, | gNMI, NETCONF,| NETCONF, | 1139 | & subscribe | RESTCONF, SNMP, | RESTCONF, | RESTCONF, | 1140 | | YANG-Push | YANG-Push | YANG-Push | 1141 +-------------+-----------------+---------------+--------------+ 1142 | data gen. & | MIB, | YANG | IOAM, PSAMP | 1143 | process | YANG | | PBT, AM, | 1144 +-------------+-----------------+---------------+--------------+ 1145 | data encode.| gRPC, HTTP, TCP | BMP, TCP | IPFIX, UDP | 1146 | & export | | | | 1147 +-------------+-----------------+---------------+--------------+ 1148 Figure 5: Existing Work Mapping 1150 5. Evolution of Network Telemetry Applications 1152 Network telemetry is an evolving technical area. As the network 1153 moves towards the automated operation, network telemetry applications 1154 undergo several stages of evolution which add new layer of 1155 requirements to the underlying network telemetry techniques. Each 1156 stage is built upon the techniques adopted by the previous stages 1157 plus some new requirements. 1159 Stage 0 - Static Telemetry: The telemetry data source and type are 1160 determined at design time. The network operator can only 1161 configure how to use it with limited flexibility. 1163 Stage 1 - Dynamic Telemetry: The custom telemetry data can be 1164 dynamically programmed or configured at runtime without 1165 interrupting the network operation, allowing a trade-off among 1166 resource, performance, flexibility, and coverage. 1168 Stage 2 - Interactive Telemetry: The network operator can 1169 continuously customize and fine tune the telemetry data in real 1170 time to reflect the network operation's visibility requirements. 1171 Compared with Stage 1, the changes are frequent based on the real- 1172 time feedback. At this stage, some tasks can be automated, but 1173 human operators still need to sit in the middle to make decisions. 1175 Stage 3 - Closed-loop Telemetry: The telemetry is free from the 1176 interference of human operators, except for generating the 1177 reports. The intelligent network operation engine automatically 1178 issues the telemetry data requests, analyzes the data, and updates 1179 the network operations in closed control loops. 1181 Existing technologies are ready for stage 0 and stage 1. Individual 1182 stage 2 and stage 3 applications are also possible now. However, the 1183 future autonomic networks may need a comprehensive operation 1184 management system which works at stage 2 and stage 3 to cover all the 1185 network operation tasks. A well-defined network telemetry framework 1186 is the first step towards this direction. 1188 6. Security Considerations 1190 The complexity of network telemetry raises significant security 1191 implications. For example, telemetry data can be manipulated to 1192 exhaust various network resources at each plane as well as the data 1193 consumer; falsified or tampered data can mislead the decision-making 1194 and paralyze networks; wrong configuration and programming for 1195 telemetry is equally harmful. The telemetry data is highly 1196 sensitive, which exposes a lot of information about the network and 1197 its configuration. Some of that information can make designing 1198 attacks against the network much easier (e.g., exact details of what 1199 software and patches have been installed), and allows an attacker to 1200 determine whether a device may be subject to unprotected security 1201 vulnerabilities. 1203 Given that this document has proposed a framework for network 1204 telemetry and the telemetry mechanisms discussed are more extensive 1205 (in both message frequency and traffic amount) than the conventional 1206 network OAM concepts, we must also reflect that various new security 1207 considerations may also arise. A number of techniques already exist 1208 for securing the forwarding plane, the control plane, and the 1209 management plane in a network, but it is important to consider if any 1210 new threat vectors are now being enabled via the use of network 1211 telemetry procedures and mechanisms. 1213 Security considerations for networks that use telemetry methods may 1214 include: 1216 * Telemetry framework trust and policy model; 1218 * Role management and access control for enabling and disabling 1219 telemetry capabilities; 1221 * Protocol transport used telemetry data and inherent security 1222 capabilities; 1224 * Telemetry data stores, storage encryption and methods of access; 1226 * Tracking telemetry events and any abnormalities that might 1227 identify malicious attacks using telemetry interfaces. 1229 * Authentication and signing of telemetry data to make data more 1230 trustworthy. 1232 * Segregating the telemetry data traffic from the data traffic 1233 carried over the network (e.g., historically management access and 1234 management data may be carried via an independent management 1235 network). 1237 Some security considerations highlighted above may be minimized or 1238 negated with policy management of network telemetry. In a network 1239 telemetry deployment it would be advantageous to separate telemetry 1240 capabilities into different classes of policies, i.e., Role Based 1241 Access Control and Event-Condition-Action policies. Also, potential 1242 conflicts between network telemetry mechanisms must be detected 1243 accurately and resolved quickly to avoid unnecessary network 1244 telemetry traffic propagation escalating into an unintended or 1245 intended denial of service attack. 1247 Further study of the security issues will be required, and it is 1248 expected that the security mechanisms and protocols are developed and 1249 deployed along with a network telemetry system. 1251 In addition to security, privacy is also an important issue. Large- 1252 scale network data collection is a major threat to user privacy 1253 [RFC7258]. The Network Telemetry Framework is not applicable to 1254 networks whose endpoints represent individual users, such as general- 1255 purpose access networks. Any collection or retention of data in 1256 those networks must be tightly limited to protect user privacy. 1258 7. IANA Considerations 1260 This document includes no request to IANA. 1262 8. Contributors 1264 The other contributors of this document are listed as follows. 1266 * Tianran Zhou 1268 * Zhenbin Li 1270 * Zhenqiang Li 1272 * Daniel King 1274 * Adrian Farrel 1276 * Alexander Clemm 1278 9. Acknowledgments 1280 We would like to thank Rob Wilton, Greg Mirsky, Randy Presuhn, Joe 1281 Clarke, Victor Liu, James Guichard, Uri Blumenthal, Giuseppe 1282 Fioccola, Yunan Gu, Parviz Yegani, Young Lee, Qin Wu, Gyan Mishra, 1283 Ben Schwartz, Alexey Melnikov, Michael Scharf, and many others who 1284 have provided helpful comments and suggestions to improve this 1285 document. 1287 10. Informative References 1289 [gnmi] "gNMI - gRPC Network Management Interface", 1290 . 1293 [grpc] "gPPC, A high performance, open-source universal RPC 1294 framework", . 1296 [I-D.ietf-grow-bmp-adj-rib-out] 1297 Evens, T., Bayraktar, S., Lucente, P., Mi, P., and S. 1298 Zhuang, "Support for Adj-RIB-Out in the BGP Monitoring 1299 Protocol (BMP)", Work in Progress, Internet-Draft, draft- 1300 ietf-grow-bmp-adj-rib-out-07, 5 August 2019, 1301 . 1304 [I-D.ietf-grow-bmp-local-rib] 1305 Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, 1306 "Support for Local RIB in BGP Monitoring Protocol (BMP)", 1307 Work in Progress, Internet-Draft, draft-ietf-grow-bmp- 1308 local-rib-13, 31 August 2021, 1309 . 1312 [I-D.ietf-ippm-ioam-data] 1313 Brockners, F., Bhandari, S., and T. Mizrahi, "Data Fields 1314 for In-situ OAM", Work in Progress, Internet-Draft, draft- 1315 ietf-ippm-ioam-data-16, 8 November 2021, 1316 . 1319 [I-D.ietf-ippm-multipoint-alt-mark] 1320 Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, 1321 "Multipoint Alternate-Marking Method for Passive and 1322 Hybrid Performance Monitoring", Work in Progress, 1323 Internet-Draft, draft-ietf-ippm-multipoint-alt-mark-09, 23 1324 March 2020, . 1327 [I-D.ietf-netconf-distributed-notif] 1328 Zhou, T., Zheng, G., Voit, E., Graf, T., and P. Francois, 1329 "Subscription to Distributed Notifications", Work in 1330 Progress, Internet-Draft, draft-ietf-netconf-distributed- 1331 notif-02, 6 May 2021, . 1334 [I-D.ietf-netconf-udp-notif] 1335 Zheng, G., Zhou, T., Graf, T., Francois, P., Feng, A. H., 1336 and P. Lucente, "UDP-based Transport for Configured 1337 Subscriptions", Work in Progress, Internet-Draft, draft- 1338 ietf-netconf-udp-notif-04, 21 October 2021, 1339 . 1342 [I-D.irtf-nmrg-ibn-concepts-definitions] 1343 Clemm, A., Ciavaglia, L., Granville, L. Z., and J. 1344 Tantsura, "Intent-Based Networking - Concepts and 1345 Definitions", Work in Progress, Internet-Draft, draft- 1346 irtf-nmrg-ibn-concepts-definitions-05, 2 September 2021, 1347 . 1350 [I-D.kumar-rtgwg-grpc-protocol] 1351 Kumar, A., Kolhe, J., Ghemawat, S., and L. Ryan, "gRPC 1352 Protocol", Work in Progress, Internet-Draft, draft-kumar- 1353 rtgwg-grpc-protocol-00, 8 July 2016, 1354 . 1357 [I-D.openconfig-rtgwg-gnmi-spec] 1358 Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, 1359 C., and C. Morrow, "gRPC Network Management Interface 1360 (gNMI)", Work in Progress, Internet-Draft, draft- 1361 openconfig-rtgwg-gnmi-spec-01, 5 March 2018, 1362 . 1365 [I-D.pedro-nmrg-anticipated-adaptation] 1366 Martinez-Julia, P., "Exploiting External Event Detectors 1367 to Anticipate Resource Requirements for the Elastic 1368 Adaptation of SDN/NFV Systems", Work in Progress, 1369 Internet-Draft, draft-pedro-nmrg-anticipated-adaptation- 1370 02, 29 June 2018, . 1373 [I-D.song-ippm-postcard-based-telemetry] 1374 Song, H., Mirsky, G., Filsfils, C., Abdelsalam, A., Zhou, 1375 T., Li, Z., Shin, J., and K. Lee, "Postcard-based On-Path 1376 Flow Data Telemetry using Packet Marking", Work in 1377 Progress, Internet-Draft, draft-song-ippm-postcard-based- 1378 telemetry-10, 9 July 2021, 1379 . 1382 [I-D.song-opsawg-dnp4iq] 1383 Song, H. and J. Gong, "Requirements for Interactive Query 1384 with Dynamic Network Probes", Work in Progress, Internet- 1385 Draft, draft-song-opsawg-dnp4iq-01, 19 June 2017, 1386 . 1389 [I-D.song-opsawg-ifit-framework] 1390 Song, H., Qin, F., Chen, H., Jin, J., and J. Shin, "In- 1391 situ Flow Information Telemetry", Work in Progress, 1392 Internet-Draft, draft-song-opsawg-ifit-framework-16, 21 1393 October 2021, . 1396 [I-D.wwx-netmod-event-yang] 1397 Wu, Q., Bryskin, I., Birkholz, H., Liu, X., and B. Claise, 1398 "A YANG Data model for ECA Policy Management", Work in 1399 Progress, Internet-Draft, draft-wwx-netmod-event-yang-10, 1400 1 November 2020, . 1403 [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, 1404 "Simple Network Management Protocol (SNMP)", RFC 1157, 1405 DOI 10.17487/RFC1157, May 1990, 1406 . 1408 [RFC2578] McCloghrie, K., Ed., Perkins, D., Ed., and J. 1409 Schoenwaelder, Ed., "Structure of Management Information 1410 Version 2 (SMIv2)", STD 58, RFC 2578, 1411 DOI 10.17487/RFC2578, April 1999, 1412 . 1414 [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, 1415 DOI 10.17487/RFC2981, October 2000, 1416 . 1418 [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations 1419 for the Simple Network Management Protocol (SNMP)", 1420 STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, 1421 . 1423 [RFC3594] Duffy, P., "PacketCable Security Ticket Control Sub-Option 1424 for the DHCP CableLabs Client Configuration (CCC) Option", 1425 RFC 3594, DOI 10.17487/RFC3594, September 2003, 1426 . 1428 [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management 1429 Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, 1430 September 2004, . 1432 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1433 Zekauskas, "A One-way Active Measurement Protocol 1434 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 1435 . 1437 [RFC5085] Nadeau, T., Ed. and C. Pignataro, Ed., "Pseudowire Virtual 1438 Circuit Connectivity Verification (VCCV): A Control 1439 Channel for Pseudowires", RFC 5085, DOI 10.17487/RFC5085, 1440 December 2007, . 1442 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1443 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1444 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1445 . 1447 [RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for 1448 the Network Configuration Protocol (NETCONF)", RFC 6020, 1449 DOI 10.17487/RFC6020, October 2010, 1450 . 1452 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 1453 and A. Bierman, Ed., "Network Configuration Protocol 1454 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 1455 . 1457 [RFC6812] Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare, 1458 S., and E. Yedavalli, "Cisco Service-Level Assurance 1459 Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013, 1460 . 1462 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1463 "Specification of the IP Flow Information Export (IPFIX) 1464 Protocol for the Exchange of Flow Information", STD 77, 1465 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1466 . 1468 [RFC7258] Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an 1469 Attack", BCP 188, RFC 7258, DOI 10.17487/RFC7258, May 1470 2014, . 1472 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 1473 Weingarten, "An Overview of Operations, Administration, 1474 and Maintenance (OAM) Tools", RFC 7276, 1475 DOI 10.17487/RFC7276, June 2014, 1476 . 1478 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 1479 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 1480 DOI 10.17487/RFC7540, May 2015, 1481 . 1483 [RFC7575] Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A., 1484 Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic 1485 Networking: Definitions and Design Goals", RFC 7575, 1486 DOI 10.17487/RFC7575, June 2015, 1487 . 1489 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 1490 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 1491 May 2016, . 1493 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 1494 Monitoring Protocol (BMP)", RFC 7854, 1495 DOI 10.17487/RFC7854, June 2016, 1496 . 1498 [RFC7950] Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", 1499 RFC 7950, DOI 10.17487/RFC7950, August 2016, 1500 . 1502 [RFC8040] Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF 1503 Protocol", RFC 8040, DOI 10.17487/RFC8040, January 2017, 1504 . 1506 [RFC8084] Fairhurst, G., "Network Transport Circuit Breakers", 1507 BCP 208, RFC 8084, DOI 10.17487/RFC8084, March 2017, 1508 . 1510 [RFC8085] Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage 1511 Guidelines", BCP 145, RFC 8085, DOI 10.17487/RFC8085, 1512 March 2017, . 1514 [RFC8259] Bray, T., Ed., "The JavaScript Object Notation (JSON) Data 1515 Interchange Format", STD 90, RFC 8259, 1516 DOI 10.17487/RFC8259, December 2017, 1517 . 1519 [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, 1520 L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, 1521 "Alternate-Marking Method for Passive and Hybrid 1522 Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, 1523 January 2018, . 1525 [RFC8639] Voit, E., Clemm, A., Gonzalez Prieto, A., Nilsen-Nygaard, 1526 E., and A. Tripathy, "Subscription to YANG Notifications", 1527 RFC 8639, DOI 10.17487/RFC8639, September 2019, 1528 . 1530 [RFC8641] Clemm, A. and E. Voit, "Subscription to YANG Notifications 1531 for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641, 1532 September 2019, . 1534 [RFC8762] Mirsky, G., Jun, G., Nydell, H., and R. Foote, "Simple 1535 Two-Way Active Measurement Protocol", RFC 8762, 1536 DOI 10.17487/RFC8762, March 2020, 1537 . 1539 [RFC8924] Aldrin, S., Pignataro, C., Ed., Kumar, N., Ed., Krishnan, 1540 R., and A. Ghanwani, "Service Function Chaining (SFC) 1541 Operations, Administration, and Maintenance (OAM) 1542 Framework", RFC 8924, DOI 10.17487/RFC8924, October 2020, 1543 . 1545 [xml] "Extensible Markup Language (XML) 1.0 (Fifth Edition)", 1546 . 1548 Appendix A. A Survey on Existing Network Telemetry Techniques 1550 In this non-normative appendix, we provide an overview of some 1551 existing techniques and standard proposals for each network telemetry 1552 module. 1554 A.1. Management Plane Telemetry 1555 A.1.1. Push Extensions for NETCONF 1557 NETCONF [RFC6241] is a popular network management protocol 1558 recommended by IETF. Its core strength is for managing 1559 configuration, but can also be used for data collection. YANG-Push 1560 [RFC8641] [RFC8639] extends NETCONF and enables subscriber 1561 applications to request a continuous, customized stream of updates 1562 from a YANG datastore. Providing such visibility into changes made 1563 upon YANG configuration and operational objects enables new 1564 capabilities based on the remote mirroring of configuration and 1565 operational state. Moreover, distributed data collection mechanism 1566 [I-D.ietf-netconf-distributed-notif] via UDP based publication 1567 channel [I-D.ietf-netconf-udp-notif] provides enhanced efficiency for 1568 the NETCONF based telemetry. 1570 A.1.2. gRPC Network Management Interface 1572 gRPC Network Management Interface (gNMI) 1573 [I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol 1574 based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote 1575 Procedure Call) framework. With a single gRPC service definition, 1576 both configuration and telemetry can be covered. gRPC is an HTTP/2 1577 [RFC7540] based open-source micro-service communication framework. 1578 It provides a number of capabilities which are well-suited for 1579 network telemetry, including: 1581 * Full-duplex streaming transport model combined with a binary 1582 encoding mechanism provides good telemetry efficiency. 1584 * gRPC provides higher-level features consistency across platforms 1585 that common HTTP/2 libraries typically do not. This 1586 characteristic is especially valuable for the fact that telemetry 1587 data collectors normally reside on a large variety of platforms. 1589 * The built-in load-balancing and failover mechanism. 1591 A.2. Control Plane Telemetry 1593 A.2.1. BGP Monitoring Protocol 1595 BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP 1596 sessions and is intended to provide a convenient interface for 1597 obtaining route views. 1599 The BGP routing information is collected from the monitored device(s) 1600 to the BMP monitoring station by setting up the BMP TCP session. The 1601 BGP peers are monitored by the BMP Peer Up and Peer Down 1602 Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], 1603 Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib 1604 [I-D.ietf-grow-bmp-local-rib]) are encapsulated in the BMP Route 1605 Monitoring Message and the BMP Route Mirroring Message, providing 1606 both an initial table dump and real-time route updates. In addition, 1607 BGP statistics are reported through the BMP Stats Report Message, 1608 which could be either timer triggered or event-driven. Future BMP 1609 extensions could further enrich BGP monitoring applications. 1611 A.3. Data Plane Telemetry 1613 A.3.1. The Alternate Marking (AM) technology 1615 The Alternate Marking method enables efficient measurements of packet 1616 loss, delay, and jitter both in IP and Overlay Networks, as presented 1617 in [RFC8321] and [I-D.ietf-ippm-multipoint-alt-mark]. 1619 This technique can be applied to point-to-point and multipoint-to- 1620 multipoint flows. Alternate Marking creates batches of packets by 1621 alternating the value of 1 bit (or a label) of the packet header. 1622 These batches of packets are unambiguously recognized over the 1623 network and the comparison of packet counters for each batch allows 1624 the packet loss calculation. The same idea can be applied to delay 1625 measurement by selecting ad hoc packets with a marking bit dedicated 1626 for delay measurements. 1628 Alternate Marking method needs two counters each marking period for 1629 each flow under monitor. For instance, by considering n measurement 1630 points and m monitored flows, the order of magnitude of the packet 1631 counters for each time interval is n*m*2 (1 per color). 1633 Since networks offer rich sets of network performance measurement 1634 data (e.g., packet counters), traditional approaches run into 1635 limitations. The bottleneck is the generation and export of the data 1636 and the amount of data that can be reasonably collected from the 1637 network. In addition, management tasks related to determining and 1638 configuring which data to generate lead to significant deployment 1639 challenges. 1641 The Multipoint Alternate Marking approach, described in 1642 [I-D.ietf-ippm-multipoint-alt-mark], aims to resolve this issue and 1643 make the performance monitoring more flexible in case a detailed 1644 analysis is not needed. 1646 An application orchestrates network performance measurements tasks 1647 across the network to allow for optimized monitoring. The 1648 application can choose how roughly or precisely to configure 1649 measurement points depending on the application's requirements. 1651 Using Alternate Marking, it is possible to monitor a Multipoint 1652 Network without in depth examination by using the Network Clustering 1653 (subnetworks that are portions of the entire network that preserve 1654 the same property of the entire network, called clusters). So in the 1655 case that there is packet loss or the delay is too high then the 1656 specific filtering criteria could be applied to gather a more 1657 detailed analysis by using a different combination of clusters up to 1658 a per-flow measurement as described in Alternate-Marking (AM) 1659 [RFC8321]. 1661 In summary, an application can configure end-to-end network 1662 monitoring. If the network does not experience issues, this 1663 approximate monitoring is good enough and is very cheap in terms of 1664 network resources. However, in case of problems, the application 1665 becomes aware of the issues from this approximate monitoring and, in 1666 order to localize the portion of the network that has issues, 1667 configures the measurement points more extensively, allowing more 1668 detailed monitoring to be performed. After the detection and 1669 resolution of the problem, the initial approximate monitoring can be 1670 used again. 1672 A.3.2. Dynamic Network Probe 1674 Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] 1675 proposes a programmable means to customize the data that an 1676 application collects from the data plane. A direct benefit of DNP is 1677 the reduction of the exported data. A full DNP solution covers 1678 several components including data source, data subscription, and data 1679 generation. The data subscription needs to define the derived data 1680 which can be composed and derived from the raw data sources. The 1681 data generation takes advantage of the moderate in-network computing 1682 to produce the desired data. 1684 While DNP can introduce unforeseeable flexibility to the data plane 1685 telemetry, it also faces some challenges. It requires a flexible 1686 data plane that can be dynamically reprogrammed at run-time. The 1687 programming API is yet to be defined. 1689 A.3.3. IP Flow Information Export (IPFIX) Protocol 1691 Traffic on a network can be seen as a set of flows passing through 1692 network elements. IP Flow Information Export (IPFIX) [RFC7011] 1693 provides a means of transmitting traffic flow information for 1694 administrative or other purposes. A typical IPFIX enabled system 1695 includes a pool of Metering Processes that collects data packets at 1696 one or more Observation Points, optionally filters them and 1697 aggregates information about these packets. An Exporter then gathers 1698 each of the Observation Points together into an Observation Domain 1699 and sends this information via the IPFIX protocol to a Collector. 1701 A.3.4. In-Situ OAM 1703 Traditional passive and active monitoring and measurement techniques 1704 are either inaccurate or resource-consuming. It is preferable to 1705 directly acquire data associated with a flow's packets when the 1706 packets pass through a network. In-situ OAM (iOAM) 1707 [I-D.ietf-ippm-ioam-data], a data generation technique, embeds a new 1708 instruction header to user packets and the instruction directs the 1709 network nodes to add the requested data to the packets. Thus, at the 1710 path end, the packet's experience gained on the entire forwarding 1711 path can be collected. Such firsthand data is invaluable to many 1712 network OAM applications. 1714 However, iOAM also faces some challenges. The issues on performance 1715 impact, security, scalability and overhead limits, encapsulation 1716 difficulties in some protocols, and cross-domain deployment need to 1717 be addressed. 1719 A.3.5. Postcard Based Telemetry 1721 PBT [I-D.song-ippm-postcard-based-telemetry] is a proposed 1722 complementary technique to IOAM. PBT directly exports data at each 1723 node through an independent packet. At the cost of higher bandwidth 1724 overhead and the need for data correlation, PBT shows several 1725 advantages over IOAM. It can also help to identify packet drop 1726 location in case a packet is dropped on its forwarding path. 1728 A.3.6. Existing OAM for Specific Data Planes 1730 Various data planes raises unique OAM requirements. IETF has 1731 published OAM technique and framework documents (e.g., [RFC8924] and 1732 [RFC5085]) targeting different data planes such as MPLS, L2-VPN, 1733 NVO3, VXLAN, BIER, SFC, and DETNET. The aforementioned data plane 1734 telemetry techniques can be used to enhance the OAM capability on 1735 such data planes. 1737 A.4. External Data and Event Telemetry 1739 A.4.1. Sources of External Events 1741 To ensure that the information provided by external event detectors 1742 and used by the network management solutions is meaningful for 1743 management purposes, the network telemetry framework must ensure that 1744 such detectors (sources) are easily connected to the management 1745 solutions (sinks). This requires the specification of a list of 1746 potential external data sources that could be of interest in network 1747 management and match it to the connectors and/or interfaces required 1748 to connect them. 1750 Categories of external event sources that may be of interest to 1751 network management include:: 1753 * Smart objects and sensors. With the consolidation of the Internet 1754 of Things~(IoT) any network system will have many smart objects 1755 attached to its physical surroundings and logical operation 1756 environments. Most of these objects will be essentially based on 1757 sensors of many kinds (e.g., temperature, humidity, presence) and 1758 the information they provide can be very useful for the management 1759 of the network, even when they are not specifically deployed for 1760 such purpose. Elements of this source type will usually provide a 1761 specific protocol for interaction, especially one of those 1762 protocols related to IoT, such as the Constrained Application 1763 Protocol (CoAP). 1765 * Online news reporters. Several online news services have the 1766 ability to provide enormous quantity of information about 1767 different events occurring in the world. Some of those events can 1768 impact on the network system managed by a specific framework and, 1769 therefore, such information may be of interest to the management 1770 solution. For instance, diverse security reports, such as the 1771 Common Vulnerabilities and Exposures (CVE), can be issued by the 1772 corresponding authority and used by the management solution to 1773 update the managed system if needed. Instead of a specific 1774 protocol and data format, the sources of this kind of information 1775 usually follow a relaxed but structured format. This format will 1776 be part of both the ontology and information model of the 1777 telemetry framework. 1779 * Global event analyzers. The advance of Big Data analyzers 1780 provides a huge amount of information and, more interestingly, the 1781 identification of events detected by analyzing many data streams 1782 from different origins. In contrast with the other types of 1783 sources, which are focused on specific events, the detectors of 1784 this source type will detect generic events. For example, a 1785 sports event takes place and some unexpected movement makes it 1786 fascinating and many people connect to sites that are reporting on 1787 the event. The underlying networks supporting the services that 1788 cover the event can be affected by such situation so their 1789 management solutions should be aware of it. In contrast with the 1790 other source types, a new information model, format, and reporting 1791 protocol is required to integrate the detectors of this type with 1792 the management solution. 1794 Additional types of detector types can be added to the system, but 1795 they will be generally the result of composing the properties offered 1796 by these main classes. 1798 A.4.2. Connectors and Interfaces 1800 For allowing external event detectors to be properly integrated with 1801 other management solutions, both elements must expose interfaces and 1802 protocols that are subject to their particular objective. Since 1803 external event detectors will be focused on providing their 1804 information to their main consumers, which generally will not be 1805 limited to the network management solutions, the framework must 1806 include the definition of the required connectors for ensuring the 1807 interconnection between detectors (sources) and their consumers 1808 within the management systems (sinks) are effective. 1810 In some situations, the interconnection between the external event 1811 detectors and the management system is via the management plane. For 1812 those situations there will be a special connector that provides the 1813 typical interfaces found in most other elements connected to the 1814 management plane. For instance, the interfaces could accomplish this 1815 with a specific data model (YANG) and specific telemetry protocol, 1816 such as NETCONF, YANG-Push, or gRPC. 1818 Authors' Addresses 1820 Haoyu Song 1821 Futurewei 1822 United States of America 1824 Email: haoyu.song@futurewei.com 1826 Fengwei Qin 1827 China Mobile 1828 P.R. China 1830 Email: qinfengwei@chinamobile.com 1831 Pedro Martinez-Julia 1832 NICT 1833 Japan 1835 Email: pedro@nict.go.jp 1837 Laurent Ciavaglia 1838 Rakuten Mobile 1839 France 1841 Email: laurent.ciavaglia@rakuten.com 1843 Aijun Wang 1844 China Telecom 1845 P.R. China 1847 Email: wangaj.bri@chinatelecom.cn