idnits 2.17.1 draft-ietf-opsawg-ntf-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 13, 2020) is 1474 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-13) exists of draft-ietf-grow-bmp-local-rib-06 == Outdated reference: A later version (-17) exists of draft-ietf-ippm-ioam-data-09 == Outdated reference: A later version (-16) exists of draft-song-ippm-postcard-based-telemetry-06 == Outdated reference: A later version (-21) exists of draft-song-opsawg-ifit-framework-11 -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 8321 (Obsoleted by RFC 9341) Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG H. Song 3 Internet-Draft Futurewei 4 Intended status: Informational F. Qin 5 Expires: October 15, 2020 China Mobile 6 P. Martinez-Julia 7 NICT 8 L. Ciavaglia 9 Nokia 10 A. Wang 11 China Telecom 12 April 13, 2020 14 Network Telemetry Framework 15 draft-ietf-opsawg-ntf-03 17 Abstract 19 Network telemetry is the technology for gaining network insight and 20 facilitating efficient and automated network management. It engages 21 various techniques for remote data collection, correlation, and 22 consumption. This document provides an architectural framework for 23 network telemetry, motivated by the network operation challenges and 24 requirements. As evidenced by some key characteristics and industry 25 practices, network telemetry covers technologies and protocols beyond 26 the conventional network Operations, Administration, and Management 27 (OAM). It promises better flexibility, scalability, accuracy, 28 coverage, and performance and allows automated control loops to suit 29 both today's and tomorrow's network operation. This document 30 clarifies the terminologies and classifies the modules and components 31 of a network telemetry system from several different perspectives. 32 The framework and taxonomy help to set a common ground for the 33 collection of related work and provide guidance for related technique 34 and standard developments. 36 Status of This Memo 38 This Internet-Draft is submitted in full conformance with the 39 provisions of BCP 78 and BCP 79. 41 Internet-Drafts are working documents of the Internet Engineering 42 Task Force (IETF). Note that other groups may also distribute 43 working documents as Internet-Drafts. The list of current Internet- 44 Drafts is at https://datatracker.ietf.org/drafts/current/. 46 Internet-Drafts are draft documents valid for a maximum of six months 47 and may be updated, replaced, or obsoleted by other documents at any 48 time. It is inappropriate to use Internet-Drafts as reference 49 material or to cite them other than as "work in progress." 51 This Internet-Draft will expire on October 15, 2020. 53 Copyright Notice 55 Copyright (c) 2020 IETF Trust and the persons identified as the 56 document authors. All rights reserved. 58 This document is subject to BCP 78 and the IETF Trust's Legal 59 Provisions Relating to IETF Documents 60 (https://trustee.ietf.org/license-info) in effect on the date of 61 publication of this document. Please review these documents 62 carefully, as they describe your rights and restrictions with respect 63 to this document. Code Components extracted from this document must 64 include Simplified BSD License text as described in Section 4.e of 65 the Trust Legal Provisions and are provided without warranty as 66 described in the Simplified BSD License. 68 Table of Contents 70 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 71 2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 4 72 2.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 5 73 2.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 6 74 2.3. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 7 75 2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 8 76 3. The Necessity of a Network Telemetry Framework . . . . . . . 10 77 4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 11 78 4.1. Data Acquiring Mechanisms and Data Types . . . . . . . . 12 79 4.2. Data Object Modules . . . . . . . . . . . . . . . . . . . 13 80 4.2.1. Requirements and Challenges for each Module . . . . . 16 81 4.3. Function Components . . . . . . . . . . . . . . . . . . . 19 82 4.4. Existing Works Mapped in the Framework . . . . . . . . . 21 83 5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 22 84 6. Security Considerations . . . . . . . . . . . . . . . . . . . 23 85 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24 86 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 24 87 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 24 88 10. Informative References . . . . . . . . . . . . . . . . . . . 25 89 Appendix A. A Survey on Existing Network Telemetry Techniques . 28 90 A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 28 91 A.1.1. Push Extensions for NETCONF . . . . . . . . . . . . . 28 92 A.1.2. gRPC Network Management Interface . . . . . . . . . . 28 93 A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 29 94 A.2.1. BGP Monitoring Protocol . . . . . . . . . . . . . . . 29 95 A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 29 96 A.3.1. The IPFPM technology . . . . . . . . . . . . . . . . 29 97 A.3.2. Dynamic Network Probe . . . . . . . . . . . . . . . . 30 98 A.3.3. IP Flow Information Export (IPFIX) protocol . . . . . 31 99 A.3.4. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 31 100 A.3.5. Postcard Based Telemetry . . . . . . . . . . . . . . 31 101 A.4. External Data and Event Telemetry . . . . . . . . . . . . 32 102 A.4.1. Sources of External Events . . . . . . . . . . . . . 32 103 A.4.2. Connectors and Interfaces . . . . . . . . . . . . . . 33 104 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33 106 1. Introduction 108 Network visibility is the ability of management tools to see the 109 state and behavior of a network. It is essential for successful 110 network operation. Network telemetry is the process of measuring, 111 correlating, recording, and distributing information about the 112 behavior of a network. Network telemetry has been considered as an 113 ideal means to gain sufficient network visibility with better 114 flexibility, scalability, accuracy, coverage, and performance than 115 some conventional network Operations, Administration, and Management 116 (OAM) techniques. 118 However, the term of network telemetry lacks a solid and unambiguous 119 definition. The scope and coverage of it cause confusion and 120 misunderstandings. It is beneficial to clarify the concept and 121 provide a clear architectural framework for network telemetry, so we 122 can articulate the technical field, and better align the related 123 techniques and standard works. 125 To fulfill such an undertaking, we first discuss some key 126 characteristics of network telemetry which set a clear distinction 127 from the conventional network OAM and show that some conventional OAM 128 technologies can be considered a subset of the network telemetry 129 technologies. We then provide an architectural framework for network 130 telemetry from three different perspectives. We show how network 131 telemetry can meet the current and future network operation 132 requirements, and the challenges each telemetry module is facing. 133 Based on the distinction of modules and function components, we can 134 map the existing and emerging techniques and protocols into the 135 framework. At last, we outline a road-map for the evolution of the 136 network telemetry system and discuss the potential security concerns 137 for network telemetry. 139 The purpose of the framework and taxonomy is to set a common ground 140 for the collection of related work and provide guidance for future 141 technique and standard developments. To the best of our knowledge, 142 this document is the first such effort for network telemetry in 143 industry standards organizations. 145 2. Motivation 147 The term "big data" is used to describe the extremely large volume of 148 data sets that can be analyzed computationally to reveal patterns, 149 trends, and associations. Network is undoubtedly a source of big 150 data because of its scale and all the traffic goes through it. It is 151 easy to see that network OAM can benefit from network big data. 153 Today one can access advanced big data analytics capability through a 154 plethora of commercial and open source platforms (e.g., Apache 155 Hadoop), tools (e.g., Apache Spark), and techniques (e.g., machine 156 learning). Thanks to the advance of computing and storage 157 technologies, network big data analytics gives network operators an 158 opportunity to gain network insights and move towards network 159 autonomy. Some operators start to explore the application of 160 Artificial Intelligence (AI) to make sense of network data. Software 161 tools can use the network data to detect and react on network faults, 162 anomalies, and policy violations, as well as predicting future 163 events. In turn, the network policy updates for planning, intrusion 164 prevention, optimization, and self-healing may be applied. 166 It is conceivable that an intent-driven autonomic network [RFC7575] 167 is the logical next step for network evolution following Software 168 Defined Network (SDN), aiming to reduce (or even eliminate) human 169 labor, make more efficient use of network resources, and provide 170 better services more aligned with customer requirements. Although it 171 takes time to reach the ultimate goal, the journey has started 172 nevertheless. 174 However, while the data processing capability is improved and 175 applications are hungry for more data, the networks lag behind in 176 extracting and translating network data into useful and actionable 177 information in efficient ways. The system bottleneck is shifting 178 from data consumption to data supply. Both the number of network 179 nodes and the traffic bandwidth keep increasing at a fast pace. The 180 network configuration and policy change at smaller time slots than 181 before. More subtle events and fine-grained data through all network 182 planes need to be captured and exported in real time. In a nutshell, 183 it is a challenge to get enough high-quality data out of network 184 efficiently, timely, and flexibly. Therefore, we need to examine the 185 existing network technologies and protocols, and identify any 186 potential technique and standard gaps based on the real network and 187 device architectures. 189 In the remaining of this section, first we discuss several key use 190 cases for today's and future network operations. Next, we show why 191 the current network OAM techniques and protocols are insufficient for 192 these use cases. The discussion underlines the need of new methods, 193 techniques, and protocols which we assign under an umbrella term - 194 network telemetry. 196 2.1. Use Cases 198 These use cases are essential for network operations. While the list 199 is by no means exhaustive, it is enough to highlight the requirements 200 for data velocity, variety, volume, and veracity in networks. 202 Policy and Intent Compliance: Network policies are the rules that 203 constraint the services for network access, provide service 204 differentiation, or enforce specific treatment on the traffic. 205 For example, a service function chain is a policy that requires 206 the selected flows to pass through a set of ordered network 207 functions. An intent is a high-level abstract policy which 208 requires a complex translation and mapping process before being 209 applied on networks. While a policy is enforced, the compliance 210 needs to be verified and monitored continuously, and any violation 211 needs to be reported immediately. 213 SLA Compliance: A Service-Level Agreement (SLA) defines the level of 214 service a user expects from a network operator, which include the 215 metrics for the service measurement and remedy/penalty procedures 216 when the service level misses the agreement. Users need to check 217 if they get the service as promised and network operators need to 218 evaluate how they can deliver the services that can meet the SLA 219 based on realtime network measurement. 221 Root Cause Analysis: Any network failure can be the cause or effect 222 of a sequence of chained events. Troubleshooting and recovery 223 require quick identification of the root cause of any observable 224 issues. However, the root cause is not always straightforward to 225 identify, especially when the failure is sporadic and the related 226 and unrelated events are overwhelming and interleaved. While 227 machine learning technologies can be used for root cause analysis, 228 it up to the network to sense and provide the relevant data. 230 Network Optimization: This covers all short-term and long-term 231 network optimization techniques, including load balancing, Traffic 232 Engineering (TE), and network planning. Network operators are 233 motivated to optimize their network utilization and differentiate 234 services for better Return On Investment (ROI) or lower Capital 235 Expenditures (CAPEX). The first step is to know the real-time 236 network conditions before applying policies for traffic 237 manipulation. In some cases, micro-bursts need to be detected in 238 a very short time-frame so that fine-grained traffic control can 239 be applied to avoid network congestion. The long-term network 240 capacity planning and topology augmentation rely on the 241 accumulated data of network operations. 243 Event Tracking and Prediction: The visibility of traffic path and 244 performance is critical for services and applications that rely on 245 healthy network operation. Numerous related network events are of 246 interest to network operators. For example, Network operators 247 want to learn where and why packets are dropped for an application 248 flow. They also want to be warned of issues in advance so 249 proactive actions can be taken to avoid catastrophic consequences. 251 2.2. Challenges 253 For a long time, network operators have relied upon SNMP [RFC3416], 254 Command-Line Interface (CLI), or Syslog to monitor the network. Some 255 other OAM techniques as described in [RFC7276] are also used to 256 facilitate network troubleshooting. These conventional techniques 257 are not sufficient to support the above use cases for the following 258 reasons: 260 o Most use cases need to continuously monitor the network and 261 dynamically refine the data collection in real-time. The poll- 262 based low-frequency data collection is ill-suited for these 263 applications. Subscription-based streaming data directly pushed 264 from the data source (e.g., the forwarding chip) is preferred to 265 provide enough data quantity and precision at scale. 267 o Comprehensive data is needed from packet processing engine to 268 traffic manager, from line cards to main control board, from user 269 flows to control protocol packets, from device configurations to 270 operations, and from physical layer to application layer. 271 Conventional OAM only covers a narrow range of data (e.g., SNMP 272 only handles data from the Management Information Base (MIB)). 273 Traditional network devices cannot provide all the necessary 274 probes. More open and programmable network devices are therefore 275 needed. 277 o Many application scenarios need to correlate network-wide data 278 from multiple sources (i.e., from distributed network devices, 279 different components of a network device, or different network 280 planes). A piecemeal solution is often lacking the capability to 281 consolidate the data from multiple sources. The composition of a 282 complete solution, as partly proposed by Autonomic Resource 283 Control Architecture(ARCA) 284 [I-D.pedro-nmrg-anticipated-adaptation], will be empowered and 285 guided by a comprehensive framework. 287 o Some of the conventional OAM techniques (e.g., CLI and Syslog) 288 lack a formal data model. The unstructured data hinder the tool 289 automation and application extensibility. Standardized data 290 models are essential to support the programmable networks. 292 o Although some conventional OAM techniques support data push (e.g., 293 SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow), the pushed data 294 are limited to only predefined management plane warnings (e.g., 295 SNMP Trap) or sampled user packets (e.g., sFlow). Network 296 operators require the data with arbitrary source, granularity, and 297 precision which are beyond the capability of the existing 298 techniques. 300 o The conventional passive measurement techniques can either consume 301 excessive network resources and render excessive redundant data, 302 or lead to inaccurate results; on the other hand, the conventional 303 active measurement techniques can interfere with the user traffic 304 and their results are indirect. Techniques that can collect 305 direct and on-demand data from user traffic are more favorable. 307 2.3. Glossary 309 Before further discussion, we list some key terminology and acronyms 310 used in this documents. We make an intended distinction between 311 network telemetry and network OAM. 313 AI: Artificial Intelligence. In network domain, AI refers to the 314 machine-learning based technologies for automated network 315 operation and other tasks. 317 BMP: BGP Monitoring Protocol, specified in [RFC7854]. 319 DNP: Dynamic Network Probe, referring to programmable in-network 320 sensors for network monitoring and measurement. 322 DPI: Deep Packet Inspection, referring to the techniques that 323 examines packet beyond packet L3/L4 headers. 325 gNMI: gRPC Network Management Interface, a network management 326 protocol from OpenConfig Operator Working Group, mainly 327 contributed by Google. See [gnmi] for details. 329 gRPC: gRPC Remote Procedure Call, a open source high performance RPC 330 framework that gNMI is based on. See [grpc] for details. 332 IPFIX: IP Flow Information Export Protocol, specified in [RFC7011]. 334 IPFPM: IP Flow Performance Measurement method, specified in 335 [RFC8321]. 337 IOAM: In-situ OAM, a dataplane on-path telemetry technique. 339 NETCONF: Network Configuration Protocol, specified in [RFC6241]. 341 Network Telemetry: Acquiring and processing network data remotely 342 for network monitoring and operation. A general term for a large 343 set of network visibility techniques and protocols, with the 344 characteristics defined in this document. Network telemetry 345 addresses the current network operation issues and enables smooth 346 evolution toward future intent-driven autonomous networks. 348 NMS: Network Management System, referring to applications that allow 349 network administrators manage a network's software and hardware 350 components. It usually records data from a network's remote 351 points to carry out central reporting to a system administrator. 353 OAM: Operations, Administration, and Maintenance. A group of 354 network management functions that provide network fault 355 indication, fault localization, performance information, and data 356 and diagnosis functions. Most conventional network monitoring 357 techniques and protocols belong to network OAM. 359 PBT: Postcard-Based Telemetry, a dataplane on-path telemetry 360 technique. 362 SNMP: Simple Network Management Protocol. Version 1 and 2 are 363 specified in [RFC1157] and [RFC3416], respectively. 365 YANG: The abbreviation of "Yet Another Next Generation". YANG is a 366 data modeling language for the definition of data sent over 367 network management protocols such as the NETCONF and RESTCONF. 368 YANG is defined in [RFC6020]. 370 YANG FSM: A YANG model that describes events, operations, and finite 371 state machine of YANG-defined network elements. 373 YANG PUSH: A method to subscribe pushed data from remote YANG 374 datastore on network devices. 376 2.4. Network Telemetry 378 Network telemetry has emerged as a mainstream technical term to refer 379 to the newer data collection and consumption techniques, 380 distinguishing itself from the convention techniques for network OAM. 381 The representative techniques and protocols include IPFIX [RFC7011] 382 and gPRC [grpc]. Network telemetry allows separate entities to 383 acquire data from network devices so that data can be visualized and 384 analyzed to support network monitoring and operation. Network 385 telemetry overlaps with the conventional network OAM and has a wider 386 scope than it. It is expected that network telemetry can provide the 387 necessary network insight for autonomous networks and address the 388 shortcomings of conventional OAM techniques. 390 One difference between the network telemetry and the network OAM is 391 that in general the network telemetry assumes machines as data 392 consumer rather than human operators. Hence, the network telemetry 393 can directly trigger the automated network operation, while the 394 conventional OAM tools usually help human operators to monitor and 395 diagnose the networks and guide manual network operations. The 396 difference leads to very different techniques. 398 Although the network telemetry techniques are just emerging and 399 subject to continuous evolution, several characteristics of network 400 telemetry have been well accepted. Note that network telemetry is 401 intended to be an umbrella term covering a wide spectrum of 402 techniques, so the following characteristics are not expected to be 403 held by every specific technique. 405 o Push and Streaming: Instead of polling data from network devices, 406 the telemetry collector subscribes to the streaming data pushed 407 from data sources in network devices. 409 o Volume and Velocity: The telemetry data is intended to be consumed 410 by machines rather than by human being. Therefore, the data 411 volume is huge and the processing is often in realtime. 413 o Normalization and Unification: Telemetry aims to address the 414 overall network automation needs. The piecemeal solutions offered 415 by the conventional OAM approach are no longer suitable. Efforts 416 need to be made to normalize the data representation and unify the 417 protocols. 419 o Model-based: The telemetry data is modeled in advance which allows 420 applications to configure and consume data with ease. 422 o Data Fusion: The data for a single application can come from 423 multiple data sources (e.g., cross-domain, cross-device, and 424 cross-layer) and needs to be correlated to take effect. 426 o Dynamic and Interactive: Since the network telemetry means to be 427 used in a closed control loop for network automation, it needs to 428 run continuously and adapt to the dynamic and interactive queries 429 from the network operation controller. 431 In addition, an ideal network telemetry solution may also have the 432 following features or properties: 434 o In-Network Customization: The data can be customized in network at 435 run-time to cater to the specific need of applications. This 436 needs the support of a programmable data plane which allows probes 437 with custom functions to be deployed at flexible locations. 439 o In-Network Data Aggregation and Correlation: Network devices and 440 aggregation points can work out which events and what data needs 441 to be stored, reported, or discarded thus reducing the load on the 442 central collection and processing points while still ensuring that 443 the right information is ready to be processed in a timely way. 445 o In-Network Processing and Action: Sometimes it is not necessary or 446 feasible to gather all information to a central point to be 447 processed and acted upon. It is possible for the data processing 448 to be done in network, and actions to be taken locally. 450 o Direct Data Plane Export: The data originated from the data plane 451 forwarding chips can be directly exported to the data consumer for 452 efficiency, especially when the data bandwidth is large and the 453 real-time processing is required. 455 o In-band Data Collection: In addition to the passive and active 456 data collection approaches, the new hybrid approach allows to 457 directly collect data for any target flow on its entire forwarding 458 path [I-D.song-opsawg-ifit-framework]. 460 It is worth noting that, a network telemetry system should not be 461 intrusive to normal network operations, by avoiding the pitfall of 462 the "observer effect". That is, it should not change the network 463 behavior and affect the forwarding performance. Otherwise, the whole 464 purpose of network telemetry is defied. 466 Although in many cases a network telemetry system is akin to the SDN 467 architecture, it is important to understand that network telemetry 468 does not infer the need of any centralized data processing and 469 analytics engine. Telemetry data producers and consumers can 470 perfectly work in distributed or peer-to-peer fashions instead. 472 3. The Necessity of a Network Telemetry Framework 474 Big data analytics and machine-learning based AI technologies are 475 applied for network operation automation, relying on abundant data 476 from networks. The single-sourced and static data acquisition cannot 477 meet the data requirements. It is desirable to have a framework that 478 integrates multiple telemetry approaches from different layers. This 479 allows flexible combinations for different applications. The 480 framework would benefit application development for the following 481 reasons: 483 o The future autonomous networks will require a holistic view on 484 network visibility. All the use cases and applications need to be 485 supported uniformly and coherently under a single intelligent 486 agent. Therefore, the protocols and mechanisms should be 487 consolidated into a minimum yet comprehensive set. A telemetry 488 framework can help to normalize the technique developments. 490 o Network visibility presents multiple viewpoints. For example, the 491 device viewpoint takes the network infrastructure as the 492 monitoring object from which the network topology and device 493 status can be acquired; the traffic viewpoint takes the flows or 494 packets as the monitoring object from which the traffic quality 495 and path can be acquired. An application may need to switch its 496 viewpoint during operation. It may also need to correlate a 497 service and its impact on network experience to acquire the 498 comprehensive information. 500 o Applications require network telemetry to be elastic in order to 501 efficiently use the network resource and reduce the performance 502 impact. Routine network monitoring covers the entire network with 503 low data sampling rate. When issues arise or trends emerge, the 504 telemetry data source can be modified and the data rate can be 505 boosted. 507 o Efficient data fusion is critical for applications to reduce the 508 overall quantity of data and improve the accuracy of analysis. 510 A telemetry framework collects together all of the telemetry-related 511 works from different sources and working groups within IETF. This 512 makes it possible to assemble a comprehensive network telemetry 513 system and to avoid repetitious or redundant work. The framework 514 should cover the concepts and components from the standardization 515 perspective. This document clarifies the layered modules on which 516 the telemetry is exerted and decomposes the telemetry system into a 517 set of distinct components that the existing and future work can 518 easily map to. 520 4. Network Telemetry Framework 522 Network telemetry techniques can be classified from multiple 523 dimensions. In this document, we provide three unique perspectives: 524 data acquiring mechanisms, data objects, and function components. 526 4.1. Data Acquiring Mechanisms and Data Types 528 Broadly speaking, network data can be acquired through subscription 529 (push) and query (poll). A subscriber may request data when it is 530 ready. It follows a Publish-Subscription (Pub-Sub) mode or a 531 Subscription-Publish (Sub-Pub) mode. In the Pub-Sub mode, pre- 532 defined data are published and multiple qualified subscribers can 533 subscribe the data. In the Sub-Pub mode, a subscriber designates 534 what data are of interest and demands the network devices to deliver 535 the data when available. 537 In contrast, query is used when a querier expects immediate feedback 538 from network devices. The queried data may be directly extracted 539 from some specific data source, or synthesized and processed from raw 540 data. Query suits for interactive network telemetry applications. 542 There are four types of data from network devices: 544 Simple Data: The data that are steadily available from some data 545 store or static probes in network devices. such data can be 546 specified by YANG model. 548 Complex Data: The data need to be synthesized or processed in 549 network from raw data from one or more network devices. The data 550 processing function can be statically or dynamically loaded into 551 network devices. 553 Event-triggered Data: The data are conditionally acquired based on 554 the occurrence of some events. An event can be modeled as a 555 Finite State Machine (FSM). 557 Streaming Data: The data are continuously or periodically generated. 558 It can be time series or the dump of databases. The streaming 559 data reflect realtime network states and metrics and require large 560 bandwidth and processing power. 562 The above data types are not mutually exclusive. For example, event- 563 triggered data can be simple or complex, and streaming data can be 564 event triggered. The relationships of these data types are 565 illustrated in Figure 1. 567 +--------------------------+ 568 | +----------------------+ | 569 | | +-----------------+ | | 570 | | | +-------------+ | | | 571 | | | | Simple Data | | | | 572 | | | +-------------+ | | | 573 | | | Complex Data | | | 574 | | +-----------------+ | | 575 | | Event-triggered Data | | 576 | +----------------------+ | 577 | Streaming Data | 578 +--------------------------+ 580 Figure 1: Data Type Relationship 582 Subscription usually deals with event-triggered data and streaming 583 data, and query usually deals with simple data and complex data. The 584 conventional OAM techniques are mostly about querying simple data. 585 While these techniques are still useful, more advanced network 586 telemetry techniques are designed mainly for event-triggered or 587 streaming data subscription, and complex data query. 589 4.2. Data Object Modules 591 Telemetry can be applied on the forwarding plane, the control plane, 592 and the management plane in a network, as well as other sources out 593 of the network, as shown in Figure 2. Therefore, we categorize the 594 network telemetry into four distinct modules with each having its own 595 interface to Network Operation Applications. 597 +------------------------------+ 598 | | 599 | Network Operation |<-------+ 600 | Applications | | 601 | | | 602 +------------------------------+ | 603 ^ ^ ^ | 604 | | | | 605 V | V V 606 +-----------|---+--------------+ +-----------+ 607 | | | | | | 608 | Control Pl|ane| | | External | 609 | Telemetry | <---> | | Data and | 610 | | | | | Event | 611 | ^ V | Management | | Telemetry | 612 +------|--------+ Plane | | | 613 | V | Telemetry | +-----------+ 614 | Forwarding | | 615 | Plane <---> | 616 | Telemetry | | 617 | | | 618 +---------------+--------------+ 620 Figure 2: Modules in Layer Category of NTF 622 The rationale of this partition lies in the different telemetry data 623 objects which result in different data source and export locations. 624 Such differences have profound implications on in-network data 625 programming and processing capability, data encoding and transport 626 protocol, and data bandwidth and latency. 628 We summarize the major differences of the four modules in the 629 following table. They are compared from six aspects: data object, 630 data export location, data model, data encoding, telemetry protocol, 631 and transport method. Data object is the target and source of each 632 module. Because the data source varies, the data export location 633 varies. Because each data export location has different capability, 634 the proper data model, encoding, and transport method cannot be kept 635 the same. As a result, the suitable telemetry protocol for each 636 module can be different. Some representative techniques are shown in 637 the corresponding table blocks to highlight the technical diversity 638 of these modules. The key point is that one cannot expect to use a 639 universal protocol to cover all the network telemetry requirements. 641 +---------+--------------+--------------+--------------+-----------+ 642 | Module | Control | Management | Forwarding | External | 643 | | Plane | Plane | Plane | Data | 644 +---------+--------------+--------------+--------------+-----------+ 645 |Object | control | config. & | flow & packet| terminal, | 646 | | protocol & | operation | QoS, traffic | social & | 647 | | signaling, | state, MIB | stat., buffer| environ- | 648 | | RIB, ACL | | & queue stat.| mental | 649 +---------+--------------+--------------+--------------+-----------+ 650 |Export | main control | main control | fwding chip | various | 651 |Location | CPU, | CPU | or linecard | | 652 | | linecard CPU | | CPU; main | | 653 | | or fwding | | control CPU | | 654 | | chip | | unlikely | | 655 +---------+--------------+--------------+--------------+-----------+ 656 |Data | YANG, | MIB, syslog, | template, | YANG | 657 |Model | custom | YANG, | YANG, | | 658 | | | custom | custom | | 659 +---------+--------------+--------------+--------------+-----------+ 660 |Data | GPB, JSON, | GPB, JSON, | plain | GPB, JSON | 661 |Encoding | XML, plain | XML | | XML, plain| 662 +---------+--------------+--------------+--------------+-----------+ 663 |Protocol | gRPC,NETCONF,| gPRC,NETCONF,| IPFIX, mirror| gRPC | 664 | | IPFIX,mirror | | | | 665 +---------+--------------+--------------+--------------+-----------+ 666 |Transport| HTTP, TCP, | HTTP, TCP | UDP | HTTP,TCP | 667 | | UDP | | | UDP | 668 +---------+--------------+--------------+--------------+-----------+ 670 Figure 3: Comparison of the Data Object Modules 672 Note that the interaction with the network operation applications can 673 be indirect. Some in-device data transfer is possible. For example, 674 in the management plane telemetry, the management plane may need to 675 acquire data from the data plane. Some of the operational states can 676 only be derived from the data plane such as the interface status and 677 statistics. For another example, the control plane telemetry may 678 need to access the Forwarding Information Base (FIB) in data plane. 680 On the other hand, an application may involve more than one plane and 681 interact with multiple planes simultaneously. For example, an SLA 682 compliance application may require both the data plane telemetry and 683 the control plane telemetry. 685 4.2.1. Requirements and Challenges for each Module 687 4.2.1.1. Management Plane Telemetry 689 The management plane of network elements interacts with the Network 690 Management System (NMS), and provides information such as performance 691 data, network logging data, network warning and defects data, and 692 network statistics and state data. Some legacy protocols, such as 693 SNMP and Syslog, are widely used for the management plane. However, 694 these protocols are insufficient to meet the requirements of the 695 future automated network operation applications. 697 New management plane telemetry protocols should consider the 698 following requirements: 700 Convenient Data Subscription: An application should have the freedom 701 to choose the data export means such as the data types and the 702 export frequency. 704 Structured Data: For automatic network operation, machines will 705 replace human for network data comprehension. The schema 706 languages such as YANG can efficiently describe structured data 707 and normalize data encoding and transformation. 709 High Speed Data Transport: In order to retain the information, a 710 server needs to send a large amount of data at high frequency. 711 Compact encoding formats are needed to compress the data and 712 improve the data transport efficiency. The subscription mode, by 713 replacing the query mode, reduces the interactions between clients 714 and servers and helps to improve the server's efficiency. 716 4.2.1.2. Control Plane Telemetry 718 The control plane telemetry refers to the health condition monitoring 719 of different network control protocols covering Layer 2 to Layer 7. 720 Keeping track of the running status of these protocols is beneficial 721 for detecting, localizing, and even predicting various network 722 issues, as well as network optimization, in real-time and in fine 723 granularity. 725 One of the most challenging problems for the control plane telemetry 726 is how to correlate the End-to-End (E2E) Key Performance Indicators 727 (KPI) to a specific layer's KPIs. For example, an IPTV user may 728 describe his User Experience (UE) by the video fluency and 729 definition. Then in case of an unusually poor UE KPI or a service 730 disconnection, it is non-trivial to delimit and pinpoint the issue in 731 the responsible protocol layer (e.g., the Transport Layer or the 732 Network Layer), the responsible protocol (e.g., ISIS or BGP at the 733 Network Layer), and finally the responsible device(s) with specific 734 reasons. 736 Traditional OAM-based approaches for control plane KPI measurement 737 include PING (L3), Tracert (L3), Y.1731 (L2), and so on. One common 738 issue behind these methods is that they only measure the KPIs instead 739 of reflecting the actual running status of these protocols, making 740 them less effective or efficient for control plane troubleshooting 741 and network optimization. 743 An example of the control plane telemetry is the BGP monitoring 744 protocol (BMP), it is currently used to monitoring the BGP routes and 745 enables rich applications, such as BGP peer analysis, AS analysis, 746 prefix analysis, security analysis, and so on. However, the 747 monitoring of other layers, protocols and the cross-layer, cross- 748 protocol KPI correlations are still in their infancy (e.g., the IGP 749 monitoring is missing), which require further research. 751 4.2.1.3. Data Plane Telemetry 753 An effective data plane telemetry system relies on the data that the 754 network device can expose. The data's quality, quantity, and 755 timeliness must meet some stringent requirements. This raises some 756 challenges to the network data plane devices where the first hand 757 data originate. 759 o A data plane device's main function is user traffic processing and 760 forwarding. While supporting network visibility is important, the 761 telemetry is just an auxiliary function, and it should not impede 762 normal traffic processing and forwarding (i.e., the performance is 763 not lowered and the behavior is not altered due to the telemetry 764 functions). 766 o The network operation applications requires end-to-end visibility 767 from various sources, which results in a huge volume of data. 768 However, the sheer data quantity should not stress the network 769 bandwidth, regardless of the data delivery approach (i.e., through 770 in-band or out-of-band channels). 772 o The data plane devices must provide timely data with the minimum 773 possible delay. Long processing, transport, storage, and analysis 774 delay can impact the effectiveness of the control loop and even 775 render the data useless. 777 o The data should be structured and labeled, and easy for 778 applications to parse and consume. At the same time, the data 779 types needed by applications can vary significantly. The data 780 plane devices need to provide enough flexibility and 781 programmability to support the precise data provision for 782 applications. 784 o The data plane telemetry should support incremental deployment and 785 work even though some devices are unaware of the system. This 786 challenge is highly relevant to the standards and legacy networks. 788 The data plane programmability is essential to support network 789 telemetry. Newer data plane forwarding chips are equipped with 790 advanced telemetry features and provide flexibility to support 791 customized telemetry functions. 793 4.2.1.3.1. Technique Taxonomy 795 There can be multiple possible dimensions to classify the data plane 796 telemetry techniques. 798 Active, Passive, and Hybrid: The active and passive methods (as well 799 as the hybrid types) are well documented in [RFC7799]. The 800 passive methods include TCPDUMP, IPFIX [RFC7011], sflow, and 801 traffic mirror. These methods usually have low data coverage. 802 The bandwidth cost is very high in order to improve the data 803 coverage. On the other hand, the active methods include Ping, 804 Traceroute, OWAMP [RFC4656], and TWAMP [RFC5357]. These methods 805 are intrusive and only provide indirect network measurement 806 results. The hybrid methods, including in-situ OAM 807 [I-D.ietf-ippm-ioam-data], IPFPM [RFC8321], and Multipoint 808 Alternate Marking [I-D.fioccola-ippm-multipoint-alt-mark], provide 809 a well-balanced and more flexible approach. However, these 810 methods are also more complex to implement. 812 In-Band and Out-of-Band: The telemetry data, before being exported 813 to some collector, can be carried in user packets. Such methods 814 are considered in-band (e.g., in-situ OAM 815 [I-D.ietf-ippm-ioam-data]). If the telemetry data is directly 816 exported to some collector without modifying the user packets, 817 such methods are considered out-of-band (e.g., postcard-based 818 INT). It is possible to have hybrid methods. For example, only 819 the telemetry instruction or partial data is carried by user 820 packets (e.g., IPFPM [RFC8321]). 822 E2E and In-Network: Some E2E methods start from and end at the 823 network end hosts (e.g., Ping). The other methods work in 824 networks and are transparent to end hosts. However, if needed, 825 the in-network methods can be easily extended into end hosts. 827 Flow, Path, and Node: Depending on the telemetry objective, the 828 methods can be flow-based (e.g., in-situ OAM 830 [I-D.ietf-ippm-ioam-data]), path-based (e.g., Traceroute), and 831 node-based (e.g., IPFIX [RFC7011]). 833 4.2.1.4. External Data Telemetry 835 Events that occur outside the boundaries of the network system are 836 another important source of network telemetry. Correlating both 837 internal telemetry data and external events with the requirements of 838 network systems, as presented in 839 [I-D.pedro-nmrg-anticipated-adaptation], provides a strategic and 840 functional advantage to management operations. 842 As with other sources of telemetry information, the data and events 843 must meet strict requirements, especially in terms of timeliness, 844 which is essential to properly incorporate external event information 845 to management cycles. The specific challenges are described as 846 follows: 848 o The role of external event detector can be played by multiple 849 elements, including hardware (e.g. physical sensors, such as 850 seismometers) and software (e.g. Big Data sources that analyze 851 streams of information, such as Twitter messages). Thus, the 852 transmitted data must support different shapes but, at the same 853 time, follow a common but extensible schema. 855 o Since the main function of the external event detectors is to 856 perform the notifications, their timeliness is assumed. However, 857 once messages have been dispatched, they must be quickly collected 858 and inserted into the control plane with variable priority, which 859 will be high for important sources and/or important events and low 860 for secondary ones. 862 o The schema used by external detectors must be easily adopted by 863 current and future devices and applications. Therefore, it must 864 be easily mapped to current information models, such as in terms 865 of YANG. 867 Organizing together both internal and external telemetry information 868 will be key for the general exploitation of the management 869 possibilities of current and future network systems, as reflected in 870 the incorporation of cognitive capabilities to new hardware and 871 software (virtual) elements. 873 4.3. Function Components 875 The telemetry module at each plane can be further partitioned into 876 five distinct components: 878 Data Query, Analysis, and Storage: This component works at the 879 application layer. On the one hand, it is responsible for issuing 880 data requirements. The data of interest can be modeled data 881 through configuration or custom data through programming. The 882 data requirements can be queries for one-shot data or 883 subscriptions for events or streaming data. On the other hand, it 884 receives, stores, and processes the returned data from network 885 devices. Data analysis can be interactive to initiate further 886 data queries. This component can reside in either network devices 887 or remote controllers. 889 Data Configuration and Subscription: This component deploys data 890 queries on devices. It determines the protocol and channel for 891 applications to acquire desired data. This component is also 892 responsible for configuring the desired data that might not be 893 directly available form data sources. The subscription data can 894 be described by models, templates, or programs. 896 Data Encoding and Export: This component determines how telemetry 897 data are delivered to the data analysis and storage component. 898 The data encoding and the transport protocol may vary due to the 899 data exporting location. 901 Data Generation and Processing: The requested data needs to be 902 captured, processed, and formatted in network devices from raw 903 data sources. This may involve in-network computing and 904 processing on either the fast path or the slow path in network 905 devices. 907 Data Object and Source: This component determines the monitoring 908 object and original data source. The data source usually just 909 provides raw data which needs further processing. A data source 910 can be considered a probe. A probe can be statically installed or 911 dynamically installed. 913 +----------------------------------------+ 914 | | 915 | Data Query, Analysis, & Storage | 916 | | 917 +----------------------------------------+ 918 | ^ 919 | | 920 V | 921 +---------------------+------------------+ 922 | Data Configuration | | 923 | & Subscription | Data Encoding | 924 | (model, template, | & Export | 925 | & program) | | 926 +---------------------+------------------| 927 | | 928 | Data Generation | 929 | & Processing | 930 | | 931 +----------------------------------------| 932 | | 933 | Data Object and Source | 934 | | 935 +----------------------------------------+ 937 Figure 4: Components in the Network Telemetry Framework 939 4.4. Existing Works Mapped in the Framework 941 The following two tables provide a non-exhaustive list of existing 942 works (mainly published in IETF and with the emphasis on the latest 943 new technologies) and shows their positions in the framework. More 944 details can be found in Appendix A. 946 The first table is based on the data acquiring mechanisms and data 947 types. 949 +-----------------+---------------+----------------+ 950 | | Query | Subscription | 951 | | | | 952 +-----------------+---------------+----------------+ 953 | Simple Data | SNMP, NETCONF,| SNMP, NETCONF | 954 | | YANG, BMP, | YANG, gRPC | 955 | | gRPC | | 956 +-----------------+---------------+----------------+ 957 | Complex Data | DNP, YANG FSM | DNP, YANG PUSH | 958 | | gRPC, NETCONF | gPRC, NETCONF | 959 +-----------------+---------------+----------------+ 960 | Event-triggered | | gRPC, NETCONF, | 961 | Data | N/A | YANG PUSH, DNP | 962 | | | YANG FSM | 963 +-----------------+---------------+----------------+ 964 | Streaming Data | | gRPC, NETCONF, | 965 | | N/A | IOAM, PBT, DNP | 966 | | | IPFIX, IPFPM | 967 +-----------------+---------------+----------------+ 969 Figure 5: Existing Work Mapping I 971 The second table is based on the telemetry modules and components. 973 +--------------+---------------+----------------+---------------+ 974 | | Management | Control | Forwarding | 975 | | Plane | Plane | Plane | 976 +--------------+---------------+----------------+---------------+ 977 | data Config. | gRPC, NETCONF,| NETCONF/YANG | NETCONF/YANG, | 978 | & subscrib. | YANG PUSH | | YANG FSM | 979 +--------------+---------------+----------------+---------------+ 980 | data gen. & | DNP, | DNP, | IOAM, | 981 | processing | YANG | YANG | PBT, IPFPM, | 982 | | | | DNP | 983 +--------------+---------------+----------------+---------------+ 984 | data | gRPC, NETCONF | BMP, NETCONF | IPFIX | 985 | export | YANG PUSH | | | 986 +--------------+---------------+----------------+---------------+ 988 Figure 6: Existing Work Mapping II 990 5. Evolution of Network Telemetry 992 Network telemetry is a fast evolving technical area. As the network 993 moves towards the automated operation, network telemetry undergoes 994 several levels of evolution. 996 Level 0 - Static Telemetry: The telemetry data source and type are 997 determined at design time. The network operator can only 998 configure how to use it with limited flexibility. 1000 Level 1 - Dynamic Telemetry: The telemetry data can be dynamically 1001 programmed or configured at runtime, allowing a tradeoff among 1002 resource, performance, flexibility, and coverage. DNP is an 1003 effort towards this direction. 1005 Level 2 - Interactive Telemetry: The network operator can 1006 continuously customize the telemetry data in real time to reflect 1007 the network operation's visibility requirements. At this level, 1008 some tasks can be automated, although ultimately human operators 1009 will still need to sit in the middle to make decisions. 1011 Level 3 - Closed-loop Telemetry: Human operators are completely 1012 excluded from the control loop. The intelligent network operation 1013 engine automatically issues the telemetry data requests, analyzes 1014 the data, and updates the network operations in closed control 1015 loops. 1017 While most of the existing technologies belong to level 0 and level 1018 1, with the help of a clearly defined network telemetry framework, we 1019 are now possible to assemble the technologies to support level 2 and 1020 make solid steps towards level 3. 1022 6. Security Considerations 1024 Given that this document has proposed a framework for network 1025 telemetry and the telemetry mechanisms discussed are distinct (in 1026 both message frequency and traffic amount) from the conventional 1027 network OAM concepts, we must also reflect that various new security 1028 considerations may also arise. A number of techniques already exist 1029 for securing the forwarding plane, the control plane, and the 1030 management plane in a network, but it is important to consider if any 1031 new threat vectors are now being enabled via the use of network 1032 telemetry procedures and mechanisms. 1034 Security considerations for networks that use telemetry methods may 1035 include: 1037 o Telemetry framework trust and policy model; 1039 o Role management and access control for enabling and disabling 1040 telemetry capabilities; 1042 o Protocol transport used telemetry data and inherent security 1043 capabilities; 1045 o Telemetry data stores, storage encryption and methods of access; 1047 o Tracking telemetry events and any abnormalities that might 1048 identify malicious attacks using telemetry interfaces. 1050 Some of the security considerations highlighted above may be 1051 minimized or negated with policy management of network telemetry. In 1052 a network telemetry deployment it would be advantageous to separate 1053 telemetry capabilities into different classes of policies, i.e., Role 1054 Based Access Control and Event-Condition-Action policies. Also, 1055 potential conflicts between network telemetry mechanisms must be 1056 detected accurately and resolved quickly to avoid unnecessary network 1057 telemetry traffic propagation escalating into an unintended or 1058 intended denial of service attack. 1060 Further study of the security issues will be required, and it is 1061 expected that the secuirty mechanisms and protocols are devloped and 1062 deployed along with a network telemetry system. 1064 7. IANA Considerations 1066 This document includes no request to IANA. 1068 8. Contributors 1070 The other contributors of this document are listed as follows. 1072 o Tianran Zhou 1074 o Zhenbin Li 1076 o Zhenqiang Li 1078 o Daniel King 1080 o Adrian Farrel 1082 9. Acknowledgments 1084 We would like to thank Randy Presuhn, Joe Clarke, Victor Liu, James 1085 Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan Gu, Parviz Yegani, 1086 Young Lee, Alexander Clemm, Qin Wu, and many others who have provided 1087 helpful comments and suggestions to improve this document. 1089 10. Informative References 1091 [gnmi] "gNMI - gRPC Network Management Interface", 1092 . 1095 [grpc] "gPPC, A high performance, open-source universal RPC 1096 framework", . 1098 [I-D.fioccola-ippm-multipoint-alt-mark] 1099 Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, 1100 "Multipoint Alternate Marking method for passive and 1101 hybrid performance monitoring", draft-fioccola-ippm- 1102 multipoint-alt-mark-04 (work in progress), June 2018. 1104 [I-D.ietf-grow-bmp-adj-rib-out] 1105 Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S. 1106 Zhuang, "Support for Adj-RIB-Out in BGP Monitoring 1107 Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-07 (work 1108 in progress), August 2019. 1110 [I-D.ietf-grow-bmp-local-rib] 1111 Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, 1112 "Support for Local RIB in BGP Monitoring Protocol (BMP)", 1113 draft-ietf-grow-bmp-local-rib-06 (work in progress), 1114 November 2019. 1116 [I-D.ietf-ippm-ioam-data] 1117 Brockners, F., Bhandari, S., Pignataro, C., Gredler, H., 1118 Leddy, J., Youell, S., Mizrahi, T., Mozes, D., Lapukhov, 1119 P., remy@barefootnetworks.com, r., daniel.bernier@bell.ca, 1120 d., and J. Lemon, "Data Fields for In-situ OAM", draft- 1121 ietf-ippm-ioam-data-09 (work in progress), March 2020. 1123 [I-D.ietf-netconf-udp-pub-channel] 1124 Zheng, G., Zhou, T., and A. Clemm, "UDP based Publication 1125 Channel for Streaming Telemetry", draft-ietf-netconf-udp- 1126 pub-channel-05 (work in progress), March 2019. 1128 [I-D.ietf-netconf-yang-push] 1129 Clemm, A. and E. Voit, "Subscription to YANG Datastores", 1130 draft-ietf-netconf-yang-push-25 (work in progress), May 1131 2019. 1133 [I-D.kumar-rtgwg-grpc-protocol] 1134 Kumar, A., Kolhe, J., Ghemawat, S., and L. Ryan, "gRPC 1135 Protocol", draft-kumar-rtgwg-grpc-protocol-00 (work in 1136 progress), July 2016. 1138 [I-D.openconfig-rtgwg-gnmi-spec] 1139 Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, 1140 C., and C. Morrow, "gRPC Network Management Interface 1141 (gNMI)", draft-openconfig-rtgwg-gnmi-spec-01 (work in 1142 progress), March 2018. 1144 [I-D.pedro-nmrg-anticipated-adaptation] 1145 Martinez-Julia, P., "Exploiting External Event Detectors 1146 to Anticipate Resource Requirements for the Elastic 1147 Adaptation of SDN/NFV Systems", draft-pedro-nmrg- 1148 anticipated-adaptation-02 (work in progress), June 2018. 1150 [I-D.song-ippm-postcard-based-telemetry] 1151 Song, H., Zhou, T., Li, Z., Shin, J., and K. Lee, 1152 "Postcard-based On-Path Flow Data Telemetry", draft-song- 1153 ippm-postcard-based-telemetry-06 (work in progress), 1154 October 2019. 1156 [I-D.song-opsawg-dnp4iq] 1157 Song, H. and J. Gong, "Requirements for Interactive Query 1158 with Dynamic Network Probes", draft-song-opsawg-dnp4iq-01 1159 (work in progress), June 2017. 1161 [I-D.song-opsawg-ifit-framework] 1162 Song, H., Qin, F., Chen, H., Jin, J., and J. Shin, "In- 1163 situ Flow Information Telemetry", draft-song-opsawg-ifit- 1164 framework-11 (work in progress), March 2020. 1166 [I-D.zhou-netconf-multi-stream-originators] 1167 Zhou, T., Zheng, G., Voit, E., and A. Clemm, "Subscription 1168 to Multiple Stream Originators", draft-zhou-netconf-multi- 1169 stream-originators-10 (work in progress), November 2019. 1171 [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, 1172 "Simple Network Management Protocol (SNMP)", RFC 1157, 1173 DOI 10.17487/RFC1157, May 1990, 1174 . 1176 [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, 1177 DOI 10.17487/RFC2981, October 2000, 1178 . 1180 [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations 1181 for the Simple Network Management Protocol (SNMP)", 1182 STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, 1183 . 1185 [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management 1186 Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, 1187 September 2004, . 1189 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1190 Zekauskas, "A One-way Active Measurement Protocol 1191 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 1192 . 1194 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1195 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1196 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1197 . 1199 [RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for 1200 the Network Configuration Protocol (NETCONF)", RFC 6020, 1201 DOI 10.17487/RFC6020, October 2010, 1202 . 1204 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 1205 and A. Bierman, Ed., "Network Configuration Protocol 1206 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 1207 . 1209 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1210 "Specification of the IP Flow Information Export (IPFIX) 1211 Protocol for the Exchange of Flow Information", STD 77, 1212 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1213 . 1215 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 1216 Weingarten, "An Overview of Operations, Administration, 1217 and Maintenance (OAM) Tools", RFC 7276, 1218 DOI 10.17487/RFC7276, June 2014, 1219 . 1221 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 1222 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 1223 DOI 10.17487/RFC7540, May 2015, 1224 . 1226 [RFC7575] Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A., 1227 Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic 1228 Networking: Definitions and Design Goals", RFC 7575, 1229 DOI 10.17487/RFC7575, June 2015, 1230 . 1232 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 1233 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 1234 May 2016, . 1236 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 1237 Monitoring Protocol (BMP)", RFC 7854, 1238 DOI 10.17487/RFC7854, June 2016, 1239 . 1241 [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, 1242 L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, 1243 "Alternate-Marking Method for Passive and Hybrid 1244 Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, 1245 January 2018, . 1247 Appendix A. A Survey on Existing Network Telemetry Techniques 1249 In this non-normative appendix, we provide an overview of some 1250 existing techniques and standard proposals for each network telemetry 1251 module. 1253 A.1. Management Plane Telemetry 1255 A.1.1. Push Extensions for NETCONF 1257 NETCONF [RFC6241] is one popular network management protocol, which 1258 is also recommended by IETF. Although it can be used for data 1259 collection, NETCONF is good at configurations. YANG Push 1260 [I-D.ietf-netconf-yang-push] extends NETCONF and enables subscriber 1261 applications to request a continuous, customized stream of updates 1262 from a YANG datastore. Providing such visibility into changes made 1263 upon YANG configuration and operational objects enables new 1264 capabilities based on the remote mirroring of configuration and 1265 operational state. Moreover, distributed data collection mechanism 1266 [I-D.zhou-netconf-multi-stream-originators] via UDP based publication 1267 channel [I-D.ietf-netconf-udp-pub-channel] provides enhanced 1268 efficiency for the NETCONF based telemetry. 1270 A.1.2. gRPC Network Management Interface 1272 gRPC Network Management Interface (gNMI) 1273 [I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol 1274 based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote 1275 Procedure Call) framework. With a single gRPC service definition, 1276 both configuration and telemetry can be covered. gRPC is an HTTP/2 1277 [RFC7540] based open source micro service communication framework. 1278 It provides a number of capabilities which are well-suited for 1279 network telemetry, including: 1281 o Full-duplex streaming transport model combined with a binary 1282 encoding mechanism provided further improved telemetry efficiency. 1284 o gRPC provides higher-level features consistency across platforms 1285 that common HTTP/2 libraries typically do not. This 1286 characteristic is especially valuable for the fact that telemetry 1287 data collectors normally reside on a large variety of platforms. 1289 o The built-in load-balancing and failover mechanism. 1291 A.2. Control Plane Telemetry 1293 A.2.1. BGP Monitoring Protocol 1295 BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP 1296 sessions and intended to provide a convenient interface for obtaining 1297 route views. 1299 The BGP routing information is collected from the monitored device(s) 1300 to the BMP monitoring station by setting up the BMP TCP session. The 1301 BGP peers are monitored by the BMP Peer Up and Peer Down 1302 Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], 1303 Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib 1304 [I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route 1305 Monitoring Message and the BMP Route Mirroring Message, in the form 1306 of both initial table dump and real-time route update. In addition, 1307 BGP statistics are reported through the BMP Stats Report Message, 1308 which could be either timer triggered or event-driven. More BMP 1309 extensions can be explored to enrich the applications of BGP 1310 monitoring. 1312 A.3. Data Plane Telemetry 1314 A.3.1. The IPFPM technology 1316 The Alternate Marking method is efficient to perform packet loss, 1317 delay, and jitter measurements both in an IP and Overlay Networks, as 1318 presented in IPFPM [RFC8321] and 1319 [I-D.fioccola-ippm-multipoint-alt-mark]. 1321 This technique can be applied to point-to-point and multipoint-to- 1322 multipoint flows. Alternate Marking creates batches of packets by 1323 alternating the value of 1 bit (or a label) of the packet header. 1324 These batches of packets are unambiguously recognized over the 1325 network and the comparison of packet counters for each batch allows 1326 the packet loss calculation. The same idea can be applied to delay 1327 measurement by selecting ad hoc packets with a marking bit dedicated 1328 for delay measurements. 1330 Alternate Marking method needs two counters each marking period for 1331 each flow under monitor. For instance, by considering n measurement 1332 points and m monitored flows, the order of magnitude of the packet 1333 counters for each time interval is n*m*2 (1 per color). 1335 Since networks offer rich sets of network performance measurement 1336 data (e.g packet counters), traditional approaches run into 1337 limitations. One reason is the fact that the bottleneck is the 1338 generation and export of the data and the amount of data that can be 1339 reasonably collected from the network. In addition, management tasks 1340 related to determining and configuring which data to generate lead to 1341 significant deployment challenges. 1343 Multipoint Alternate Marking approach, described in 1344 [I-D.fioccola-ippm-multipoint-alt-mark], aims to resolve this issue 1345 and makes the performance monitoring more flexible in case a detailed 1346 analysis is not needed. 1348 An application orchestrates network performance measurements tasks 1349 across the network to allow an optimized monitoring and it can 1350 calibrate how deep can be obtained monitoring data from the network 1351 by configuring measurement points roughly or meticulously. 1353 Using Alternate Marking, it is possible to monitor a Multipoint 1354 Network without examining in depth by using the Network Clustering 1355 (subnetworks that are portions of the entire network that preserve 1356 the same property of the entire network, called clusters). So in 1357 case there is packet loss or the delay is too high the filtering 1358 criteria could be specified more in order to perform a detailed 1359 analysis by using a different combination of clusters up to a per- 1360 flow measurement as described in IPFPM [RFC8321]. 1362 In summary, an application can configure end-to-end network 1363 monitoring. If the network does not experiment issues, this 1364 approximate monitoring is good enough and is very cheap in terms of 1365 network resources. However, in case of problems, the application 1366 becomes aware of the issues from this approximate monitoring and, in 1367 order to localize the portion of the network that has issues, 1368 configures the measurement points more exhaustively. So a new 1369 detailed monitoring is performed. After the detection and resolution 1370 of the problem the initial approximate monitoring can be used again. 1372 A.3.2. Dynamic Network Probe 1374 Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] 1375 provides a programmable means to customize the data that an 1376 application collects from the data plane. A direct benefit of DNP is 1377 the reduction of the exported data. A full DNP solution covers 1378 several components including data source, data subscription, and data 1379 generation. The data subscription needs to define the complex data 1380 which can be composed and derived from the raw data sources. The 1381 data generation takes advantage of the moderate in-network computing 1382 to produce the desired data. 1384 While DNP can introduce unforeseeable flexibility to the data plane 1385 telemetry, it also faces some challenges. It requires a flexible 1386 data plane that can be dynamically reprogrammed at run-time. The 1387 programming API is yet to be defined. 1389 A.3.3. IP Flow Information Export (IPFIX) protocol 1391 Traffic on a network can be seen as a set of flows passing through 1392 network elements. IP Flow Information Export (IPFIX) [RFC7011] 1393 provides a means of transmitting traffic flow information for 1394 administrative or other purposes. A typical IPFIX enabled system 1395 includes a pool of Metering Processes collects data packets at one or 1396 more Observation Points, optionally filters them and aggregates 1397 information about these packets. An Exporter then gathers each of 1398 the Observation Points together into an Observation Domain and sends 1399 this information via the IPFIX protocol to a Collector. 1401 A.3.4. In-Situ OAM 1403 Traditional passive and active monitoring and measurement techniques 1404 are either inaccurate or resource-consuming. It is preferable to 1405 directly acquire data associated with a flow's packets when the 1406 packets pass through a network. In-situ OAM (iOAM) 1407 [I-D.ietf-ippm-ioam-data], a data generation technique, embeds a new 1408 instruction header to user packets and the instruction directs the 1409 network nodes to add the requested data to the packets. Thus, at the 1410 path end, the packet's experience gained on the entire forwarding 1411 path can be collected. Such firsthand data is invaluable to many 1412 network OAM applications. 1414 However, iOAM also faces some challenges. The issues on performance 1415 impact, security, scalability and overhead limits, encapsulation 1416 difficulties in some protocols, and cross-domain deployment need to 1417 be addressed. 1419 A.3.5. Postcard Based Telemetry 1421 PBT [I-D.song-ippm-postcard-based-telemetry] is an alternative to 1422 IOAM. PBT directly exports data at each node through an independent 1423 packet. PBT solves several issues of IOAM. It can also help to 1424 identify packet drop location in case a packet is dropped on its 1425 forwarding path. 1427 A.4. External Data and Event Telemetry 1429 A.4.1. Sources of External Events 1431 To ensure that the information provided by external event detectors 1432 and used by the network management solutions is meaningful for the 1433 management purposes, the network telemetry framework must ensure that 1434 such detectors (sources) are easily connected to the management 1435 solutions (sinks). This requires the specification of a simple 1436 taxonomy of detectors and match it to the connectors and/or 1437 interfaces required to connect them. 1439 Once detectors are classified in such taxonomy, their definitions are 1440 enlarged with the qualities and other aspects used to handle them and 1441 represented in the ontology and information model (e.g. YANG). 1442 Therefore, differentiating several types of detectors as potential 1443 sources of external events is essential for the integrity of the 1444 management framework. We thus differentiate the following source 1445 types of external events: 1447 o Smart objects and sensors. With the consolidation of the Internet 1448 of Things~(IoT) any network system will have many smart objects 1449 attached to its physical surroundings and logical operation 1450 environments. Most of these objects will be essentially based on 1451 sensors of many kinds (e.g. temperature, humidity, presence) and 1452 the information they provide can be very useful for the management 1453 of the network, even when they are not specifically deployed for 1454 such purpose. Elements of this source type will usually provide a 1455 specific protocol for interaction, especially one of those 1456 protocols related to IoT, such as the Constrained Application 1457 Protocol (CoAP). It will be used by the telemetry framework to 1458 interact with the relevant objects. 1460 o Online news reporters. Several online news services have the 1461 ability to provide enormous quantity of information about 1462 different events occurring in the world. Some of those events can 1463 impact on the network system managed by a specific framework and, 1464 therefore, it will be interested on getting such information. For 1465 instance, diverse security reports, such as the Common 1466 Vulnerabilities and Exposures (CVE), can be issued by the 1467 corresponding authority and used by the management solution to 1468 update the managed system if needed. Instead of a specific 1469 protocol and data format, the sources of this kind of information 1470 usually follow a relaxed but structured format. This format will 1471 be part of both the ontology and information model of the 1472 telemetry framework. 1474 o Global event analyzers. The advance of Big Data analyzers 1475 provides a huge amount of information and, more interestingly, the 1476 identification of events detected by analyzing many data streams 1477 from different origins. In contrast with the other types of 1478 sources, which are focused in specific events, the detectors of 1479 this source type will detect very generic events. For example, a 1480 sports event takes place and some unexpected movement makes it 1481 highly interesting and many people connects to sites that are 1482 covering such event. The systems supporting the services that 1483 cover the event can be affected by such situation so their 1484 management solutions should be aware of it. In contrast with the 1485 other source types, a new information model, format, and reporting 1486 protocol is required to integrate the detectors of this type with 1487 the management solution. 1489 Additional types of detector types can be added to the system but 1490 they will be generally the result of composing the properties offered 1491 by these main classes. In any case, future revisions of the network 1492 telemetry framework will include the required types that cover new 1493 circumstances and that cannot be obtained by composition. 1495 A.4.2. Connectors and Interfaces 1497 For allowing external event detectors to be properly integrated with 1498 other management solutions, both elements must expose interfaces and 1499 protocols that are subject to their particular objective. Since 1500 external event detectors will be focused on providing their 1501 information to their main consumers, which generally will not be 1502 limited to the network management solutions, the framework must 1503 include the definition of the required connectors for ensuring the 1504 interconnection between detectors (sources) and their consumers 1505 within the management systems (sinks) are effective. 1507 In some situations, the interconnection between the external event 1508 detectors and the management system is via the management plane. For 1509 those situations there will be a special connector that provides the 1510 typical interfaces found in most other elements connected to the 1511 management plane. For instance, the interfaces will accomplish with 1512 a specific information model (YANG) and specific telemetry protocol, 1513 such as NETCONF, SNMP, or gRPC. 1515 Authors' Addresses 1516 Haoyu Song 1517 Futurewei 1518 2330 Central Expressway 1519 Santa Clara 1520 USA 1522 Email: hsong@futurewei.com 1524 Fengwei Qin 1525 China Mobile 1526 No. 32 Xuanwumenxi Ave., Xicheng District 1527 Beijing, 100032 1528 P.R. China 1530 Email: qinfengwei@chinamobile.com 1532 Pedro Martinez-Julia 1533 NICT 1534 4-2-1, Nukui-Kitamachi 1535 Koganei, Tokyo 184-8795 1536 Japan 1538 Email: pedro@nict.go.jp 1540 Laurent Ciavaglia 1541 Nokia 1542 Villarceaux 91460 1543 France 1545 Email: laurent.ciavaglia@nokia.com 1547 Aijun Wang 1548 China Telecom 1549 Beiqijia Town, Changping District 1550 Beijing, 102209 1551 P.R. China 1553 Email: wangaj.bri@chinatelecom.cn