idnits 2.17.1 draft-ietf-opsawg-ntf-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 8, 2019) is 1652 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-13) exists of draft-ietf-grow-bmp-local-rib-05 == Outdated reference: A later version (-16) exists of draft-song-ippm-postcard-based-telemetry-05 == Outdated reference: A later version (-10) exists of draft-zhou-netconf-multi-stream-originators-06 -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 8321 (Obsoleted by RFC 9341) Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG H. Song, Ed. 3 Internet-Draft Futurewei 4 Intended status: Informational F. Qin 5 Expires: April 10, 2020 China Mobile 6 P. Martinez-Julia 7 NICT 8 L. Ciavaglia 9 Nokia 10 A. Wang 11 China Telecom 12 October 8, 2019 14 Network Telemetry Framework 15 draft-ietf-opsawg-ntf-02 17 Abstract 19 Network telemetry is the technology for gaining network insight and 20 facilitating efficient and automated network management. It engages 21 various techniques for remote data collection, correlation, and 22 consumption. This document provides an architectural framework for 23 network telemetry, motivated by the network operation challenges and 24 requirements. As evidenced by some key characteristics and industry 25 practices, network telemetry covers technologies and protocols beyond 26 the conventional network Operations, Administration, and Management 27 (OAM). It promises better flexibility, scalability, accuracy, 28 coverage, and performance and allows automated control loops to suit 29 both today's and tomorrow's network operation. This document 30 clarifies the terminologies and classifies the modules and components 31 of a network telemetry system from several different perspectives. 32 To the best of our knowledge, this document is the first such effort 33 for network telemetry in industry standards organizations. The 34 framework and taxonomy help to set a common ground for the collection 35 of related work and provide guidance for future technique and 36 standard developments. 38 Status of This Memo 40 This Internet-Draft is submitted in full conformance with the 41 provisions of BCP 78 and BCP 79. 43 Internet-Drafts are working documents of the Internet Engineering 44 Task Force (IETF). Note that other groups may also distribute 45 working documents as Internet-Drafts. The list of current Internet- 46 Drafts is at https://datatracker.ietf.org/drafts/current/. 48 Internet-Drafts are draft documents valid for a maximum of six months 49 and may be updated, replaced, or obsoleted by other documents at any 50 time. It is inappropriate to use Internet-Drafts as reference 51 material or to cite them other than as "work in progress." 53 This Internet-Draft will expire on April 10, 2020. 55 Copyright Notice 57 Copyright (c) 2019 IETF Trust and the persons identified as the 58 document authors. All rights reserved. 60 This document is subject to BCP 78 and the IETF Trust's Legal 61 Provisions Relating to IETF Documents 62 (https://trustee.ietf.org/license-info) in effect on the date of 63 publication of this document. Please review these documents 64 carefully, as they describe your rights and restrictions with respect 65 to this document. Code Components extracted from this document must 66 include Simplified BSD License text as described in Section 4.e of 67 the Trust Legal Provisions and are provided without warranty as 68 described in the Simplified BSD License. 70 Table of Contents 72 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 73 2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 4 74 2.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 5 75 2.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 6 76 2.3. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 7 77 2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 8 78 3. The Necessity of a Network Telemetry Framework . . . . . . . 10 79 4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 11 80 4.1. Data Acquiring Mechanisms and Data Types . . . . . . . . 12 81 4.2. Data Object Modules . . . . . . . . . . . . . . . . . . . 13 82 4.2.1. Requirements and Challenges for each Module . . . . . 15 83 4.3. Function Components . . . . . . . . . . . . . . . . . . . 19 84 4.4. Existing Works Mapped in the Framework . . . . . . . . . 21 85 5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 22 86 6. Security Considerations . . . . . . . . . . . . . . . . . . . 23 87 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 24 88 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 24 89 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 24 90 10. Informative References . . . . . . . . . . . . . . . . . . . 24 91 Appendix A. A Survey on Existing Network Telemetry Techniques . 28 92 A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 28 93 A.1.1. Push Extensions for NETCONF . . . . . . . . . . . . . 28 94 A.1.2. gRPC Network Management Interface . . . . . . . . . . 28 95 A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 29 96 A.2.1. BGP Monitoring Protocol . . . . . . . . . . . . . . . 29 97 A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 29 98 A.3.1. The IPFPM technology . . . . . . . . . . . . . . . . 29 99 A.3.2. Dynamic Network Probe . . . . . . . . . . . . . . . . 30 100 A.3.3. IP Flow Information Export (IPFIX) protocol . . . . . 31 101 A.3.4. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 31 102 A.3.5. Postcard Based Telemetry . . . . . . . . . . . . . . 31 103 A.4. External Data and Event Telemetry . . . . . . . . . . . . 31 104 A.4.1. Sources of External Events . . . . . . . . . . . . . 32 105 A.4.2. Connectors and Interfaces . . . . . . . . . . . . . . 33 106 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 33 108 1. Introduction 110 Network visibility is the ability of management tools to see the 111 state and behavior of a network. It is essential for successful 112 network operation. Network telemetry is the process of measuring, 113 correlating, recording, and distributing information about the 114 behavior of a network. Network telemetry has been considered as an 115 ideal means to gain sufficient network visibility with better 116 flexibility, scalability, accuracy, coverage, and performance than 117 some conventional network Operations, Administration, and Management 118 (OAM) techniques. 120 However, so far the term of network telemetry lacks a solid and 121 unambiguous definition. The scope and coverage of it cause confusion 122 and misunderstandings. It is beneficial to clarify the concept and 123 provide a clear architectural framework for network telemetry, so we 124 can articulate the technical field, and better align the related 125 techniques and standard works. 127 To fulfill such an undertaking, we first discuss some key 128 characteristics of network telemetry which set a clear distinction 129 from the conventional network OAM and show that some conventional OAM 130 technologies can be considered a subset of the network telemetry 131 technologies. We then provide an architectural framework from three 132 different perspectives for network telemetry. We show how network 133 telemetry can meet the current and future network operation 134 requirements, and the challenges each telemetry module is facing. 135 Based on the distinction of modules and function components, we can 136 easily map the existing and emerging techniques and protocols into 137 the framework. At last, we outline a road-map for the evolution of 138 the network telemetry system and discuss the potential security 139 concerns for network telemetry. 141 The purpose of the framework and taxonomy is to set a common ground 142 for the collection of related work and provide guidance for future 143 technique and standard developments. To the best of our knowledge, 144 this document is the first such effort for network telemetry in 145 industry standards organizations. 147 2. Motivation 149 The term of Big data is used to describe the extremely large volume 150 of data sets that can be analyzed computationally to reveal patterns, 151 trends, and associations. Network is undoubtedly a source of big 152 data because of its scale and all the traffic goes through it. It is 153 easy to see that network OAM can benefit from network big data. 155 Today one can easily access advanced big data analytics capability 156 through a plethora of commercial and open source platforms (e.g., 157 Apache Hadoop), tools (e.g., Apache Spark), and techniques (e.g., 158 machine learning). Thanks to the advance of computing and storage 159 technologies, network big data analytics gives network operators an 160 unprecedented opportunity to gain network insights and move towards 161 network autonomy. Some operators start to explore the application of 162 Artificial Intelligence (AI) to make sense of network data. Software 163 tools can use the network data to detect and react on network faults, 164 anomalies, and policy violations, as well as predicting future 165 events. In turn, the network policy updates for planning, intrusion 166 prevention, optimization, and self-healing may be applied. 168 It is conceivable that an intent-driven autonomic network [RFC7575] 169 is the logical next step for network evolution following Software 170 Defined Network (SDN), aiming to reduce (or even eliminate) human 171 labor, make the most efficient usage of network resources, and 172 provide better services more aligned with customer requirements. 173 Although it takes time to reach the ultimate goal, the journey has 174 started nevertheless. 176 However, while the data processing capability is improved and 177 applications are hungry for more data, the networks lag behind in 178 extracting and translating network data into useful and actionable 179 information. The system bottleneck is shifting from data consumption 180 to data supply. Both the number of network nodes and the traffic 181 bandwidth keep increasing at a fast pace. The network configuration 182 and policy change at a much smaller time slot than ever before. More 183 subtle events and fine-grained data through all network planes need 184 to be captured and exported in real time. In a nutshell, it is a 185 challenge to get enough high-quality data out of network efficiently, 186 timely, and flexibly. Therefore, we need to examine the existing 187 network technologies and protocols, and identify any potential 188 technique and standard gaps based on the real network and device 189 architectures. 191 In the remaining of this section, first we discuss several key use 192 cases for today's and future network operations. Next, we show why 193 the current network OAM techniques and protocols are insufficient for 194 these use cases. The discussion underlines the need of new methods, 195 techniques, and protocols which we may assign under an umbrella term 196 - network telemetry. 198 2.1. Use Cases 200 These use cases are essential for network operations. While the list 201 is by no means exhaustive, it is enough to highlight the requirements 202 for data velocity, variety, volume, and veracity in networks. 204 Policy and Intent Compliance: Network policies are the rules that 205 constraint the services for network access, provide service 206 differentiation, or enforce specific treatment on the traffic. 207 For example, a service function chain is a policy that requires 208 the selected flows to pass through a set of ordered network 209 functions. An intent is a high-level abstract policy which 210 requires a complex translation and mapping process before being 211 applied on networks. While a policy is enforced, the compliance 212 needs to be verified and monitored continuously. 214 SLA Compliance: A Service-Level Agreement (SLA) defines the level of 215 service a user expects from a network operator, which include the 216 metrics for the service measurement and remedy/penalty procedures 217 when the service level misses the agreement. Users need to check 218 if they get the service as promised and network operators need to 219 evaluate how they can deliver the services that can meet the SLA. 221 Root Cause Analysis: Any network failure can be the cause or effect 222 of a sequence of chained events. Troubleshooting and recovery 223 require quick identification of the root cause of any observable 224 issues. However, the root cause is not always straightforward to 225 identify, especially when the failure is sporadic and the related 226 and unrelated events are overwhelming. While machine learning 227 technologies can be used for root cause analysis, it up to the 228 network to sense and provide all the relevant data. 230 Network Optimization: This covers all short-term and long-term 231 network optimization techniques, including load balancing, Traffic 232 Engineering (TE), and network planning. Network operators are 233 motivated to optimize their network utilization and differentiate 234 services for better Return On Investment (ROI) or lower Capital 235 Expenditures (CAPEX). The first step is to know the real-time 236 network conditions before applying policies for traffic 237 manipulation. In some cases, micro-bursts need to be detected in 238 a very short time-frame so that fine-grained traffic control can 239 be applied to avoid network congestion. The long-term network 240 capacity planning and topology augmentation also rely on the 241 accumulated data of the network operations. 243 Event Tracking and Prediction: The visibility of user traffic path 244 and performance is critical for healthy network operation. 245 Numerous related network events are of interest to network 246 operators. For example, Network operators always want to learn 247 where and why packets are dropped for an application flow. They 248 also want to be warned of issues in advance so proactive actions 249 can be taken to avoid catastrophic consequences. 251 2.2. Challenges 253 For a long time, network operators have relied upon SNMP [RFC3416], 254 Command-Line Interface (CLI), or Syslog to monitor the network. Some 255 other OAM techniques as described in [RFC7276] are also used to 256 facilitate network troubleshooting. These conventional techniques 257 are not sufficient to support the above use cases for the following 258 reasons: 260 o Most use cases need to continuously monitor the network and 261 dynamically refine the data collection in real-time and 262 interactively. The poll-based low-frequency data collection is 263 ill-suited for these applications. Subscription-based streaming 264 data directly pushed from the data source (e.g., the forwarding 265 chip) is preferred to provide enough data quantity and precision 266 at scale. 268 o Comprehensive data is needed from packet processing engine to 269 traffic manager, from line cards to main control board, from user 270 flows to control protocol packets, from device configurations to 271 operations, and from physical layer to application layer. 272 Conventional OAM only covers a narrow range of data (e.g., SNMP 273 only handles data from the Management Information Base (MIB)). 274 Traditional network devices cannot provide all the necessary 275 probes. An open and programmable network device is therefore 276 needed. 278 o Many application scenarios need to correlate network-wide data 279 from multiple sources (i.e., from distributed network devices, 280 different components of a network device, or different network 281 planes). A piecemeal solution is often lacking the capability to 282 consolidate the data from multiple sources. The composition of a 283 complete solution, as partly proposed by Autonomic Resource 284 Control Architecture(ARCA) 285 [I-D.pedro-nmrg-anticipated-adaptation], will be empowered and 286 guided by a comprehensive framework. 288 o Some of the conventional OAM techniques (e.g., CLI and Syslog) 289 lack a formal data model. The unstructured data hinder the tool 290 automation and application extensibility. Standardized data 291 models are essential to support the programmable networks. 293 o Although some conventional OAM techniques support data push (e.g., 294 SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow), the pushed data 295 are limited to only predefined management plane warnings (e.g., 296 SNMP Trap) or sampled user packets (e.g., sFlow). We require the 297 data with arbitrary source, granularity, and precision which are 298 beyond the capability of the existing techniques. 300 o The conventional passive measurement techniques can either consume 301 too much network resources and render too much redundant data, or 302 lead to inaccurate results; the conventional active measurement 303 techniques can interfere with the user traffic and their results 304 are indirect. We need techniques that can collect direct and on- 305 demand data from user traffic. 307 2.3. Glossary 309 Before further discussion, we list some key terminology and acronyms 310 used in this documents. We make an intended distinction between 311 network telemetry and network OAM. 313 AI: Artificial Intelligence. In network domain, AI refers to the 314 machine-learning based technologies for automated network 315 operation and other tasks. 317 BMP: BGP Monitoring Protocol, specified in [RFC7854]. 319 DNP: Dynamic Network Probe, referring to programmable in-network 320 sensors for network monitoring and measurement. 322 DPI: Deep Packet Inspection, referring to the techniques that 323 examines packet beyond packet L3/L4 headers. 325 gNMI: gRPC Network Management Interface, a network management 326 protocol from OpenConfig Operator Working Group, mainly 327 contributed by Google. See [gnmi] for details. 329 gRPC: gRPC Remote Procedure Call, a open source high performance RPC 330 framework that gNMI is based on. See [grpc] for details. 332 IPFIX: IP Flow Information Export Protocol, specified in [RFC7011]. 334 IPFPM: IP Flow Performance Measurement method, specified in 335 [RFC8321]. 337 IOAM: In-situ OAM, a dataplane on-path telemetry technique. 339 NETCONF: Network Configuration Protocol, specified in [RFC6241]. 341 Network Telemetry: Acquiring and processing network data remotely 342 for network monitoring and operation. A general term for a large 343 set of network visibility techniques and protocols, with the 344 characteristics defined in this document. Network telemetry 345 addresses the current network operation issues and enables smooth 346 evolution toward intent-driven autonomous networks. 348 NMS: Network Management System, referring to applications that allow 349 network administrators manage a network's software and hardware 350 components. It usually records data from a network's remote 351 points to carry out central reporting to a system administrator. 353 OAM: Operations, Administration, and Maintenance. A group of 354 network management functions that provide network fault 355 indication, fault localization, performance information, and data 356 and diagnosis functions. Most conventional network monitoring 357 techniques and protocols belong to network OAM. 359 PBT: Postcard-Based Telemetry, a dataplane on-path telemetry 360 technique. 362 SNMP: Simple Network Management Protocol. Version 1 and 2 are 363 specified in [RFC1157] and [RFC3416], respectively. 365 YANG: The abbreviation of "Yet Another Next Generation". YANG is a 366 data modeling language for the definition of data sent over 367 network management protocols such as the NETCONF and RESTCONF. 368 YANG is defined in [RFC6020]. 370 YANG FSM: A YANG model that describes events, operations, and finite 371 state machine of YANG-defined network elements. 373 YANG PUSH: A method to subscribe pushed data from remote YANG 374 datastore on network devices. 376 2.4. Network Telemetry 378 Network telemetry has emerged as a mainstream technical term to refer 379 to the newer data collection and consumption techniques, 380 distinguishing itself from the convention techniques for network OAM. 381 The representative techniques and protocols include IPFIX [RFC7011] 382 and gPRC [grpc]. Network telemetry allows separate entities to 383 acquire data from network devices so that data can be visualized and 384 analyzed to support network monitoring and operation. Network 385 telemetry overlaps with the conventional network OAM and has a wider 386 scope than it. It is expected that network telemetry can provide the 387 necessary network insight for autonomous networks and address the 388 shortcomings of conventional OAM techniques. 390 One difference between the network telemetry and the network OAM is 391 that the network telemetry assumes machines as data consumer rather 392 than human operators. Hence, the network telemetry can directly 393 trigger the automated network operation, while the conventional OAM 394 tools usually help human operators to monitor and diagnose the 395 networks and guide manual network operations. The difference leads 396 to very different techniques. 398 Although the network telemetry techniques are just emerging and 399 subject to continuous evolution, several characteristics of network 400 telemetry have been well accepted (Note that network telemetry is 401 intended to be an umbrella term covering a wide spectrum of 402 techniques, so the following characteristics are not expected to be 403 held by every specific technique): 405 o Push and Streaming: Instead of polling data from network devices, 406 the telemetry collector subscribes to the streaming data pushed 407 from data sources in network devices. 409 o Volume and Velocity: The telemetry data is intended to be consumed 410 by machines rather than by human being. Therefore, the data 411 volume is huge and the processing is often in realtime. 413 o Normalization and Unification: Telemetry aims to address the 414 overall network automation needs. The piecemeal solutions offered 415 by the conventional OAM approach are no longer suitable. Efforts 416 need to be made to normalize the data representation and unify the 417 protocols. 419 o Model-based: The telemetry data is modeled in advance which allows 420 applications to configure and consume data with ease. 422 o Data Fusion: The data for a single application can come from 423 multiple data sources (e.g., cross-domain, cross-device, and 424 cross-layer) and needs to be correlated to take effect. 426 o Dynamic and Interactive: Since the network telemetry means to be 427 used in a closed control loop for network automation, it needs to 428 run continuously and adapt to the dynamic and interactive queries 429 from the network operation controller. 431 In addition, an ideal network telemetry solution may also have the 432 following features or properties: 434 o In-Network Customization: The data can be customized in network at 435 run-time to cater to the specific need of applications. This 436 needs the support of a programmable data plane which allows probes 437 to be deployed at flexible locations. 439 o In-Network Data Aggregation and Correlation: Network devices and 440 aggregation points can work out which events and what data needs 441 to be stored, reported, or discarded thus reducing the load on the 442 central collection and processing points while still ensuring that 443 the right information is ready to be processed in a timely way. 445 o In-Network Processing and Action: Sometimes it is not necessary or 446 feasible to gather all information to a central point so that it 447 can be processed and acted upon. It is possible for the data 448 processing to be done in the network, and actions taken more 449 locally and more responsively. 451 o Direct Data Plane Export: The data originated from data plane can 452 be directly exported to the data consumer for efficiency, 453 especially when the data bandwidth is large and the real-time 454 processing is required. 456 o In-band Data Collection: In addition to the passive and active 457 data collection approaches, the new hybrid approach allows to 458 directly collect data for any target flow on its entire forwarding 459 path. 461 It is worth noting that, no matter how sophisticated a network 462 telemetry system is, it should not be intrusive to networks, by 463 avoiding the pitfall of the "observer effect". That is, it should 464 not change the network behavior and affect the forwarding 465 performance. 467 Although in many cases a network telemetry system is akin to the SDN 468 architecture, it is important to understand that network telemetry 469 does not infer the need of any centralized data processing and 470 analytics engine. Telemetry data producers and consumers can 471 perfectly work in distributed or peer-to-peer fashions instead. 473 3. The Necessity of a Network Telemetry Framework 475 Big data analytics and machine-learning based AI technologies are 476 applied for network operation automation, relying on abundant data 477 from networks. The single-sourced and static data acquisition cannot 478 meet the data requirements. It is desirable to have a framework that 479 integrates multiple telemetry approaches from different layers. This 480 allows flexible combinations for different applications. The 481 framework would benefit application development for the following 482 reasons: 484 o The future autonomous networks will require a holistic view on 485 network visibility. All the use cases and applications need to be 486 supported uniformly and coherently under a single intelligent 487 agent. Therefore, the protocols and mechanisms should be 488 consolidated into a minimum yet comprehensive set. A telemetry 489 framework can help to normalize the technique developments. 491 o Network visibility presents multiple viewpoints. For example, the 492 device viewpoint takes the network infrastructure as the 493 monitoring object from which the network topology and device 494 status can be acquired; the traffic viewpoint takes the flows or 495 packets as the monitoring object from which the traffic quality 496 and path can be acquired. An application may need to switch its 497 viewpoint during operation. It may also need to correlate a 498 service and impact on network experience to acquire the 499 comprehensive information. 501 o Applications require network telemetry to be elastic in order to 502 efficiently use the network resource and reduce the performance 503 impact. Routine network monitoring covers the entire network with 504 low data sampling rate. When issues arise or trends emerge, the 505 telemetry data source can be modified and the data rate can be 506 boosted. 508 o Efficient data fusion is critical for applications to reduce the 509 overall quantity of data and improve the accuracy of analysis. 511 A telemetry framework collects together all of the telemetry-related 512 work from different sources and working groups within the IETF. This 513 makes it possible to assemble a comprehensive network telemetry 514 system and to avoid repetitious or redundant work. The framework 515 should cover the concepts and components from the standardization 516 perspective. This document clarifies the layered modules on which 517 the telemetry is exerted and decomposes the telemetry system into a 518 set of distinct components that the existing and future work can 519 easily map to. 521 4. Network Telemetry Framework 523 Network telemetry techniques can be classified from multiple 524 dimensions. In this document, we provide three unique perspectives: 525 data acquiring mechanisms, data objects, and function components. 527 4.1. Data Acquiring Mechanisms and Data Types 529 Broadly speaking, network data can be acquired through subscription 530 (push) and query (poll). A subscriber may request data when it is 531 ready. It follows a Publish-Subscription (Pub-Sub) mode or a 532 Subscription-Publish (Sub-Pub) mode. In the Pub-Sub mode, pre- 533 defined data are published and multiple qualified subscribers can 534 subscribe the data. In the Sub-Pub mode, a subscriber designates 535 what data are of interest and demands the network devices to deliver 536 the data when they are available. 538 In contrast, a querier expects immediate feedback from network 539 devices. It is usually used in a more interactive environment. The 540 queried data may be directly extracted from some specific data 541 source, or synthesized and processed from raw data. 543 There are four types of data from network devices: 545 Simple Data: The data that are steadily available from some data 546 store or static probes in network devices. such data can be 547 specified by YANG model. 549 Complex Data: The data need to be synthesized or processed from raw 550 data from one or more network devices. The data processing 551 function can be statically or dynamically loaded into network 552 devices. 554 Event-triggered Data: The data are conditionally acquired based on 555 the occurrence of some event. An event can be modeled as a Finite 556 State Machine (FSM). 558 Streaming Data: The data are continuously or periodically generated. 559 It can be time series or the dump of databases. The streaming 560 data reflect realtime network states and metrics and require large 561 bandwidth and processing power. 563 The above data types are not mutually exclusive. For example, event- 564 triggered data can be simple or complex, and streaming data can be 565 event triggered. The relationships of these data types are 566 illustrated in Figure 1 567 +--------------------------+ 568 | +----------------------+ | 569 | | +-----------------+ | | 570 | | | +-------------+ | | | 571 | | | | Simple Data | | | | 572 | | | +-------------+ | | | 573 | | | Complex Data | | | 574 | | +-----------------+ | | 575 | | Event-triggered Data | | 576 | +----------------------+ | 577 | Streaming Data | 578 +--------------------------+ 580 Figure 1: Data Type Relationship 582 Subscription usually deals with event-triggered data and streaming 583 data, and query usually deals with simple data and complex data. It 584 is easy to see that conventional OAM techniques are mostly about 585 querying simple data only. While these techniques are still useful, 586 advanced network telemetry techniques pay more attention on the other 587 three data types, and prefer event/streaming data subscription and 588 complex data query over simple data query. 590 4.2. Data Object Modules 592 Telemetry can be applied on the forwarding plane, the control plane, 593 and the management plane in a network, as well as other sources out 594 of the network, as shown in Figure 2. Therefore, we categorize the 595 network telemetry into four distinct modules with each having its own 596 interface to Network Operation Applications. 598 +------------------------------+ 599 | | 600 | Network Operation |<-------+ 601 | Applications | | 602 | | | 603 +------------------------------+ | 604 ^ ^ ^ | 605 | | | | 606 V | V V 607 +-----------|---+--------------+ +-----------+ 608 | | | | | | 609 | Control Pl|ane| | | External | 610 | Telemetry | <---> | | Data and | 611 | | | | | Event | 612 | ^ V | Management | | Telemetry | 613 +------|--------+ Plane | | | 614 | V | Telemetry | +-----------+ 615 | Forwarding | | 616 | Plane <---> | 617 | Telemetry | | 618 | | | 619 +---------------+--------------+ 621 Figure 2: Modules in Layer Category of NTF 623 The rationale of this partition lies in the different telemetry data 624 objects which result in different data source and export locations. 625 Such differences have profound implications on in-network data 626 programming and processing capability, data encoding and transport 627 protocol, and data bandwidth and latency. 629 We summarize the major differences of the four modules in the 630 following table. They are mainly compared from six aspects: data 631 object, data export location, data model, data encoding, telemetry 632 protocol, and transport method. Data object is the target and source 633 of each module. Because the data source varies, the data export 634 location varies. Because each data export location has different 635 capability, the proper data model, encoding, and transport method 636 cannot be kept the same. As a result, the suitable telemetry 637 protocol for each module can be different. Some representative 638 techniques are shown in some table blocks to highlight the technical 639 diversity of these modules. One cannot expect to use a universal 640 protocol to cover all the network telemetry requirements. 642 +---------+--------------+--------------+--------------+-----------+ 643 | Module | Control | Management | Forwarding | External | 644 | | Plane | Plane | Plane | Data | 645 +---------+--------------+--------------+--------------+-----------+ 646 |Object | control | config. & | flow & packet| terminal, | 647 | | protocol & | operation | QoS, traffic | social & | 648 | | signaling, | state, MIB | stat., buffer| environ- | 649 | | RIB, ACL | | & queue stat.| mental | 650 +---------+--------------+--------------+--------------+-----------+ 651 |Export | main control | main control | fwding chip | various | 652 |Location | CPU, | CPU | or linecard | | 653 | | linecard CPU | | CPU; main | | 654 | | or fwding | | control CPU | | 655 | | chip | | unlikely | | 656 +---------+--------------+--------------+--------------+-----------+ 657 |Data | YANG, | MIB, syslog, | template, | YANG | 658 |Model | custom | YANG, | YANG, | | 659 | | | custom | custom | | 660 +---------+--------------+--------------+--------------+-----------+ 661 |Data | GPB, JSON, | GPB, JSON, | plain | GPB, JSON | 662 |Encoding | XML, plain | XML | | XML, plain| 663 +---------+--------------+--------------+--------------+-----------+ 664 |Protocol | gRPC,NETCONF,| gPRC,NETCONF,| IPFIX, mirror| gRPC | 665 | | IPFIX,mirror | | | | 666 +---------+--------------+--------------+--------------+-----------+ 667 |Transport| HTTP, TCP, | HTTP, TCP | UDP | HTTP,TCP | 668 | | UDP | | | UDP | 669 +---------+--------------+--------------+--------------+-----------+ 671 Figure 3: Comparison of the Data Object Modules 673 Note that the interaction with the network operation applications can 674 be indirect. For example, in the management plane telemetry, the 675 management plane may need to acquire data from the data plane. Some 676 of the operational states can only be derived from the data plane 677 such as the interface status and statistics. For another example, 678 the control plane telemetry may need to access the Forwarding 679 Information Base (FIB) in data plane. On the other hand, an 680 application may involve more than one plane simultaneously. For 681 example, an SLA compliance application may require both the data 682 plane telemetry and the control plane telemetry. 684 4.2.1. Requirements and Challenges for each Module 685 4.2.1.1. Management Plane Telemetry 687 The management plane of network elements interacts with the Network 688 Management System (NMS), and provides information such as performance 689 data, network logging data, network warning and defects data, and 690 network statistics and state data. Some legacy protocols, such as 691 SNMP and Syslog, are widely used for the management plane. However, 692 these protocols are insufficient to meet the requirements of the 693 future automated network operation applications. 695 New management plane telemetry protocols should consider the 696 following requirements: 698 Convenient Data Subscription: An application should have the freedom 699 to choose the data export means such as the data types and the 700 export frequency. 702 Structured Data: For automatic network operation, machines will 703 replace human for network data comprehension. The schema 704 languages such as YANG can efficiently describe structured data 705 and normalize data encoding and transformation. 707 High Speed Data Transport: In order to retain the information, a 708 server needs to send a large amount of data at high frequency. 709 Compact encoding formats are needed to compress the data and 710 improve the data transport efficiency. The push mode, by 711 replacing the poll mode, can also reduce the interactions between 712 clients and servers, which help to improve the server's 713 efficiency. 715 4.2.1.2. Control Plane Telemetry 717 The control plane telemetry refers to the health condition monitoring 718 of different network protocols, which covers Layer 2 to Layer 7. 719 Keeping track of the running status of these protocols is beneficial 720 for detecting, localizing, and even predicting various network 721 issues, as well as network optimization, in real-time and in fine 722 granularity. 724 One of the most challenging problems for the control plane telemetry 725 is how to correlate the E2E Key Performance Indicators (KPI) to a 726 specific layer's KPIs. For example, an IPTV user may describe his 727 User Experience (UE) by the video fluency and definition. Then in 728 case of an unusually poor UE KPI or a service disconnection, it is 729 non-trivial work to delimit and localize the issue to the responsible 730 protocol layer (e.g., the Transport Layer or the Network Layer), the 731 responsible protocol (e.g., ISIS or BGP at the Network Layer), and 732 finally the responsible device(s) with specific reasons. 734 Traditional OAM-based approaches for control plane KPI measurement 735 include PING (L3), Tracert (L3), Y.1731 (L2) and so on. One common 736 issue behind these methods is that they only measure the KPIs instead 737 of reflecting the actual running status of these protocols, making 738 them less effective or efficient for control plane troubleshooting 739 and network optimization. An example of the control plane telemetry 740 is the BGP monitoring protocol (BMP), it is currently used to 741 monitoring the BGP routes and enables rich applications, such as BGP 742 peer analysis, AS analysis, prefix analysis, security analysis, and 743 so on. However, the monitoring of other layers, protocols and the 744 cross-layer, cross-protocol KPI correlations are still in their 745 infancy (e.g., the IGP monitoring is missing), which require 746 substantial further research. 748 4.2.1.3. Data Plane Telemetry 750 An effective data plane telemetry system relies on the data that the 751 network device can expose. The data's quality, quantity, and 752 timeliness must meet some stringent requirements. This raises some 753 challenges to the network data plane devices where the first hand 754 data originate. 756 o A data plane device's main function is user traffic processing and 757 forwarding. While supporting network visibility is important, the 758 telemetry is just an auxiliary function, and it should not impede 759 normal traffic processing and forwarding (i.e., the performance is 760 not lowered and the behavior is not altered due to the telemetry 761 functions). 763 o The network operation applications requires end-to-end visibility 764 from various sources, which results in a huge volume of data. 765 However, the sheer data quantity should not stress the network 766 bandwidth, regardless of the data delivery approach (i.e., through 767 in-band or out-of-band channels). 769 o The data plane devices must provide timely data with the minimum 770 possible delay. Long processing, transport, storage, and analysis 771 delay can impact the effectiveness of the control loop and even 772 render the data useless. 774 o The data should be structured and labeled, and easy for 775 applications to parse and consume. At the same time, the data 776 types needed by applications can vary significantly. The data 777 plane devices need to provide enough flexibility and 778 programmability to support the precise data provision for 779 applications. 781 o The data plane telemetry should support incremental deployment and 782 work even though some devices are unaware of the system. This 783 challenge is highly relevant to the standards and legacy networks. 785 The industry has agreed that the data plane programmability is 786 essential to support network telemetry. Newer data plane chips are 787 all equipped with advanced telemetry features and provide flexibility 788 to support customized telemetry functions. 790 4.2.1.3.1. Technique Taxonomy 792 There can be multiple possible dimensions to classify the data plane 793 telemetry techniques. 795 Active and Passive: The active and passive methods (as well as the 796 hybrid types) are well documented in [RFC7799]. The passive 797 methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic 798 mirror. These methods usually have low data coverage. The 799 bandwidth cost is very high in order to improve the data coverage. 800 On the other hand, the active methods include Ping, Traceroute, 801 OWAMP [RFC4656], and TWAMP [RFC5357]. These methods are intrusive 802 and only provide indirect network measurement results. The hybrid 803 methods, including in-situ OAM 804 [I-D.brockners-inband-oam-requirements], IPFPM [RFC8321], and 805 Multipoint Alternate Marking 806 [I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced 807 and more flexible approach. However, these methods are also more 808 complex to implement. 810 In-Band and Out-of-Band: The telemetry data, before being exported 811 to some collector, can be carried in user packets. Such methods 812 are considered in-band (e.g., in-situ OAM 813 [I-D.brockners-inband-oam-requirements]). If the telemetry data 814 is directly exported to some collector without modifying the user 815 packets, Such methods are considered out-of-band (e.g., postcard- 816 based INT). It is possible to have hybrid methods. For example, 817 only the telemetry instruction or partial data is carried by user 818 packets (e.g., IPFPM [RFC8321]). 820 E2E and In-Network: Some E2E methods start from and end at the 821 network end hosts (e.g., Ping). The other methods work in 822 networks and are transparent to end hosts. However, if needed, 823 the in-network methods can be easily extended into end hosts. 825 Flow, Path, and Node: Depending on the telemetry objective, the 826 methods can be flow-based (e.g., in-situ OAM 827 [I-D.brockners-inband-oam-requirements]), path-based (e.g., 828 Traceroute), and node-based (e.g., IPFIX [RFC7011]). 830 4.2.1.4. External Data Telemetry 832 Events that occur outside the boundaries of the network system are 833 another important source of telemetry information. Correlating both 834 internal telemetry data and external events with the requirements of 835 network systems, as presented in Exploiting External Event Detectors 836 to Anticipate Resource Requirements for the Elastic Adaptation of 837 SDN/NFV Systems [I-D.pedro-nmrg-anticipated-adaptation], provides a 838 strategic and functional advantage to management operations. 840 As with other sources of telemetry information, the data and events 841 must meet strict requirements, especially in terms of timeliness, 842 which is essential to properly incorporate external event information 843 to management cycles. Thus, the specific challenges are described as 844 follows: 846 o The role of external event detector can be played by multiple 847 elements, including hardware (e.g. physical sensors, such as 848 seismometers) and software (e.g. Big Data sources that analyze 849 streams of information, such as Twitter messages). Thus, the 850 transmitted data must support different shapes but, at the same 851 time, follow a common but extensible ontology. 853 o Since the main function of the external event detectors is to 854 perform the notifications, their timeliness is assumed. However, 855 once messages have been dispatched, they must be quickly collected 856 and inserted into the control plane with variable priority, which 857 will be high for important sources and/or important events and low 858 for secondary ones. 860 o The ontology used by external detectors must be easily adopted by 861 current and future devices and applications. Therefore, it must 862 be easily mapped to current information models, such as in terms 863 of YANG. 865 Organizing together both internal and external telemetry information 866 will be key for the general exploitation of the management 867 possibilities of current and future network systems, as reflected in 868 the incorporation of cognitive capabilities to new hardware and 869 software (virtual) elements. 871 4.3. Function Components 873 At each plane, the telemetry can be further partitioned into five 874 distinct components: 876 Data Query, Analysis, and Storage: This component works at the 877 application layer. On the one hand, it is responsible for issuing 878 data queries. The queries can be for modeled data through 879 configuration or custom data through programming. The queries can 880 be one shot or subscriptions for events or streaming data. On the 881 other hand, it receives, stores, and processes the returned data 882 from network devices. Data analysis can be interactive to 883 initiate further data queries. Note that this component can 884 reside in either network devices or remote controllers. 886 Data Configuration and Subscription: This component deploys data 887 queries on devices. It determines the protocol and channel for 888 applications to acquire desired data. This component is also 889 responsible for configuring the desired data that might not be 890 directly available form data sources. The subscription data can 891 be described by models, templates, or programs. 893 Data Encoding and Export: This component determines how telemetry 894 data are delivered to the data analysis and storage component. 895 The data encoding and the transport protocol may vary due to the 896 data exporting location. 898 Data Generation and Processing: The requested data needs to be 899 captured, processed, and formatted in network devices from raw 900 data sources. This may involve in-network computing and 901 processing on either the fast path or the slow path in network 902 devices. 904 Data Object and Source: This component determines the monitoring 905 object and original data source. The data source usually just 906 provides raw data which needs further processing. A data source 907 can be considered a probe. A probe can be statically installed or 908 dynamically installed. 910 +----------------------------------------+ 911 | | 912 | Data Query, Analysis, & Storage | 913 | | 914 +----------------------------------------+ 915 | ^ 916 | | 917 V | 918 +---------------------+------------------+ 919 | Data Configuration | | 920 | & Subscription | Data Encoding | 921 | (model, template, | & Export | 922 | & program) | | 923 +---------------------+------------------| 924 | | 925 | Data Generation | 926 | & Processing | 927 | | 928 +----------------------------------------| 929 | | 930 | Data Object and Source | 931 | | 932 +----------------------------------------+ 934 Figure 4: Components in the Network Telemetry Framework 936 4.4. Existing Works Mapped in the Framework 938 The following two tables provide a non-exhaustive list of existing 939 works (mainly published in IETF and with the emphasis on the latest 940 new technologies) and shows their positions in the framework. The 941 details about the mentioned work can be found in Appendix A. 943 +-----------------+---------------+----------------+ 944 | | Query | Subscription | 945 | | | | 946 +-----------------+---------------+----------------+ 947 | Simple Data | SNMP, NETCONF,| | 948 | | YANG, BMP, | | 949 | | IOAM, PBT,gPRC| | 950 +-----------------+---------------+----------------+ 951 | Complex Data | DNP, YANG FSM | | 952 | | gRPC, NETCONF | | 953 +-----------------+---------------+----------------+ 954 | Event-triggered | | gRPC, NETCONF, | 955 | Data | | YANG PUSH, DNP | 956 | | | IOAM, PBT, | 957 | | | YANG FSM | 958 +-----------------+---------------+----------------+ 959 | Streaming Data | | gRPC, NETCONF, | 960 | | | IOAM, PBT, DNP | 961 | | | IPFIX, IPFPM | 962 +-----------------+---------------+----------------+ 964 Figure 5: Existing Work Mapping I 966 +--------------+---------------+----------------+---------------+ 967 | | Management | Control | Forwarding | 968 | | Plane | Plane | Plane | 969 +--------------+---------------+----------------+---------------+ 970 | data Config. | gRPC, NETCONF,| NETCONF/YANG | NETCONF/YANG, | 971 | & subscrib. | YANG PUSH | | YANG FSM | 972 +--------------+---------------+----------------+---------------+ 973 | data gen. & | DNP, | DNP, | IOAM, | 974 | processing | YANG | YANG | PBT, IPFPM, | 975 | | | | DNP | 976 +--------------+---------------+----------------+---------------+ 977 | data | gRPC, NETCONF | BMP, NETCONF | IPFIX | 978 | export | YANG PUSH | | | 979 +--------------+---------------+----------------+---------------+ 981 Figure 6: Existing Work Mapping II 983 5. Evolution of Network Telemetry 985 As the network is evolving towards the automated operation, network 986 telemetry also undergoes several levels of evolution. 988 Level 0 - Static Telemetry: The telemetry data source and type are 989 determined at design time. The network operator can only 990 configure how to use it with limited flexibility. 992 Level 1 - Dynamic Telemetry: The telemetry data can be dynamically 993 programmed or configured at runtime, allowing a tradeoff among 994 resource, performance, flexibility, and coverage. DNP is an 995 effort towards this direction. 997 Level 2 - Interactive Telemetry: The network operator can 998 continuously customize the telemetry data in real time to reflect 999 the network operation's visibility requirements. At this level, 1000 some tasks can be automated, although ultimately human operators 1001 will still need to sit in the middle to make decisions. 1003 Level 3 - Closed-loop Telemetry: Human operators are completely 1004 excluded from the control loop. The intelligent network operation 1005 engine automatically issues the telemetry data request, analyzes 1006 the data, and updates the network operations in closed control 1007 loops. 1009 While most of the existing technologies belong to level 0 and level 1010 1, with the help of a clearly defined network telemetry framework, we 1011 can assemble the technologies to support level 2 and make solid steps 1012 towards level 3. 1014 6. Security Considerations 1016 Given that this document has proposed a framework for network 1017 telemetry and the telemetry mechanisms discussed are distinct (in 1018 both message frequency and traffic amount) from the conventional 1019 network OAM concepts, we must also reflect that various new security 1020 considerations may also arise. A number of techniques already exist 1021 for securing the data plane, control plane, and the management plane 1022 in a network, but the it is important to consider if any new threat 1023 vectors are now being enabled via the use of network telemetry 1024 procedures and mechanisms. 1026 Security considerations for networks that use telemetry methods may 1027 include: 1029 o Telemetry framework trust and policy model; 1031 o Role management and access control for enabling and disabling 1032 telemetry capabilities; 1034 o Protocol transport used telemetry data and inherent security 1035 capabilities; 1037 o Telemetry data stores, storage encryption and methods of access; 1039 o Tracking telemetry events and any abnormalities that might 1040 identify malicious attacks using telemetry interfaces. 1042 Some of the security considerations highlighted above may be 1043 minimized or negated with policy management of network telemetry. In 1044 a network telemetry deployment it would be advantageous to separate 1045 telemetry capabilities into different classes of policies, i.e., Role 1046 Based Access Control and Event-Condition-Action policies. Also, 1047 potential conflicts between network telemetry mechanisms must be 1048 detected accurately and resolved quickly to avoid unnecessary network 1049 telemetry traffic propagation escalating into an unintended or 1050 intended denial of service attack. 1052 Further discussion and development of this section will be required, 1053 and it is expected that this security section, and subsequent policy 1054 section will be developed further. 1056 7. IANA Considerations 1058 This document includes no request to IANA. 1060 8. Contributors 1062 The other contributors of this document are listed as follows. 1064 o Tianran Zhou 1066 o Zhenbin Li 1068 o Daniel King 1070 o Adrian Farrel 1072 9. Acknowledgments 1074 We would like to thank Randy Presuhn, Joe Clarke, Victor Liu, James 1075 Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan Gu, Parviz Yegani, 1076 Young Lee, Alexander Clemm, Qin Wu, and many others who have provided 1077 helpful comments and suggestions to improve this document. 1079 10. Informative References 1081 [gnmi] "gNMI - gRPC Network Management Interface", 1082 . 1085 [grpc] "gPPC, A high performance, open-source universal RPC 1086 framework", . 1088 [I-D.brockners-inband-oam-requirements] 1089 Brockners, F., Bhandari, S., Dara, S., Pignataro, C., 1090 Gredler, H., Leddy, J., Youell, S., Mozes, D., Mizrahi, 1091 T., Lapukhov, P., and r. remy@barefootnetworks.com, 1092 "Requirements for In-situ OAM", draft-brockners-inband- 1093 oam-requirements-03 (work in progress), March 2017. 1095 [I-D.fioccola-ippm-multipoint-alt-mark] 1096 Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, 1097 "Multipoint Alternate Marking method for passive and 1098 hybrid performance monitoring", draft-fioccola-ippm- 1099 multipoint-alt-mark-04 (work in progress), June 2018. 1101 [I-D.ietf-grow-bmp-adj-rib-out] 1102 Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S. 1103 Zhuang, "Support for Adj-RIB-Out in BGP Monitoring 1104 Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-07 (work 1105 in progress), August 2019. 1107 [I-D.ietf-grow-bmp-local-rib] 1108 Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, 1109 "Support for Local RIB in BGP Monitoring Protocol (BMP)", 1110 draft-ietf-grow-bmp-local-rib-05 (work in progress), 1111 August 2019. 1113 [I-D.ietf-netconf-udp-pub-channel] 1114 Zheng, G., Zhou, T., and A. Clemm, "UDP based Publication 1115 Channel for Streaming Telemetry", draft-ietf-netconf-udp- 1116 pub-channel-05 (work in progress), March 2019. 1118 [I-D.ietf-netconf-yang-push] 1119 Clemm, A. and E. Voit, "Subscription to YANG Datastores", 1120 draft-ietf-netconf-yang-push-25 (work in progress), May 1121 2019. 1123 [I-D.kumar-rtgwg-grpc-protocol] 1124 Kumar, A., Kolhe, J., Ghemawat, S., and L. Ryan, "gRPC 1125 Protocol", draft-kumar-rtgwg-grpc-protocol-00 (work in 1126 progress), July 2016. 1128 [I-D.openconfig-rtgwg-gnmi-spec] 1129 Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, 1130 C., and C. Morrow, "gRPC Network Management Interface 1131 (gNMI)", draft-openconfig-rtgwg-gnmi-spec-01 (work in 1132 progress), March 2018. 1134 [I-D.pedro-nmrg-anticipated-adaptation] 1135 Martinez-Julia, P., "Exploiting External Event Detectors 1136 to Anticipate Resource Requirements for the Elastic 1137 Adaptation of SDN/NFV Systems", draft-pedro-nmrg- 1138 anticipated-adaptation-02 (work in progress), June 2018. 1140 [I-D.song-ippm-postcard-based-telemetry] 1141 Song, H., Zhou, T., Li, Z., Shin, J., and K. Lee, 1142 "Postcard-based On-Path Flow Data Telemetry", draft-song- 1143 ippm-postcard-based-telemetry-05 (work in progress), 1144 September 2019. 1146 [I-D.song-opsawg-dnp4iq] 1147 Song, H. and J. Gong, "Requirements for Interactive Query 1148 with Dynamic Network Probes", draft-song-opsawg-dnp4iq-01 1149 (work in progress), June 2017. 1151 [I-D.zhou-netconf-multi-stream-originators] 1152 Zhou, T., Zheng, G., Voit, E., Clemm, A., and A. Bierman, 1153 "Subscription to Multiple Stream Originators", draft-zhou- 1154 netconf-multi-stream-originators-06 (work in progress), 1155 July 2019. 1157 [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, 1158 "Simple Network Management Protocol (SNMP)", RFC 1157, 1159 DOI 10.17487/RFC1157, May 1990, 1160 . 1162 [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, 1163 DOI 10.17487/RFC2981, October 2000, 1164 . 1166 [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations 1167 for the Simple Network Management Protocol (SNMP)", 1168 STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, 1169 . 1171 [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management 1172 Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, 1173 September 2004, . 1175 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1176 Zekauskas, "A One-way Active Measurement Protocol 1177 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 1178 . 1180 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1181 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1182 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1183 . 1185 [RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for 1186 the Network Configuration Protocol (NETCONF)", RFC 6020, 1187 DOI 10.17487/RFC6020, October 2010, 1188 . 1190 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 1191 and A. Bierman, Ed., "Network Configuration Protocol 1192 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 1193 . 1195 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1196 "Specification of the IP Flow Information Export (IPFIX) 1197 Protocol for the Exchange of Flow Information", STD 77, 1198 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1199 . 1201 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 1202 Weingarten, "An Overview of Operations, Administration, 1203 and Maintenance (OAM) Tools", RFC 7276, 1204 DOI 10.17487/RFC7276, June 2014, 1205 . 1207 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 1208 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 1209 DOI 10.17487/RFC7540, May 2015, 1210 . 1212 [RFC7575] Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A., 1213 Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic 1214 Networking: Definitions and Design Goals", RFC 7575, 1215 DOI 10.17487/RFC7575, June 2015, 1216 . 1218 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 1219 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 1220 May 2016, . 1222 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 1223 Monitoring Protocol (BMP)", RFC 7854, 1224 DOI 10.17487/RFC7854, June 2016, 1225 . 1227 [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, 1228 L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, 1229 "Alternate-Marking Method for Passive and Hybrid 1230 Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, 1231 January 2018, . 1233 Appendix A. A Survey on Existing Network Telemetry Techniques 1235 In this non-normative appendix, we provide an overview of some 1236 existing techniques and standard proposals for each network telemetry 1237 module. 1239 A.1. Management Plane Telemetry 1241 A.1.1. Push Extensions for NETCONF 1243 NETCONF [RFC6241] is one popular network management protocol, which 1244 is also recommended by IETF. Although it can be used for data 1245 collection, NETCONF is good at configurations. YANG Push 1246 [I-D.ietf-netconf-yang-push] extends NETCONF and enables subscriber 1247 applications to request a continuous, customized stream of updates 1248 from a YANG datastore. Providing such visibility into changes made 1249 upon YANG configuration and operational objects enables new 1250 capabilities based on the remote mirroring of configuration and 1251 operational state. Moreover, distributed data collection mechanism 1252 [I-D.zhou-netconf-multi-stream-originators] via UDP based publication 1253 channel [I-D.ietf-netconf-udp-pub-channel] provides enhanced 1254 efficiency for the NETCONF based telemetry. 1256 A.1.2. gRPC Network Management Interface 1258 gRPC Network Management Interface (gNMI) 1259 [I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol 1260 based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote 1261 Procedure Call) framework. With a single gRPC service definition, 1262 both configuration and telemetry can be covered. gRPC is an HTTP/2 1263 [RFC7540] based open source micro service communication framework. 1264 It provides a number of capabilities which are well-suited for 1265 network telemetry, including: 1267 o Full-duplex streaming transport model combined with a binary 1268 encoding mechanism provided further improved telemetry efficiency. 1270 o gRPC provides higher-level features consistency across platforms 1271 that common HTTP/2 libraries typically do not. This 1272 characteristic is especially valuable for the fact that telemetry 1273 data collectors normally reside on a large variety of platforms. 1275 o The built-in load-balancing and failover mechanism. 1277 A.2. Control Plane Telemetry 1279 A.2.1. BGP Monitoring Protocol 1281 BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP 1282 sessions and intended to provide a convenient interface for obtaining 1283 route views. 1285 The BGP routing information is collected from the monitored device(s) 1286 to the BMP monitoring station by setting up the BMP TCP session. The 1287 BGP peers are monitored by the BMP Peer Up and Peer Down 1288 Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], 1289 Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib 1290 [I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route 1291 Monitoring Message and the BMP Route Mirroring Message, in the form 1292 of both initial table dump and real-time route update. In addition, 1293 BGP statistics are reported through the BMP Stats Report Message, 1294 which could be either timer triggered or event-driven. More BMP 1295 extensions can be explored to enrich the applications of BGP 1296 monitoring. 1298 A.3. Data Plane Telemetry 1300 A.3.1. The IPFPM technology 1302 The Alternate Marking method is efficient to perform packet loss, 1303 delay, and jitter measurements both in an IP and Overlay Networks, as 1304 presented in IPFPM [RFC8321] and 1305 [I-D.fioccola-ippm-multipoint-alt-mark]. 1307 This technique can be applied to point-to-point and multipoint-to- 1308 multipoint flows. Alternate Marking creates batches of packets by 1309 alternating the value of 1 bit (or a label) of the packet header. 1310 These batches of packets are unambiguously recognized over the 1311 network and the comparison of packet counters for each batch allows 1312 the packet loss calculation. The same idea can be applied to delay 1313 measurement by selecting ad hoc packets with a marking bit dedicated 1314 for delay measurements. 1316 Alternate Marking method needs two counters each marking period for 1317 each flow under monitor. For instance, by considering n measurement 1318 points and m monitored flows, the order of magnitude of the packet 1319 counters for each time interval is n*m*2 (1 per color). 1321 Since networks offer rich sets of network performance measurement 1322 data (e.g packet counters), traditional approaches run into 1323 limitations. One reason is the fact that the bottleneck is the 1324 generation and export of the data and the amount of data that can be 1325 reasonably collected from the network. In addition, management tasks 1326 related to determining and configuring which data to generate lead to 1327 significant deployment challenges. 1329 Multipoint Alternate Marking approach, described in 1330 [I-D.fioccola-ippm-multipoint-alt-mark], aims to resolve this issue 1331 and makes the performance monitoring more flexible in case a detailed 1332 analysis is not needed. 1334 An application orchestrates network performance measurements tasks 1335 across the network to allow an optimized monitoring and it can 1336 calibrate how deep can be obtained monitoring data from the network 1337 by configuring measurement points roughly or meticulously. 1339 Using Alternate Marking, it is possible to monitor a Multipoint 1340 Network without examining in depth by using the Network Clustering 1341 (subnetworks that are portions of the entire network that preserve 1342 the same property of the entire network, called clusters). So in 1343 case there is packet loss or the delay is too high the filtering 1344 criteria could be specified more in order to perform a detailed 1345 analysis by using a different combination of clusters up to a per- 1346 flow measurement as described in IPFPM [RFC8321]. 1348 In summary, an application can configure end-to-end network 1349 monitoring. If the network does not experiment issues, this 1350 approximate monitoring is good enough and is very cheap in terms of 1351 network resources. However, in case of problems, the application 1352 becomes aware of the issues from this approximate monitoring and, in 1353 order to localize the portion of the network that has issues, 1354 configures the measurement points more exhaustively. So a new 1355 detailed monitoring is performed. After the detection and resolution 1356 of the problem the initial approximate monitoring can be used again. 1358 A.3.2. Dynamic Network Probe 1360 Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] 1361 provides a programmable means to customize the data that an 1362 application collects from the data plane. A direct benefit of DNP is 1363 the reduction of the exported data. A full DNP solution covers 1364 several components including data source, data subscription, and data 1365 generation. The data subscription needs to define the complex data 1366 which can be composed and derived from the raw data sources. The 1367 data generation takes advantage of the moderate in-network computing 1368 to produce the desired data. 1370 While DNP can introduce unforeseeable flexibility to the data plane 1371 telemetry, it also faces some challenges. It requires a flexible 1372 data plane that can be dynamically reprogrammed at run-time. The 1373 programming API is yet to be defined. 1375 A.3.3. IP Flow Information Export (IPFIX) protocol 1377 Traffic on a network can be seen as a set of flows passing through 1378 network elements. IP Flow Information Export (IPFIX) [RFC7011] 1379 provides a means of transmitting traffic flow information for 1380 administrative or other purposes. A typical IPFIX enabled system 1381 includes a pool of Metering Processes collects data packets at one or 1382 more Observation Points, optionally filters them and aggregates 1383 information about these packets. An Exporter then gathers each of 1384 the Observation Points together into an Observation Domain and sends 1385 this information via the IPFIX protocol to a Collector. 1387 A.3.4. In-Situ OAM 1389 Traditional passive and active monitoring and measurement techniques 1390 are either inaccurate or resource-consuming. It is preferable to 1391 directly acquire data associated with a flow's packets when the 1392 packets pass through a network. In-situ OAM (iOAM) 1393 [I-D.brockners-inband-oam-requirements], a data generation technique, 1394 embeds a new instruction header to user packets and the instruction 1395 directs the network nodes to add the requested data to the packets. 1396 Thus, at the path end, the packet's experience gained on the entire 1397 forwarding path can be collected. Such firsthand data is invaluable 1398 to many network OAM applications. 1400 However, iOAM also faces some challenges. The issues on performance 1401 impact, security, scalability and overhead limits, encapsulation 1402 difficulties in some protocols, and cross-domain deployment need to 1403 be addressed. 1405 A.3.5. Postcard Based Telemetry 1407 PBT [I-D.song-ippm-postcard-based-telemetry] is an alternative to 1408 IOAM. PBT directly exports data at each node through an independent 1409 packet. PBT solves several issues of IOAM. It can also help to 1410 identify packet drop location in case a packet is dropped on its 1411 forwarding path. 1413 A.4. External Data and Event Telemetry 1414 A.4.1. Sources of External Events 1416 To ensure that the information provided by external event detectors 1417 and used by the network management solutions is meaningful for the 1418 management purposes, the network telemetry framework must ensure that 1419 such detectors (sources) are easily connected to the management 1420 solutions (sinks). This requires the specification of a simple 1421 taxonomy of detectors and match it to the connectors and/or 1422 interfaces required to connect them. 1424 Once detectors are classified in such taxonomy, their definitions are 1425 enlarged with the qualities and other aspects used to handle them and 1426 represented in the ontology and information model (e.g. YANG). 1427 Therefore, differentiating several types of detectors as potential 1428 sources of external events is essential for the integrity of the 1429 management framework. We thus differentiate the following source 1430 types of external events: 1432 o Smart objects and sensors. With the consolidation of the Internet 1433 of Things~(IoT) any network system will have many smart objects 1434 attached to its physical surroundings and logical operation 1435 environments. Most of these objects will be essentially based on 1436 sensors of many kinds (e.g. temperature, humidity, presence) and 1437 the information they provide can be very useful for the management 1438 of the network, even when they are not specifically deployed for 1439 such purpose. Elements of this source type will usually provide a 1440 specific protocol for interaction, especially one of those 1441 protocols related to IoT, such as the Constrained Application 1442 Protocol (CoAP). It will be used by the telemetry framework to 1443 interact with the relevant objects. 1445 o Online news reporters. Several online news services have the 1446 ability to provide enormous quantity of information about 1447 different events occurring in the world. Some of those events can 1448 impact on the network system managed by a specific framework and, 1449 therefore, it will be interested on getting such information. For 1450 instance, diverse security reports, such as the Common 1451 Vulnerabilities and Exposures (CVE), can be issued by the 1452 corresponding authority and used by the management solution to 1453 update the managed system if needed. Instead of a specific 1454 protocol and data format, the sources of this kind of information 1455 usually follow a relaxed but structured format. This format will 1456 be part of both the ontology and information model of the 1457 telemetry framework. 1459 o Global event analyzers. The advance of Big Data analyzers 1460 provides a huge amount of information and, more interestingly, the 1461 identification of events detected by analyzing many data streams 1462 from different origins. In contrast with the other types of 1463 sources, which are focused in specific events, the detectors of 1464 this source type will detect very generic events. For example, a 1465 sports event takes place and some unexpected movement makes it 1466 highly interesting and many people connects to sites that are 1467 covering such event. The systems supporting the services that 1468 cover the event can be affected by such situation so their 1469 management solutions should be aware of it. In contrast with the 1470 other source types, a new information model, format, and reporting 1471 protocol is required to integrate the detectors of this type with 1472 the management solution. 1474 Additional types of detector types can be added to the system but 1475 they will be generally the result of composing the properties offered 1476 by these main classes. In any case, future revisions of the network 1477 telemetry framework will include the required types that cover new 1478 circumstances and that cannot be obtained by composition. 1480 A.4.2. Connectors and Interfaces 1482 For allowing external event detectors to be properly integrated with 1483 other management solutions, both elements must expose interfaces and 1484 protocols that are subject to their particular objective. Since 1485 external event detectors will be focused on providing their 1486 information to their main consumers, which generally will not be 1487 limited to the network management solutions, the framework must 1488 include the definition of the required connectors for ensuring the 1489 interconnection between detectors (sources) and their consumers 1490 within the management systems (sinks) are effective. 1492 In some situations, the interconnection between the external event 1493 detectors and the management system is via the management plane. For 1494 those situations there will be a special connector that provides the 1495 typical interfaces found in most other elements connected to the 1496 management plane. For instance, the interfaces will accomplish with 1497 a specific information model (YANG) and specific telemetry protocol, 1498 such as NETCONF, SNMP, or gRPC. 1500 Authors' Addresses 1502 Haoyu Song (editor) 1503 Futurewei 1504 2330 Central Expressway 1505 Santa Clara 1506 USA 1508 Email: hsong@futurewei.com 1509 Fengwei Qin 1510 China Mobile 1511 No. 32 Xuanwumenxi Ave., Xicheng District 1512 Beijing, 100032 1513 P.R. China 1515 Email: qinfengwei@chinamobile.com 1517 Pedro Martinez-Julia 1518 NICT 1519 4-2-1, Nukui-Kitamachi 1520 Koganei, Tokyo 184-8795 1521 Japan 1523 Email: pedro@nict.go.jp 1525 Laurent Ciavaglia 1526 Nokia 1527 Villarceaux 91460 1528 France 1530 Email: laurent.ciavaglia@nokia.com 1532 Aijun Wang 1533 China Telecom 1534 Beiqijia Town, Changping District 1535 Beijing, 102209 1536 P.R. China 1538 Email: wangaj.bri@chinatelecom.cn