idnits 2.17.1 draft-opsawg-ntf-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 25, 2019) is 1859 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC1157' is defined on line 919, but no explicit reference was found in the text == Outdated reference: A later version (-07) exists of draft-ietf-grow-bmp-adj-rib-out-04 == Outdated reference: A later version (-13) exists of draft-ietf-grow-bmp-local-rib-03 == Outdated reference: A later version (-25) exists of draft-ietf-netconf-yang-push-22 == Outdated reference: A later version (-16) exists of draft-song-ippm-postcard-based-telemetry-02 == Outdated reference: A later version (-10) exists of draft-zhou-netconf-multi-stream-originators-04 -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 8321 (Obsoleted by RFC 9341) Summary: 0 errors (**), 0 flaws (~~), 8 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG H. Song, Ed. 3 Internet-Draft Huawei 4 Intended status: Informational Z. Li 5 Expires: September 26, 2019 China Mobile 6 P. Martinez-Julia 7 NICT 8 L. Ciavaglia 9 Nokia 10 A. Wang 11 China Telecom 12 March 25, 2019 14 Network Telemetry Framework 15 draft-opsawg-ntf-00 17 Abstract 19 This document provides an architectural framework for network 20 telemetry to address the current and future network operation 21 challenges and requirements. As evidenced by the defining 22 characteristics and industry practice, network telemetry covers 23 technologies and protocols beyond the conventional network 24 Operations, Administration, and Management (OAM). Network telemetry 25 promises better flexibility, scalability, accuracy, coverage, and 26 performance and allows automated control loops to suit both today's 27 and tomorrow's network operation requirements. This document 28 clarifies the terminologies and classifies the modules and components 29 of a network telemetry system. The framework and taxonomy help to 30 set a common ground for the collection of related work and provide 31 guidance for future technique and standard developments. 33 Status of This Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF). Note that other groups may also distribute 40 working documents as Internet-Drafts. The list of current Internet- 41 Drafts is at https://datatracker.ietf.org/drafts/current/. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 This Internet-Draft will expire on September 26, 2019. 50 Copyright Notice 52 Copyright (c) 2019 IETF Trust and the persons identified as the 53 document authors. All rights reserved. 55 This document is subject to BCP 78 and the IETF Trust's Legal 56 Provisions Relating to IETF Documents 57 (https://trustee.ietf.org/license-info) in effect on the date of 58 publication of this document. Please review these documents 59 carefully, as they describe your rights and restrictions with respect 60 to this document. Code Components extracted from this document must 61 include Simplified BSD License text as described in Section 4.e of 62 the Trust Legal Provisions and are provided without warranty as 63 described in the Simplified BSD License. 65 Table of Contents 67 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 68 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 69 2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 4 70 2.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 4 71 2.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 5 72 2.3. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 7 73 2.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 8 74 3. The Necessity of a Network Telemetry Framework . . . . . . . 9 75 4. Network Telemetry Framework . . . . . . . . . . . . . . . . . 10 76 4.1. Data Acquiring Mechanisms . . . . . . . . . . . . . . . . 11 77 4.2. Data Objects . . . . . . . . . . . . . . . . . . . . . . 12 78 4.3. Function Components . . . . . . . . . . . . . . . . . . . 14 79 4.4. Existing Works Mapped in the Framework . . . . . . . . . 16 80 5. Evolution of Network Telemetry . . . . . . . . . . . . . . . 17 81 6. Security Considerations . . . . . . . . . . . . . . . . . . . 18 82 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 83 8. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 19 84 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 19 85 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 86 10.1. Normative References . . . . . . . . . . . . . . . . . . 19 87 10.2. Informative References . . . . . . . . . . . . . . . . . 20 88 Appendix A. A Survey on Existing Network Telemetry Techniques . 23 89 A.1. Management Plane Telemetry . . . . . . . . . . . . . . . 23 90 A.1.1. Requirements and Challenges . . . . . . . . . . . . . 23 91 A.1.2. Push Extensions for NETCONF . . . . . . . . . . . . . 23 92 A.1.3. gRPC Network Management Interface . . . . . . . . . . 24 93 A.2. Control Plane Telemetry . . . . . . . . . . . . . . . . . 24 94 A.2.1. Requirements and Challenges . . . . . . . . . . . . . 24 95 A.2.2. BGP Monitoring Protocol . . . . . . . . . . . . . . . 25 96 A.3. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 25 97 A.3.1. Requirements and Challenges . . . . . . . . . . . . . 25 98 A.3.2. Technique Taxonomy . . . . . . . . . . . . . . . . . 26 99 A.3.3. The IPFPM technology . . . . . . . . . . . . . . . . 27 100 A.3.4. Dynamic Network Probe . . . . . . . . . . . . . . . . 28 101 A.3.5. IP Flow Information Export (IPFIX) protocol . . . . . 29 102 A.3.6. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 29 103 A.3.7. Postcard Based Telemetry . . . . . . . . . . . . . . 29 104 A.4. External Data and Event Telemetry . . . . . . . . . . . . 29 105 A.4.1. Requirements and Challenges . . . . . . . . . . . . . 30 106 A.4.2. Sources of External Events . . . . . . . . . . . . . 30 107 A.4.3. Connectors and Interfaces . . . . . . . . . . . . . . 32 108 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 32 110 1. Introduction 112 Network visibility is essential for network operation. Network 113 telemetry has been widely considered as an ideal mean to gain 114 sufficient network visibility with better flexibility, scalability, 115 accuracy, coverage, and performance than conventional OAM 116 technologies. However, confusion and misunderstandings about the 117 network telemetry remain (e.g., the scope and coverage of the term). 118 We need an unambiguous concept and a clear architectural framework 119 for network telemetry so we can better align the related technology 120 and standard work. 122 First, we show some key characteristics of network telemetry which 123 set a clear distinction from the conventional network OAM and show 124 that some conventional OAM technologies can be considered a subset of 125 the network telemetry technologies. We then provide an architectural 126 framework for network telemetry to meet the current and future 127 network operation requirements. Following the framework, we classify 128 the components of a network telemetry system so we can easily map the 129 existing and emerging techniques and protocols into the framework. 130 At last, we outline a roadmap for the evolution of the network 131 telemetry system. 133 The purpose of the framework and taxonomy is to set a common ground 134 for the collection of related work and provide guidance for future 135 technique and standard developments. 137 1.1. Requirements Language 139 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 140 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 141 "OPTIONAL" in this document are to be interpreted as described in BCP 142 14 [RFC2119][RFC8174] when, and only when, they appear in all 143 capitals, as shown here. 145 2. Motivation 147 Thanks to the advance of the computing and storage technologies, 148 today's big data analytics and machine learning-based Artificial 149 Intelligence (AI) give network operators an unprecedented opportunity 150 to gain network insights and move towards network autonomy. Software 151 tools can use the network data to detect and react on network faults, 152 anomalies, and policy violations, as well as predicting future 153 events. In turn, the network policy updates for planning, intrusion 154 prevention, optimization, and self-healing may be applied. 156 It is conceivable that an intent-driven autonomous network is the 157 logical next step for network evolution following Software Defined 158 Network (SDN), aiming to reduce (or even eliminate) human labor, make 159 the most efficient usage of network resources, and provide better 160 services more aligned with customer requirements. Although it takes 161 time to reach the ultimate goal, the journey has started 162 nevertheless. 164 However, the system bottleneck is shifting from data consumption to 165 data supply. Both the number of network nodes and the traffic 166 bandwidth keep increasing at a fast pace. The network configuration 167 and policy change at a much smaller time slot than ever before. More 168 subtle events and fine-grained data through all network planes need 169 to be captured and exported in real time. In a nutshell, it is a 170 challenge to get enough high-quality data out of network efficiently, 171 timely, and flexibly. Therefore, we need to examine the existing 172 network technologies and protocols, and identify any potential 173 technique and standard gaps based on the real network and device 174 architectures. 176 In the remaining of this section, first we discuss several key use 177 cases for today's and future network operations. Next, we show why 178 the current network OAM techniques and protocols are insufficient for 179 these use cases. The discussion underlines the need of new methods, 180 techniques, and protocols which we may assign under an umbrella term 181 - network telemetry. 183 2.1. Use Cases 185 These use cases are essential for network operations. While the list 186 is by no means exhaustive, it is enough to highlight the requirements 187 for data velocity, variety, and volume in networks. 189 Policy and Intent Compliance: Network policies are the rules that 190 constraint the services for network access, provide service 191 differentiation, or enforce specific treatment on the traffic. 192 For example, a service function chain is a policy that requires 193 the selected flows to pass through a set of ordered network 194 functions. An intents is a high-level abstract policy which 195 requires a complex translation and mapping process before being 196 applied on networks. While a policy is enforced, the compliance 197 needs to be verified and monitored continuously. 199 SLA Compliance: A Service-Level Agreement (SLA) defines the level of 200 service a user expects from a network operator, which include the 201 metrics for the service measurement and remedy/penalty procedures 202 when the service level misses the agreement. Users need to check 203 if they get the service as promised and network operators need to 204 evaluate how they can deliver the services that can meet the SLA. 206 Root Cause Analysis: Any network failure can be the cause or effect 207 of a sequence of chained events. Troubleshooting and recovery 208 require quick identification of the root cause of any observable 209 issues. However, the root cause is not always straightforward to 210 identify, especially when the failure is sporadic and the related 211 and unrelated events are overwhelming. While machine learning 212 technologies can be used for root cause analysis, it up to the 213 network to sense and provide all the relevant data. 215 Network Optimization: This covers all short-term and long-term 216 network optimization techniques, including load balancing, Traffic 217 Engineering (TE), and network planning. Network operators are 218 motivated to optimize their network utilization and differentiate 219 services for better ROI or lower CAPEX. The first step is to know 220 the real-time network conditions before applying policies for 221 traffic manipulation. In some cases, micro-bursts need to be 222 detected in a very short time-frame so that fine-grained traffic 223 control can be applied to avoid network congestion. The long-term 224 network capacity planning and topology augmentation also rely on 225 the accumulated data of the network operations. 227 Event Tracking and Prediction: The visibility of user traffic path 228 and performance is critical for healthy network operation. 229 Numerous related network events are of interest to network 230 operators. For example, Network operators always want to learn 231 where and why packets are dropped for an application flow. They 232 also want to be warned of issues in advance so proactive actions 233 can be taken to avoid catastrophic consequences. 235 2.2. Challenges 237 For a long time, network operators have relied upon SNMP [RFC3416], 238 Command-Line Interface (CLI), or Syslog to monitor the network. Some 239 other OAM techniques as described in [RFC7276] are also used to 240 facilitate network troubleshooting. These conventional techniques 241 are not sufficient to support the above use cases for the following 242 reasons: 244 o Most use cases need to continuously monitor the network and 245 dynamically refine the data collection in real-time and 246 interactively. The poll-based low-frequency data collection is 247 ill-suited for these applications. Subscription-based streaming 248 data directly pushed from the data source (e.g., the forwarding 249 chip) is preferred to provide enough data quantity and precision 250 at scale. 252 o Comprehensive data is needed from packet processing engine to 253 traffic manager, from line cards to main control board, from user 254 flows to control protocol packets, from device configurations to 255 operations, and from physical layer to application layer. 256 Conventional OAM only covers a narrow range of data (e.g., SNMP 257 only handles data from the Management Information Base (MIB)). 258 Traditional network devices cannot provide all the necessary 259 probes. An open and programmable network device is therefore 260 needed. 262 o Many application scenarios need to correlate data from multiple 263 sources (i.e., from distributed network devices, different 264 components of a network device, or different network planes). A 265 piecemeal solution is often lacking the capability to consolidate 266 the data from multiple sources. The composition of a complete 267 solution, as partly proposed by Autonomic Resource Control 268 Architecture(ARCA) [I-D.pedro-nmrg-anticipated-adaptation], will 269 be empowered and guided by a comprehensive framework. 271 o Some of the conventional OAM techniques (e.g., CLI and Syslog) are 272 lack of formal data model. The unstructured data hinder the tool 273 automation and application extensibility. Standardized data 274 models are essential to support the programmable networks. 276 o Although some conventional OAM techniques support data push (e.g., 277 SNMP Trap [RFC2981][RFC3877], Syslog, and sFlow), the pushed data 278 are limited to only predefined management plane warnings (e.g., 279 SNMP Trap) or sampled user packets (e.g., sFlow). We require the 280 data with arbitrary source, granularity, and precision which are 281 beyond the capability of the existing techniques. 283 o The conventional passive measurement techniques can either consume 284 too much network resources and render too much redundant data, or 285 lead to inaccurate results; the conventional active measurement 286 techniques can interfere with the user traffic and their results 287 are indirect. We need techniques that can collect direct and on- 288 demand data from user traffic. 290 2.3. Glossary 292 Before further discussion, we list some key terminology and acronyms 293 used in this documents. We make an intended distinction between 294 network telemetry and network OAM. 296 AI: Artificial Intelligence. Use machine-learning based 297 technologies to automate network operation. 299 BMP: BGP Monitoring Protocol 301 DNP: Dynamic Network Probe 303 DPI: Deep Packet Inspection 305 gNMI: gRPC Network Management Interface 307 gRPC: gRPC Remote Procedure Call 309 IDN: Intent-Driven Network 311 IPFIX: IP Flow Information Export Protocol 313 IPFPM: IP Flow Performance Measurement 315 IOAM: In-situ OAM 317 NETCONF: Network Configuration Protocol 319 Network Telemetry: Acquiring network data remotely for network 320 monitoring and operation. A general term for a large set of 321 network visibility techniques and protocols, with the 322 characteristics defined in this document. Network telemetry 323 addresses the current network operation issues and enables smooth 324 evolution toward intent-driven autonomous networks. 326 NMS: Network Management System 328 OAM: Operations, Administration, and Maintenance. A group of 329 network management functions that provide network fault 330 indication, fault localization, performance information, and data 331 and diagnosis functions. Most conventional network monitoring 332 techniques and protocols belong to network OAM. 334 SNMP: Simple Network Management Protocol 336 YANG: A data modeling language for NETCONF 337 YANG FSM: A YANG model to define device side finite state machine 339 YANG PUSH: A method to subscribe pushed data from remote YANG 340 datastore 342 2.4. Network Telemetry 344 Network telemetry has emerged as a mainstream technical term to refer 345 to the newer data collection and consumption techniques, 346 distinguishing itself from the convention techniques for network OAM. 347 The representative techniques and protocols include IPFIX [RFC7011] 348 and gPRC [I-D.kumar-rtgwg-grpc-protocol]. Network telemetry allows 349 separate entities to acquire data from network devices so that data 350 can be visualized and analyzed to support network monitoring and 351 operation. Network telemetry overlaps with the conventional network 352 OAM and has a wider scope than it. It is expected that network 353 telemetry can provide the necessary network insight for autonomous 354 networks and address the shortcomings of conventional OAM techniques. 356 One difference between the network telemetry and the network OAM is 357 that the network telemetry assumes machines as data consumer, while 358 the conventional network OAM usually assumes human operators. Hence, 359 the network telemetry can directly trigger the automated network 360 operation, but the conventional OAM tools only help human operators 361 to monitor and diagnose the networks and guide manual network 362 operations. The difference leads to very different techniques. 364 Although the network telemetry techniques are just emerging and 365 subject to continuous evolution, several characteristics of network 366 telemetry have been well accepted (Note that network telemetry is 367 intended to be an umbrella term covering a wide spectrum of 368 techniques, so the following characteristics are not expected to be 369 held by every specific technique): 371 o Push and Streaming: Instead of polling data from network devices, 372 the telemetry collector subscribes to the streaming data pushed 373 from data sources in network devices. 375 o Volume and Velocity: The telemetry data is intended to be consumed 376 by machine rather than by a human. Therefore, the data volume is 377 huge and the processing is often in realtime. 379 o Normalization and Unification: Telemetry aims to address the 380 overall network automation needs. The piecemeal solutions offered 381 by the conventional OAM approach are no longer suitable. Efforts 382 need to be made to normalize the data representation and unify the 383 protocols. 385 o Model-based: The telemetry data is modeled in advance which allows 386 applications to configure and consume data with ease. 388 o Data Fusion: The data for a single application can come from 389 multiple data sources (e.g., cross-domain, cross-device, and 390 cross-layer) and needs to be correlated to take effect. 392 o Dynamic and Interactive: Since the network telemetry means to be 393 used in a closed control loop for network automation, it needs to 394 run continuously and adapt to the dynamic and interactive queries 395 from the network operation controller. 397 Note that a technique does not need to have all the above 398 characteristics to be qualified as telemetry. An ideal network 399 telemetry solution may also have the following features or 400 properties: 402 o In-Network Customization: The data can be customized in network at 403 run-time to cater to the specific need of applications. This 404 needs the support of a programmable data plane which allows probes 405 to be deployed at flexible locations. 407 o Direct Data Plane Export: The data originated from data plane can 408 be directly exported to the data consumer for efficiency, 409 especially when the data bandwidth is large and the real-time 410 processing is required. 412 o In-band Data Collection: In addition to the passive and active 413 data collection approaches, the new hybrid approach allows to 414 directly collect data for any target flow on its entire forwarding 415 path. 417 o Non-intrusive: The telemetry system should avoid the pitfall of 418 the "observer effect". That is, it should not change the network 419 behavior and affect the forwarding performance. 421 3. The Necessity of a Network Telemetry Framework 423 Big data analytics and machine-learning based AI technologies are 424 applied for network operation automation, relying on abundant data 425 from networks. The single-sourced and static data acquisition cannot 426 meet the data requirements. It is desirable to have a framework that 427 integrates multiple telemetry approaches from different layers. This 428 allows flexible combinations for different applications. The 429 framework would benefit application development for the following 430 reasons: 432 o The future autonomous networks will require a holistic view on 433 network visibility. All the use cases and applications need to be 434 supported uniformly and coherently under a single intelligent 435 agent. Therefore, the protocols and mechanisms should be 436 consolidated into a minimum yet comprehensive set. A telemetry 437 framework can help to normalize the technique developments. 439 o Network visibility presents multiple viewpoints. For example, the 440 device viewpoint takes the network infrastructure as the 441 monitoring object from which the network topology and device 442 status can be acquired; the traffic viewpoint takes the flows or 443 packets as the monitoring object from which the traffic quality 444 and path can be acquired. An application may need to switch its 445 viewpoint during operation. It may also need to correlate a 446 service and impact on network experience to acquire the 447 comprehensive information. 449 o Applications require network telemetry to be elastic in order to 450 efficiently use the network resource and reduce the performance 451 impact. Routine network monitoring covers the entire network with 452 low data sampling rate. When issues arise or trends emerge, the 453 telemetry data source can be modified and the data rate can be 454 boosted. 456 o Efficient data fusion is critical for applications to reduce the 457 overall quantity of data and improve the accuracy of analysis. 459 So far, some telemetry related work has been done within IETF. 460 However, the work is fragmented and scattered in different working 461 groups. The lack of coherence makes it difficult to assemble a 462 comprehensive network telemetry system and causes repetitive and 463 redundant work. 465 A formal network telemetry framework is needed for constructing a 466 working system. The framework should cover the concepts and 467 components from the standardization perspective. This document 468 clarifies the layers on which the telemetry is exerted and decomposes 469 the telemetry system into a set of distinct components that the 470 existing and future work can easily map to. 472 4. Network Telemetry Framework 474 Network telemetry techniques can be classified from multiple 475 dimensions. In this document, we provide three unique perspectives: 476 data acquiring mechanisms, data objects, and function components. 478 4.1. Data Acquiring Mechanisms 480 Broadly speaking, network data can be acquired through subscription 481 (push) and query (poll). A subscriber may request data when it is 482 ready. It follows a Publish-Subscription (Pub-Sub) mode or a 483 Subscription-Publish (Sub-Pub) mode. In the Pub-Sub mode, pre- 484 defined data are published and multiple qualified subscribers can 485 subscribe the data. In the Sub-Pub mode, a subscriber designates 486 what data are of interest and demands the network devices to deliver 487 the data when they are available. 489 In contrast, a querier expects immediate feedback from network 490 devices. It is usually used in a more interactive environment. The 491 queried data may be directly extracted from some specific data 492 source, or synthesized and processed from raw data. 494 There are four types of data from network devices: 496 Simple Data: The data that are steadily available from some data 497 store or static probes in network devices. such data can be 498 specified by YANG model. 500 Custom Data: The data need to be synthesized or processed from raw 501 data from one or more network devices. The data processing 502 function can be statically or dynamically loaded into network 503 devices. 505 Event-triggered Data: The data are conditionally acquired based on 506 the occurrence of some event. An event can be modeled as a Finite 507 State Machine (FSM). 509 Streaming Data: The data are continuously or periodically generated. 510 It can be time series or the dump of databases. The streaming 511 data reflect realtime network states and metrics and require large 512 bandwidth and processing power. 514 The above data types are not mutual exclusive. For example, event- 515 triggered data can be simple or custom, and streaming data can be 516 event triggered. The relationships of these data types are 517 illustrated in Figure 1 518 +--------------------------+ 519 | +----------------------+ | 520 | | +-----------------+ | | 521 | | | +-------------+ | | | 522 | | | | Simple Data | | | | 523 | | | +-------------+ | | | 524 | | | Custom Data | | | 525 | | +-----------------+ | | 526 | | Event-triggered Data | | 527 | +----------------------+ | 528 | Streaming Data | 529 +--------------------------+ 531 Figure 1: Data Type Relationship 533 Subscription usually deals with event-triggered data and streaming 534 data, and query usually deals with simple data and custom data. It 535 is easy to see that conventional OAM techniques are mostly about 536 querying simple data only. While these techniques are still useful, 537 advanced network telemetry techniques pay more attention on the other 538 three data types, and prefer event/streaming data subscription and 539 custom data query over simple data query. 541 4.2. Data Objects 543 Telemetry can be applied on the forwarding plane, the control plane, 544 and the management plane in a network, as well as other sources out 545 of the network, as shown in Figure 2. Therefore, we categorize the 546 network telemetry into four distinct modules. 548 +------------------------------+ 549 | | 550 | Network Operation |<-------+ 551 | Applications | | 552 | | | 553 +------------------------------+ | 554 ^ ^ ^ | 555 | | | | 556 V | V V 557 +-----------|---+--------------+ +-----------+ 558 | | | | | | 559 | Control Pl|ane| | | External | 560 | Telemetry | <---> | | Data and | 561 | | | | | Event | 562 | ^ V | Management | | Telemetry | 563 +------|--------+ Plane | | | 564 | V | Telemetry | +-----------+ 565 | Forwarding | | 566 | Plane <---> | 567 | Telemetry | | 568 | | | 569 +---------------+--------------+ 571 Figure 2: Layer Category of the Network Telemetry Framework 573 The rationale of this partition lies in the different telemetry data 574 objects which result in different data source and export locations. 575 Such differences have profound implications on in-network data 576 programming and processing capability, data encoding and transport 577 protocol, and data bandwidth and latency. 579 We summarize the major differences of the four modules in the 580 following table. Some representative techniques are shown in some 581 table blocks to highlight the technical diversity of these modules. 583 +---------+--------------+--------------+--------------+-----------+ 584 | Module | Control | Management | Forwarding | External | 585 | | Plane | Plane | Plane | Data | 586 +---------+--------------+--------------+--------------+-----------+ 587 |Object | control | config. & | flow & packet| terminal, | 588 | | protocol & | operation | QoS, traffic | social & | 589 | | signaling, | state, MIB | stat., buffer| environ- | 590 | | RIB, ACL | | & queue stat.| mental | 591 +---------+--------------+--------------+--------------+-----------+ 592 |Export | main control | main control | fwding chip | various | 593 |Location | CPU, | CPU | or linecard | | 594 | | linecard CPU | | CPU; main | | 595 | | or fwding | | control CPU | | 596 | | chip | | unlikely | | 597 +---------+--------------+--------------+--------------+-----------+ 598 |Model | YANG, | MIB, syslog, | template, | YANG | 599 | | custom | YANG, | YANG, | | 600 | | | custom | custom | | 601 +---------+--------------+--------------+--------------+-----------+ 602 |Encoding | GPB, JSON, | GPB, JSON, | plain | GPB, JSON | 603 | | XML, plain | XML | | XML, plain| 604 +---------+--------------+--------------+--------------+-----------+ 605 |Protocol | gRPC,NETCONF,| gPRC,NETCONF,| IPFIX, mirror| gRPC | 606 | | IPFIX,mirror | | | | 607 +---------+--------------+--------------+--------------+-----------+ 608 |Transport| HTTP, TCP, | HTTP, TCP | UDP | TCP, UDP | 609 | | UDP | | | | 610 +---------+--------------+--------------+--------------+-----------+ 612 Figure 3: Layer Category of the Network Telemetry Framework 614 Note that the interaction with the network operation applications can 615 be indirect. For example, in the management plane telemetry, the 616 management plane may need to acquire data from the data plane. Some 617 of the operational states can only be derived from the data plane 618 such as the interface status and statistics. For another example, 619 the control plane telemetry may need to access the FIB in data plane. 620 On the other hand, an application may involve more than one plane 621 simultaneously. For example, an SLA compliance application may 622 require both the data plane telemetry and the control plane 623 telemetry. 625 4.3. Function Components 627 At each plane, the telemetry can be further partitioned into five 628 distinct components: 630 Data Query, Analysis, and Storage: This component works at the 631 application layer. On the one hand, it is responsible for issuing 632 data queries. The queries can be for modeled data through 633 configuration or custom data through programming. The queries can 634 be one shot or subscriptions for events or streaming data. On the 635 other hand, it receives, stores, and processes the returned data 636 from network devices. Data analysis can be interactive to 637 initiate further data queries. 639 Data Configuration and Subscription: This component deploys data 640 queries on devices. It determines the protocol and channel for 641 applications to acquire desired data. This component is also 642 responsible for configuring the desired data that might not be 643 directly available form data sources. The subscription data can 644 be described by models, templates, or programs. 646 Data Encoding and Export: This component determines how telemetry 647 data are delivered to the data analysis and storage component. 648 The data encoding and the transport protocol may vary due to the 649 data exporting location. 651 Data Generation and Processing: The requested data needs to be 652 captured, processed, and formatted in network devices from raw 653 data sources. This may involve in-network computing and 654 processing on either the fast path or the slow path in network 655 devices. 657 Data Object and Source: This component determines the monitoring 658 object and original data source. The data source usually just 659 provides raw data which needs further processing. A data source 660 can be considered a probe. A probe can be statically installed or 661 dynamically installed. 663 +----------------------------------------+ 664 | | 665 | Data Query, Analysis, & Storage | 666 | | 667 +----------------------------------------+ 668 | ^ 669 | | 670 V | 671 +---------------------+------------------+ 672 | Data Configuration | | 673 | & Subscription | Data Encoding | 674 | (model, template, | & Export | 675 | & program) | | 676 +---------------------+------------------| 677 | | 678 | Data Generation | 679 | & Processing | 680 | | 681 +----------------------------------------| 682 | | 683 | Data Object and Source | 684 | | 685 +----------------------------------------+ 687 Figure 4: Components in the Network Telemetry Framework 689 Since most existing standard-related work belongs to the first four 690 components, in the remainder of the document, we focus on these 691 components only. 693 4.4. Existing Works Mapped in the Framework 695 The following two tables provide a non-exhaustive list of existing 696 works (mainly published in IETF and with the emphasis on the latest 697 new technologies) and shows their positions in the framework. The 698 details about the mentioned work can be found in Appendix A. 700 +-----------------+---------------+----------------+ 701 | | Query | Subscription | 702 | | | | 703 +-----------------+---------------+----------------+ 704 | Simple Data | SNMP, NETCONF,| | 705 | | YANG, BMP, | | 706 | | IOAM, PBT | | 707 +-----------------+---------------+----------------+ 708 | Custom Data | DNP, YANG FSM | | 709 | | gRPC, NETCONF | | 710 +-----------------+---------------+----------------+ 711 | Event-triggered | | gRPC, NETCONF, | 712 | Data | | YANG PUSH, DNP | 713 | | | IOAM, PBT, | 714 | | | YANG FSM | 715 +-----------------+---------------+----------------+ 716 | Streaming Data | | gRPC, NETCONF, | 717 | | | IOAM, PBT, DNP | 718 | | | IPFIX, IPFPM | 719 +-----------------+---------------+----------------+ 721 Figure 5: Existing Work Mapping I 723 +--------------+---------------+----------------+---------------+ 724 | | Management | Control | Forwarding | 725 | | Plane | Plane | Plane | 726 +--------------+---------------+----------------+---------------+ 727 | data Config. | gRPC, NETCONF,| NETCONF/YANG | NETCONF/YANG, | 728 | & subscrib. | YANG PUSH | | YANG FSM | 729 +--------------+---------------+----------------+---------------+ 730 | data gen. & | DNP, | DNP, | In-situ OAM, | 731 | processing | YANG | YANG | PBT, IPFPM, | 732 | | | | DNP | 733 +--------------+---------------+----------------+---------------+ 734 | data | gRPC, NETCONF | BMP, NETCONF | IPFIX | 735 | export | YANG PUSH | | | 736 +--------------+---------------+----------------+---------------+ 738 Figure 6: Existing Work Mapping II 740 5. Evolution of Network Telemetry 742 As the network is evolving towards the automated operation, network 743 telemetry also undergoes several levels of evolution. 745 Level 0 - Static Telemetry: The telemetry data is determined at 746 design time. The network operator can only configure how to use 747 it with limited flexibility. 749 Level 1 - Dynamic Telemetry: The telemetry data can be dynamically 750 programmed or configured at runtime, allowing a tradeoff among 751 resource, performance, flexibility, and coverage. DNP is an 752 effort towards this direction. 754 Level 2 - Interactive Telemetry: The network operator can 755 continuously customize the telemetry data in real time to reflect 756 the network operation's visibility requirements. At this level, 757 some tasks can be automated, although ultimately human operators 758 will still need to sit in the middle to make decisions. 760 Level 3 - Closed-loop Telemetry: Human operators are completely 761 excluded from the control loop. The intelligent network operation 762 engine automatically issues the telemetry data request, analyzes 763 the data, and updates the network operations in closed control 764 loops. 766 While most of the existing technologies belong to level 0 and level 767 1, with the help of a clearly defined network telemetry framework, we 768 can assemble the technologies to support level 2 and make solid steps 769 towards level 3. 771 6. Security Considerations 773 Given that this document has proposed a framework for network 774 telemetry and the telemetry mechanisms discussed are distinct (in 775 both message frequency and traffic amount) from the conventional 776 network OAM concepts, we must also reflect that various new security 777 considerations may also arise. A number of techniques already exist 778 for securing the data plane, control plane, and the management plane 779 in a network, but the it is important to consider if any new threat 780 vectors are now being enabled via the use of network telemetry 781 procedures and mechanisms. 783 Security considerations for networks that use telemetry methods may 784 include: 786 o Telemetry framework trust and policy model; 788 o Role management and access control for enabling and disabling 789 telemetry capabilities; 791 o Protocol transport used telemetry data and inherent security 792 capabilities; 794 o Telemetry data stores, storage encryption and methods of access; 796 o Tracking telemetry events and any abnormalities that might 797 identify malicious attacks using telemetry interfaces. 799 Some of the security considerations highlighted above may be 800 minimized or negated with policy management of network telemetry. In 801 a network telemetry deployment it would be advantageous to separate 802 telemetry capabilities into different classes of policies, i.e., Role 803 Based Access Control and Event-Condition-Action policies. Also, 804 potential conflicts between network telemetry mechanisms must be 805 detected accurately and resolved quickly to avoid unnecessary network 806 telemetry traffic propagation escalating into an unintended or 807 intended denial of service attack. 809 Further discussion and development of this section will be required, 810 and it is expected that this security section, and subsequent policy 811 section will be developed further. 813 7. IANA Considerations 815 This document includes no request to IANA. 817 8. Contributors 819 The other major contributors of this document are listed as follows. 821 o Tianran Zhou 823 o Zhenbin Li 825 o Daniel King 827 9. Acknowledgments 829 We would like to thank Adrian Farrel, Randy Presuhn, Victor Liu, 830 James Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan Gu, Parviz 831 Yegani, Young Lee, Alexander Clemm, Joe Clarke, and many others who 832 have provided helpful comments and suggestions to improve this 833 document. 835 10. References 837 10.1. Normative References 839 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 840 Requirement Levels", BCP 14, RFC 2119, 841 DOI 10.17487/RFC2119, March 1997, 842 . 844 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 845 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 846 May 2017, . 848 10.2. Informative References 850 [I-D.brockners-inband-oam-requirements] 851 Brockners, F., Bhandari, S., Dara, S., Pignataro, C., 852 Gredler, H., Leddy, J., Youell, S., Mozes, D., Mizrahi, 853 T., Lapukhov, P., and r. remy@barefootnetworks.com, 854 "Requirements for In-situ OAM", draft-brockners-inband- 855 oam-requirements-03 (work in progress), March 2017. 857 [I-D.fioccola-ippm-multipoint-alt-mark] 858 Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, 859 "Multipoint Alternate Marking method for passive and 860 hybrid performance monitoring", draft-fioccola-ippm- 861 multipoint-alt-mark-04 (work in progress), June 2018. 863 [I-D.ietf-grow-bmp-adj-rib-out] 864 Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S. 865 Zhuang, "Support for Adj-RIB-Out in BGP Monitoring 866 Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-04 (work 867 in progress), March 2019. 869 [I-D.ietf-grow-bmp-local-rib] 870 Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, 871 "Support for Local RIB in BGP Monitoring Protocol (BMP)", 872 draft-ietf-grow-bmp-local-rib-03 (work in progress), March 873 2019. 875 [I-D.ietf-netconf-udp-pub-channel] 876 Zheng, G., Zhou, T., and A. Clemm, "UDP based Publication 877 Channel for Streaming Telemetry", draft-ietf-netconf-udp- 878 pub-channel-05 (work in progress), March 2019. 880 [I-D.ietf-netconf-yang-push] 881 Clemm, A., Voit, E., Prieto, A., Tripathy, A., Nilsen- 882 Nygaard, E., Bierman, A., and B. Lengyel, "Subscription to 883 YANG Datastores", draft-ietf-netconf-yang-push-22 (work in 884 progress), February 2019. 886 [I-D.kumar-rtgwg-grpc-protocol] 887 Kumar, A., Kolhe, J., Ghemawat, S., and L. Ryan, "gRPC 888 Protocol", draft-kumar-rtgwg-grpc-protocol-00 (work in 889 progress), July 2016. 891 [I-D.openconfig-rtgwg-gnmi-spec] 892 Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, 893 C., and C. Morrow, "gRPC Network Management Interface 894 (gNMI)", draft-openconfig-rtgwg-gnmi-spec-01 (work in 895 progress), March 2018. 897 [I-D.pedro-nmrg-anticipated-adaptation] 898 Martinez-Julia, P., "Exploiting External Event Detectors 899 to Anticipate Resource Requirements for the Elastic 900 Adaptation of SDN/NFV Systems", draft-pedro-nmrg- 901 anticipated-adaptation-02 (work in progress), June 2018. 903 [I-D.song-ippm-postcard-based-telemetry] 904 Song, H., Zhou, T., Li, Z., and J. Shin, "Postcard-based 905 In-band Flow Data Telemetry", draft-song-ippm-postcard- 906 based-telemetry-02 (work in progress), March 2019. 908 [I-D.song-opsawg-dnp4iq] 909 Song, H. and J. Gong, "Requirements for Interactive Query 910 with Dynamic Network Probes", draft-song-opsawg-dnp4iq-01 911 (work in progress), June 2017. 913 [I-D.zhou-netconf-multi-stream-originators] 914 Zhou, T., Zheng, G., Voit, E., Clemm, A., and A. Bierman, 915 "Subscription to Multiple Stream Originators", draft-zhou- 916 netconf-multi-stream-originators-04 (work in progress), 917 March 2019. 919 [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, 920 "Simple Network Management Protocol (SNMP)", RFC 1157, 921 DOI 10.17487/RFC1157, May 1990, 922 . 924 [RFC2981] Kavasseri, R., Ed., "Event MIB", RFC 2981, 925 DOI 10.17487/RFC2981, October 2000, 926 . 928 [RFC3416] Presuhn, R., Ed., "Version 2 of the Protocol Operations 929 for the Simple Network Management Protocol (SNMP)", 930 STD 62, RFC 3416, DOI 10.17487/RFC3416, December 2002, 931 . 933 [RFC3877] Chisholm, S. and D. Romascanu, "Alarm Management 934 Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877, 935 September 2004, . 937 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 938 Zekauskas, "A One-way Active Measurement Protocol 939 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 940 . 942 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 943 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 944 RFC 5357, DOI 10.17487/RFC5357, October 2008, 945 . 947 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 948 and A. Bierman, Ed., "Network Configuration Protocol 949 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 950 . 952 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 953 "Specification of the IP Flow Information Export (IPFIX) 954 Protocol for the Exchange of Flow Information", STD 77, 955 RFC 7011, DOI 10.17487/RFC7011, September 2013, 956 . 958 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 959 Weingarten, "An Overview of Operations, Administration, 960 and Maintenance (OAM) Tools", RFC 7276, 961 DOI 10.17487/RFC7276, June 2014, 962 . 964 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 965 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 966 DOI 10.17487/RFC7540, May 2015, 967 . 969 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 970 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 971 May 2016, . 973 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 974 Monitoring Protocol (BMP)", RFC 7854, 975 DOI 10.17487/RFC7854, June 2016, 976 . 978 [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, 979 L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, 980 "Alternate-Marking Method for Passive and Hybrid 981 Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, 982 January 2018, . 984 Appendix A. A Survey on Existing Network Telemetry Techniques 986 We provide an overview of the challenges and existing solutions for 987 each network telemetry module. 989 A.1. Management Plane Telemetry 991 A.1.1. Requirements and Challenges 993 The management plane of the network element interacts with the 994 Network Management System (NMS), and provides information such as 995 performance data, network logging data, network warning and defects 996 data, and network statistics and state data. Some legacy protocols 997 are widely used for the management plane, such as SNMP and Syslog. 998 However, these protocols are insufficient to meet the requirements of 999 the automatic network operation applications. 1001 New management plane telemetry protocols should consider the 1002 following requirements: 1004 Convenient Data Subscription: An application should have the freedom 1005 to choose the data export means such as the data types and the 1006 export frequency. 1008 Structured Data: For automatic network operation, machines will 1009 replace human for network data comprehension. The schema 1010 languages such as YANG can efficiently describe structured data 1011 and normalize data encoding and transformation. 1013 High Speed Data Transport: In order to retain the information, a 1014 server needs to send a large amount of data at high frequency. 1015 Compact encoding formats are needed to compress the data and 1016 improve the data transport efficiency. The push mode, by 1017 replacing the poll mode, can also reduce the interactions between 1018 clients and servers, which help to improve the server's 1019 efficiency. 1021 A.1.2. Push Extensions for NETCONF 1023 NETCONF [RFC6241] is one popular network management protocol, which 1024 is also recommended by IETF. Although it can be used for data 1025 collection, NETCONF is good at configurations. YANG Push 1027 [I-D.ietf-netconf-yang-push] extends NETCONF and enables subscriber 1028 applications to request a continuous, customized stream of updates 1029 from a YANG datastore. Providing such visibility into changes made 1030 upon YANG configuration and operational objects enables new 1031 capabilities based on the remote mirroring of configuration and 1032 operational state. Moreover, distributed data collection mechanism 1033 [I-D.zhou-netconf-multi-stream-originators] via UDP based publication 1034 channel [I-D.ietf-netconf-udp-pub-channel] provides enhanced 1035 efficiency for the NETCONF based telemetry. 1037 A.1.3. gRPC Network Management Interface 1039 gRPC Network Management Interface (gNMI) 1040 [I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol 1041 based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote 1042 Procedure Call) framework. With a single gRPC service definition, 1043 both configuration and telemetry can be covered. gRPC is an HTTP/2 1044 [RFC7540] based open source micro service communication framework. 1045 It provides a number of capabilities which are well-suited for 1046 network telemetry, including: 1048 o Full-duplex streaming transport model combined with a binary 1049 encoding mechanism provided further improved telemetry efficiency. 1051 o gRPC provides higher-level features consistency across platforms 1052 that common HTTP/2 libraries typically do not. This 1053 characteristic is especially valuable for the fact that telemetry 1054 data collectors normally reside on a large variety of platforms. 1056 o The built-in load-balancing and failover mechanism. 1058 A.2. Control Plane Telemetry 1060 A.2.1. Requirements and Challenges 1062 The control plane telemetry refers to the health condition monitoring 1063 of different network protocols, which covers Layer 2 to Layer 7. 1064 Keeping track of the running status of these protocols is beneficial 1065 for detecting, localizing, and even predicting various network 1066 issues, as well as network optimization, in real-time and in fine 1067 granularity. 1069 One of the most challenging problems for the control plane telemetry 1070 is how to correlate the E2E Key Performance Indicators (KPI) to a 1071 specific layer's KPIs. For example, an IPTV user may describe his 1072 User Experience (UE) by the video fluency and definition. Then in 1073 case of an unusually poor UE KPI or a service disconnection, it is 1074 non-trivial work to delimit and localize the issue to the responsible 1075 protocol layer (e.g., the Transport Layer or the Network Layer), the 1076 responsible protocol (e.g., ISIS or BGP at the Network Layer), and 1077 finally the responsible device(s) with specific reasons. 1079 Traditional OAM-based approaches for control plane KPI measurement 1080 include PING (L3), Tracert (L3), Y.1731 (L2) and so on. One common 1081 issue behind these methods is that they only measure the KPIs instead 1082 of reflecting the actual running status of these protocols, making 1083 them less effective or efficient for control plane troubleshooting 1084 and network optimization. An example of the control plane telemetry 1085 is the BGP monitoring protocol (BMP), it is currently used to 1086 monitoring the BGP routes and enables rich applications, such as BGP 1087 peer analysis, AS analysis, prefix analysis, security analysis, and 1088 so on. However, the monitoring of other layers, protocols and the 1089 cross-layer, cross-protocol KPI correlations are still in their 1090 infancy (e.g., the IGP monitoring is missing), which require 1091 substantial further research. 1093 A.2.2. BGP Monitoring Protocol 1095 BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP 1096 sessions and intended to provide a convenient interface for obtaining 1097 route views. 1099 The BGP routing information is collected from the monitored device(s) 1100 to the BMP monitoring station by setting up the BMP TCP session. The 1101 BGP peers are monitored by the BMP Peer Up and Peer Down 1102 Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], 1103 Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib 1104 [I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route 1105 Monitoring Message and the BMP Route Mirroring Message, in the form 1106 of both initial table dump and real-time route update. In addition, 1107 BGP statistics are reported through the BMP Stats Report Message, 1108 which could be either timer triggered or event-driven. More BMP 1109 extensions can be explored to enrich the applications of BGP 1110 monitoring. 1112 A.3. Data Plane Telemetry 1114 A.3.1. Requirements and Challenges 1116 An effective data plane telemetry system relies on the data that the 1117 network device can expose. The data's quality, quantity, and 1118 timeliness must meet some stringent requirements. This raises some 1119 challenges to the network data plane devices where the first hand 1120 data originate. 1122 o A data plane device's main function is user traffic processing and 1123 forwarding. While supporting network visibility is important, the 1124 telemetry is just an auxiliary function, and it should not impede 1125 normal traffic processing and forwarding (i.e., the performance is 1126 not lowered and the behavior is not altered due to the telemetry 1127 functions). 1129 o The network operation applications requires end-to-end visibility 1130 from various sources, which results in a huge volume of data. 1131 However, the sheer data quantity should not stress the network 1132 bandwidth, regardless of the data delivery approach (i.e., through 1133 in-band or out-of-band channels). 1135 o The data plane devices must provide timely data with the minimum 1136 possible delay. Long processing, transport, storage, and analysis 1137 delay can impact the effectiveness of the control loop and even 1138 render the data useless. 1140 o The data should be structured and labeled, and easy for 1141 applications to parse and consume. At the same time, the data 1142 types needed by applications can vary significantly. The data 1143 plane devices need to provide enough flexibility and 1144 programmability to support the precise data provision for 1145 applications. 1147 o The data plane telemetry should support incremental deployment and 1148 work even though some devices are unaware of the system. This 1149 challenge is highly relevant to the standards and legacy networks. 1151 The industry has agreed that the data plane programmability is 1152 essential to support network telemetry. Newer data plane chips are 1153 all equipped with advanced telemetry features and provide flexibility 1154 to support customized telemetry functions. 1156 A.3.2. Technique Taxonomy 1158 There can be multiple possible dimensions to classify the data plane 1159 telemetry techniques. 1161 Active and Passive: The active and passive methods (as well as the 1162 hybrid types) are well documented in [RFC7799]. The passive 1163 methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic 1164 mirror. These methods usually have low data coverage. The 1165 bandwidth cost is very high in order to improve the data coverage. 1166 On the other hand, the active methods include Ping, Traceroute, 1167 OWAMP [RFC4656], and TWAMP [RFC5357]. These methods are intrusive 1168 and only provide indirect network measurement results. The hybrid 1169 methods, including in-situ OAM 1171 [I-D.brockners-inband-oam-requirements], IPFPM [RFC8321], and 1172 Multipoint Alternate Marking 1173 [I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced 1174 and more flexible approach. However, these methods are also more 1175 complex to implement. 1177 In-Band and Out-of-Band: The telemetry data, before being exported 1178 to some collector, can be carried in user packets. Such methods 1179 are considered in-band (e.g., in-situ OAM 1180 [I-D.brockners-inband-oam-requirements]). If the telemetry data 1181 is directly exported to some collector without modifying the user 1182 packets, Such methods are considered out-of-band (e.g., postcard- 1183 based INT). It is possible to have hybrid methods. For example, 1184 only the telemetry instruction or partial data is carried by user 1185 packets (e.g., IPFPM [RFC8321]). 1187 E2E and In-Network: Some E2E methods start from and end at the 1188 network end hosts (e.g., Ping). The other methods work in 1189 networks and are transparent to end hosts. However, if needed, 1190 the in-network methods can be easily extended into end hosts. 1192 Flow, Path, and Node: Depending on the telemetry objective, the 1193 methods can be flow-based (e.g., in-situ OAM 1194 [I-D.brockners-inband-oam-requirements]), path-based (e.g., 1195 Traceroute), and node-based (e.g., IPFIX [RFC7011]). 1197 A.3.3. The IPFPM technology 1199 The Alternate Marking method is efficient to perform packet loss, 1200 delay, and jitter measurements both in an IP and Overlay Networks, as 1201 presented in IPFPM [RFC8321] and 1202 [I-D.fioccola-ippm-multipoint-alt-mark]. 1204 This technique can be applied to point-to-point and multipoint-to- 1205 multipoint flows. Alternate Marking creates batches of packets by 1206 alternating the value of 1 bit (or a label) of the packet header. 1207 These batches of packets are unambiguously recognized over the 1208 network and the comparison of packet counters for each batch allows 1209 the packet loss calculation. The same idea can be applied to delay 1210 measurement by selecting ad hoc packets with a marking bit dedicated 1211 for delay measurements. 1213 Alternate Marking method needs two counters each marking period for 1214 each flow under monitor. For instance, by considering n measurement 1215 points and m monitored flows, the order of magnitude of the packet 1216 counters for each time interval is n*m*2 (1 per color). 1218 Since networks offer rich sets of network performance measurement 1219 data (e.g packet counters), traditional approaches run into 1220 limitations. One reason is the fact that the bottleneck is the 1221 generation and export of the data and the amount of data that can be 1222 reasonably collected from the network. In addition, management tasks 1223 related to determining and configuring which data to generate lead to 1224 significant deployment challenges. 1226 Multipoint Alternate Marking approach, described in 1227 [I-D.fioccola-ippm-multipoint-alt-mark], aims to resolve this issue 1228 and makes the performance monitoring more flexible in case a detailed 1229 analysis is not needed. 1231 An application orchestrates network performance measurements tasks 1232 across the network to allow an optimized monitoring and it can 1233 calibrate how deep can be obtained monitoring data from the network 1234 by configuring measurement points roughly or meticulously. 1236 Using Alternate Marking, it is possible to monitor a Multipoint 1237 Network without examining in depth by using the Network Clustering 1238 (subnetworks that are portions of the entire network that preserve 1239 the same property of the entire network, called clusters). So in 1240 case there is packet loss or the delay is too high the filtering 1241 criteria could be specified more in order to perform a detailed 1242 analysis by using a different combination of clusters up to a per- 1243 flow measurement as described in IPFPM [RFC8321]. 1245 In summary, an application can configure end-to-end network 1246 monitoring. If the network does not experiment issues, this 1247 approximate monitoring is good enough and is very cheap in terms of 1248 network resources. However, in case of problems, the application 1249 becomes aware of the issues from this approximate monitoring and, in 1250 order to localize the portion of the network that has issues, 1251 configures the measurement points more exhaustively. So a new 1252 detailed monitoring is performed. After the detection and resolution 1253 of the problem the initial approximate monitoring can be used again. 1255 A.3.4. Dynamic Network Probe 1257 Hardware-based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] 1258 provides a programmable means to customize the data that an 1259 application collects from the data plane. A direct benefit of DNP is 1260 the reduction of the exported data. A full DNP solution covers 1261 several components including data source, data subscription, and data 1262 generation. The data subscription needs to define the custom data 1263 which can be composed and derived from the raw data sources. The 1264 data generation takes advantage of the moderate in-network computing 1265 to produce the desired data. 1267 While DNP can introduce unforeseeable flexibility to the data plane 1268 telemetry, it also faces some challenges. It requires a flexible 1269 data plane that can be dynamically reprogrammed at run-time. The 1270 programming API is yet to be defined. 1272 A.3.5. IP Flow Information Export (IPFIX) protocol 1274 Traffic on a network can be seen as a set of flows passing through 1275 network elements. IP Flow Information Export (IPFIX) [RFC7011] 1276 provides a means of transmitting traffic flow information for 1277 administrative or other purposes. A typical IPFIX enabled system 1278 includes a pool of Metering Processes collects data packets at one or 1279 more Observation Points, optionally filters them and aggregates 1280 information about these packets. An Exporter then gathers each of 1281 the Observation Points together into an Observation Domain and sends 1282 this information via the IPFIX protocol to a Collector. 1284 A.3.6. In-Situ OAM 1286 Traditional passive and active monitoring and measurement techniques 1287 are either inaccurate or resource-consuming. It is preferable to 1288 directly acquire data associated with a flow's packets when the 1289 packets pass through a network. In-situ OAM (iOAM) 1290 [I-D.brockners-inband-oam-requirements], a data generation technique, 1291 embeds a new instruction header to user packets and the instruction 1292 directs the network nodes to add the requested data to the packets. 1293 Thus, at the path end, the packet's experience gained on the entire 1294 forwarding path can be collected. Such firsthand data is invaluable 1295 to many network OAM applications. 1297 However, iOAM also faces some challenges. The issues on performance 1298 impact, security, scalability and overhead limits, encapsulation 1299 difficulties in some protocols, and cross-domain deployment need to 1300 be addressed. 1302 A.3.7. Postcard Based Telemetry 1304 PBT [I-D.song-ippm-postcard-based-telemetry] is an alternative to 1305 IOAM. PBT directly exports data at each node through an independent 1306 packet. PBT solves several issues of IOAM. It can also help to 1307 identify packet drop location in case a packet is dropped on its 1308 forwarding path. 1310 A.4. External Data and Event Telemetry 1312 Events that occur outside the boundaries of the network system are 1313 another important source of telemetry information. Correlating both 1314 internal telemetry data and external events with the requirements of 1315 network systems, as presented in Exploiting External Event Detectors 1316 to Anticipate Resource Requirements for the Elastic Adaptation of 1317 SDN/NFV Systems [I-D.pedro-nmrg-anticipated-adaptation], provides a 1318 strategic and functional advantage to management operations. 1320 A.4.1. Requirements and Challenges 1322 As with other sources of telemetry information, the data and events 1323 must meet strict requirements, especially in terms of timeliness, 1324 which is essential to properly incorporate external event information 1325 to management cycles. Thus, the specific challenges are described as 1326 follows: 1328 o The role of external event detector can be played by multiple 1329 elements, including hardware (e.g. physical sensors, such as 1330 seismometers) and software (e.g. Big Data sources that analyze 1331 streams of information, such as Twitter messages). Thus, the 1332 transmitted data must support different shapes but, at the same 1333 time, follow a common but extensible ontology. 1335 o Since the main function of the external event detectors is to 1336 perform the notifications, their timeliness is assumed. However, 1337 once messages have been dispatched, they must be quickly collected 1338 and inserted into the control plane with variable priority, which 1339 will be high for important sources and/or important events and low 1340 for secondary ones. 1342 o The ontology used by external detectors must be easily adopted by 1343 current and future devices and applications. Therefore, it must 1344 be easily mapped to current information models, such as in terms 1345 of YANG. 1347 Organizing together both internal and external telemetry information 1348 will be key for the general exploitation of the management 1349 possibilities of current and future network systems, as reflected in 1350 the incorporation of cognitive capabilities to new hardware and 1351 software (virtual) elements. 1353 A.4.2. Sources of External Events 1355 To ensure that the information provided by external event detectors 1356 and used by the network management solutions is meaningful for the 1357 management purposes, the network telemetry framework must ensure that 1358 such detectors (sources) are easily connected to the management 1359 solutions (sinks). This requires the specification of a simple 1360 taxonomy of detectors and match it to the connectors and/or 1361 interfaces required to connect them. 1363 Once detectors are classified in such taxonomy, their definitions are 1364 enlarged with the qualities and other aspects used to handle them and 1365 represented in the ontology and information model (e.g. YANG). 1366 Therefore, differentiating several types of detectors as potential 1367 sources of external events is essential for the integrity of the 1368 management framework. We thus differentiate the following source 1369 types of external events: 1371 o Smart objects and sensors. With the consolidation of the Internet 1372 of Things~(IoT) any network system will have many smart objects 1373 attached to its physical surroundings and logical operation 1374 environments. Most of these objects will be essentially based on 1375 sensors of many kinds (e.g. temperature, humidity, presence) and 1376 the information they provide can be very useful for the management 1377 of the network, even when they are not specifically deployed for 1378 such purpose. Elements of this source type will usually provide a 1379 specific protocol for interaction, especially one of those 1380 protocols related to IoT, such as the Constrained Application 1381 Protocol (CoAP). It will be used by the telemetry framework to 1382 interact with the relevant objects. 1384 o Online news reporters. Several online news services have the 1385 ability to provide enormous quantity of information about 1386 different events occurring in the world. Some of those events can 1387 impact on the network system managed by a specific framework and, 1388 therefore, it will be interested on getting such information. For 1389 instance, diverse security reports, such as the Common 1390 Vulnerabilities and Exposures (CVE), can be issued by the 1391 corresponding authority and used by the management solution to 1392 update the managed system if needed. Instead of a specific 1393 protocol and data format, the sources of this kind of information 1394 usually follow a relaxed but structured format. This format will 1395 be part of both the ontology and information model of the 1396 telemetry framework. 1398 o Global event analyzers. The advance of Big Data analyzers 1399 provides a huge amount of information and, more interestingly, the 1400 identification of events detected by analyzing many data streams 1401 from different origins. In contrast with the other types of 1402 sources, which are focused in specific events, the detectors of 1403 this source type will detect very generic events. For example, a 1404 sports event takes place and some unexpected movement makes it 1405 highly interesting and many people connects to sites that are 1406 covering such event. The systems supporting the services that 1407 cover the event can be affected by such situation so their 1408 management solutions should be aware of it. In contrast with the 1409 other source types, a new information model, format, and reporting 1410 protocol is required to integrate the detectors of this type with 1411 the management solution. 1413 Additional types of detector types can be added to the system but 1414 they will be generally the result of composing the properties offered 1415 by these main classes. In any case, future revisions of the network 1416 telemetry framework will include the required types that cover new 1417 circumstances and that cannot be obtained by composition. 1419 A.4.3. Connectors and Interfaces 1421 For allowing external event detectors to be properly integrated with 1422 other management solutions, both elements must expose interfaces and 1423 protocols that are subject to their particular objective. Since 1424 external event detectors will be focused on providing their 1425 information to their main consumers, which generally will not be 1426 limited to the network management solutions, the framework must 1427 include the definition of the required connectors for ensuring the 1428 interconnection between detectors (sources) and their consumers 1429 within the management systems (sinks) are effective. 1431 In some situations, the interconnection between the external event 1432 detectors and the management system is via the management plane. For 1433 those situations there will be a special connector that provides the 1434 typical interfaces found in most other elements connected to the 1435 management plane. For instance, the interfaces will accomplish with 1436 a specific information model (YANG) and specific telemetry protocol, 1437 such as NETCONF, SNMP, or gRPC. 1439 Authors' Addresses 1441 Haoyu Song (editor) 1442 Huawei 1443 2330 Central Expressway 1444 Santa Clara 1445 USA 1447 Email: haoyu.song@huawei.com 1449 Zhenqiang Li 1450 China Mobile 1451 No. 32 Xuanwumenxi Ave., Xicheng District 1452 Beijing, 100032 1453 P.R. China 1455 Email: lizhenqiang@chinamobile.com 1456 Pedro Martinez-Julia 1457 NICT 1458 4-2-1, Nukui-Kitamachi 1459 Koganei, Tokyo 184-8795 1460 Japan 1462 Email: pedro@nict.go.jp 1464 Laurent Ciavaglia 1465 Nokia 1466 Villarceaux 91460 1467 France 1469 Email: laurent.ciavaglia@nokia.com 1471 Aijun Wang 1472 China Telecom 1473 Beiqijia Town, Changping District 1474 Beijing, 102209 1475 P.R. China 1477 Email: wangaj.bri@chinatelecom.cn