idnits 2.17.1 draft-song-opsawg-ntf-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (August 8, 2018) is 2081 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-07) exists of draft-ietf-grow-bmp-adj-rib-out-01 == Outdated reference: A later version (-13) exists of draft-ietf-grow-bmp-local-rib-01 == Outdated reference: A later version (-05) exists of draft-ietf-netconf-udp-pub-channel-03 == Outdated reference: A later version (-25) exists of draft-ietf-netconf-yang-push-17 == Outdated reference: A later version (-10) exists of draft-zhou-netconf-multi-stream-originators-02 -- Obsolete informational reference (is this intentional?): RFC 7540 (Obsoleted by RFC 9113) -- Obsolete informational reference (is this intentional?): RFC 8321 (Obsoleted by RFC 9341) Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 OPSAWG H. Song, Ed. 3 Internet-Draft T. Zhou 4 Intended status: Informational ZB. Li 5 Expires: February 9, 2019 Huawei 6 G. Fioccola 7 Telecom Italia 8 ZQ. Li 9 China Mobile 10 P. Martinez-Julia 11 NICT 12 L. Ciavaglia 13 Nokia 14 A. Wang 15 China Telecom 16 August 8, 2018 18 Network Telemetry Framework 19 draft-song-opsawg-ntf-00 21 Abstract 23 This document suggests the necessity of an architectural framework 24 for network telemetry in order to meet the current and future network 25 operation requirements. The defining characteristics of network 26 telemetry shows a clear distinction from the conventional network OAM 27 concept; hence the network telemetry demands new techniques and 28 protocols. This document clarifies the terminologies and classifies 29 the categories and components of a network telemetry framework. The 30 requirements, challenges, existing solutions, and future directions 31 are discussed for each category. The network telemetry framework and 32 the taxonomy help to set a common ground for the collection of 33 related works and put future technique and standard developments into 34 perspective. 36 Requirements Language 38 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 39 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 40 "OPTIONAL" in this document are to be interpreted as described in BCP 41 14 [RFC2119][RFC8174] when, and only when, they appear in all 42 capitals, as shown here. 44 Status of This Memo 46 This Internet-Draft is submitted in full conformance with the 47 provisions of BCP 78 and BCP 79. 49 Internet-Drafts are working documents of the Internet Engineering 50 Task Force (IETF). Note that other groups may also distribute 51 working documents as Internet-Drafts. The list of current Internet- 52 Drafts is at https://datatracker.ietf.org/drafts/current/. 54 Internet-Drafts are draft documents valid for a maximum of six months 55 and may be updated, replaced, or obsoleted by other documents at any 56 time. It is inappropriate to use Internet-Drafts as reference 57 material or to cite them other than as "work in progress." 59 This Internet-Draft will expire on February 9, 2019. 61 Copyright Notice 63 Copyright (c) 2018 IETF Trust and the persons identified as the 64 document authors. All rights reserved. 66 This document is subject to BCP 78 and the IETF Trust's Legal 67 Provisions Relating to IETF Documents 68 (https://trustee.ietf.org/license-info) in effect on the date of 69 publication of this document. Please review these documents 70 carefully, as they describe your rights and restrictions with respect 71 to this document. Code Components extracted from this document must 72 include Simplified BSD License text as described in Section 4.e of 73 the Trust Legal Provisions and are provided without warranty as 74 described in the Simplified BSD License. 76 Table of Contents 78 1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 3 79 1.1. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . 4 80 1.2. Challenges . . . . . . . . . . . . . . . . . . . . . . . 5 81 1.3. Glossary . . . . . . . . . . . . . . . . . . . . . . . . 5 82 1.4. Network Telemetry . . . . . . . . . . . . . . . . . . . . 6 83 2. The Necessity of a Network Telemetry Framework . . . . . . . 8 84 3. Network Telemetry Framework . . . . . . . . . . . . . . . . . 9 85 3.1. Existing Works Mapped in the Framework . . . . . . . . . 11 86 3.2. Management Plane Telemetry . . . . . . . . . . . . . . . 12 87 3.2.1. Requirements and Challenges . . . . . . . . . . . . . 12 88 3.2.2. Push Extensions for NETCONF . . . . . . . . . . . . . 13 89 3.2.3. gRPC Network Management Interface . . . . . . . . . . 13 90 3.3. Control Plane Telemetry . . . . . . . . . . . . . . . . . 14 91 3.3.1. Requirements and Challenges . . . . . . . . . . . . . 14 92 3.3.2. BGP Monitoring Protocol . . . . . . . . . . . . . . . 14 93 3.4. Data Plane Telemetry . . . . . . . . . . . . . . . . . . 15 94 3.4.1. Requirements and Challenges . . . . . . . . . . . . . 15 95 3.4.2. Technique Taxonomy . . . . . . . . . . . . . . . . . 16 96 3.4.3. The IPFPM technology . . . . . . . . . . . . . . . . 16 97 3.4.4. Dynamic Network Probe . . . . . . . . . . . . . . . . 18 98 3.4.5. IP Flow Information Export (IPFIX) protocol . . . . . 18 99 3.4.6. In-Situ OAM . . . . . . . . . . . . . . . . . . . . . 18 100 3.5. External Data and Event Telemetry . . . . . . . . . . . . 19 101 3.5.1. Requirements and Challenges . . . . . . . . . . . . . 19 102 4. Evolution of Network Telemetry . . . . . . . . . . . . . . . 20 103 5. Security Considerations . . . . . . . . . . . . . . . . . . . 20 104 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20 105 7. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 20 106 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 21 107 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 21 108 9.1. Normative References . . . . . . . . . . . . . . . . . . 21 109 9.2. Informative References . . . . . . . . . . . . . . . . . 21 110 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23 112 1. Motivation 114 The advance of AI/ML technologies gives networks an unprecedented 115 opportunity to realize network autonomy with closed control loops. 116 An intent-driven autonomous network is the logical next step for 117 network evolution following SDN, aiming to reduce (or even eliminate) 118 human labor, make the most efficient use of network resources, and 119 provide better services more aligned with customer requirements. 120 Although we still have a long way to reach the ultimate goal, the 121 journey has started nevertheless. 123 The storage and computing technologies are already mature enough to 124 be able to retain and process a huge amount of data and make real- 125 time inference. Tools based on machine learning technologies and big 126 data analytics are powerful in detecting and reacting on network 127 faults, anomalies, and policy violations. In turn, the network 128 policy updates for planning, intrusion prevention, optimization, and 129 self-healing can be applied. Some tools can even predict future 130 events based on historical data. 132 However, the networks fail to keep pace with such data need. The 133 current network architecture, protocol suite, and system design are 134 not ready yet to provide enough quality data. In the remaining of 135 this section, first we identify a few key network operation use cases 136 that network operators need the most. These use cases are also the 137 essential functions of the future autonomous networks. Next, we show 138 why the current network OAM techniques and protocols are not 139 sufficient to meet the requirements of these use cases. The 140 discussion underlines the need of a new brood of techniques and 141 protocols which we put under an umbrella term - network telemetry. 143 1.1. Use Cases 145 All these use cases involves the data extracted from the network data 146 plane and sometimes from the network control plane and management 147 plane. 149 Intent and Policy Compliance: Network policies are the rules that 150 constraint the services for network access, provide differentiate 151 within a service, or enforce specific treatment on the traffic. 152 For example, a service function chain is a policy that requires 153 the selected flows to pass through a set of network functions in 154 order. An intents is a high-level abstract policy which requires 155 a complex translation and mapping process before being applied on 156 networks. While a policy is enforced, the compliance needs to be 157 verified and monitored continuously. 159 SLA Compliance: A Service-Level Agreement (SLA) defines the level of 160 service a user expects from a network operator, which include the 161 metrics for the service measurement and remedy/penalty procedures 162 when the service level misses the agreement. Users need to check 163 if they get the service as promised and network operators need to 164 evaluate how they can deliver the services that can meet the SLA. 166 Root Cause Analysis: Network failure often involves a sequence of 167 chained events and the source of the failure is not 168 straightforward to identify, especially when the failure is 169 sporadic. While machine learning or other data analytics 170 technologies can be used for root cause analysis, it up to the 171 network to provide all the relevant data for analysis. 173 Load Balancing, Traffic Engineering, and Network Planning: Network 174 operators are motivated to optimize their network utilization for 175 better ROI or lower CAPEX, as well as differentiation across 176 services and/or users of a given service. The first step is to 177 know the real-time network conditions before applying policies to 178 steer the user traffic or adjust the load balancing algorithm. In 179 some cases network micro-bursts need to be detected in a very 180 short time-frame so that fine grained traffic control can be 181 applied to avoid possible network congestion. The long term 182 network capacity planning and topology augmentation also rely on 183 the accumulated data of the network operation. 185 Event Tracking and Prediction: Network visibility is critical for a 186 healthy network operation. Numerous network events are of 187 interest to network operators. For example, Network operators 188 always want to learn where and why packets are dropped for an 189 application flow. They also want to be warned by some early signs 190 that some component is going to fail so the proper fix or 191 replacement can be made in time. 193 1.2. Challenges 195 The conventional OAM techniques, as described in [RFC7276], are not 196 sufficient to support the above use cases for the following reasons: 198 o Most use cases need to continuously monitor the network and 199 dynamically refine the data collection in real-time and 200 interactively. The poll-based low-frequency data collection is 201 ill-suited for these applications. Streaming data directly pushed 202 from the data source is preferred. 204 o Various data is needed from any place ranging from the packet 205 processing engine to the QoS traffic manager. Traditional data 206 plane devices cannot provide the necessary probes. An open and 207 programmable data plane is therefore needed. 209 o Many application scenarios need to correlate data from multiple 210 sources (e.g., from distributed nodes or from different network 211 plane). A piecemeal solution is often lacking the capability to 212 consolidate the data from multiple sources. The composition of a 213 complete solution, as partly proposed by ARCA 214 [I-D.pedro-nmrg-anticipated-adaptation], will be empowered and 215 guided by a comprehensive framework. 217 o The passive measurement techniques can either consume too much 218 network resources and render too much redundant data, or lead to 219 inaccurate results. The active measurement techniques are 220 indirect, and they can interfere with the user traffic. We need 221 techniques that can collect direct and on-demand data from user 222 traffic. 224 1.3. Glossary 226 Before further discussion, we list some key terminology and acronyms 227 used in this documents. We make an intended distinction between 228 network telemetry and network OAM. 230 AI: Artificial Intelligence. Use machine-learning based 231 technologies to automate network operation. 233 BMP: BGP Monitoring Protocol 235 DNP: Dynamic Network Probe 237 DPI: Deep Packet Inspection 238 gNMI: gPRC Network Management Interface 240 gRPC: gRPC Remote Procedure Call 242 IDN: Intent-Driven Network 244 IPFIX: IP Flow Information Export Protocol 246 IPFPM: IP Flow Performance Measurement 248 IOAM: In-situ OAM 250 NETCONF: Network Configuration Protocol 252 Network Telemetry: A general term for a new brood of network 253 visibility techniques and protocols, with the characteristics 254 defined in this document. Network telemetry enables smooth 255 evolution toward intent-driven autonomous networks. 257 NMS: Network Management System 259 OAM: Operations, Administration, and Maintenance. A group of 260 network management functions that provide network fault 261 indication, fault localization, performance information, and data 262 and diagnosis functions. Most conventional network monitoring 263 techniques and protocols belong to network OAM. 265 SNMP: Simple Network Management Protocol 267 YANG: A data modeling language for NETCONF 269 YANG FSM: A YANG model to define device side finite state machine 271 YANG PUSH: A method to subscribe pushed data from remote YANG 272 datastore 274 1.4. Network Telemetry 276 For a long time, network operators have relied upon protocols such as 277 SNMP [RFC1157] to monitor the network. SNMP can only provide limited 278 information about the network. Since SNMP is poll-based, it incurs 279 low data rate and high processing overhead. Such drawbacks make SNMP 280 unsuitable for today's automatic network applications. 282 Network telemetry has emerged as a mainstream technical term to refer 283 to the newer techniques of data collection and consumption, 284 distinguishing itself form the convention techniques for network OAM. 285 It is expected that network telemetry can provide the necessary 286 network visibility for autonomous networks, address the shortcomings 287 of conventional OAM techniques, and allow for the emergence of new 288 techniques bearing certain characterisitcs. 290 One key difference between the network telemetry and the network OAM 291 is that the network telemetry assumes an intelligent machine in the 292 center of a closed control loop, while the network OAM assumes the 293 human network operators in the middle of an open control loop. The 294 network telemetry can directly trigger the automated network 295 operation; The conventional OAM tools only help human operators to 296 monitor and diagnose the networks and guide manual network 297 operations. The different assumptions lead to very different 298 techniques. 300 Although the network telemetry techniques are just emerging and 301 subject to continuous evolution, several defining characteristics of 302 network telemetry have been well accepted: 304 o Push and Streaming: Instead of polling data from network devices, 305 the telemetry collector subscribes to the streaming data pushed 306 from the data source in network devices. 308 o Volume and Velocity: The telemetry data is intended to be consumed 309 by machine rather than by human. Therefore, the data volume is 310 huge and the processing is often in realtime. 312 o Normalization and Unification: Telemetry aims to address the 313 overall network automation needs. The piecemeal solutions offered 314 by the conventional OAM approach are no longer suitable. Efforts 315 need to be made to normalize the data representation and unify the 316 protocols. 318 o Model-based: The data is model-based which allows applications to 319 configure and consume data with ease. 321 o Data Fusion: The data for a single application can come from 322 multiple data sources (e.g., cross domain, cross device, and cross 323 layer) and needs to be correlated to take effect. 325 o Dynamic and Interactive: Since the network telemetry means to be 326 used in a closed control loop for network automation, it needs to 327 run continuously and adapt to the dynamic and interactive queries 328 from the network operation controller. 330 In addition, the ideal network telemetry solution should also support 331 the following features: 333 o In-Network Customization: The data can be customized in network at 334 run-time to cater to the specific need of applications. This 335 needs the support of a programmable data plane which allows probes 336 to be deployed at flexible locations. 338 o Direct Data Plane Export: The data originated from data plane can 339 be directly exported to the data consumer for efficiency, 340 especially when the data bandwidth is large and the real-time 341 processing is required. 343 o In-band Data Collection: In addition to the passive and active 344 data collection approaches, the new hybrid approach allows to 345 directly collect data for any target flow on its entire forwarding 346 path. 348 o Non-intrusive: The telemetry system should not fall into the trap 349 of the "observer effect". That is, it should not change the 350 network behavior or affect the forwarding performance. 352 2. The Necessity of a Network Telemetry Framework 354 Big data analytics and machine-learning based AI technologies are 355 applied for network operation automation, relying on abundant data 356 from networks. The single-sourced and static data acquisition cannot 357 meet the data requirements. It is desirable to have a framework that 358 integrates multiple telemetry approaches from different layers, and 359 allows flexible combinations for different applications. The 360 framework will benefit application development for the following 361 reasons. 363 o The future autonomous networks will require a holistic view on 364 network visibility. All the use cases and applications need to be 365 supported uniformly and coherently under a single intelligent 366 agent. Therefore, the protocols and mechanisms should be 367 consolidated into a minimum yet comprehensive set. A telemetry 368 framework can help to normalize the technique developments. 370 o Network visibility presents multiple viewpoints. For example, the 371 device viewpoint takes the network infrastructure as the 372 monitoring object from which the network topology and device 373 status can be acquired; the traffic viewpoint takes the flows or 374 packets as the monitoring object from which the traffic quality 375 and path can be acquired. An application may need to switch its 376 viewpoint during operation. It may also need to correlate a 377 service and its network experience to acquire the comprehensive 378 information. 380 o Applications require network telemetry to be elastic in order to 381 efficiently use the network resource and reduce the performance 382 impact. Routine network monitoring covers the entire network with 383 low data sampling rate. When issues arise or trends emerge, the 384 telemetry data source can be modified and the data rate can be 385 boosted. 387 o Efficient data fusion is critical for applications to reduce the 388 overall quantity of data and improve the accuracy of analysis. 390 So far, some telemetry related work has been done within IETF. 391 However, this work is fragmented and scattered in different working 392 groups. The lack of coherence makes it difficult to assemble a 393 comprehensive network telemetry system and causes repetitive and 394 redundant work. 396 A formal network telemetry framework is needed for constructing a 397 working system. The framework should cover the concepts and 398 components from the standardization perspective. This document 399 clarifies the layers on which the telemetry is exerted and decomposes 400 the telemetry system into a set of distinct components that the 401 existing and future work can easily map to. 403 3. Network Telemetry Framework 405 Telemetry can be applied on the data plane, the control plane, and 406 the management plane in a network, as well as other sources out of 407 the network, as shown in Figure 1. 409 +------------------------------+ 410 | | 411 | Network Operation |<-------+ 412 | Applications | | 413 | | | 414 +------------------------------+ | 415 ^ ^ ^ | 416 | | | | 417 V | V V 418 +-----------|---+--------------+ +-----------+ 419 | | | | | | 420 | Control Pl|ane| | | External | 421 | Telemetry | <---> | | Data and | 422 | | | | | Event | 423 | ^ V | Management | | Telemetry | 424 +------|--------+ Plane | | | 425 | V | Telemetry | +-----------+ 426 | | | 427 | Data Plane <---> | 428 | Telemetry | | 429 | | | 430 +---------------+--------------+ 432 Figure 1: Layer Category of the Network Telemetry Framework 434 Note that the interaction with the network operation applications can 435 be indirect. For example, in the management plane telemetry, the 436 management plane may need to acquire data from the data plane. On 437 the other hand, an application may involve more than one plane 438 simultaneously. For example, an SLA compliance application may 439 require both the data plane telemetry and the control plane 440 telemetry. 442 At each plane, the telemetry can be further partitioned into five 443 distinct components: 445 Data Source: Determine where the original data is acquired. The 446 data source usually just provides raw data which needs further 447 processing. A data source can be considered a probe. A probe can 448 be statically installed or dynamically installed. 450 Data Subscription: Determine the protocol and channel for 451 applications to acquire desired data. Data subscription is also 452 responsible to define the desired data that might not be directly 453 available form data sources. The subscription data can be 454 described by a model. The model can be statically installed or 455 dynamically installed. 457 Data Generation: The original data needs to be processed, encoded, 458 and formatted in network devices to meet application subscription 459 requirements. This may involve in-network computing and 460 processing on either the fast path or the slow path in network 461 devices. 463 Data Export: Determine how the ready data are delivered to 464 applications. 466 Data Analysis and Storage: In this final step, data is consumed by 467 applications or stored for future reference. Data analysis can be 468 interactive. It may initiate further data subscription. 470 +------------------------------+ 471 | | 472 | Data Analysis/Storage | 473 | | 474 +------------------------------+ 475 | ^ 476 | | 477 V | 478 +---------------+--------------+ 479 | | | 480 | Data | Data | 481 | Subscription | Export | 482 | | | 483 +---------------+--------------| 484 | | 485 | Data Generation | 486 | | 487 +------------------------------| 488 | | 489 | Data Source | 490 | | 491 +------------------------------+ 493 Figure 2: Components in the Network Telemetry Framework 495 Since most existing standard-related work belongs to the first four 496 components, in the remainder of the document, we focus on these 497 components only. 499 3.1. Existing Works Mapped in the Framework 501 The following table provides a non-exhaustive list of existing works 502 (mainly published in IETF and with the emphasis on the latest new 503 technologies) and shows their positions in the framework. 505 +-----------+--------------+---------------+--------------+ 506 | | Management | Control | Data | 507 | | Plane | Plane | Plane | 508 +-----------+--------------+---------------+--------------+ 509 | | YANG Data | Control Proto.| Flow/Packet | 510 | Data | Store | Network State | Statistics | 511 | Source | | | States | 512 | | | | DPI | 513 +-----------+--------------+---------------+--------------+ 514 | | gPRC | NETCONF/YANG | NETCONF/YANG | 515 | Data | YANG PUSH | BGP | YANG FSM | 516 | Subscribe | | | | 517 | | | | | 518 +-----------+--------------+---------------+--------------+ 519 | | Soft DNP | Soft DNP | In-situ OAM | 520 | Data | | | IPFPM | 521 | Generation| | | Hard DNP | 522 | | | | | 523 +-----------+--------------+---------------+--------------+ 524 | | gRPC | BMP | IPFIX | 525 | Data | YANG PUSH | | UDP | 526 | Export | UDP | | | 527 | | | | | 528 +-----------+--------------+---------------+--------------+ 530 Figure 3: Existing Work 532 3.2. Management Plane Telemetry 534 3.2.1. Requirements and Challenges 536 The management plane of the network element interacts with the 537 Network Management System (NMS), and provides information such as 538 performance data, network logging data, network warning and defects 539 data, and network statistics and state data. Some legacy protocols 540 are widely used for the management plane, such as SNMP and Syslog, 541 but these protocols do not meet the requirements of the automatic 542 network operation applications. 544 New management plane telemetry protocols should consider the 545 following requirements: 547 Convenient Data Subscription: An application should have the freedom 548 to choose the data export means such as the data types and the 549 export frequency. 551 Structured Data: For automatic network operation, machines will 552 replace human for network data comprehension. The schema 553 languages such as YANG can efficiently describe structured data 554 and normalize data encoding and transformation. 556 High Speed Data Transport: In order to retain the information, a 557 server needs to send a large amount of data at high frequency. 558 Compact encoding formats are needed to compress the data and 559 improve the data transport efficiency. The push mode, by 560 replacing the poll mode, can also reduce the interactions between 561 clients and servers, which help to improve the server's 562 efficiency. 564 3.2.2. Push Extensions for NETCONF 566 NETCONF [RFC6241] is one popular network management protocol, which 567 is also recommended by IETF. Although it can be used for data 568 collection, NETCONF is good at configurations. YANG Push 569 [I-D.ietf-netconf-yang-push] extends NETCONF and enables subscriber 570 applications to request a continuous, customized stream of updates 571 from a YANG datastore. Providing such visibility into changes made 572 upon YANG configuration and operational objects enables new 573 capabilities based on the remote mirroring of configuration and 574 operational state. Moreover, distributed data collection mechanism 575 [I-D.zhou-netconf-multi-stream-originators] via UDP based publication 576 channel [I-D.ietf-netconf-udp-pub-channel] provides enhanced 577 efficiency for the NETCONF based telemetry. 579 3.2.3. gRPC Network Management Interface 581 gRPC Network Management Interface (gNMI) 582 [I-D.openconfig-rtgwg-gnmi-spec] is a network management protocol 583 based on the gRPC [I-D.kumar-rtgwg-grpc-protocol] RPC (Remote 584 Procedure Call) framework. With a single gRPC service definition, 585 both configuration and telemetry can be covered. gRPC is an HTTP/2 586 [RFC7540] based open source micro service communication framework. 587 It provides a number of capabilities that makes it well-suited for 588 network telemetry, including: 590 o Full-duplex streaming transport model combined with a binary 591 encoding mechanism provided further improved telemetry efficiency. 593 o gRPC provides higher-level features consistency across platforms 594 that common HTTP/2 libraries typically do not. This 595 characteristic is especially valuable for the fact that telemetry 596 data collectors normally reside on a large variety of platforms. 598 o The built-in load-balancing and failover mechanism. 600 3.3. Control Plane Telemetry 602 3.3.1. Requirements and Challenges 604 The control plane telemetry refers to the health condition monitoring 605 of different network protocols, which covers Layer 2 to Layer 7. 606 Keeping track of the running status of these protocols is beneficial 607 for detecting, localizing, and even predicting various network 608 issues, as well as network optimization, in real-time and in fine 609 granularity. 611 One of the most challenging problems for the control plane telemetry 612 is how to correlate the E2E Key Performance Indicators (KPI) to a 613 specific layer's KPIs. For example, an IPTV user may describe his 614 User Experience (UE) by the video fluency and definition. Then in 615 case of an unusually poor UE KPI or a service disconnection, it is 616 non-trivial work to delimit and localize the issue to the responsible 617 protocol layer (e.g., the Transport Layer or the Network Layer), the 618 responsible protocol (e.g., ISIS or BGP at the Network Layer), and 619 finally the responsible device(s) with specific reasons. 621 Traditional OAM-based approaches for control plane KPI measurement 622 include PING (L3), Tracert (L3), Y.1731 (L2) and so on. One common 623 issue behind these methods is that they only measure the KPIs instead 624 of reflecting the actual running status of these protocols, making 625 them less effective or efficient for control plane troubleshooting 626 and network optimization. An example of the control plane telemetry 627 is the BGP monitoring protocol (BMP), it is currently used to 628 monitoring the BGP routes and enables rich applications, such as BGP 629 peer analysis, AS analysis, prefix analysis, security analysis, and 630 so on. However, the monitoring of other layers, protocols and the 631 cross-layer, cross-protocol KPI correlations are still in their 632 infancy (e.g., the IGP monitoring is missing), which require 633 substantial further research. 635 3.3.2. BGP Monitoring Protocol 637 BGP Monitoring Protocol (BMP) [RFC7854] is used to monitor BGP 638 sessions and intended to provide a convenient interface for obtaining 639 route views. 641 The BGP routing information is collected from the monitored device(s) 642 to the BMP monitoring station by setting up the BMP TCP session. The 643 BGP peers are monitored by the BMP Peer Up and Peer Down 644 Notifications. The BGP routes (including Adjacency_RIB_In [RFC7854], 645 Adjacency_RIB_out [I-D.ietf-grow-bmp-adj-rib-out], and Local_Rib 646 [I-D.ietf-grow-bmp-local-rib] are encapsulated in the BMP Route 647 Monitoring Message and the BMP Route Mirroring Message, in the form 648 of both initial table dump and real-time route update. In addition, 649 BGP statistics are reported through the BMP Stats Report Message, 650 which could be either timer triggered or event driven. More BMP 651 extensions can be explored to enrich the applications of BGP 652 monitoring. 654 3.4. Data Plane Telemetry 656 3.4.1. Requirements and Challenges 658 An effective data plane telemetry system relies on the data that the 659 network device can expose. The data's quality, quantity, and 660 timeliness must meet some stringent requirements. This raises some 661 challenges to the network data plane devices where the first hand 662 data originate. 664 o A data plane device's main function is user traffic processing and 665 forwarding. While supporting network visibility is important, the 666 telemetry is just an auxiliary function and it should not impede 667 normal traffic processing and forwarding (i.e., the performance is 668 not lowered and the behavior is not altered due to the telemetry 669 functions). 671 o The network operation applications requires end-to-end visibility 672 from various sources, which results in a huge volume of data. 673 However, the sheer data quantity should not stress the network 674 bandwidth, regardless of the data delivery approach (i.e., through 675 in-band or out-of-band channels). 677 o The data plane devices must provide the data in a timely manner 678 with the minimum possible delay. Long processing, transport, 679 storage, and analysis delay can impact the effectiveness of the 680 control loop and even render the data useless. 682 o The data should be structured and labeled, and easy for 683 applications to parse and consume. At the same time, the data 684 types needed by applications can vary significantly. The data 685 plane devices need to provide enough flexibility and 686 programmability to support the precise data provision for 687 applications. 689 o The data plane telemetry should support incremental deployment and 690 work even though some devices are unaware of the system. This 691 challenge is highly relevant to the standards and legacy networks. 693 The industry has agreed that the data plane programmability is 694 essential to support network telemetry. Newer data plane chips are 695 all equipped with advanced telemetry features and provide flexibility 696 to support customized telemetry functions. 698 3.4.2. Technique Taxonomy 700 There can be multiple possible dimensions to classify the data plane 701 telemetry techniques. 703 Active and Passive: The active and passive methods (as well as the 704 hybrid types) are well documented in [RFC7799]. The passive 705 methods include TCPDUMP, IPFIX [RFC7011], sflow, and traffic 706 mirror. These methods usually have low data coverage. The 707 bandwidth cost is very high in order to improve the data coverage. 708 On the other hand, the active methods include Ping, Traceroute, 709 OWAMP [RFC4656], and TWAMP [RFC5357]. These methods are intrusive 710 and only provide indirect network measurement results. The hybrid 711 methods, including in-situ OAM 712 [I-D.brockners-inband-oam-requirements], IPFPM [RFC8321], and 713 Multipoint Alternate Marking 714 [I-D.fioccola-ippm-multipoint-alt-mark], provide a well-balanced 715 and more flexible approach. However, these methods are also more 716 complex to implement. 718 In-Band and Out-of-Band: The telemetry data, before being exported 719 to some collector, can be carried in user packets. Such methods 720 are considered in-band (e.g., in-situ OAM 721 [I-D.brockners-inband-oam-requirements]). If the telemetry data 722 is directly exported to some collector without modifying the user 723 packets, Such methods are considered out-of-band (e.g., postcard- 724 based INT). It is possible to have hybrid methods. For example, 725 only the telemetry instruction or partial data is carried by user 726 packets (e.g., IPFPM [RFC8321]). 728 E2E and In-Network: Some E2E methods start from and end at the 729 network end hosts (e.g., Ping). The other methods work in 730 networks and are transparent to end hosts. However, if needed, 731 the in-network methods can be easily extended into end hosts. 733 Flow, Path, and Node: Depending on the telemetry objective, the 734 methods can be flow-based (e.g., in-situ OAM 735 [I-D.brockners-inband-oam-requirements]), path-based (e.g., 736 Traceroute), and node-based (e.g., IPFIX [RFC7011]). 738 3.4.3. The IPFPM technology 740 The Alternate Marking method is efficient to perform packet loss, 741 delay, and jitter measurements both in an IP and Overlay Networks, as 742 presented in IPFPM [RFC8321] and 743 [I-D.fioccola-ippm-multipoint-alt-mark]. 745 This technique can be applied to point-to-point and multipoint-to- 746 multipoint flows. Alternate Marking creates batches of packets by 747 alternating the value of 1 bit (or a label) of the packet header. 748 These batches of packets are unambiguously recognized over the 749 network and the comparison of packet counters for each batch allows 750 the packet loss calculation. The same idea can be applied to delay 751 measurement by selecting ad hoc packets with a marking bit dedicated 752 for delay measurements. 754 Alternate Marking method needs two counters each marking period for 755 each flow under monitor. For instance, by considering n measurement 756 points and m monitored flows, the order of magnitude of the packet 757 counters for each time interval is n*m*2 (1 per color). 759 Since networks offer rich sets of network performance measurement 760 data (e.g packet counters), traditional approaches run into 761 limitations. One reason is the fact that the bottleneck is the 762 generation and export of the data and the amount of data that can be 763 reasonably collected from the network. In addition, management tasks 764 related to determining and configuring which data to generate lead to 765 significant deployment challenges. 767 Multipoint Alternate Marking approach, described in 768 [I-D.fioccola-ippm-multipoint-alt-mark], aims to resolve this issue 769 and makes the performance monitoring more flexible in case a detailed 770 analysis is not needed. 772 An application orchestrates network performance measurements tasks 773 across the network to allow an optimized monitoring and it can 774 calibrate how deep can be obtained monitoring data from the network 775 by configuring measurement points roughly or meticulously. 777 Using Alternate Marking, it is possible to monitor a Multipoint 778 Network without examining in depth by using the Network Clustering 779 (subnetworks that are portions of the entire network that preserve 780 the same property of the entire network, called clusters). So in 781 case there is packet loss or the delay is too high the filtering 782 criteria could be specified more in order to perform a detailed 783 analysis by using a different combination of clusters up to a per- 784 flow measurement as described in IPFPM [RFC8321]. 786 In summary, an application can configure initially an end to end 787 monitoring between ingress points and egress points of the network. 788 If the network does not experiment issues, this approximate 789 monitoring is good enough and is very cheap in terms of network 790 resources. But, in case of problems, the application becomes aware 791 of the issues from this approximate monitoring and, in order to 792 localize the portion of the network that has issues, configures the 793 measurement points more exhaustively. So a new detailed monitoring 794 is performed. After the detection and resolution of the problem the 795 initial approximate monitoring can be used again. 797 3.4.4. Dynamic Network Probe 799 Hardware based Dynamic Network Probe (DNP) [I-D.song-opsawg-dnp4iq] 800 provides a programmable means to customize the data that an 801 application collects from the data plane. A direct benefit of DNP is 802 the reduction of the exported data. A full DNP solution covers 803 several components including data source, data subscription, and data 804 generation. The data subscription needs to define the custom data 805 which can be composed and derived from the raw data sources. The 806 data generation takes advantage of the moderate in-network computing 807 to produce the desired data. 809 While DNP can introduce unforeseeable flexibility to the data plane 810 telemetry, it also faces some challenges. It requires a flexible 811 data plane that can be dynamically reprogrammed at run-time. The 812 programming API is yet to be defined. 814 3.4.5. IP Flow Information Export (IPFIX) protocol 816 Traffic on a network can be seen as a set of flows passing through 817 network elements. IP Flow Information Export (IPFIX) [RFC7011] 818 provides a means of transmitting traffic flow information for 819 administrative or other purposes. A typical IPFIX enabled system 820 includes a pool of Metering Processes collects data packets at one or 821 more Observation Points, optionally filters them and aggregates 822 information about these packets. An Exporter then gathers each of 823 the Observation Points together into an Observation Domain and sends 824 this information via the IPFIX protocol to a Collector. 826 3.4.6. In-Situ OAM 828 Traditional passive and active monitoring and measurement techniques 829 are either inaccurate or resource-consuming. It is preferable to 830 directly acquire data associated with a flow's packets when the 831 packets pass through a network. In-situ OAM (iOAM) 832 [I-D.brockners-inband-oam-requirements], a data generation technique, 833 embeds a new instruction header to user packets and the instruction 834 directs the network nodes to add the requested data to the packets. 835 Thus, at the path end the packet's experience on the entire 836 forwarding path can be collected. Such firsthand data is invaluable 837 to many network OAM applications. 839 However, iOAM also faces some challenges. The issues on performance 840 impact, security, scalability and overhead limits, encapsulation 841 difficulties in some protocols, and cross-domain deployment need to 842 be addressed. 844 3.5. External Data and Event Telemetry 846 Events that occur outside the boundaries of the network system are 847 another important source of telemetry information. Correlating both 848 internal telemetry data and external events with the requirements of 849 network systems, as presented in Exploiting External Event Detectors 850 to Anticipate Resource Requirements for the Elastic Adaptation of 851 SDN/NFV Systems [I-D.pedro-nmrg-anticipated-adaptation], provides a 852 strategic and functional advantage to management operations. 854 3.5.1. Requirements and Challenges 856 As with other sources of telemetry information, the data and events 857 must meet strict requirements, especially in terms of timeliness, 858 which is essential to properly incorporate external event information 859 to management cycles. Thus, the specific challenges are described as 860 follows: 862 o The role of external event detector can be played by multiple 863 elements, including hardware (e.g. physical sensors, such as 864 seismometers) and software (e.g. Big Data sources that analyze 865 streams of information, such as Twitter messages). Thus, the 866 transmitted data must support different shapes but, at the same 867 time, follow a common but extensible ontology. 869 o Since the main function of the external event detectors is 870 actually to perform the notifications, their timeliness is 871 assumed. However, once messages have been dispatched, they must 872 be quickly collected and inserted into the control plane with 873 variable priority, which will be high for important sources and/or 874 important events and low for secondary ones. 876 o The ontology used by external detectors must be easily adopted by 877 current and future devices and applications. Therefore, it must 878 be easily mapped to current information models, such as in terms 879 of YANG. 881 Organizing together both internal and external telemetry information 882 will be key for the general exploitation of the management 883 possibilities of current and future network systems, as reflected in 884 the incorporation of cognitive capabilities to new hardware and 885 software (virtual) elements. 887 4. Evolution of Network Telemetry 889 As the network is evolving towards the automated operation, network 890 telemetry also undergoes several levels of evolution. 892 Level 0 - Static Telemetry: The telemetry data is determined at 893 design time. The network operator can only configure how to use 894 it with limited flexibility. 896 Level 1 - Dynamic Telemetry: The telemetry data can be dynamically 897 programmed or configured at runtime, allowing a tradeoff among 898 resource, performance, flexibility, and coverage. DNP is an 899 effort towards this direction. 901 Level 2 - Interactive Telemetry: The network operator can 902 continuously customize the telemetry data in real time to reflect 903 the network operation's visibility requirements. At this level, 904 some tasks can be automated but human operators still need to sit 905 in the middle to make decisions. 907 Level 3 - Closed-loop Telemetry: Human operators are completely 908 excluded from the control loop. The intelligent network operation 909 engine automatically issues the telemetry data request, analyzes 910 the data, and updates the network operations in closed control 911 loops. 913 While most of the existing technologies belong to level 0 and level 914 1, with the help of a clearly defined network telemetry framework, we 915 can assemble the technologies to support level 2 and make solid steps 916 towards level 3. 918 5. Security Considerations 920 TBD 922 6. IANA Considerations 924 This document includes no request to IANA. 926 7. Contributors 928 The other main contributors of this document are listed as follows. 930 o James N. Guichard, Huawei 932 o Yunan Gu, Huawei 934 8. Acknowledgments 936 We would like to thank Victor Liu and others who have provided 937 helpful comments and suggestions to improve this document. 939 9. References 941 9.1. Normative References 943 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 944 Requirement Levels", BCP 14, RFC 2119, 945 DOI 10.17487/RFC2119, March 1997, 946 . 948 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 949 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 950 May 2017, . 952 9.2. Informative References 954 [I-D.brockners-inband-oam-requirements] 955 Brockners, F., Bhandari, S., Dara, S., Pignataro, C., 956 Gredler, H., Leddy, J., Youell, S., Mozes, D., Mizrahi, 957 T., <>, P., and r. remy@barefootnetworks.com, 958 "Requirements for In-situ OAM", draft-brockners-inband- 959 oam-requirements-03 (work in progress), March 2017. 961 [I-D.fioccola-ippm-multipoint-alt-mark] 962 Fioccola, G., Cociglio, M., Sapio, A., and R. Sisto, 963 "Multipoint Alternate Marking method for passive and 964 hybrid performance monitoring", draft-fioccola-ippm- 965 multipoint-alt-mark-04 (work in progress), June 2018. 967 [I-D.ietf-grow-bmp-adj-rib-out] 968 Evens, T., Bayraktar, S., Lucente, P., Mi, K., and S. 969 Zhuang, "Support for Adj-RIB-Out in BGP Monitoring 970 Protocol (BMP)", draft-ietf-grow-bmp-adj-rib-out-01 (work 971 in progress), March 2018. 973 [I-D.ietf-grow-bmp-local-rib] 974 Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, 975 "Support for Local RIB in BGP Monitoring Protocol (BMP)", 976 draft-ietf-grow-bmp-local-rib-01 (work in progress), 977 February 2018. 979 [I-D.ietf-netconf-udp-pub-channel] 980 Zheng, G., Zhou, T., and A. Clemm, "UDP based Publication 981 Channel for Streaming Telemetry", draft-ietf-netconf-udp- 982 pub-channel-03 (work in progress), July 2018. 984 [I-D.ietf-netconf-yang-push] 985 Clemm, A., Voit, E., Prieto, A., Tripathy, A., Nilsen- 986 Nygaard, E., Bierman, A., and B. Lengyel, "YANG Datastore 987 Subscription", draft-ietf-netconf-yang-push-17 (work in 988 progress), July 2018. 990 [I-D.kumar-rtgwg-grpc-protocol] 991 Kumar, A., Kolhe, J., Ghemawat, S., and L. Ryan, "gRPC 992 Protocol", draft-kumar-rtgwg-grpc-protocol-00 (work in 993 progress), July 2016. 995 [I-D.openconfig-rtgwg-gnmi-spec] 996 Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, 997 C., and C. Morrow, "gRPC Network Management Interface 998 (gNMI)", draft-openconfig-rtgwg-gnmi-spec-01 (work in 999 progress), March 2018. 1001 [I-D.pedro-nmrg-anticipated-adaptation] 1002 Martinez-Julia, P., "Exploiting External Event Detectors 1003 to Anticipate Resource Requirements for the Elastic 1004 Adaptation of SDN/NFV Systems", draft-pedro-nmrg- 1005 anticipated-adaptation-02 (work in progress), June 2018. 1007 [I-D.song-opsawg-dnp4iq] 1008 Song, H. and J. Gong, "Requirements for Interactive Query 1009 with Dynamic Network Probes", draft-song-opsawg-dnp4iq-01 1010 (work in progress), June 2017. 1012 [I-D.zhou-netconf-multi-stream-originators] 1013 Zhou, T., Zheng, G., Voit, E., Clemm, A., and A. Bierman, 1014 "Subscription to Multiple Stream Originators", draft-zhou- 1015 netconf-multi-stream-originators-02 (work in progress), 1016 May 2018. 1018 [RFC1157] Case, J., Fedor, M., Schoffstall, M., and J. Davin, 1019 "Simple Network Management Protocol (SNMP)", RFC 1157, 1020 DOI 10.17487/RFC1157, May 1990, 1021 . 1023 [RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. 1024 Zekauskas, "A One-way Active Measurement Protocol 1025 (OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006, 1026 . 1028 [RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. 1029 Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", 1030 RFC 5357, DOI 10.17487/RFC5357, October 2008, 1031 . 1033 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 1034 and A. Bierman, Ed., "Network Configuration Protocol 1035 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 1036 . 1038 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1039 "Specification of the IP Flow Information Export (IPFIX) 1040 Protocol for the Exchange of Flow Information", STD 77, 1041 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1042 . 1044 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 1045 Weingarten, "An Overview of Operations, Administration, 1046 and Maintenance (OAM) Tools", RFC 7276, 1047 DOI 10.17487/RFC7276, June 2014, 1048 . 1050 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 1051 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 1052 DOI 10.17487/RFC7540, May 2015, 1053 . 1055 [RFC7799] Morton, A., "Active and Passive Metrics and Methods (with 1056 Hybrid Types In-Between)", RFC 7799, DOI 10.17487/RFC7799, 1057 May 2016, . 1059 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 1060 Monitoring Protocol (BMP)", RFC 7854, 1061 DOI 10.17487/RFC7854, June 2016, 1062 . 1064 [RFC8321] Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, 1065 L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, 1066 "Alternate-Marking Method for Passive and Hybrid 1067 Performance Monitoring", RFC 8321, DOI 10.17487/RFC8321, 1068 January 2018, . 1070 Authors' Addresses 1071 Haoyu Song (editor) 1072 Huawei 1073 2330 Central Expressway 1074 Santa Clara 1075 USA 1077 Email: haoyu.song@huawei.com 1079 Tianran Zhou 1080 Huawei 1081 156 Beiqing Road 1082 Beijing, 100095 1083 P.R. China 1085 Email: zhoutianran@huawei.com 1087 Zhenbin Li 1088 Huawei 1089 156 Beiqing Road 1090 Beijing, 100095 1091 P.R. China 1093 Email: lizhenbin@huawei.com 1095 Giuseppe Fioccola 1096 Telecom Italia 1097 Via Reiss Romoli, 274 1098 Torino 10148 1099 Italy 1101 Email: giuseppe.fioccola@telecomitalia.it 1103 Zhenqiang Li 1104 China Mobile 1105 No. 32 Xuanwumenxi Ave., Xicheng District 1106 Beijing, 100032 1107 P.R. China 1109 Email: lizhenqiang@chinamobile.com 1110 Pedro Martinez-Julia 1111 NICT 1112 4-2-1, Nukui-Kitamachi 1113 Koganei, Tokyo 184-8795 1114 Japan 1116 Phone: +81 42 327 7293 1117 Email: pedro@nict.go.jp 1119 Laurent Ciavaglia 1120 Nokia 1121 Villarceaux 91460 1122 France 1124 Email: laurent.ciavaglia@nokia.com 1126 Aijun Wang 1127 China Telecom 1128 Beiqijia Town, Changping District 1129 Beijing, 102209 1130 P.R. China 1132 Email: wangaj.bri@chinatelecom.cn