idnits 2.17.1 draft-lapukhov-dataplane-probe-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The abstract seems to contain references ([RFC7276]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 18, 2016) is 2955 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 opsawg P. Lapukhov 3 Internet-Draft Facebook 4 Intended status: Standards Track March 18, 2016 5 Expires: September 19, 2016 7 Data-plane probe for in-band telemetry collection 8 draft-lapukhov-dataplane-probe-00 10 Abstract 12 Detecting and isolating network faults in IP networks has 13 traditionally been done using tools like ping and traceroute (see 14 [RFC7276]) or more complex systems built on similar concepts of 15 active probing and path tracing. While using active synthetic probes 16 is proven to be helpful in detecting data-plane faults, isolating 17 fault location has proven to be a much harder problem, especially in 18 diverse networks with multiple active forwarding planes (e.g. IP and 19 MPLS). Moreover, existing end-to-end tools do not generally support 20 functionality beyond dealing with packet loss - for example, they are 21 hardly useful for detecting and reporting transient (i.e. milli- or 22 even micro-second) network congestion. 24 Modern network forwarding hardware can enable more sophisticated 25 data-plane functionality that provides substantial improvement to the 26 isolation and identification capabilities of network elements. For 27 example, it has become possible to encode a snapshot of a network 28 elements forwarding state within the packet payload as it transits 29 the device. One example of such device/network state would be queue 30 depth on the egress port taken by that specific packet. When 31 combined with a unique device identifier embedded in the same packet, 32 this could allow for precise time and topological identification of 33 the the congested location within the network. 35 This document proposes a standard format for embedding telemetry 36 information in UDP-based probing packets, i.e. packets designated for 37 testing the network while not carrying application traffic. These 38 active probes could be conveyed over multiple protocols (ICMP, UDP, 39 TCP, etc.) but this document specifically focuses on UDP, given its 40 simple semantics. In addition this document provides recommendations 41 on handling the active probes by devices that do not support the 42 required data-plane functionality. 44 Status of This Memo 46 This Internet-Draft is submitted in full conformance with the 47 provisions of BCP 78 and BCP 79. 49 Internet-Drafts are working documents of the Internet Engineering 50 Task Force (IETF). Note that other groups may also distribute 51 working documents as Internet-Drafts. The list of current Internet- 52 Drafts is at http://datatracker.ietf.org/drafts/current/. 54 Internet-Drafts are draft documents valid for a maximum of six months 55 and may be updated, replaced, or obsoleted by other documents at any 56 time. It is inappropriate to use Internet-Drafts as reference 57 material or to cite them other than as "work in progress." 59 This Internet-Draft will expire on September 19, 2016. 61 Copyright Notice 63 Copyright (c) 2016 IETF Trust and the persons identified as the 64 document authors. All rights reserved. 66 This document is subject to BCP 78 and the IETF Trust's Legal 67 Provisions Relating to IETF Documents 68 (http://trustee.ietf.org/license-info) in effect on the date of 69 publication of this document. Please review these documents 70 carefully, as they describe your rights and restrictions with respect 71 to this document. Code Components extracted from this document must 72 include Simplified BSD License text as described in Section 4.e of 73 the Trust Legal Provisions and are provided without warranty as 74 described in the Simplified BSD License. 76 Table of Contents 78 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 79 2. Data plane probe . . . . . . . . . . . . . . . . . . . . . . 4 80 2.1. Probe transport . . . . . . . . . . . . . . . . . . . . . 4 81 2.2. Probe structure . . . . . . . . . . . . . . . . . . . . . 4 82 2.3. Header Format . . . . . . . . . . . . . . . . . . . . . . 5 83 2.4. Telemetry Record Template . . . . . . . . . . . . . . . . 7 84 2.5. Telemetry Record . . . . . . . . . . . . . . . . . . . . 8 85 3. Telemetry Record Types . . . . . . . . . . . . . . . . . . . 9 86 3.1. Device Identifier . . . . . . . . . . . . . . . . . . . . 9 87 3.2. Timestamp . . . . . . . . . . . . . . . . . . . . . . . . 10 88 3.3. Queueing Delay . . . . . . . . . . . . . . . . . . . . . 10 89 3.4. Ingress/Egress Port IDs . . . . . . . . . . . . . . . . . 11 90 3.5. Forwarding Information . . . . . . . . . . . . . . . . . 11 91 3.5.1. IPv6 Route . . . . . . . . . . . . . . . . . . . . . 12 92 3.5.2. IPv4 Route . . . . . . . . . . . . . . . . . . . . . 12 93 3.5.3. MPLS Route . . . . . . . . . . . . . . . . . . . . . 12 94 4. Operating in loopback mode . . . . . . . . . . . . . . . . . 13 95 5. Processing Probe Packet . . . . . . . . . . . . . . . . . . . 14 96 5.1. Detecting a probe . . . . . . . . . . . . . . . . . . . . 14 98 6. Non-Capable Devices . . . . . . . . . . . . . . . . . . . . . 14 99 7. Handling data-plane probes in the MPLS domain . . . . . . . . 14 100 8. Multi-chip device considerations . . . . . . . . . . . . . . 15 101 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 102 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15 103 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 104 11.1. Normative References . . . . . . . . . . . . . . . . . . 15 105 11.2. Informative References . . . . . . . . . . . . . . . . . 15 106 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 15 108 1. Introduction 110 Detecting and isolating faults in IP networks may involve multiple 111 tools and approaches, but by far the two most popular utilities used 112 by operators are ping and traceroute. The ping utility provides the 113 basic end-to-end connectivity check by sending a special ICMP packet. 114 There are other variants of ping that work using TCP or UDP probes, 115 but may require a special responder application (for UDP) on the 116 other end of the probed connection. 118 This type of active probing approach has its limitations. First, it 119 operates end-to-end and thus it is impossible to tell where in the 120 path the fault has happened from simply observing the packet loss 121 ratios. Secondly, in multipath (ECMP) scenarios it can be quite 122 difficult to fully and/or deterministically exercise all the possible 123 paths connecting two end-points. 125 The traceroute utility has multiple variants as well - UDP, ICMP and 126 TCP based, for instance, and special variant for MPLS LSP testing. 127 Practically all variants follow the same model of operations: varying 128 TTL field setting in outgoing probes and analyzing the returned ICMP 129 unreachable messages. This does allow isolating the fault down to 130 the IP hop that is losing packets, but has its own limitations. As 131 with the ping utility, it becomes complicated to explore all possible 132 ECMP paths in the network. This is especially problematic in large 133 Clos fabric topologies that are very common in large data-center 134 networks. Next, many network devices limit the rate of outgoing ICMP 135 messages as well as the rate of "exception" packets "punted" to the 136 control plane processor. This puts a functional limit on the packet 137 rate that the traceroute can probe a given hop with, and hence 138 impacts the resolution and time to isolate a fault. Lastly, the 139 treatment for these control packets is often different from the 140 packets that take regular forwarding path: the latter are normally 141 not redirected to the control plane processor and handled purely in 142 the data-plane hardware. 144 Modern network processing elements (both hardware and software based) 145 are capable of packet handling beyond basic forwarding and simple 146 header modifications. Of special interest is the ability to capture 147 and embed instantaneous state from the network element and encode 148 this state directly into the transit packet. One example would be to 149 record the transit device's name, ingress and egress port 150 identifiers, queue depths, timestamps and so on. By collecting this 151 state along each network device in the path, it becomes trivial to 152 trace a probe's path through the network as well as record transit 153 device characteristics. Extending this model, one could build a tool 154 that combines the useful properties of ping and traceroute using a 155 single packet flight through the network, without the constraints of 156 control plane (aka "slow path") processing. To aid in the 157 development of such tooling, this document defines a format for 158 embedding telemetry information in the body of active probing 159 packets. 161 2. Data plane probe 163 This section defines the structure of the active data-plane probe 164 packets. 166 2.1. Probe transport 168 This document assumes the use of IP/UDP for data-plane probing 169 (either IPv4 or IPv6). A receiving application may listen on a pre- 170 defined UDP port to collect and possibly echo back the information 171 embedded in the probe. One potential limitation to this methodology 172 is the size of the probe packet, as some data-plane faults may only 173 impact packets of a given size or range of sizes. In this case, the 174 data-plane probe may not be able to detect such issues, given the 175 requirement to pre-allocate storage in the packet body. 177 2.2. Probe structure 179 The sender is responsible for constructing a packet large enough to 180 hold all records to be added by the network elements. Concurrently, 181 the probes must not exceed the minimum MTU allowed along the path, so 182 it is assumed that the sender either knows the needed MTU or relies 183 on well-known mechanisms for path MTU discovery. After adding the 184 mandatory protocol (IP, UDP, etc.) headers, the packet payload is 185 built according to the following layout: 187 +---------------------------------------------------------+ 188 | Header | 189 +---------------------------------------------------------+ 190 | Telemetry Record template | 191 +---------------------------------------------------------+ 192 | Placeholder for telemetry record 1 | 193 +---------------------------------------------------------+ 194 | Placeholder for telemetry record 2 | 195 +---------------------------------------------------------+ 196 . . 197 . . 198 . . 199 +---------------------------------------------------------+ 200 | Placeholder for telemetry record N | 201 +---------------------------------------------------------+ 203 Figure 1: Probe layout 205 Notice that all record placeholders are equal size, as prescribed by 206 the telemetry record template, and that space for those must be pre- 207 allocated by the sender of the packet. Each record corresponds to a 208 single network element on the path from sender to receiver of the 209 packet. 211 2.3. Header Format 213 The probe payload starts with a fixed-size header. The header 214 identifies the packet as a probe packet, and encodes basic 215 information shared by all telemetry records. 217 0 1 2 3 218 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 219 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 220 | Probe Marker (1) | 221 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 222 | Probe Marker (2) | 223 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 224 | Version Number | Must Be Zero |S|O| 225 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 226 | Message Type | Hop Limit | Must Be Zero | 227 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 228 | Sender's Handle | 229 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 230 | Sequence Number | 231 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 232 | Write Offset | 233 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 235 Figure 2: Header Format 237 (1) The "Probe Marker" fields are arbitrary 32-bit values generally 238 used by the network elements to identify the packet as a probe 239 packet. These fields should be interpreted as unsigned integer 240 values, stored in network byte order. For example, a network 241 element may be configured to recognize a UDP packet destined to 242 port 31337 and having 0xDEAD 0xBEEF as the values in "Probe 243 Marker" field as an active probe, and treat it respectively. 245 (2) "Version Number" is currently set to 1. 247 (3) The "Global Flags" field is 8 bits, and defines the following 248 flags: 250 (1) "Overflow" (O-bit) (least significant bit). This bit is 251 set by the network element if there is no record 252 placeholder available: i.e. the packet is already "full" of 253 telemetry information. 255 (2) "Sealed" (S-bit). This bit instructs the network element 256 to forward the packet WITHOUT embedding telemetry data, 257 even if it matches the probe identification rules. This 258 mechanism could be used to send "realistic" probes of 259 arbitrary size after the network path associated with the 260 combination of source/destination IP addresses and ports 261 has been previously established. The network element must 262 not inspect the "Telemetry Record Template" field for 263 "sealed" probes. 265 (4) The "Message Type" field value could be either "1" - "Probe" or 266 "2" - "Probe Reply" 268 (5) "Hop Limit" is defined only for "Message Type" of "1" ("Probe"). 269 For "Probe Reply" the "Hop Limit" field must be set to zero. 270 This field is treated as an integer value and decremented by 271 every network element in the path as "Probe" propagates. See 272 the Section 4 section on the intended use of the field. 274 (6) The "Sender's Handle" field is set by the sender to allow the 275 receiver to identify a particular originator of probe packets. 276 Along with "Sequence Number" it allows for tracking of packet 277 order and loss within the network. 279 (7) The "Write Offset" field specifies the offset for the next 280 telemetry record to be written in the probe packet body. It 281 counts from the start of the packet body and must be initially 282 set to the first octet after the "Record Template" field. It 283 must be incremented by every network element that adds a 284 telemetry record, without overflowing the storage. This 285 simplifies the work for the subsequent network element - it just 286 needs to parse the template and then add the data at the "Write 287 Offset". 289 2.4. Telemetry Record Template 291 The following figure defines the "Record Template". This template 292 uses type-length fields to describe the telemetry data records as 293 added by network elements. The most significant bit in the "Type" 294 field must be set to zero. 296 0 1 2 3 297 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 298 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 299 | TL record count (N) | Must Be Zero | 300 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 301 |0| Type 1 | Length 1 | 302 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 303 |0| Type 2 | Length 2 | 304 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 305 . . 306 . . 307 . . 308 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 309 |0| Type N | Length N | 310 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 312 Figure 3: Record Template 314 2.5. Telemetry Record 316 This section defines the structure of a telemetry record. Every 317 network element capable of reporting inband telemetry data must add a 318 record as defined in the "Record Template" to the probe packet. The 319 new record must be inserted at the "Write Offset" position in the 320 packet payload, with the "Write Offset" subsequenly incremented by 321 the size of the new record. The order of TLV elements must follow 322 the order prescribed by the Figure 3 portion of the probe packet. 323 The most significant bit in the type field ("S-bit") must be set to 324 "1" if the network element was able to understand and record the 325 requested telemetry type. That bit must be set to zero otherwise, 326 along with the contents of the "Value" field. The length field is 327 the TLV field length including the "Type" and "Length" fields. 329 If writing a new telemetry record to the packet body would cause it 330 to exceed the packet size, no record is added and the overflow 331 "O-bit" must be set to "1" in the probe header. 333 0 1 2 3 334 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 335 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 336 |S| Type 1 | Length 1 | 337 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 338 | | 339 . . 340 . Value 1 . 341 . . 342 | | 343 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 344 |S| Type 2 | Length 2 | 345 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 346 | | 347 . . 348 . Value 2 . 349 . . 350 | | 351 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 352 . . 353 . . 354 . . 355 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 356 |S| Type N | Length N | 357 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 358 | | 359 . . 360 . Value N . 361 . . 362 | | 363 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 365 Figure 4: Telemetry Record Format 367 3. Telemetry Record Types 369 This section defines some of the telemetry record types that could be 370 supported by the network elements. 372 3.1. Device Identifier 374 This is used to identify the device reporting telemetry information. 375 This document does not prescribe any specific identifier format. 377 0 1 2 3 378 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 379 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 380 |S| Type = 1 (Device ID) | Length = 12 | 381 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 382 | Device ID (1) | 383 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 384 | Device ID (2) | 385 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 387 Figure 5: Device Identifier 389 3.2. Timestamp 391 This telemetry record encodes the time that the packet enters and 392 leaves the device, in UTC. The "entering" time is recorded when the 393 L2 header enters the processing pipeline. The "exit" time is 394 recorded when the network elements starts serializing L2 header on 395 egress port. 397 0 1 2 3 398 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 399 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 400 |S| Type = 10 (Timestamp) | Length = 28 | 401 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 402 | Receive Seconds | 403 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 404 | Receive Microseconds | 405 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 406 | Receive Nanoseconds | 407 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 408 | Send Seconds | 409 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 410 | Send Microseconds | 411 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 412 | Send Nanoseconds | 413 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 415 Figure 6: Timestamp 417 3.3. Queueing Delay 419 Encodes the amount of time that the frame has spent queued in the 420 network element. This is only recorded if packet has been queued, 421 and defines the time spent in memory buffers. This could be helpful 422 to detect queueing-related delays in the network. In case of the 423 cut-through switching operation this must be set to zero. 425 0 1 2 3 426 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 427 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 428 |S| Type = 11 (Queueing Delay) | Length = 16 | 429 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 430 | Seconds | 431 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 432 | Microseconds | 433 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 434 | Nanoseconds | 435 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 437 Figure 7: Queueing Delay 439 3.4. Ingress/Egress Port IDs 441 This record stores the ingress and egress physical ports used to 442 receive and send packet respectively. Here, "physical port" means a 443 unit with actual MAC and PHY devices associated - not any logical 444 subdivision based, for example, on protocol level tags (e.g. VLAN). 445 The port identifiers are opaque, and defined as 32-bit entries. 447 0 1 2 3 448 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 449 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 450 |S| Type = 12 (Port IDs) | Length = 12 | 451 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 452 | Ingress Port ID | 453 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 454 | Egress Port ID | 455 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 457 Figure 8: Ingress/Egress Port IDs 459 3.5. Forwarding Information 461 Records defined in this section require the network element to store 462 forwarding information that was used to direct the packet to the 463 next-hop. In the network that uses multiple forwarding plane 464 implementations (e.g. IP and MPLS) the originator of the probe is 465 required to populate the record template with all kinds of forwarding 466 information it expects in the path. The network elements then 467 populate the entries they know about, e.g. in IPv4-only network the 468 "IPv6 Route" record will be left unfilled, and so will be "MPLS 469 Route". 471 3.5.1. IPv6 Route 473 This record stores the IPv6 route that has been used for packet 474 forwarding. If not used, then S-bit is set to zero, along with the 475 value field. 477 0 1 2 3 478 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 479 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 480 |S| Type = 20 (IPv6 Route) | Length = 24 | 481 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 482 | ECMP group size | ECMP group index | Prefix Length | 483 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 484 | IPv6 Address (1) | 485 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 486 | IPv6 Address (2) | 487 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 488 | IPv6 Address (3) | 489 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 490 | IPv6 Address (4) | 491 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 493 Figure 9: IPv6 Route 495 3.5.2. IPv4 Route 497 This record stores the IPv4 route that has been used for packet 498 forwarding. If not used, then S-bit is set to zero, along with the 499 value field. 501 0 1 2 3 502 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 503 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 504 |S| Type = 21 (IPv4 Route) | Length = 12 | 505 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 506 | ECMP group size | ECMP group index | Prefix Length | 507 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 508 | IPv4 Address | 509 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 511 Figure 10: IPv4 Route 513 3.5.3. MPLS Route 515 This record stores the MPLS label mapping that has been used for 516 packet forwarding. It is possible that inbound or outbound label set 517 set to zero, if it was not used (e.g. on ingress or egress of the 518 domain). At the edge of IP2MPLS or MPLS2IP domain it is expected 519 that the device would fill in the "MPLS Route" telemetry record along 520 with the corresponding "IPv6 Route" or "IPv4 Route" records. 522 0 1 2 3 523 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 524 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 525 |S| Type = 22 (MPLS Route) | Length = 16 | 526 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 527 | Operation | ECMP group size | ECMP group index | 528 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 529 | Must Be Zero | Incoming MPLS Label | 530 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 531 | Must Be Zero | Outgoing MPLS Label | 532 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 534 Figure 11: MPLS Route 536 There are three MPLS operations defined 538 "1" - Push 540 "2" - Pop 542 "3" - Swap 544 4. Operating in loopback mode 546 In "loopback" mode the flow of probes is "turned back" at a given 547 network element. The network element that "turns" packets around is 548 identified using the "Hop Limit" field. The network element that 549 receives a "Probe" type packet having "Hop Limit" value of "1" is 550 required to perform the following: 552 Change the "Message Type" field to "Probe Reply" and set the "Hop 553 Limit" to zero. 555 Swap the destination/source addresses and port values in the IP/ 556 UDP headers of the probe packet. 558 Add a telemetry record as required using the newly build IP/UDP 559 headers to determine forwarding information. 561 This way, the original probe is routed back to originator. Notice 562 that the return path may be different from the path that the original 563 probe has taken. This path will be recorded by the network elements 564 as the reply is transported back to the sender. Using this technique 565 one may progressively test a path until its breaking point. Unlike 566 the traditional traceroute utility, however, the returning packets 567 are the original probes, not the ICMP messages. 569 5. Processing Probe Packet 571 5.1. Detecting a probe 573 Since the probe looks like a regular UDP packet, the data-plane 574 hardware needs a way to recognize it for special processing. This 575 document does not prescribe a specific way to do that. For example, 576 classification could be based on only the destination UDP port, or 577 using more complex pattern matching techniques, e.g matching on the 578 contents of "Probe Marker" field. 580 6. Non-Capable Devices 582 Non-capable devices are those that cannot process a probe natively in 583 the fast-path data plane. Further, there could be two types of such 584 devices: those that can still process it via the control-plane 585 software, and those that can not. The control-plane processing 586 should be triggered by use of the "Router-Alert" option for IPv4 of 587 IPv6 packets (see [RFC2113] or [RFC2711]) added by the originator of 588 the probe. A control-plane capable device is expected to interpret 589 and fill-in as much telemetry-record data as it possibly could, given 590 the limited abilities. 592 Network elements that are not capable of processing the data-plane 593 probes are expected to perform regular packet forwarding. If a 594 network element receives a packet with the router-alert option set, 595 but has no special configuration to detect such probes, it should 596 process it according to [RFC6398]. Absence of the router alert 597 option leaves the non dataplane-capable devices with the only option 598 of processing the probe using traditional forwarding. 600 7. Handling data-plane probes in the MPLS domain 602 In general, the payload of an MPLS packet is opaque to the network 603 element. However, in many cases the network element still performs a 604 lookup beyond the MPLS label stack, e.g. to obtain information such 605 as L4 ports for load balancing. It may be possible to perform data- 606 plane probe classification in the same manner, additionally using the 607 "Probe Marker" to distinguish the probe packets. 609 In accordance to [RFC6178] Label Edge Routers (LERs) are required not 610 to impose an MPLS router-alert label for packets carrying the router- 611 alert option. It may be beneficial to enable such translation, so 612 that an end-to-end validation could be performed if a control-plane 613 capable MPLS network element is present on the probe's path. 615 8. Multi-chip device considerations 617 TBD 619 9. IANA Considerations 621 None 623 10. Acknowledgements 625 The author would like to thank L.J. Wobker and Changhoom Kim for 626 reviewing and providing valuable comments for the initial version of 627 this document. 629 11. References 631 11.1. Normative References 633 [RFC2113] Katz, D., "IP Router Alert Option", RFC 2113, 634 DOI 10.17487/RFC2113, February 1997, 635 . 637 [RFC2711] Partridge, C. and A. Jackson, "IPv6 Router Alert Option", 638 RFC 2711, DOI 10.17487/RFC2711, October 1999, 639 . 641 [RFC6398] Le Faucheur, F., Ed., "IP Router Alert Considerations and 642 Usage", BCP 168, RFC 6398, DOI 10.17487/RFC6398, October 643 2011, . 645 [RFC6178] Smith, D., Mullooly, J., Jaeger, W., and T. Scholl, "Label 646 Edge Router Forwarding of IPv4 Option Packets", RFC 6178, 647 DOI 10.17487/RFC6178, March 2011, 648 . 650 11.2. Informative References 652 [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. 653 Weingarten, "An Overview of Operations, Administration, 654 and Maintenance (OAM) Tools", RFC 7276, 655 DOI 10.17487/RFC7276, June 2014, 656 . 658 Author's Address 659 Petr Lapukhov 660 Facebook 661 1 Hacker Way 662 Menlo Park, CA 94025 663 US 665 Email: petr@fb.com