opsawg P. Lapukhov Internet-Draft Facebook Intended status: Standards TrackMarch 18, 2016R. Chang Expires:September 19,December 12, 2016 Barefoot Networks June 10, 2016 Data-plane probe for in-band telemetry collectiondraft-lapukhov-dataplane-probe-00draft-lapukhov-dataplane-probe-01 Abstract Detecting and isolating network faults in IP networks has traditionally been done using tools like ping and traceroute (see [RFC7276]) or more complex systems built on similar concepts of active probing and path tracing. While using active synthetic probes is proven to be helpful in detecting data-plane faults, isolating fault locationhas proven to beis a much harder problem, especially in diverse networks with multiple active forwarding planes (e.g. IP and MPLS). Moreover, existing end-to-end tools do not generally support functionality beyond dealing with packet loss - for example, they are hardly useful for detecting and reporting transient (i.e. milli- or even micro-second) network congestion. Modern network forwarding hardware canenableallow for more sophisticated data-plane functionality that provides substantial improvement to the isolation and identification capabilities of network elements. For example, it has become possible to encode a snapshot of a networkelements forwardingelement's state within the packet payload as it transits the device. One example of suchdevice/networkstate would be queue depth on the egress port taken by that specific packet. When combined with a unique device identifier embedded in the same packet, this could allow for precise time and topological identification of the the congested location within the network. This document proposes astandardformat for requesting and embedding telemetry information inUDP-based probing packets,active probes, i.e.packetspacket designated for actively testing the network while not carrying application traffic. These active probes could be conveyed over multiple protocols (ICMP, UDP, TCP, etc.)but thisand the documentspecifically focuses on UDP, given its simple semantics.does not prescribe any particular transport. Inadditionaddition, this document provides recommendations on handling the active probes by devices that do not support the required data-plane functionality. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire onSeptember 19,December 12, 2016. Copyright Notice Copyright (c) 2016 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Data plane probe . . . . . . . . . . . . . . . . . . . . . . 4 2.1. Probe transport . . . . . . . . . . . . . . . . . . . . . 4 2.2. Probe structure . . . . . . . . . . . . . . . . . . . . . 4 2.3. Header Format . . . . . . . . . . . . . . . . . . . . . . 5 2.4. TelemetryRecord Template . . . . . . . . . . . . . . . . 7 2.5.Data Frame and TelemetryRecord . . . . . . . . . . . . . . .Data Records . . . . .87 3. Telemetry Record Types . . . . . . . . . . . . . . . . . . .98 3.1. Device Identifier . . . . . . . . . . . . . . . . . . . . 9 3.2. Timestamp . . . . . . . . . . . . . . . . . . . . . . . .109 3.3. Queueing Delay . . . . . . . . . . . . . . . . . . . . .109 3.4. Ingress/Egress Port IDs . . . . . . . . . . . . . . . . .1110 3.5.Forwarding Information . . . . . . . . . . . . . . . . . 11 3.5.1. IPv6 Route . . . . . . . .Opaque State Snapshot . . . . . . . . . . . . .12 3.5.2. IPv4 Route . . . . . . . . . . . . . . . . . . . . . 12 3.5.3. MPLS Route . . .. . . . .. . . . . . . . . . . . . 1210 4. Operating in loopback mode . . . . . . . . . . . . . . . . .1311 5. Processing Probe Packet . . . . . . . . . . . . . . . . . . .1411 5.1. Detecting a probe . . . . . . . . . . . . . . . . . . . .1412 6. Non-Capable Devices . . . . . . . . . . . . . . . . . . . . .1412 7. Handling data-plane probes in the MPLS domain . . . . . . . .1412 8. Multi-chip device considerations . . . . . . . . . . . . . .1512 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . .1513 10.Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 15 11.References . . . . . . . . . . . . . . . . . . . . . . . . .15 11.1.13 10.1. Normative References . . . . . . . . . . . . . . . . . .15 11.2.13 10.2. Informative References . . . . . . . . . . . . . . . . .15 Author's Address .13 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . .1513 1. Introduction Detecting and isolating faults in IP networks may involve multiple tools and approaches, but by far the two most popular utilities used by operators are ping and traceroute. The ping utility provides the basic end-to-end connectivity check by sending a special ICMP packet. There are other variants of ping that work using TCP or UDP probes, but may require a special responder application (for UDP) on the other end of the probed connection. This type of active probing approach has its limitations. First, it operates end-to-end and thus it is impossible to tell where in the path the fault has happened from simply observing the packet loss ratios. Secondly, in multipath (ECMP) scenarios it can bequitedifficult to fully and/or deterministically exercise all the possible paths connecting two end-points. The traceroute utility has multiple variants as well - UDP, ICMP and TCP based, for instance, and special variant for MPLS LSP testing. Practically all variants follow the same model of operations: varying TTL field setting in outgoing probes and analyzing the returned ICMP unreachable messages. This does allow isolating the fault down to the IP hop that is losing packets, but has its own limitations. As with the ping utility, it becomes complicated to explore all possible ECMP paths in the network. This is especially problematic in large Clos fabric topologies that are very common in large data-center networks. Next, many network devices limit the rate of outgoing ICMP messages as well as the rate of "exception" packets "punted" to the control plane processor. This puts a functional limit on the packet rate that the traceroute can probe a given hop with, and hence impacts the resolution and time to isolate a fault. Lastly, the treatment for these control packets is often different from the packets that take regular forwarding path: the latter are normally not redirected to the control plane processor and handled purely in the data-plane hardware. Modern network processing elements (both hardware and software based) are capable of packet handling beyond basic forwarding and simple header modifications. Of special interest is the ability to capture and embed instantaneous state from the network element and encode this state directly into the transit packet. One example would be to record the transit device's name, ingress and egress port identifiers,queue depths,queueing delays, timestamps and so on. By collecting this state along each network device in the path, it becomes trivial to trace a probe's path through the network as well as record transit device characteristics. Extending this model, one could build a tool that combines the useful properties of ping and traceroute using a single packet flight through the network, without the constraints of control plane (aka "slow path") processing. To aid in the development of such tooling, this document defines a format for requesting and embedding telemetry information in the body of active probing packets. 2. Data plane probe This section defines the structure of the active data-planeprobe packets.probe. 2.1. Probe transport This documentassumes the use of IP/UDPdoes not prescribe any specific encapsulation fordata-plane probing (either IPv4 or IPv6). A receiving application may listen on a pre- defined UDP port to collect and possibly echo back the information embedded inthe data-plane probe.One potential limitation to this methodology is the size ofFor example, the probepacket, as some data-plane faults may only impact packets ofcould be embedded inside agiven sizeUDP packet, orrange of sizes. In this case, the data-plane probe may not be able to detect such issues, given the requirement to pre-allocate storage in the packet body.within an IPv6 extension header. 2.2. Probe structure Thesender is responsible for constructingprobe consists of apacket large enough to hold allfixed-size "Header" and arbitrary number of variable-length "telemetry data frames" following the header. Frames are variable length, and each frame, in turn, consists of multiple "telemetry record" fields defined below in this document. The recordsto beare addedby the network elements. Concurrently,per theprobes must not exceed the minimum MTU allowed along the path, so it is assumed that the sender either knows the needed MTU or relies on well-known mechanisms for path MTU discovery. After adding the mandatory protocol (IP, UDP, etc.) headers,request of thepacket payload is built according totelemetry information specified in thefollowing layout:header. +---------------------------------------------------------+ | Header | +---------------------------------------------------------+ | TelemetryRecord template | +---------------------------------------------------------+ | Placeholder for telemetry record 1data frame N | +---------------------------------------------------------+ |Placeholder for telemetry record 2Telemetry data frame N-1 | +---------------------------------------------------------+ . . . . . . +---------------------------------------------------------+ |Placeholder for telemetry record NTelemetry data frame 1 | +---------------------------------------------------------+ Figure 1: Probe layout Notice thatall record placeholders are equal size, as prescribed bythetelemetry record template, and that space for those must be pre- allocated byfirst frame is at thesenderend of the packet.Each record corresponds to a single network element onFor efficient hardware implementation, new frames are pushed onto thepath from senderstack at each hop. This eliminates the need for the transit network elements toreceiver ofinspect thepacket.full packet and allows for arbitrarily long packets as the MTU allows. 2.3. Header Format The probe payload starts with a fixed-size header. The header identifies the packet as a data-plane probe packet, and encodes basic information shared by all telemetry records. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Probe Marker (1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Probe Marker (2) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | VersionNumber | Must Be Zero |S|O| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Message Type | Flags | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Telemetry Request Vector | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Hop Limit | Hop Count | Must Be Zero | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Sender's HandleMaximum Length | Current Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sender's Handle | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Write Offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+Figure 2: Header Format (1) The "Probe Marker" fields are arbitrary 32-bit values generally used by the network elements to identify the packet as a probe packet. These fields should be interpreted as unsigned integer values, stored in network byte order. For example, a network element may be configured to recognize a UDP packet destined to port 31337 and having 0xDEAD 0xBEEF as the values in "Probe Marker" field as an active probe, and treat it respectively. (2) "Version Number" is currently set to 1. (3) The"Global Flags""Message Type" field value could be either "1" - "Probe" or "2" - "Probe Reply" (4) The "Flags" field is 8 bits, and defines the following flags: (5) (1) "Overflow" (O-bit) (least significant bit). This bit is set by the network element iftherethe number of records on the packet isno record placeholder available:at the maximum limit as specified by the packet: i.e. the packet is already "full" of telemetry information.(2) "Sealed" (S-bit). This bit instructs(6) "Telemetry Request Vector" is a 32-bit long field that requests well-known inband telemetry information from the networkelement to forward the packet WITHOUT embedding telemetry data, even if it matcheselements on theprobe identification rules. This mechanism could be usedpath. A bit set in this vector translates tosend "realistic" probesa request ofarbitrary size after the network path associated with the combinationa particular type ofsource/destination IP addresses and ports has been previously established.information. Thenetwork element must not inspectfollowing types/bits are currently defined, starting with the"Telemetry Record Template" field for "sealed" probes.least significant bit first: (1) Bit 0: Device identifier. (2) Bit 1: Timestamp. (3) Bit 2: Queueing delay. (4)The "Message Type" field value could be either "1" - "Probe" or "2" - "Probe Reply"Bit 3: Ingress/Egress port identifiers. (5) Bit 31: Opaque state snapshot request. (7) "Hop Limit" is defined only for "Message Type" of "1" ("Probe"). For "Probe Reply" the "Hop Limit" field must be set to zero. This field is treated as an integer valueand decremented by every network element inrepresenting thepath as "Probe" propagates.number of network elements. See the Section 4 section on the intended use of the field.(6)(8) The"Sender's Handle""Hop Count" fieldis set by the sender to allowspecifies thereceiver to identify a particular originatorcurrent number ofprobe packets. Along with "Sequence Number" it allows for trackinghops ofpacket order and loss within the network. (7) The "Write Offset" field specifies the offset for the next telemetry record to be written incapable network elements theprobepacketbody.has transit through. Itcounts from the start of the packet bodybegins with zero and must beinitially set to the first octet after the "Record Template" field. It must beincremented by one for every network element that adds a telemetryrecord, without overflowing the storage. Thisrecord. Combined with a push mechanism, this simplifies the work for the subsequent network element- itand the packet receiver. The subsequent network element just needs to parse the template and thenadd the data atinsert new record(s) immediately after the"Write Offset". 2.4. Telemetry Record Templatetemplate. (9) Thefollowing figure defines"Max Length" field specifies the"Record Template". This template uses type-length fields to describemaximum length of the telemetrydata records as added by network elements. The most significant bitpayload in bytes. Given that the"Type" field must besender knows the minimum path MTU, the sender can setto zero. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TL record count (N) | Must Be Zero | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Type 1 | Length 1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Type 2 | Length 2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . . . . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0| Type N | Length N | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3: Record Template 2.5. Telemetry Record This section definesthestructure of a telemetry record. Every network element capablemaximum ofreporting inband telemetry data must add a record as defined inpayload bytes allowed before exceeding the"Record Template"MTU. Thus, a simple comparison between "Current Length" and "Max Length" allows tothe probe packet. The new record mustdecide whether or not data could beinserted atadded. (10) The "Current Length" field specifies the"Write Offset" positioncurrent length of data stored in thepacket payload, with the "Write Offset" subsequenlyprobe. This field is incremented by eacn network element by thesizenumber of bytes it has added with thenew record.telemetry data frame. (11) Theorder of TLV elements must follow"Sender's Handle" field is set by the sender to allow the receiver to identify a particular originator of probe packets. Along with "Sequence Number" it allows for tracking of packet orderprescribedand loss within the network. 2.4. Telemetry Data Frame and Telemetry Data Records Each telemetry data frame is constructed by concatenating multiple telemetry data record, per theFigure 3 portionrequest in "Telemetry Request Vector" fields of the dataplane probepacket.header. Themost significant bitframe starts with a 16-bit length field, which reflects the frame size in bytes, excluding the length of thetypefield("S-bit") must be setitself. Following the "Frame Length" field is a "Telemetry Response Vector" field: this vector corresponds to"1" ifthe records the network element wasablecapable of recording in the frame. The body of the frame is constructed by appending fixed-size records corresponding tounderstand and recordevery bit set in "Telemetry Response Vector". All of the records, except the one requestedtelemetry type. Thatby 31st bitmust be set to zero otherwise, along("Opaque State Snapshot") are fixed size, with their lengths defined in Section 3. The order of the records in the frame follows thecontentsorder of the"Value" field. The lengthbits in the "Telemetry Request Vector" (also reflected in "Telemetry Response Vector"). Finally, if requested, a variable-length field is appended at the end of the frame, with theTLVlength field occupying the first 8 bits. This "length" field reflects the lengthincludingof the"Type" and "Length" fields.opaque data excluding the length field itself. Ifwritinginserting a new telemetry recordto the packet bodywould causeit"Current Length" to exceedthe packet size,"Max Length", no record is added and the overflow "O-bit" must be set to "1" in the probe header. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|S| Type 1| Frame Length1| Must be Zero | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Telemetry Response Vector | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . Fixed Size Field 0 . .Value 1(if requested) . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|S| Type 2|Length 2|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+. Fixed Size Field 1 . . (if requested) . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ . . .Value 2. ~ ~ . . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+. Fixed Size Field 30 . . (if requested) . . . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|S| Type N| LengthN|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| +-+-+-+-+-+-+-+-+ + | | . Opaque State Snapshot . .Value N . .(if requested) . | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure4:3: TelemetryRecordFrame Format 3. Telemetry Record Types This section defines some of the telemetry record types that could be supported by the network elements. 3.1. Device Identifier This record is used to identify the device reporting telemetry information. This document does not prescribe any specific identifier format. In general, it is expected to be configured by the operator. The length of this record is 32-bit. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|S| Type = 1 (Device ID) | Length = 12 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Device ID (1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Device ID(2)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure5:4: Device Identifier 3.2. Timestamp This telemetry record encodes the timethatdata associated with the packet. Most existing hardware support timestamping for IEEE1588. To leverage existing hardware capabilities, packetentersreceive time is stored similarly as 48-bits of seconds, 32-bits of nanoseconds, andleaves the device, in UTC. The "entering"residence time isrecorded when the L2 header enters the processing pipeline.in 48-bits of nanoseconds. The"exit" timelength of this record isrecorded when the network elements starts serializing L2 header on egress port.128 bits. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|S| Type = 10 (Timestamp) | Length = 28 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Receive Seconds [47:16] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ReceiveMicroseconds | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+Seconds [15:0] | Receive Nanoseconds [31:16] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Send Seconds | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+Receive Nanoseconds [15:0] |Send MicrosecondsResidence Time [47:32] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Send NanosecondsResidence Time [31:0] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure6:5: Timestamp 3.3. Queueing DelayEncodesThis record encodes the amount of time that the frame has spent queued in the network element. This is only recorded if packet has been queued, and defines the time spent in memory buffers. This could be helpful to detect queueing-related delays in the network. If the queueing delay exceeds the maximum number of 2+ seconds allowed by the 31-bit number, the network element must set the overflow "O-bit". In case of the cut-through switching operation this must be set to zero. The length of this record is 32 bits. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|S| Type = 11 (Queueing Delay) | Length = 16 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Seconds | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Microseconds | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ||O| Nanoseconds | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure7:6: Queueing Delay 3.4. Ingress/Egress Port IDs This record stores the ingress and egress physical ports used to receive and send packet respectively. Here, "physical port" means a unit with actual MAC and PHY devices associated - not any logical subdivision based, for example, on protocol level tags (e.g. VLAN). The port identifiers are opaque, and defined as32-bit16-bit entries. For example, those could be the corresponding SNMP ifIndex values. The length of this record is 32 bits. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|S| Type = 12 (Port IDs) | Length = 12 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| Ingress Port ID |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Egress Port ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure8:7: Ingress/Egress Port IDs 3.5.Forwarding Information Records defined in this section requireOpaque State Snapshot This record has variable size. It allows the network element to storeforwarding information that was used to directarbitrary state in thepacketprobe, without a pre-defined schema. The schema needs tothe next-hop. In the network that uses multiple forwarding plane implementations (e.g. IP and MPLS) the originator of the probe is requiredmade known topopulate the record template with all kinds of forwarding information it expects inthepath.analyzer by some out-of-band means. Thenetwork elements then populate the entries they know about, e.g.16-bit "Schema Id" field inIPv4-only networkthe"IPv6 Route" record will be left unfilled, and so will be "MPLS Route". 3.5.1. IPv6 Route Thisrecordstores the IPv6 route that has been used for packet forwarding. If not used, then S-bitissetsupposed tozero, along with the value field. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |S| Type = 20 (IPv6 Route) | Length = 24 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ECMP group size | ECMP group index | Prefix Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 Address (1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 Address (2) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 Address (3) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv6 Address (4) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 9: IPv6 Route 3.5.2. IPv4 Route This record storeslet theIPv4 route that has been used for packet forwarding. If not used, then S-bit is setanalyzer know which particular schema tozero, along with the value field. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |S| Type = 21 (IPv4 Route) | Length = 12 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ECMP group size | ECMP group index | Prefix Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | IPv4 Address | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 10: IPv4 Route 3.5.3. MPLS Route This record stores the MPLS label mapping that has been used for packet forwarding. Ituse, and it ispossible that inbound or outbound label set setexpected tozero, if it was not used (e.g.be configured oningress or egress ofthedomain). Atnetwork element by theedge of IP2MPLS or MPLS2IP domain itoperator. This ID is expectedthatto be configured on the devicewould fill in the "MPLS Route" telemetry record along withby thecorresponding "IPv6 Route" or "IPv4 Route" records.network operator. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|S| Type = 22 (MPLS Route)| Length= 16 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Operation | ECMP group size|ECMP group indexSchema Id | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Must Be Zero|Incoming MPLS Label|+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|Must Be Zero|Outgoing MPLS LabelOpaque Data | ~ ~ . . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure11: MPLS Route There are three MPLS operations defined "1" - Push "2" - Pop "3" - Swap8: Opaque State 4. Operating in loopback mode In "loopback" mode the flow of probes is "turned back" ata givensome network element. The network element that "turns" packets around is identified using the "Hop Limit" field. The network element that receives a "Probe" type packet having "Hop Limit" valueof "1"equal to "Hop Count" is required to perform the following: Change the "Message Type" field to "ProbeReply"Reply", andsetkeep the "Hop Limit"toat zero. Swap the destination/source IP addressesand port valuesin theIP/ UDP headers oftransport header to send theprobe packet.packet back to the originator. Add a new telemetryrecord as required using the newly build IP/UDP headersdata frame corresponding todeterminethe new forwarding information. This way, the original probe is routed back to originator. Notice that the return path may be different from the path that the original probe has taken. This path will be recorded by the network elements as the reply is transported back to the sender. Using this technique one may progressively test a path until its breaking point.Unlike the traditional traceroute utility, however, the returningIf a network element is incapable of redirecting packetsare the original probes, notback to theICMP messages.originator, another option would be exporting those packets to a network analyzer device, using some sort of encapsulation header. 5. Processing Probe Packet 5.1. Detecting a probeSince the probe looks like a regular UDP packet, the data-plane hardware needsAs mentioned previously, awaycombination of techniques need torecognize it for special processing.be used to differentiate the active probes. Thisdocument doesmay include, but should notprescribe a specific way to do that. For example, classification couldbebased on only the destination UDP port, orlimited to usingmore complex pattern matching techniques, e.g matching onjust thecontentsknown position of "ProbeMarker" field.Id" fields. 6. Non-Capable Devices Non-capable devices are those that cannot process a probe natively in the fast-path data plane. Further, there could be two types of such devices: those that can still process it via the control-plane software, and those that can not. The control-plane processing should be triggered by use of the "Router-Alert" option for IPv4 of IPv6 packets (see [RFC2113] or [RFC2711]) added by the originator of the probe. A control-plane capable device is expected to interpret and fill-in as much telemetry-record data as it possibly could, given the limited abilities. Network elements that are not capable of processing the data-plane probes are expected to perform regular packet forwarding. If a network element receives a packet with the router-alert option set, but has no special configuration to detect such probes, it should process it according to [RFC6398]. Absence of the router alert option leaves the non dataplane-capable devices with the only option of processing the probe using traditional forwarding. 7. Handling data-plane probes in the MPLS domain In general, the payload of an MPLS packet is opaque to the network element. However, in many cases the network element still performs a lookup beyond the MPLS label stack, e.g. to obtain information such as L4 ports for load balancing. It may be possible to perform data- plane probe classification in the same manner, additionally using the "Probe Marker" to distinguish the probe packets. In accordance to [RFC6178] Label Edge Routers (LERs) are required not to impose an MPLS router-alert label for packets carrying the router- alert option. It may be beneficial to enable such translation, so that an end-to-end validation could be performed if a control-plane capable MPLS network element is present on the probe's path. 8. Multi-chip device considerations TBD 9. IANA Considerations None 10.Acknowledgements The author would like to thank L.J. Wobker and Changhoom Kim for reviewing and providing valuable comments for the initial version of this document. 11.References11.1.10.1. Normative References [RFC2113] Katz, D., "IP Router Alert Option", RFC 2113, DOI 10.17487/RFC2113, February 1997, <http://www.rfc-editor.org/info/rfc2113>. [RFC2711] Partridge, C. and A. Jackson, "IPv6 Router Alert Option", RFC 2711, DOI 10.17487/RFC2711, October 1999, <http://www.rfc-editor.org/info/rfc2711>. [RFC6398] Le Faucheur, F., Ed., "IP Router Alert Considerations and Usage", BCP 168, RFC 6398, DOI 10.17487/RFC6398, October 2011, <http://www.rfc-editor.org/info/rfc6398>. [RFC6178] Smith, D., Mullooly, J., Jaeger, W., and T. Scholl, "Label Edge Router Forwarding of IPv4 Option Packets", RFC 6178, DOI 10.17487/RFC6178, March 2011, <http://www.rfc-editor.org/info/rfc6178>.11.2.10.2. Informative References [RFC7276] Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. Weingarten, "An Overview of Operations, Administration, and Maintenance (OAM) Tools", RFC 7276, DOI 10.17487/RFC7276, June 2014, <http://www.rfc-editor.org/info/rfc7276>.Author's AddressAuthors' Addresses Petr Lapukhov Facebook 1 Hacker Way Menlo Park, CA 94025 US Email: petr@fb.com Remy Chang Barefoot Networks 2185 Park Boulevard Palo Alto, CA 94306 US Email: remy@barefootnetworks.com