idnits 2.17.1 draft-nordmark-nvo3-transcending-traceroute-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 18 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. == There are 1 instance of lines with non-RFC3849-compliant IPv6 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 741 has weird spacing: '...t range low/...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (Jul 2016) is 2842 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Downref: Normative reference to an Informational RFC: RFC 7348 ** Downref: Normative reference to an Informational RFC: RFC 7365 -- Obsolete informational reference (is this intentional?): RFC 1933 (Obsoleted by RFC 2893) -- Obsolete informational reference (is this intentional?): RFC 4379 (Obsoleted by RFC 8029) Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NVO3 WG E. Nordmark 3 Internet-Draft C. Appanna 4 Intended status: Standards Track A. Lo 5 Expires: January 2, 2017 Arista Networks 6 S. Boutros 7 A. Dubey 8 VMware 9 Jul 2016 11 Layer-Transcending Traceroute for Overlay Networks like VXLAN 12 draft-nordmark-nvo3-transcending-traceroute-03 14 Abstract 16 Tools like traceroute have been very valuable for the operation of 17 the Internet. Part of that value comes from being able to display 18 information about routers and paths over which the user of the tool 19 has no control, but the traceroute output can be passed along to 20 someone else that can further investigate or fix the problem. 22 In overlay networks such as VXLAN and NVGRE the prevailing view is 23 that since the overlay network has no control of the underlay there 24 needs to be special tools and agreements to enable extracting traces 25 from the underlay. We argue that enabling visibility into the 26 underlay and using existing tools like traceroute has been overlooked 27 and would add value in many deployments of overlay networks. 29 This document specifies an approach that can be used to make 30 traceroute transcend layers of encapsulation including details for 31 how to apply this to VXLAN. The technique can be applied to other 32 encapsulations used for overlay networks. It can also be implemented 33 using current commercial silicon. 35 Status of this Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at http://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on January 2, 2017. 51 Copyright Notice 53 Copyright (c) 2016 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (http://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. Code Components extracted from this document must 62 include Simplified BSD License text as described in Section 4.e of 63 the Trust Legal Provisions and are provided without warranty as 64 described in the Simplified BSD License. 66 Table of Contents 68 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 69 2. Solution Overview . . . . . . . . . . . . . . . . . . . . . . 5 70 3. Goals and Requirements . . . . . . . . . . . . . . . . . . . . 6 71 4. Definition Of Terms . . . . . . . . . . . . . . . . . . . . . 7 72 5. Example Topologies . . . . . . . . . . . . . . . . . . . . . . 7 73 6. Controlling and selecting ttl behavior . . . . . . . . . . . . 11 74 7. Introducing a ttl copyin flag in the encapsulation header . . 11 75 8. Encapsulation Behavior . . . . . . . . . . . . . . . . . . . . 12 76 9. Decapsulating Behavior . . . . . . . . . . . . . . . . . . . . 15 77 10. Other ICMP errors . . . . . . . . . . . . . . . . . . . . . . 16 78 11. Downstream Egress Paths Object . . . . . . . . . . . . . . . . 16 79 12. Security Considerations . . . . . . . . . . . . . . . . . . . 19 80 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 81 14. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 82 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20 83 15.1. Normative References . . . . . . . . . . . . . . . . . . 20 84 15.2. Informative References . . . . . . . . . . . . . . . . . 20 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 22 87 1. Introduction 89 Tools like traceroute have been very valuable for the operation of 90 the Internet. Part of that value comes from being able to display 91 information about routers and paths over which the user of the tool 92 has no control, but the traceroute output can be passed along to 93 someone else that can further investigate or fix the problem. The 94 output of traceroute can be included in an email or a trouble ticket 95 to report the problem. This provide a lot more information than the 96 mere indication that A can't communicate with B, in particular when 97 the failures are transient. The ping tool provides some of the same 98 benefits in being able to return ICMP errors such as host unreachable 99 messages. 101 This document shows how those tools can be used to gather information 102 for both the overlay and underlay parts of an end-to-end path by 103 providing the option to have some packets use a uniform time-to-live 104 (ttl) model for the tunnels, and associated ICMP error handling. 105 These changes are limited to the tunnel ingress and egress points. 107 The desire to make traceroute provide useful information for overlay 108 network is not an argument against also using a layered approach for 109 OAM as specified in e.g., [I-D.tissa-lime-yang-oam-model]. Such 110 approaches are quite appropriate for continuos monitoring at 111 different layers and across different domains. A layer transcending 112 traceroute complements the ability to do layered and/or continuos 113 monitoring. 115 The traceroute tool relies on receiving ICMP errors [RFC0792] in 116 combination with using different IP time-to-live values. That 117 results in the packet making it further and further towards the 118 destination with ICMP ttl exceeded errors being received from each 119 hop. That provides the user the working path even if the packets are 120 black holed eventually, and also provides any errors like ICMP host 121 unreachable. The fundamental assumption is that the ttl is 122 decremented for each hop and that the resulting ICMP ttl exceeded 123 errors are delivered back to the host. 125 When some encapsulation is used to tunnel packets there is an 126 architectural question how those tunnels should be viewed from the 127 rest of the network. Different models were described first for 128 diffserv in [RFC2983] and then applied to MPLS in [RFC3270] and 129 expanded to MPLS ttl handling in [RFC3443] and those models apply to 130 other forms of direct or indirect IP in IP tunnels. Those RFCs 131 define two models for ttl that are of interest to us: 133 o A pipe model, where the tunnel is invisible to the rest of the 134 network in that it looks like a direct connection between the 135 tunnel ingress and egress. 137 o A uniform model, where the ttl decrements uniformly for hops 138 outside and inside the tunnel. 140 The tunneling mechanisms discussed in NVO3 (such as VXLAN [RFC7348], 141 NVGRE [I-D.sridharan-virtualization-nvgre], GENEVE 142 [I-D.gross-geneve], and GUE [I-D.herbert-gue]), have either been 143 specified to provide the pipe model of a tunnel or are silent on the 144 setting of the outer ttl. Those protocols can be extended to have an 145 optional uniform tunnel model when the payload is IP, following the 146 same model as in [RFC3443]. Note that these encapsulations carry 147 Ethernet frames hence are not even aware that the payload is IP. 148 However, IP is the bulk of what is carried over such tunnels and the 149 ingress NVE can inspect the IP part of the Ethernet frame. 151 However, for general application traffic the pipe model is fine and 152 might even be expected by some applications. In general, when the 153 source and destination IP are in the same IP subnet the ttl should 154 not be decremented. Thus it makes sense to have a way to selectively 155 enable the uniform model perhaps based on some method to identify 156 packets associated with traceroute or some marker in the packet 157 itself that the traceroute tool can set. 159 2. Solution Overview 161 The pieces needed to accomplish this are: 163 o One or more ways to select the uniform model packets at the tunnel 164 ingress. 166 o Tunnel ingress copying out the original ttl from a selected packet 167 to the outer IP header, and then doing a check and decrement of 168 that ttl. 170 o If that ttl check results in ttl expiry at the tunnel ingress, 171 then deliver an ICMP ttl exceeded packet back to the host. 173 o A mechanism by which the tunnel egress knows which packets should 174 have uniform model, for instance a bit in the encapsulation 175 header. 177 o The tunnel egress copying in the ttl (for identified packets) from 178 the outer header to the inner IP header, then doing a check and 179 decrement of that ttl. 181 o If ttl check results in ttl expiry at the tunnel egress, then 182 deliver an ICMP error back to the original host (or, perhaps 183 better, to tunnel ingress the same way as underlay routers do). 185 o IP routers in the underlay will deliver any ICMP errors to the 186 source IP address of the packet. For tunneled packets that will 187 be the tunnel ingress. Hence the tunnel ingress needs to be able 188 to take such ICMP errors and form corresponding ICMP errors that 189 are sent back to the host. The requirement in [RFC1812] ensures 190 that the ICMP errors will contain enough headers to form such an 191 ICMP error. It has been noted that there are routers in the 192 Internet which decades later fail to conform to that aspect of 193 [RFC1812]. 195 The idea to reflect (some) ICMP errors from inside a tunnel back to 196 the original source goes back to IPv6 in IPv4 encapsulation as 197 specified in [RFC1933] and [RFC2473]. However, those drafts did not 198 advocate using a uniform ttl model for the tunnels but did handle 199 ICMP packet too big and other unreachable messages. Those drafts 200 specify how to reflect ICMP errors received from underlay routers to 201 ICMP errors sent to the original host. The addition of handling ICMP 202 ttl exceeded errors for uniform tunnel model is straight forward. 204 The information carried in the ICMP errors are quite limited - the 205 original packet plus an ICMP type and code. However, there are 206 extension mechanisms specified in [RFC4884] and used for MPLS in 207 [RFC4950] which include TLVs with additional information. If there 208 are additional information to include for overlay networks that 209 information could be added by defining new ICMP Extensions Objects 210 based on [RFC4884]. An example of such an extension for ECMP 211 information is included in this document. 213 3. Goals and Requirements 215 The following goals and requirements apply: 217 o No changes needed in the underlay. 219 o Optional changes on the decapsulating end. 221 o ECMP friendly. If the underlay employs equal cost multipath 222 routing then one should be able to use this mechanism to trace the 223 same path as a given TCP or UDP flow is using. In addition, one 224 should be able to explore different ECMP paths by varying the IP 225 addresses and port numbers in the packets originated by traceroute 226 on the host. 228 o Provide output which makes it possible to compare a regular 229 overlay traceroute with the layer-transcending output. 231 4. Definition Of Terms 233 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 234 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 235 document are to be interpreted as described in [RFC2119]. 237 The terminology such as NVE, and TS are used as specified in 238 [RFC7365]: 240 o Network Virtualization Edge (NVE): An NVE is the network entity 241 that sits at the edge of an underlay network and implements L2 242 and/or L3 network virtualization functions. 244 o Tenant System (TS): A physical or virtual system that can play the 245 role of a host or a forwarding element such as a router, switch, 246 firewall, etc. 248 o Virtual Access Points (VAPs): A logical connection point on the 249 NVE for connecting a Tenant System to a virtual network. 251 o Virtual Network (VN): A VN is a logical abstraction of a physical 252 network that provides L2 or L3 network services to a set of Tenant 253 Systems. 255 o Virtual Network Context (VN Context) Identifier: Field in an 256 overlay encapsulation header that identifies the specific VN the 257 packet belongs to. 259 We use the VTEP term in [RFC7348] as synonymous with NVE, and VNI as 260 synonymous to VN Context Identifier. 262 5. Example Topologies 264 The following example topologies illustrate different cases where we 265 want a tracing capability. The examples are for overlay technologies 266 such as VXLAN which provide a layer 2 overlay on IP. The cases for 267 layer 3 overlay on top of IP are simpler and not shown in this 268 document. 270 The VXLAN term VTEP is used as synonymous to NVO3's NVE term. 272 ----------- ----------- 273 | H1 | | H2 | 274 | 1.0.1.1 | | 1.0.1.2 | 275 | | | | 276 ----------- ----------- 277 | | 278 | | 279 ----------- ----------- ----------- 280 | VtepA | | R1 | | VtepB | 281 | 2.0.1.1 | --| 2.0.1.2 | | 2.0.2.1 | 282 | | | 2.0.2.2 |--| | 283 ----------- ----------- ----------- 285 Simple L2 overlay 287 The figure above shows two hosts connected using an underlay which 288 provides a layer two service. Thus H1 and H2 are in the same subnet 289 and unaware of the existence of the underlay. Thus a normal ping or 290 traceroute would not be able to provide any information about the 291 nature of a failure; either packets get through or they do not. When 292 the packets get through traceroute would output something like: 294 traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets 295 1 1.0.2.1 (1.0.2.1) 1.104 ms 1.235 ms 1.729 ms 297 In this case it would be desirable to be able to traceroute from H1 298 to H2 (and vice versa) and observe VtepA, R1, VtepB and H2. Thus in 299 the case of packets getting through traceroute would output: 301 traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets 302 1 2.0.1.1 (2.0.1.1) 1.104 ms 1.235 ms 1.729 ms 303 2 2.0.1.2 (2.0.1.2) 2.106 ms 2.007 ms 2.156 ms 304 3 2.0.2.1 (2.0.2.1) 35.034 ms 24.490 ms 21.626 ms 305 4 1.0.1.2 (1.0.1.2) 40.830 ms 44.694 ms 75.620 ms 307 Note that the underlay and overlay might exist in completely separate 308 addressing domains. Thus H1 might not be able to reach any of the 309 underlay addresses. And the underlay IP addresses might overlap the 310 overlay IP addresses. For example, it would be completely valid to 311 see e.g. VtepA having the same IP address as H1. The user of this 312 tool need to understand that the utility of the traceroute output is 313 to get information to determine whether the issue is in the underlay 314 or overlay, and be able to pass the underlay information to the 315 operator of the underlay. 317 In overlay networks without any ARP/ND optimizations ARP/ND packets 318 would be flooded between the tunnel endpoints. Thus if there is some 319 communication failure between H1 and H2, then H1 above might not have 320 an ARP entry for H2. This results in traceroute not being able to 321 output any data. This implies that in order to use traceroute to 322 trouble shoot the issue one would need some workaround, such as 323 installing some temporary ARP entries on the hosts. 325 ----------- ----------- ----------- ----------- 326 | H1 | | R2 | | R3 | | H4 | 327 | 1.0.1.1 | | 1.0.2.2 |--| 1.0.2.3 | | | 328 | | | 1.0.1.2 | | 1.0.3.3 |--| 1.0.3.4 | 329 ----------- ----------- ----------- ----------- 330 | | 331 | | 332 ----------- ----------- ----------- 333 | VtepA | | R1 | | VtepB | 334 | 2.0.1.1 | --| 2.0.1.2 | | 2.0.2.1 | 335 | | | 2.0.2.2 |--| | 336 ----------- ----------- ----------- 338 L2 overlay as part of larger network 340 The figure above has a overlay router the nexthop as seen by H1. In 341 this case a normal overlay traceroute would be able to display the 342 overlay path i.e. 344 traceroute to H4, 30 hops max, 60 byte packets 345 1 R2 346 2 R3 347 3 H4 349 The layer-transcending traceroute would show the combination of the 350 underlay and overlay paths i.e., 352 traceroute to H4, 30 hops max, 60 byte packets 353 1 VtepA 354 2 R1 355 3 VtepB 356 4 R2 357 5 R3 358 6 H4 360 ----------- ------------------- ----------- 361 | H1 | | R5 | | H6 | 362 | 1.0.1.1 | | | | | 363 | | | 1.0.1.2 1.0.5.5 | | 1.0.5.6 | 364 ----------- |-----------------| ----------- 365 | | | | | | 366 | | | | | | 367 ----------- ----------- |-----------------| ----------- ----------- 368 | VtepA | | R1 | | VtepB VtepC | | R6 | | VtepD | 369 | 2.0.1.1 |-| 2.0.1.2 | | 2.0.2.1 3.0.1.1 |-| 3.0.1.2 | | | 370 | | | 2.0.2.2 |-| | | 3.0.2.2 |-| 3.0.3.1 | 371 ----------- ----------- ------------------- ----------- ----------- 373 Multiple L2 overlays in path 375 The figure above has multiple overlay network segments, that are 376 connected in one router which provides the tunnel endpoints for both 377 overlay segments plus routing for the overlay. A more general 378 picture would be to have an overlay routed path between the two NVEs 379 e.g., VtepB and VtepC connected to different routers in the overlay. 380 However, such a drawing in ASCII art doesn't fit on the page. 382 An normal overlay traceroute in the above topology would show the 383 overlay router i.e., 385 traceroute to H6, 30 hops max, 60 byte packets 386 1 R5 387 2 H6 389 The layer-transcending traceroute would show the combination of the 390 underlay and overlay paths i.e., 392 traceroute to H6, 30 hops max, 60 byte packets 393 1 VtepA 394 2 R1 395 3 VtepB 396 4 R5 397 5 VtepC 398 6 R6 399 7 VtepD 400 8 H6 402 Note that the R3 device, which include VtepB and VtepC, appears as 403 three hops in the traceroute output. That is needed to be able to 404 correlate the output with the overlay output which has R3. That 405 correlation would be hard if the R3 device only appeared as VtepB in 406 the LTTON output. The three-hop representation also stays invariant 407 whether or not the NVEs and overlay router are implemented by a 408 single device or multiple devices. 410 6. Controlling and selecting ttl behavior 412 The network admin needs to be able to control who can use the layer 413 transcending traceroute, since the operator might not want to 414 disclose the underlay topology to all its users all the time. There 415 are different approaches for this such as designating particular 416 ports (Virtual Access Points in NVO3 terminology) on a NVE to have 417 uniform ttl tunnel model. We have found it useful to be able to 418 enable this capability on a per port and/or virtual network basis, in 419 addition to having a global setting per NVE. 421 When enabled on the NVEs the user on the TS needs to be able to 422 control which traffic is subject to which tunnel mode. The normal 423 traffic would use the pipe ttl tunnel model and only explicit trace 424 applications are likely to want to use the uniform ttl tunnel model. 425 Hence it makes sense to use some marker in the packets sent by the TS 426 to select those packets for uniform model on the NVE. Such a 427 mechanism should usable so that the user can perform both a regular 428 traceroute and a LTTON. 430 Potentially different fields in the packets originated by traceroute 431 on the TS can be used to mark the packets for uniform ttl tunnel 432 model. However, many of those fields such as source and destination 433 port numbers and protocol might be used in hashing for ECMP. The 434 marking that can be used without impacting ECMP is the DSCP field in 435 the packet. That field can be set with an option (--tos) in at least 436 some existing traceroute implementations. 438 Note that when DSCP is used for such marking it is a configured 439 choice subject to agreement between the operator of the TS and NVE. 440 The matching on the NVE should ignore the ECN bits as to not 441 interfere with ECN. 443 However, the DSCP value used in the overlay might have an impact on 444 the forwarding of the packets. In such a case one can use an 445 alternative selector such as the UDP source port number. That has 446 the downside of affecting the has values used for ECMP and link 447 aggregation port selection. 449 7. Introducing a ttl copyin flag in the encapsulation header 451 When this approach is applied to VXLAN [RFC7348] the decapsulating 452 NVE has to be able to identify packets that have to be processed in 453 the uniform ttl tunnel model way. For that purpose we define a new 454 flag which is sent by the encapsulating NVE on selected packets, and 455 is used by the decapsulating NVE to perform the ttl copyin, decrement 456 and check. 458 In addition to the one I-flag defined in [RFC7348] we define a new 459 T-flag to capture this the trace behavior at the decapsulating tunnel 460 endpoint. 462 0 1 2 3 463 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 464 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 465 |R|R|R|R|I|R|R|T| Reserved | 466 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 467 | VXLAN Network Identifier (VNI) | Reserved | 468 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 470 New fields: 472 T-flag: When set indicates that decapsulator should take the 473 outer ttl and copy it to the inner ttl, and then check 474 and decrement the resulting ttl. 476 8. Encapsulation Behavior 478 If the uniform ttl model is enabled for the input, and the received 479 naked packet matches the selector, then the ingress NVE will perform 480 these additional operations as part of encapsulating an IPv4 or IPv6 481 packet: 483 o Examine the IPv4 TTL (or IPv6 hopcount, respectively) on receipt 484 and if 1 or less, then drop the packet and send an ICMPv4 (or 485 ICMPv6) ttl exceeded back to the original host. Since the NVE is 486 operating on a L2 packet, it might not have any layer 3 interfaces 487 or routes for the originating host. Thus it sends the packet back 488 to the source L2 address of the packet back out the ingress port - 489 without any IP address lookup. 491 o If ttl did not expire, then decrement the above ttl/hopcount and 492 place it in the outer IP header. Encapsulate and send the packet 493 as normal. 495 o If some other errors prevent sending the packet (such as unknown 496 VN Context Id, no flood list configured), then the NVE SHOULD send 497 an ICMP host unreachable back to the host. 499 The ingress NVE will receive ICMP errors from underlay routers and 500 the egress NVE; whether due to ttl exceeded or underlay issues such 501 as host unreachable, or packet too big errors. The NVE should take 502 such errors, and in addition to any local syslog etc, generate an 503 ICMP error sent back to the host. The principle for this is 504 specified in [RFC1933] and [RFC2473]. Just like in those 505 specifications, for the inner and outer IP header could be off 506 different version. A common case of that might be an IPv6 overlay 507 with an IPv4 underlay. That case requires some changes in the ICMP 508 type and code values in addition to recreating the packets. The 509 place where LTTON differs from those specifications is that there is 510 an NVO3 header and (for L2 over L3) and L2 header in the packet. 512 The figures below show an example of ICMP header re-generation at 513 VtepA for the case of IPv6 overlay with IPv4 underlay. The case of 514 IPv4 over IPv4 is similar and simpler since the ICMP header is the 515 same for both overlay and underlay. The example uses VXLAN 516 encapsulation to provide the concrete details, but the approach 517 applies to other NVO3 proposals. 519 +--------------+ 520 | IPv4 Header | 521 | src = R1 | 522 | dst = VtepA | 523 +--------------+ 524 | ICMPv4 | 525 | Header | 526 | type = X | 527 | code = Y | 528 - - +--------------+ 529 | IPv4 Header | 530 | src = VtepA | 531 IPv4 | dst = VtepB | 532 +--------------+ 533 Packet | UDP | 534 | dst = VXLAN | 535 in +--------------+ 536 | Ethernet | 537 Error | DA = H2 mac | 538 | SA = H1 mac | 539 +--------------+ - - 540 | IPv6 | 541 | src = H1 ipv6| 542 | dst = H2 ipv6| Original IPv6 543 +--------------+ Packet. 544 | Transport | Used to 545 | Header | generate an 546 +--------------+ ICMPv6 547 | | error message 548 ~ Data ~ back to the source. 549 | | 550 - - +--------------+ - - 552 ICMPv4 Error Message Returned to Encapsulating Node 554 The above underlay ICMPv4 is used to form an overlay ICMPv6 packet by 555 extracting the Ethernet DA from the inner Ethernet SA, and forming an 556 IPv6 header where the source address is based on the source address 557 of the ICMPv4 error. The ICMPv6 type and code values are set based 558 on the ICMPv4 type and code values. 560 +--------------+ 561 | Ethernet | 562 | DA = H1 mac | From ICMPv4 packet 563 | SA = VtepA | in error 564 +--------------+ 565 | IPv6 Header | 566 | src = ::R1 | 96 zeros + IPv4 address 567 | dst = H1 ipv6| 568 +--------------+ 569 | ICMPv6 | 570 | Header | 571 | type = X' | Type and code mapped 572 | code = Y' | from v4 to v6 values 573 - - +--------------+ - - 574 | IPv6 | 575 IPv6 | src = H1 ipv6| 576 | dst = H2 ipv6| Unmodified from 577 Packet +--------------+ ICMPv4 error 578 | Transport | 579 in | Header | 580 +--------------+ 581 Error | | 582 ~ Data ~ 583 | | 584 - - +--------------+ - - 586 Generated ICMPv6 Error Message for Overlay Source 588 In the case of IPv6 over IPv4 the above example setting of the IPv6 589 source address results in this type of traceroute output: 591 traceroute to 2000:0:0:40::2, 30 hops max, 80 byte packets 592 1 ::2.0.1.1 (::2.0.1.1) 1.231 ms 1.004 ms 1.126 ms 593 2 ::2.0.1.2 (::2.0.1.2) 1.994 ms 2.301 ms 2.016 ms 594 3 ::2.0.2.1 (::2.0.2.1) 18.846 ms 30.582 ms 19.776 ms 595 4 2000:0:0:40::2 (2000:0:0:40::2) 48.964 ms 60.131 ms 53.895 ms 597 9. Decapsulating Behavior 599 If this uniform ttl model is enabled on the decapsulating NVE, and 600 the overlay header indicates that uniform ttl model applies (the 601 T-bit in the case of VXLAN), then the NVE will perform these 602 additional operations as part of decapsulating a packet where the 603 inner packet is an IPv4 or IPv6 packet: 605 o Examine the outer IPv4 TTL (or outer IPv6 hopcount, respectively) 606 on receipt and if 1 or less, then drop the packet and send an 607 outer ICMPv4 (or ICMPv6) ttl exceeded back to the source of the 608 outer packet i.e., the ingress NVE. This ICMP packet should look 609 the same as an ICMP error generated by an underlay router, and the 610 requirement in [RFC1812] on the size of the packet in error 611 applies. 613 o If ttl did not expire, then decrement the above ttl/hopcount and 614 place it in the inner IP header. If the inner IP header is IPv4 615 then update the IPv4 header checksum. Then decapsulate and send 616 the packet as for other decapsulated packets. 618 o If some other errors prevent sending the packet (such as unknown 619 VN Context Id), then the NVE SHOULD send an ICMP host unreachable 620 instead of a ttl exceeded error. 622 10. Other ICMP errors 624 The technique for selecting ttl behavior specified in this draft can 625 also be used to trigger other ICMPv4 and ICMPv6 errors. For example, 626 [RFC1933] specifies how ICMP packet too big from underlay routers can 627 be used to report over ICMP packet too big errors to the original 628 source. Other errors that are more specific to the overlay protocol 629 might also be useful, such as not being able to find a VNI ID for the 630 incoming port,vlan, or not being able to flood the packet if the 631 packet is a Broadcast, Unknown unicast, or Multicast packet. 633 11. Downstream Egress Paths Object 635 The Downstream Egress Paths Object MAY be appended to the ICMP Time 636 Exceeded and Destination Unreachable messages. A single instance of 637 the Downstream Egress Paths Object represents the egress paths at the 638 router that sends the ICMP message. The Downstream Egress Paths 639 Object must be preceded by an ICMP Extension Structure Header and an 640 ICMP Object Header. Both are defined in [RFC4884]. The format 641 follows closely [RFC4379] with some generalizations for Multipath 642 types. 644 Class-Num = TBA by IANA, Downstream Egress Paths Class 646 C-Type = 1. 648 If the replying router is the destination of the echo request, then a 649 Downstream Egress Paths Object SHOULD NOT be included in the ICMP 650 Error message. Otherwise the replying router MAY append a Downstream 651 Egress Paths Object for all interfaces over which the echo request 652 packet could be forwarded. 654 The Object Length is K*N + M*N, where M is the Multipath Length for 655 each egress path, M may not be the same for different paths. Values 656 for K are found in the description of Address Type below. 658 The Downstream Egress Paths Object has the following format: 660 0 1 2 3 661 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 662 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 663 | Path-1 MTU | Address Type | Reserved | 664 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 665 | Downstream IP Address (4 or 16 octets) | 666 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 667 | Downstream Interface Address (4 or 16 octets) | 668 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 669 | MultipathType | Multipath Length | Reserved | 670 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 671 . . 672 . (Multipath Information) . 673 . . 674 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 675 ~ ~ 676 ~ ~ 677 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 678 | Path-N MTU | Address Type | Reserved | 679 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 680 | Downstream IP Address (4 or 16 octets) | 681 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 682 | Downstream Interface Address (4 or 16 octets) | 683 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 684 | MultipathType | Multipath Length | Reserved | 685 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 686 . . 687 . (Multipath Information) . 688 . . 689 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 691 Downstream Egress Paths Object 693 Maximum Transmission Unit (MTU): 694 The MTU is the size in octets of the largest IP frame that fits on 695 the downstream interface. 697 Address Type: 698 The Address Type indicates if the interface is numbered or 699 unnumbered. It also determines the length of the Downstream IP 700 Address and Downstream Interface fields. The resulting total for 701 the initial part of the one path of the downstream Egress Paths 702 Object is listed in the table below as "K Octets". 704 The Address Type is set to one of the following values: 706 Type # Address Type K Octets 707 ------ ------------ -------- 708 1 IPv4 Numbered 16 709 2 IPv4 Unnumbered 16 710 3 IPv6 Numbered 40 711 4 IPv6 Unnumbered 28 713 Downstream IP Address and Downstream Interface Address: 714 IPv4 addresses and interface indices are encoded in 4 octets; IPv6 715 addresses are encoded in 16 octets. 717 If the interface to the downstream router has a unique IP address 718 (e.g., it is numbered and not a LAG), then the Address Type MUST 719 be set to IPv4 or IPv6, the Downstream IP Address MUST be set to 720 either the downstream router's Router ID or the interface address 721 of the downstream router, and the Downstream Interface Address 722 MUST be set to the downstream router's interface address. 724 If the interface to the downstream router does not have a unique 725 IP address (e.g., it is is unnumbered or a LAG), the Address Type 726 MUST be IPv4 Unnumbered or IPv6 Unnumbered, the Downstream IP 727 Address MUST be the downstream router's Router ID or the interface 728 address of the downstream router, and the Downstream Interface 729 Address MUST be set to the index assigned by the upstream router 730 to the interface. 732 Multipath Type: 733 The following Multipath Types are defined: 735 Key Type Multipath Information 736 --- ---------------- --------------------- 737 0 no multipath Empty (Multipath Length = 0) 738 1 MAC SA/DA Inner MAC in tunnel payload 739 2 IP Src/Dest Inner IP src/dest in tunnel payload 740 3 L4 src port L4 src ports in tunnel payload 741 4 L4 src port range low/high L4 src port pairs 743 Type 0 indicates that all packets will be forwarded out this one 744 interface. 746 Types 1 through 4 specify that the supplied Multipath Information 747 will serve to exercise this path. 749 Multipath Length: 750 The length in octets of the Multipath Information. 752 Multipath Information: 753 The Multipath Information encodes L4 source ports that will 754 exercise this path. The Multipath Information depends on the 755 Multipath Type. The contents of the field are shown in the table 756 above. For Type 4, ranges indicated by L4 source port pairs MUST 757 NOT overlap and MUST be in ascending sequence. 759 12. Security Considerations 761 The considerations in [I-D.ietf-nvo3-security-requirements] apply. 763 In addition, the use of the uniform ttl tunnel model will result in 764 ICMP errors being generated by underlay routers and consumed by NVEs. 765 That resents an attack vector which does not exist in a pipe ttl 766 tunnel model. However, ICMP errors should be rate limited [RFC1812]. 767 Implementations should also take appropriate measures in rate 768 limiting the input rate for ICMP errors that are processed by limited 769 CPU resources. 771 Some implementations might handle the trace packets (with uniform ttl 772 model) in software while the pipe ttl model packets can be handled in 773 hardware. In such a case the implementation should have mechanisms 774 to avoid starvation of limited CPU resources due to these packets. 776 13. IANA Considerations 778 TBD 780 14. Acknowledgements 782 The authors acknowledge the helpful comments from David Black and 783 Diego Garcia del Rio. 785 15. References 786 15.1. Normative References 788 [RFC0792] Postel, J., "Internet Control Message Protocol", STD 5, 789 RFC 792, DOI 10.17487/RFC0792, September 1981, 790 . 792 [RFC1812] Baker, F., Ed., "Requirements for IP Version 4 Routers", 793 RFC 1812, DOI 10.17487/RFC1812, June 1995, 794 . 796 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 797 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ 798 RFC2119, March 1997, 799 . 801 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 802 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 803 eXtensible Local Area Network (VXLAN): A Framework for 804 Overlaying Virtualized Layer 2 Networks over Layer 3 805 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 806 . 808 [RFC7365] Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y. 809 Rekhter, "Framework for Data Center (DC) Network 810 Virtualization", RFC 7365, DOI 10.17487/RFC7365, 811 October 2014, . 813 15.2. Informative References 815 [I-D.gross-geneve] 816 Gross, J., Sridhar, T., Garg, P., Wright, C., Ganga, I., 817 Agarwal, P., Duda, K., Dutt, D., and J. Hudson, "Geneve: 818 Generic Network Virtualization Encapsulation", 819 draft-gross-geneve-02 (work in progress), October 2014. 821 [I-D.herbert-gue] 822 Herbert, T., Yong, L., and O. Zia, "Generic UDP 823 Encapsulation", draft-herbert-gue-03 (work in progress), 824 March 2015. 826 [I-D.ietf-nvo3-security-requirements] 827 Hartman, S., Zhang, D., Wasserman, M., Qiang, Z., and M. 828 Zhang, "Security Requirements of NVO3", 829 draft-ietf-nvo3-security-requirements-07 (work in 830 progress), June 2016. 832 [I-D.sridharan-virtualization-nvgre] 833 Garg, P. and Y. Wang, "NVGRE: Network Virtualization using 834 Generic Routing Encapsulation", 835 draft-sridharan-virtualization-nvgre-08 (work in 836 progress), April 2015. 838 [I-D.tissa-lime-yang-oam-model] 839 Senevirathne, T., Finn, N., Kumar, D., Salam, S., Wu, Q., 840 and Z. Wang, "Generic YANG Data Model for Operations, 841 Administration, and Maintenance (OAM)", 842 draft-tissa-lime-yang-oam-model-06 (work in progress), 843 August 2015. 845 [RFC1933] Gilligan, R. and E. Nordmark, "Transition Mechanisms for 846 IPv6 Hosts and Routers", RFC 1933, DOI 10.17487/RFC1933, 847 April 1996, . 849 [RFC2473] Conta, A. and S. Deering, "Generic Packet Tunneling in 850 IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473, 851 December 1998, . 853 [RFC2983] Black, D., "Differentiated Services and Tunnels", 854 RFC 2983, DOI 10.17487/RFC2983, October 2000, 855 . 857 [RFC3270] Le Faucheur, F., Wu, L., Davie, B., Davari, S., Vaananen, 858 P., Krishnan, R., Cheval, P., and J. Heinanen, "Multi- 859 Protocol Label Switching (MPLS) Support of Differentiated 860 Services", RFC 3270, DOI 10.17487/RFC3270, May 2002, 861 . 863 [RFC3443] Agarwal, P. and B. Akyol, "Time To Live (TTL) Processing 864 in Multi-Protocol Label Switching (MPLS) Networks", 865 RFC 3443, DOI 10.17487/RFC3443, January 2003, 866 . 868 [RFC4379] Kompella, K. and G. Swallow, "Detecting Multi-Protocol 869 Label Switched (MPLS) Data Plane Failures", RFC 4379, 870 DOI 10.17487/RFC4379, February 2006, 871 . 873 [RFC4884] Bonica, R., Gan, D., Tappan, D., and C. Pignataro, 874 "Extended ICMP to Support Multi-Part Messages", RFC 4884, 875 DOI 10.17487/RFC4884, April 2007, 876 . 878 [RFC4950] Bonica, R., Gan, D., Tappan, D., and C. Pignataro, "ICMP 879 Extensions for Multiprotocol Label Switching", RFC 4950, 880 DOI 10.17487/RFC4950, August 2007, 881 . 883 Authors' Addresses 885 Erik Nordmark 886 Arista Networks 887 Santa Clara, CA 888 USA 890 Email: nordmark@arista.com 892 Chandra Appanna 893 Arista Networks 894 Santa Clara, CA 895 USA 897 Email: achandra@arista.com 899 Alton Lo 900 Arista Networks 901 Santa Clara, CA 902 USA 904 Email: altonlo@arista.com 906 Sami Boutros 907 VMware 909 Email: sboutros@vmware.com 911 Ankur Dubey 912 VMware 914 Email: adubey@vmware.com