idnits 2.17.1 draft-nordmark-nvo3-transcending-traceroute-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 18 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. == There are 1 instance of lines with non-RFC3849-compliant IPv6 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (Oct 2015) is 3087 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Downref: Normative reference to an Informational RFC: RFC 7348 ** Downref: Normative reference to an Informational RFC: RFC 7365 == Outdated reference: A later version (-07) exists of draft-ietf-nvo3-security-requirements-05 -- Obsolete informational reference (is this intentional?): RFC 1933 (Obsoleted by RFC 2893) Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NVO3 WG E. Nordmark 3 Internet-Draft C. Appanna 4 Intended status: Standards Track A. Lo 5 Expires: April 3, 2016 Arista Networks 6 Oct 2015 8 Layer-Transcending Traceroute for Overlay Networks like VXLAN 9 draft-nordmark-nvo3-transcending-traceroute-01 11 Abstract 13 Tools like traceroute have been very valuable for the operation of 14 the Internet. Part of that value comes from being able to display 15 information about routers and paths over which the user of the tool 16 has no control, but the traceroute output can be passed along to 17 someone else that can further investigate or fix the problem. 19 In overlay networks such as VXLAN and NVGRE the prevailing view is 20 that since the overlay network has no control of the underlay there 21 needs to be special tools and agreements to enable extracting traces 22 from the underlay. We argue that enabling visibility into the 23 underlay and using existing tools like traceroute has been overlooked 24 and would add value in many deployments of overlay networks. 26 This document specifies an approach that can be used to make 27 traceroute transcend layers of encapsulation including details for 28 how to apply this to VXLAN. The technique can be applied to other 29 encapsulations used for overlay networks. It can also be implemented 30 using current commercial silicon. 32 Status of this Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at http://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on April 3, 2016. 49 Copyright Notice 51 Copyright (c) 2015 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (http://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 67 2. Solution Overview . . . . . . . . . . . . . . . . . . . . . . 4 68 3. Goals and Requirements . . . . . . . . . . . . . . . . . . . . 5 69 4. Definition Of Terms . . . . . . . . . . . . . . . . . . . . . 6 70 5. Example Topologies . . . . . . . . . . . . . . . . . . . . . . 6 71 6. Controlling and selecting ttl behavior . . . . . . . . . . . . 10 72 7. Introducing a ttl copyin flag in the encapsulation header . . 10 73 8. Encapsulation Behavior . . . . . . . . . . . . . . . . . . . . 11 74 9. Decapsulating Behavior . . . . . . . . . . . . . . . . . . . . 14 75 10. Other ICMP errors . . . . . . . . . . . . . . . . . . . . . . 15 76 11. Security Considerations . . . . . . . . . . . . . . . . . . . 15 77 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 78 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16 79 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 80 14.1. Normative References . . . . . . . . . . . . . . . . . . 16 81 14.2. Informative References . . . . . . . . . . . . . . . . . 16 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18 84 1. Introduction 86 Tools like traceroute have been very valuable for the operation of 87 the Internet. Part of that value comes from being able to display 88 information about routers and paths over which the user of the tool 89 has no control, but the traceroute output can be passed along to 90 someone else that can further investigate or fix the problem. The 91 output of traceroute can be included in an email or a trouble ticket 92 to report the problem. This provide a lot more information than the 93 mere indication that A can't communicate with B, in particular when 94 the failures are transient. The ping tool provides some of the same 95 benefits in being able to return ICMP errors such as host unreachable 96 messages. 98 This document shows how those tools can be used to gather information 99 for both the overlay and underlay parts of an end-to-end path by 100 providing the option to have some packets use a uniform time-to-live 101 (ttl) model for the tunnels, and associated ICMP error handling. 102 These changes are limited to the tunnel ingress and egress points. 104 The desire to make traceroute provide useful information for overlay 105 network is not an argument against also using a layered approach for 106 OAM as specified in e.g., [I-D.tissa-lime-yang-oam-model]. Such 107 approaches are quite appropriate for continuos monitoring at 108 different layers and across different domains. A layer transcending 109 traceroute complements the ability to do layered and/or continuos 110 monitoring. 112 The traceroute tool relies on receiving ICMP errors [RFC0792] in 113 combination with using different IP time-to-live values. That 114 results in the packet making it further and further towards the 115 destination with ICMP ttl exceeded errors being received from each 116 hop. That provides the user the working path even if the packets are 117 black holed eventually, and also provides any errors like ICMP host 118 unreachable. The fundamental assumption is that the ttl is 119 decremented for each hop and that the resulting ICMP ttl exceeded 120 errors are delivered back to the host. 122 When some encapsulation is used to tunnel packets there is an 123 architectural question how those tunnels should be viewed from the 124 rest of the network. Different models were described first for 125 diffserv in [RFC2983] and then applied to MPLS in [RFC3270] and 126 expanded to MPLS ttl handling in [RFC3443] and those models apply to 127 other forms of direct or indirect IP in IP tunnels. Those RFCs 128 define two models for ttl that are of interest to us: 130 o A pipe model, where the tunnel is invisible to the rest of the 131 network in that it looks like a direct connection between the 132 tunnel ingress and egress. 134 o A uniform model, where the ttl decrements uniformly for hops 135 outside and inside the tunnel. 137 The tunneling mechanisms discussed in NVO3 (such as VXLAN [RFC7348], 138 NVGRE [I-D.sridharan-virtualization-nvgre], GENEVE 139 [I-D.gross-geneve], and GUE [I-D.herbert-gue]), have either been 140 specified to provide the pipe model of a tunnel or are silent on the 141 setting of the outer ttl. Those protocols can be extended to have an 142 optional uniform tunnel model when the payload is IP, following the 143 same model as in [RFC3443]. Note that these encapsulations carry 144 Ethernet frames hence are not even aware that the payload is IP. 145 However, IP is the bulk of what is carried over such tunnels and the 146 ingress NVE can inspect the IP part of the Ethernet frame. 148 However, for general application traffic the pipe model is fine and 149 might even be expected by some applications. In general, when the 150 source and destination IP are in the same IP subnet the ttl should 151 not be decremented. Thus it makes sense to have a way to selectively 152 enable the uniform model perhaps based on some method to identify 153 packets associated with traceroute or some marker in the packet 154 itself that the traceroute tool can set. 156 2. Solution Overview 158 The pieces needed to accomplish this are: 160 o One or more ways to select the uniform model packets at the tunnel 161 ingress. 163 o Tunnel ingress copying out the original ttl from a selected packet 164 to the outer IP header, and then doing a check and decrement of 165 that ttl. 167 o If that ttl check results in ttl expiry at the tunnel ingress, 168 then deliver an ICMP ttl exceeded packet back to the host. 170 o A mechanism by which the tunnel egress knows which packets should 171 have uniform model, for instance a bit in the encapsulation 172 header. 174 o The tunnel egress copying in the ttl (for identified packets) from 175 the outer header to the inner IP header, then doing a check and 176 decrement of that ttl. 178 o If ttl check results in ttl expiry at the tunnel egress, then 179 deliver an ICMP error back to the original host (or, perhaps 180 better, to tunnel ingress the same way as underlay routers do). 182 o IP routers in the underlay will deliver any ICMP errors to the 183 source IP address of the packet. For tunneled packets that will 184 be the tunnel ingress. Hence the tunnel ingress needs to be able 185 to take such ICMP errors and form corresponding ICMP errors that 186 are sent back to the host. The requirement in [RFC1812] ensures 187 that the ICMP errors will contain enough headers to form such an 188 ICMP error. 190 The idea to reflect (some) ICMP errors from inside a tunnel back to 191 the original source goes back to IPv6 in IPv4 encapsulation as 192 specified in [RFC1933] and [RFC2473]. However, those drafts did not 193 advocate using a uniform ttl model for the tunnels but did handle 194 ICMP packet too big and other unreachable messages. Those drafts 195 specify how to reflect ICMP errors received from underlay routers to 196 ICMP errors sent to the original host. The addition of handling ICMP 197 ttl exceeded errors for uniform tunnel model is straight forward. 199 The information carried in the ICMP errors are quite limited - the 200 original packet plus an ICMP type and code. However, there are 201 extension mechanisms specified in [RFC4884] and used for MPLS in 202 [RFC4950] which include TLVs with additional information. If there 203 are additional information to include for overlay networks that 204 information could be added by defining new ICMP Extensions Objects 205 based on [RFC4884]. Such extensions are for further study. 207 3. Goals and Requirements 209 The following goals and requirements apply: 211 o No changes needed in the underlay. 213 o Optional changes on the decapsulating end. 215 o ECMP friendly. If the underlay employs equal cost multipath 216 routing then one should be able to use this mechanism to trace the 217 same path as a given TCP or UDP flow is using. In addition, one 218 should be able to explore different ECMP paths by varying the IP 219 addresses and port numbers in the packets originated by traceroute 220 on the host. 222 o Provide output which makes it possible to compare a regular 223 overlay traceroute with the layer-transcending output. 225 4. Definition Of Terms 227 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 228 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 229 document are to be interpreted as described in [RFC2119]. 231 The terminology such as NVE, and TS are used as specified in 232 [RFC7365]: 234 o Network Virtualization Edge (NVE): An NVE is the network entity 235 that sits at the edge of an underlay network and implements L2 236 and/or L3 network virtualization functions. 238 o Tenant System (TS): A physical or virtual system that can play the 239 role of a host or a forwarding element such as a router, switch, 240 firewall, etc. 242 o Virtual Access Points (VAPs): A logical connection point on the 243 NVE for connecting a Tenant System to a virtual network. 245 o Virtual Network (VN): A VN is a logical abstraction of a physical 246 network that provides L2 or L3 network services to a set of Tenant 247 Systems. 249 o Virtual Network Context (VN Context) Identifier: Field in an 250 overlay encapsulation header that identifies the specific VN the 251 packet belongs to. 253 We use the VTEP term in [RFC7348] as synonymous with NVE, and VNI as 254 synonymous to VN Context Identifier. 256 5. Example Topologies 258 The following example topologies illustrate different cases where we 259 want a tracing capability. The examples are for overlay technologies 260 such as VXLAN which provide a layer 2 overlay on IP. The cases for 261 layer 3 overlay on top of IP are simpler and not shown in this 262 document. 264 The VXLAN term VTEP is used as synonymous to NVO3's NVE term. 266 ----------- ----------- 267 | H1 | | H2 | 268 | 1.0.1.1 | | 1.0.1.2 | 269 | | | | 270 ----------- ----------- 271 | | 272 | | 273 ----------- ----------- ----------- 274 | VtepA | | R1 | | VtepB | 275 | 2.0.1.1 | --| 2.0.1.2 | | 2.0.2.1 | 276 | | | 2.0.2.2 |--| | 277 ----------- ----------- ----------- 279 Simple L2 overlay 281 The figure above shows two hosts connected using an underlay which 282 provides a layer two service. Thus H1 and H2 are in the same subnet 283 and unaware of the existence of the underlay. Thus a normal ping or 284 traceroute would not be able to provide any information about the 285 nature of a failure; either packets get through or they do not. When 286 the packets get through traceroute would output something like: 288 traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets 289 1 1.0.2.1 (1.0.2.1) 1.104 ms 1.235 ms 1.729 ms 291 In this case it would be desirable to be able to traceroute from H1 292 to H2 (and vice versa) and observe VtepA, R1, VtepB and H2. Thus in 293 the case of packets getting through traceroute would output: 295 traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets 296 1 2.0.1.1 (2.0.1.1) 1.104 ms 1.235 ms 1.729 ms 297 2 2.0.1.2 (2.0.1.2) 2.106 ms 2.007 ms 2.156 ms 298 3 2.0.2.1 (2.0.2.1) 35.034 ms 24.490 ms 21.626 ms 299 4 1.0.1.2 (1.0.1.2) 40.830 ms 44.694 ms 75.620 ms 301 Note that the underlay and overlay might exist in completely separate 302 addressing domains. Thus H1 might not be able to reach any of the 303 underlay addresses. And the underlay IP addresses might overlap the 304 overlay IP addresses. For example, it would be completely valid to 305 see e.g. VtepA having the same IP address as H1. The user of this 306 tool need to understand that the utility of the traceroute output is 307 to get information to determine whether the issue is in the underlay 308 or overlay, and be able to pass the underlay information to the 309 operator of the underlay. 311 In overlay networks without any ARP/ND optimizations ARP/ND packets 312 would be flooded between the tunnel endpoints. Thus if there is some 313 communication failure between H1 and H2, then H1 above might not have 314 an ARP entry for H2. This results in traceroute not being able to 315 output any data. This implies that in order to use traceroute to 316 trouble shoot the issue one would need some workaround, such as 317 installing some temporary ARP entries on the hosts. 319 ----------- ----------- ----------- ----------- 320 | H1 | | R2 | | R3 | | H4 | 321 | 1.0.1.1 | | 1.0.2.2 |--| 1.0.2.3 | | | 322 | | | 1.0.1.2 | | 1.0.3.3 |--| 1.0.3.4 | 323 ----------- ----------- ----------- ----------- 324 | | 325 | | 326 ----------- ----------- ----------- 327 | VtepA | | R1 | | VtepB | 328 | 2.0.1.1 | --| 2.0.1.2 | | 2.0.2.1 | 329 | | | 2.0.2.2 |--| | 330 ----------- ----------- ----------- 332 L2 overlay as part of larger network 334 The figure above has a overlay router the nexthop as seen by H1. In 335 this case a normal overlay traceroute would be able to display the 336 overlay path i.e. 338 traceroute to H4, 30 hops max, 60 byte packets 339 1 R2 340 2 R3 341 3 H4 343 The layer-transcending traceroute would show the combination of the 344 underlay and overlay paths i.e., 346 traceroute to H4, 30 hops max, 60 byte packets 347 1 VtepA 348 2 R1 349 3 VtepB 350 4 R2 351 5 R3 352 6 H4 354 ----------- ------------------- ----------- 355 | H1 | | R5 | | H6 | 356 | 1.0.1.1 | | | | | 357 | | | 1.0.1.2 1.0.5.5 | | 1.0.5.6 | 358 ----------- |-----------------| ----------- 359 | | | | | | 360 | | | | | | 361 ----------- ----------- |-----------------| ----------- ----------- 362 | VtepA | | R1 | | VtepB VtepC | | R6 | | VtepD | 363 | 2.0.1.1 |-| 2.0.1.2 | | 2.0.2.1 3.0.1.1 |-| 3.0.1.2 | | | 364 | | | 2.0.2.2 |-| | | 3.0.2.2 |-| 3.0.3.1 | 365 ----------- ----------- ------------------- ----------- ----------- 367 Multiple L2 overlays in path 369 The figure above has multiple overlay network segments, that are 370 connected in one router which provides the tunnel endpoints for both 371 overlay segments plus routing for the overlay. A more general 372 picture would be to have an overlay routed path between the two NVEs 373 e.g., VtepB and VtepC connected to different routers in the overlay. 374 However, such a drawing in ASCII art doesn't fit on the page. 376 An normal overlay traceroute in the above topology would show the 377 overlay router i.e., 379 traceroute to H6, 30 hops max, 60 byte packets 380 1 R5 381 2 H6 383 The layer-transcending traceroute would show the combination of the 384 underlay and overlay paths i.e., 386 traceroute to H6, 30 hops max, 60 byte packets 387 1 VtepA 388 2 R1 389 3 VtepB 390 4 R5 391 5 VtepC 392 6 R6 393 7 VtepD 394 8 H6 396 Note that the R3 device, which include VtepB and VtepC, appears as 397 three hops in the traceroute output. That is needed to be able to 398 correlate the output with the overlay output which has R3. That 399 correlation would be hard if the R3 device only appeared as VtepB in 400 the LTTON output. The three-hop representation also stays invariant 401 whether or not the NVEs and overlay router are implemented by a 402 single device or multiple devices. 404 6. Controlling and selecting ttl behavior 406 The network admin needs to be able to control who can use the layer 407 transcending traceroute, since the operator might not want to 408 disclose the underlay topology to all its users all the time. There 409 are different approaches for this such as designating particular 410 ports (Virtual Access Points in NVO3 terminology) on a NVE to have 411 uniform ttl tunnel model. We have found it useful to be able to 412 enable this capability on a per port and/or virtual network basis, in 413 addition to having a global setting per NVE. 415 When enabled on the NVEs the user on the TS needs to be able to 416 control which traffic is subject to which tunnel mode. The normal 417 traffic would use the pipe ttl tunnel model and only explicit trace 418 applications are likely to want to use the uniform ttl tunnel model. 419 Hence it makes sense to use some marker in the packets sent by the TS 420 to select those packets for uniform model on the NVE. Such a 421 mechanism should usable so that the user can perform both a regular 422 traceroute and a LTTON. 424 Potentially different fields in the packets originated by traceroute 425 on the TS can be used to mark the packets for uniform ttl tunnel 426 model. However, many of those fields such as source and destination 427 port numbers and protocol might be used in hashing for ECMP. The 428 marking that can be used without impacting ECMP is the DSCP field in 429 the packet. That field can be set with an option (--tos) in at least 430 some existing traceroute implementations. 432 Note that when DSCP is used for such marking it is a configured 433 choice subject to agreement between the operator of the TS and NVE. 434 The matching on the NVE should ignore the ECN bits as to not 435 interfere with ECN. 437 However, the DSCP value used in the overlay might have an impact on 438 the forwarding of the packets. In such a case one can use an 439 alternative selector such as the UDP source port number. That has 440 the downside of affecting the has values used for ECMP and link 441 aggregation port selection. 443 7. Introducing a ttl copyin flag in the encapsulation header 445 When this approach is applied to VXLAN [RFC7348] the decapsulating 446 NVE has to be able to identify packets that have to be processed in 447 the uniform ttl tunnel model way. For that purpose we define a new 448 flag which is sent by the encapsulating NVE on selected packets, and 449 is used by the decapsulating NVE to perform the ttl copyin, decrement 450 and check. 452 In addition to the one I-flag defined in [RFC7348] we define a new 453 T-flag to capture this the trace behavior at the decapsulating tunnel 454 endpoint. 456 0 1 2 3 457 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 458 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 459 |R|R|R|R|I|R|R|T| Reserved | 460 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 461 | VXLAN Network Identifier (VNI) | Reserved | 462 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 464 New fields: 466 T-flag: When set indicates that decapsulator should take the 467 outer ttl and copy it to the inner ttl, and then check 468 and decrement the resulting ttl. 470 8. Encapsulation Behavior 472 If the uniform ttl model is enabled for the input, and the received 473 naked packet matches the selector, then the ingress NVE will perform 474 these additional operations as part of encapsulating an IPv4 or IPv6 475 packet: 477 o Examine the IPv4 TTL (or IPv6 hopcount, respectively) on receipt 478 and if 1 or less, then drop the packet and send an ICMPv4 (or 479 ICMPv6) ttl exceeded back to the original host. Since the NVE is 480 operating on a L2 packet, it might not have any layer 3 interfaces 481 or routes for the originating host. Thus it sends the packet back 482 to the source L2 address of the packet back out the ingress port - 483 without any IP address lookup. 485 o If ttl did not expire, then decrement the above ttl/hopcount and 486 place it in the outer IP header. Encapsulate and send the packet 487 as normal. 489 o If some other errors prevent sending the packet (such as unknown 490 VN Context Id, no flood list configured), then the NVE SHOULD send 491 an ICMP host unreachable back to the host. 493 The ingress NVE will receive ICMP errors from underlay routers and 494 the egress NVE; whether due to ttl exceeded or underlay issues such 495 as host unreachable, or packet too big errors. The NVE should take 496 such errors, and in addition to any local syslog etc, generate an 497 ICMP error sent back to the host. The principle for this is 498 specified in [RFC1933] and [RFC2473]. Just like in those 499 specifications, for the inner and outer IP header could be off 500 different version. A common case of that might be an IPv6 overlay 501 with an IPv4 underlay. That case requires some changes in the ICMP 502 type and code values in addition to recreating the packets. The 503 place where LTTON differs from those specifications is that there is 504 an NVO3 header and (for L2 over L3) and L2 header in the packet. 506 The figures below show an example of ICMP header re-generation at 507 VtepA for the case of IPv6 overlay with IPv4 underlay. The case of 508 IPv4 over IPv4 is similar and simpler since the ICMP header is the 509 same for both overlay and underlay. The example uses VXLAN 510 encapsulation to provide the concrete details, but the approach 511 applies to other NVO3 proposals. 513 +--------------+ 514 | IPv4 Header | 515 | src = R1 | 516 | dst = VtepA | 517 +--------------+ 518 | ICMPv4 | 519 | Header | 520 | type = X | 521 | code = Y | 522 - - +--------------+ 523 | IPv4 Header | 524 | src = VtepA | 525 IPv4 | dst = VtepB | 526 +--------------+ 527 Packet | UDP | 528 | dst = VXLAN | 529 in +--------------+ 530 | Ethernet | 531 Error | DA = H2 mac | 532 | SA = H1 mac | 533 +--------------+ - - 534 | IPv6 | 535 | src = H1 ipv6| 536 | dst = H2 ipv6| Original IPv6 537 +--------------+ Packet. 538 | Transport | Used to 539 | Header | generate an 540 +--------------+ ICMPv6 541 | | error message 542 ~ Data ~ back to the source. 543 | | 544 - - +--------------+ - - 546 ICMPv4 Error Message Returned to Encapsulating Node 548 The above underlay ICMPv4 is used to form an overlay ICMPv6 packet by 549 extracting the Ethernet DA from the inner Ethernet SA, and forming an 550 IPv6 header where the source address is based on the source address 551 of the ICMPv4 error. The ICMPv6 type and code values are set based 552 on the ICMPv4 type and code values. 554 +--------------+ 555 | Ethernet | 556 | DA = H1 mac | From ICMPv4 packet 557 | SA = VtepA | in error 558 +--------------+ 559 | IPv6 Header | 560 | src = ::R1 | 96 zeros + IPv4 address 561 | dst = H1 ipv6| 562 +--------------+ 563 | ICMPv6 | 564 | Header | 565 | type = X' | Type and code mapped 566 | code = Y' | from v4 to v6 values 567 - - +--------------+ - - 568 | IPv6 | 569 IPv6 | src = H1 ipv6| 570 | dst = H2 ipv6| Unmodified from 571 Packet +--------------+ ICMPv4 error 572 | Transport | 573 in | Header | 574 +--------------+ 575 Error | | 576 ~ Data ~ 577 | | 578 - - +--------------+ - - 580 Generated ICMPv6 Error Message for Overlay Source 582 In the case of IPv6 over IPv4 the above example setting of the IPv6 583 source address results in this type of traceroute output: 585 traceroute to 2000:0:0:40::2, 30 hops max, 80 byte packets 586 1 ::2.0.1.1 (::2.0.1.1) 1.231 ms 1.004 ms 1.126 ms 587 2 ::2.0.1.2 (::2.0.1.2) 1.994 ms 2.301 ms 2.016 ms 588 3 ::2.0.2.1 (::2.0.2.1) 18.846 ms 30.582 ms 19.776 ms 589 4 2000:0:0:40::2 (2000:0:0:40::2) 48.964 ms 60.131 ms 53.895 ms 591 9. Decapsulating Behavior 593 If this uniform ttl model is enabled on the decapsulating NVE, and 594 the overlay header indicates that uniform ttl model applies (the 595 T-bit in the case of VXLAN), then the NVE will perform these 596 additional operations as part of decapsulating a packet where the 597 inner packet is an IPv4 or IPv6 packet: 599 o Examine the outer IPv4 TTL (or outer IPv6 hopcount, respectively) 600 on receipt and if 1 or less, then drop the packet and send an 601 outer ICMPv4 (or ICMPv6) ttl exceeded back to the source of the 602 outer packet i.e., the ingress NVE. This ICMP packet should look 603 the same as an ICMP error generated by an underlay router, and the 604 requirement in [RFC1812] on the size of the packet in error 605 applies. 607 o If ttl did not expire, then decrement the above ttl/hopcount and 608 place it in the inner IP header. If the inner IP header is IPv4 609 then update the IPv4 header checksum. Then decapsulate and send 610 the packet as for other decapsulated packets. 612 o If some other errors prevent sending the packet (such as unknown 613 VN Context Id), then the NVE SHOULD send an ICMP host unreachable 614 instead of a ttl exceeded error. 616 10. Other ICMP errors 618 The technique for selecting ttl behavior specified in this draft can 619 also be used to trigger other ICMPv4 and ICMPv6 errors. For example, 620 [RFC1933] specifies how ICMP packet too big from underlay routers can 621 be used to report over ICMP packet too big errors to the original 622 source. Other errors that are more specific to the overlay protocol 623 might also be useful, such as not being able to find a VNI ID for the 624 incoming port,vlan, or not being able to flood the packet if the 625 packet is a Broadcast, Unknown unicast, or Multicast packet. 627 11. Security Considerations 629 The considerations in [I-D.ietf-nvo3-security-requirements] apply. 631 In addition, the use of the uniform ttl tunnel model will result in 632 ICMP errors being generated by underlay routers and consumed by NVEs. 633 That presents an attack vector which does not exist in a pipe ttl 634 tunnel model. However, ICMP errors should be rate limited [RFC1812]. 635 Implementations should also take appropriate measures in rate 636 limiting the input rate for ICMP errors that are processed by limited 637 CPU resources. 639 Some implementations might handle the trace packets (with uniform ttl 640 model) in software while the pipe ttl model packets can be handled in 641 hardware. In such a case the implementation should have mechanisms 642 to avoid starvation of limited CPU resources due to these packets. 644 12. IANA Considerations 646 TBD 648 13. Acknowledgements 650 The authors acknowledge the helpful comments from David Black and 651 Diego Garcia del Rio. 653 14. References 655 14.1. Normative References 657 [RFC0792] Postel, J., "Internet Control Message Protocol", STD 5, 658 RFC 792, DOI 10.17487/RFC0792, September 1981, 659 . 661 [RFC1812] Baker, F., Ed., "Requirements for IP Version 4 Routers", 662 RFC 1812, DOI 10.17487/RFC1812, June 1995, 663 . 665 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 666 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ 667 RFC2119, March 1997, 668 . 670 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 671 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 672 eXtensible Local Area Network (VXLAN): A Framework for 673 Overlaying Virtualized Layer 2 Networks over Layer 3 674 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 675 . 677 [RFC7365] Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y. 678 Rekhter, "Framework for Data Center (DC) Network 679 Virtualization", RFC 7365, DOI 10.17487/RFC7365, 680 October 2014, . 682 14.2. Informative References 684 [I-D.gross-geneve] 685 Gross, J., Sridhar, T., Garg, P., Wright, C., Ganga, I., 686 Agarwal, P., Duda, K., Dutt, D., and J. Hudson, "Geneve: 687 Generic Network Virtualization Encapsulation", 688 draft-gross-geneve-02 (work in progress), October 2014. 690 [I-D.herbert-gue] 691 Herbert, T., Yong, L., and O. Zia, "Generic UDP 692 Encapsulation", draft-herbert-gue-03 (work in progress), 693 March 2015. 695 [I-D.ietf-nvo3-security-requirements] 696 Hartman, S., Zhang, D., Wasserman, M., Qiang, Z., and M. 697 Zhang, "Security Requirements of NVO3", 698 draft-ietf-nvo3-security-requirements-05 (work in 699 progress), June 2015. 701 [I-D.sridharan-virtualization-nvgre] 702 Garg, P. and Y. Wang, "NVGRE: Network Virtualization using 703 Generic Routing Encapsulation", 704 draft-sridharan-virtualization-nvgre-08 (work in 705 progress), April 2015. 707 [I-D.tissa-lime-yang-oam-model] 708 Senevirathne, T., Finn, N., Kumar, D., Salam, S., Wu, Q., 709 and Z. Wang, "Generic YANG Data Model for Operations, 710 Administration, and Maintenance (OAM)", 711 draft-tissa-lime-yang-oam-model-06 (work in progress), 712 August 2015. 714 [RFC1933] Gilligan, R. and E. Nordmark, "Transition Mechanisms for 715 IPv6 Hosts and Routers", RFC 1933, DOI 10.17487/RFC1933, 716 April 1996, . 718 [RFC2473] Conta, A. and S. Deering, "Generic Packet Tunneling in 719 IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473, 720 December 1998, . 722 [RFC2983] Black, D., "Differentiated Services and Tunnels", 723 RFC 2983, DOI 10.17487/RFC2983, October 2000, 724 . 726 [RFC3270] Le Faucheur, F., Wu, L., Davie, B., Davari, S., Vaananen, 727 P., Krishnan, R., Cheval, P., and J. Heinanen, "Multi- 728 Protocol Label Switching (MPLS) Support of Differentiated 729 Services", RFC 3270, DOI 10.17487/RFC3270, May 2002, 730 . 732 [RFC3443] Agarwal, P. and B. Akyol, "Time To Live (TTL) Processing 733 in Multi-Protocol Label Switching (MPLS) Networks", 734 RFC 3443, DOI 10.17487/RFC3443, January 2003, 735 . 737 [RFC4884] Bonica, R., Gan, D., Tappan, D., and C. Pignataro, 738 "Extended ICMP to Support Multi-Part Messages", RFC 4884, 739 DOI 10.17487/RFC4884, April 2007, 740 . 742 [RFC4950] Bonica, R., Gan, D., Tappan, D., and C. Pignataro, "ICMP 743 Extensions for Multiprotocol Label Switching", RFC 4950, 744 DOI 10.17487/RFC4950, August 2007, 745 . 747 Authors' Addresses 749 Erik Nordmark 750 Arista Networks 751 Santa Clara, CA 752 USA 754 Email: nordmark@arista.com 756 Chandra Appanna 757 Arista Networks 758 Santa Clara, CA 759 USA 761 Email: achandra@arista.com 763 Alton Lo 764 Arista Networks 765 Santa Clara, CA 766 USA 768 Email: altonlo@arista.com