idnits 2.17.1 draft-nordmark-nvo3-transcending-traceroute-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 18 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. == There are 1 instance of lines with non-RFC3849-compliant IPv6 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (Mar 2016) is 2964 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Downref: Normative reference to an Informational RFC: RFC 7348 ** Downref: Normative reference to an Informational RFC: RFC 7365 == Outdated reference: A later version (-07) exists of draft-ietf-nvo3-security-requirements-06 -- Obsolete informational reference (is this intentional?): RFC 1933 (Obsoleted by RFC 2893) Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 NVO3 WG E. Nordmark 3 Internet-Draft C. Appanna 4 Intended status: Standards Track A. Lo 5 Expires: September 2, 2016 Arista Networks 6 Mar 2016 8 Layer-Transcending Traceroute for Overlay Networks like VXLAN 9 draft-nordmark-nvo3-transcending-traceroute-02 11 Abstract 13 Tools like traceroute have been very valuable for the operation of 14 the Internet. Part of that value comes from being able to display 15 information about routers and paths over which the user of the tool 16 has no control, but the traceroute output can be passed along to 17 someone else that can further investigate or fix the problem. 19 In overlay networks such as VXLAN and NVGRE the prevailing view is 20 that since the overlay network has no control of the underlay there 21 needs to be special tools and agreements to enable extracting traces 22 from the underlay. We argue that enabling visibility into the 23 underlay and using existing tools like traceroute has been overlooked 24 and would add value in many deployments of overlay networks. 26 This document specifies an approach that can be used to make 27 traceroute transcend layers of encapsulation including details for 28 how to apply this to VXLAN. The technique can be applied to other 29 encapsulations used for overlay networks. It can also be implemented 30 using current commercial silicon. 32 Status of this Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at http://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on September 2, 2016. 49 Copyright Notice 51 Copyright (c) 2016 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (http://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 67 2. Solution Overview . . . . . . . . . . . . . . . . . . . . . . 4 68 3. Goals and Requirements . . . . . . . . . . . . . . . . . . . . 5 69 4. Definition Of Terms . . . . . . . . . . . . . . . . . . . . . 6 70 5. Example Topologies . . . . . . . . . . . . . . . . . . . . . . 6 71 6. Controlling and selecting ttl behavior . . . . . . . . . . . . 10 72 7. Introducing a ttl copyin flag in the encapsulation header . . 10 73 8. Encapsulation Behavior . . . . . . . . . . . . . . . . . . . . 11 74 9. Decapsulating Behavior . . . . . . . . . . . . . . . . . . . . 14 75 10. Other ICMP errors . . . . . . . . . . . . . . . . . . . . . . 15 76 11. Security Considerations . . . . . . . . . . . . . . . . . . . 15 77 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 78 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16 79 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 80 14.1. Normative References . . . . . . . . . . . . . . . . . . 16 81 14.2. Informative References . . . . . . . . . . . . . . . . . 16 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18 84 1. Introduction 86 Tools like traceroute have been very valuable for the operation of 87 the Internet. Part of that value comes from being able to display 88 information about routers and paths over which the user of the tool 89 has no control, but the traceroute output can be passed along to 90 someone else that can further investigate or fix the problem. The 91 output of traceroute can be included in an email or a trouble ticket 92 to report the problem. This provide a lot more information than the 93 mere indication that A can't communicate with B, in particular when 94 the failures are transient. The ping tool provides some of the same 95 benefits in being able to return ICMP errors such as host unreachable 96 messages. 98 This document shows how those tools can be used to gather information 99 for both the overlay and underlay parts of an end-to-end path by 100 providing the option to have some packets use a uniform time-to-live 101 (ttl) model for the tunnels, and associated ICMP error handling. 102 These changes are limited to the tunnel ingress and egress points. 104 The desire to make traceroute provide useful information for overlay 105 network is not an argument against also using a layered approach for 106 OAM as specified in e.g., [I-D.tissa-lime-yang-oam-model]. Such 107 approaches are quite appropriate for continuos monitoring at 108 different layers and across different domains. A layer transcending 109 traceroute complements the ability to do layered and/or continuos 110 monitoring. 112 The traceroute tool relies on receiving ICMP errors [RFC0792] in 113 combination with using different IP time-to-live values. That 114 results in the packet making it further and further towards the 115 destination with ICMP ttl exceeded errors being received from each 116 hop. That provides the user the working path even if the packets are 117 black holed eventually, and also provides any errors like ICMP host 118 unreachable. The fundamental assumption is that the ttl is 119 decremented for each hop and that the resulting ICMP ttl exceeded 120 errors are delivered back to the host. 122 When some encapsulation is used to tunnel packets there is an 123 architectural question how those tunnels should be viewed from the 124 rest of the network. Different models were described first for 125 diffserv in [RFC2983] and then applied to MPLS in [RFC3270] and 126 expanded to MPLS ttl handling in [RFC3443] and those models apply to 127 other forms of direct or indirect IP in IP tunnels. Those RFCs 128 define two models for ttl that are of interest to us: 130 o A pipe model, where the tunnel is invisible to the rest of the 131 network in that it looks like a direct connection between the 132 tunnel ingress and egress. 134 o A uniform model, where the ttl decrements uniformly for hops 135 outside and inside the tunnel. 137 The tunneling mechanisms discussed in NVO3 (such as VXLAN [RFC7348], 138 NVGRE [I-D.sridharan-virtualization-nvgre], GENEVE 139 [I-D.gross-geneve], and GUE [I-D.herbert-gue]), have either been 140 specified to provide the pipe model of a tunnel or are silent on the 141 setting of the outer ttl. Those protocols can be extended to have an 142 optional uniform tunnel model when the payload is IP, following the 143 same model as in [RFC3443]. Note that these encapsulations carry 144 Ethernet frames hence are not even aware that the payload is IP. 145 However, IP is the bulk of what is carried over such tunnels and the 146 ingress NVE can inspect the IP part of the Ethernet frame. 148 However, for general application traffic the pipe model is fine and 149 might even be expected by some applications. In general, when the 150 source and destination IP are in the same IP subnet the ttl should 151 not be decremented. Thus it makes sense to have a way to selectively 152 enable the uniform model perhaps based on some method to identify 153 packets associated with traceroute or some marker in the packet 154 itself that the traceroute tool can set. 156 2. Solution Overview 158 The pieces needed to accomplish this are: 160 o One or more ways to select the uniform model packets at the tunnel 161 ingress. 163 o Tunnel ingress copying out the original ttl from a selected packet 164 to the outer IP header, and then doing a check and decrement of 165 that ttl. 167 o If that ttl check results in ttl expiry at the tunnel ingress, 168 then deliver an ICMP ttl exceeded packet back to the host. 170 o A mechanism by which the tunnel egress knows which packets should 171 have uniform model, for instance a bit in the encapsulation 172 header. 174 o The tunnel egress copying in the ttl (for identified packets) from 175 the outer header to the inner IP header, then doing a check and 176 decrement of that ttl. 178 o If ttl check results in ttl expiry at the tunnel egress, then 179 deliver an ICMP error back to the original host (or, perhaps 180 better, to tunnel ingress the same way as underlay routers do). 182 o IP routers in the underlay will deliver any ICMP errors to the 183 source IP address of the packet. For tunneled packets that will 184 be the tunnel ingress. Hence the tunnel ingress needs to be able 185 to take such ICMP errors and form corresponding ICMP errors that 186 are sent back to the host. The requirement in [RFC1812] ensures 187 that the ICMP errors will contain enough headers to form such an 188 ICMP error. It has been noted that there are routers in the 189 Internet which decades later fail to conform to that aspect of 190 [RFC1812]. 192 The idea to reflect (some) ICMP errors from inside a tunnel back to 193 the original source goes back to IPv6 in IPv4 encapsulation as 194 specified in [RFC1933] and [RFC2473]. However, those drafts did not 195 advocate using a uniform ttl model for the tunnels but did handle 196 ICMP packet too big and other unreachable messages. Those drafts 197 specify how to reflect ICMP errors received from underlay routers to 198 ICMP errors sent to the original host. The addition of handling ICMP 199 ttl exceeded errors for uniform tunnel model is straight forward. 201 The information carried in the ICMP errors are quite limited - the 202 original packet plus an ICMP type and code. However, there are 203 extension mechanisms specified in [RFC4884] and used for MPLS in 204 [RFC4950] which include TLVs with additional information. If there 205 are additional information to include for overlay networks that 206 information could be added by defining new ICMP Extensions Objects 207 based on [RFC4884]. Such extensions are for further study. 209 3. Goals and Requirements 211 The following goals and requirements apply: 213 o No changes needed in the underlay. 215 o Optional changes on the decapsulating end. 217 o ECMP friendly. If the underlay employs equal cost multipath 218 routing then one should be able to use this mechanism to trace the 219 same path as a given TCP or UDP flow is using. In addition, one 220 should be able to explore different ECMP paths by varying the IP 221 addresses and port numbers in the packets originated by traceroute 222 on the host. 224 o Provide output which makes it possible to compare a regular 225 overlay traceroute with the layer-transcending output. 227 4. Definition Of Terms 229 The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 230 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 231 document are to be interpreted as described in [RFC2119]. 233 The terminology such as NVE, and TS are used as specified in 234 [RFC7365]: 236 o Network Virtualization Edge (NVE): An NVE is the network entity 237 that sits at the edge of an underlay network and implements L2 238 and/or L3 network virtualization functions. 240 o Tenant System (TS): A physical or virtual system that can play the 241 role of a host or a forwarding element such as a router, switch, 242 firewall, etc. 244 o Virtual Access Points (VAPs): A logical connection point on the 245 NVE for connecting a Tenant System to a virtual network. 247 o Virtual Network (VN): A VN is a logical abstraction of a physical 248 network that provides L2 or L3 network services to a set of Tenant 249 Systems. 251 o Virtual Network Context (VN Context) Identifier: Field in an 252 overlay encapsulation header that identifies the specific VN the 253 packet belongs to. 255 We use the VTEP term in [RFC7348] as synonymous with NVE, and VNI as 256 synonymous to VN Context Identifier. 258 5. Example Topologies 260 The following example topologies illustrate different cases where we 261 want a tracing capability. The examples are for overlay technologies 262 such as VXLAN which provide a layer 2 overlay on IP. The cases for 263 layer 3 overlay on top of IP are simpler and not shown in this 264 document. 266 The VXLAN term VTEP is used as synonymous to NVO3's NVE term. 268 ----------- ----------- 269 | H1 | | H2 | 270 | 1.0.1.1 | | 1.0.1.2 | 271 | | | | 272 ----------- ----------- 273 | | 274 | | 275 ----------- ----------- ----------- 276 | VtepA | | R1 | | VtepB | 277 | 2.0.1.1 | --| 2.0.1.2 | | 2.0.2.1 | 278 | | | 2.0.2.2 |--| | 279 ----------- ----------- ----------- 281 Simple L2 overlay 283 The figure above shows two hosts connected using an underlay which 284 provides a layer two service. Thus H1 and H2 are in the same subnet 285 and unaware of the existence of the underlay. Thus a normal ping or 286 traceroute would not be able to provide any information about the 287 nature of a failure; either packets get through or they do not. When 288 the packets get through traceroute would output something like: 290 traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets 291 1 1.0.2.1 (1.0.2.1) 1.104 ms 1.235 ms 1.729 ms 293 In this case it would be desirable to be able to traceroute from H1 294 to H2 (and vice versa) and observe VtepA, R1, VtepB and H2. Thus in 295 the case of packets getting through traceroute would output: 297 traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets 298 1 2.0.1.1 (2.0.1.1) 1.104 ms 1.235 ms 1.729 ms 299 2 2.0.1.2 (2.0.1.2) 2.106 ms 2.007 ms 2.156 ms 300 3 2.0.2.1 (2.0.2.1) 35.034 ms 24.490 ms 21.626 ms 301 4 1.0.1.2 (1.0.1.2) 40.830 ms 44.694 ms 75.620 ms 303 Note that the underlay and overlay might exist in completely separate 304 addressing domains. Thus H1 might not be able to reach any of the 305 underlay addresses. And the underlay IP addresses might overlap the 306 overlay IP addresses. For example, it would be completely valid to 307 see e.g. VtepA having the same IP address as H1. The user of this 308 tool need to understand that the utility of the traceroute output is 309 to get information to determine whether the issue is in the underlay 310 or overlay, and be able to pass the underlay information to the 311 operator of the underlay. 313 In overlay networks without any ARP/ND optimizations ARP/ND packets 314 would be flooded between the tunnel endpoints. Thus if there is some 315 communication failure between H1 and H2, then H1 above might not have 316 an ARP entry for H2. This results in traceroute not being able to 317 output any data. This implies that in order to use traceroute to 318 trouble shoot the issue one would need some workaround, such as 319 installing some temporary ARP entries on the hosts. 321 ----------- ----------- ----------- ----------- 322 | H1 | | R2 | | R3 | | H4 | 323 | 1.0.1.1 | | 1.0.2.2 |--| 1.0.2.3 | | | 324 | | | 1.0.1.2 | | 1.0.3.3 |--| 1.0.3.4 | 325 ----------- ----------- ----------- ----------- 326 | | 327 | | 328 ----------- ----------- ----------- 329 | VtepA | | R1 | | VtepB | 330 | 2.0.1.1 | --| 2.0.1.2 | | 2.0.2.1 | 331 | | | 2.0.2.2 |--| | 332 ----------- ----------- ----------- 334 L2 overlay as part of larger network 336 The figure above has a overlay router the nexthop as seen by H1. In 337 this case a normal overlay traceroute would be able to display the 338 overlay path i.e. 340 traceroute to H4, 30 hops max, 60 byte packets 341 1 R2 342 2 R3 343 3 H4 345 The layer-transcending traceroute would show the combination of the 346 underlay and overlay paths i.e., 348 traceroute to H4, 30 hops max, 60 byte packets 349 1 VtepA 350 2 R1 351 3 VtepB 352 4 R2 353 5 R3 354 6 H4 356 ----------- ------------------- ----------- 357 | H1 | | R5 | | H6 | 358 | 1.0.1.1 | | | | | 359 | | | 1.0.1.2 1.0.5.5 | | 1.0.5.6 | 360 ----------- |-----------------| ----------- 361 | | | | | | 362 | | | | | | 363 ----------- ----------- |-----------------| ----------- ----------- 364 | VtepA | | R1 | | VtepB VtepC | | R6 | | VtepD | 365 | 2.0.1.1 |-| 2.0.1.2 | | 2.0.2.1 3.0.1.1 |-| 3.0.1.2 | | | 366 | | | 2.0.2.2 |-| | | 3.0.2.2 |-| 3.0.3.1 | 367 ----------- ----------- ------------------- ----------- ----------- 369 Multiple L2 overlays in path 371 The figure above has multiple overlay network segments, that are 372 connected in one router which provides the tunnel endpoints for both 373 overlay segments plus routing for the overlay. A more general 374 picture would be to have an overlay routed path between the two NVEs 375 e.g., VtepB and VtepC connected to different routers in the overlay. 376 However, such a drawing in ASCII art doesn't fit on the page. 378 An normal overlay traceroute in the above topology would show the 379 overlay router i.e., 381 traceroute to H6, 30 hops max, 60 byte packets 382 1 R5 383 2 H6 385 The layer-transcending traceroute would show the combination of the 386 underlay and overlay paths i.e., 388 traceroute to H6, 30 hops max, 60 byte packets 389 1 VtepA 390 2 R1 391 3 VtepB 392 4 R5 393 5 VtepC 394 6 R6 395 7 VtepD 396 8 H6 398 Note that the R3 device, which include VtepB and VtepC, appears as 399 three hops in the traceroute output. That is needed to be able to 400 correlate the output with the overlay output which has R3. That 401 correlation would be hard if the R3 device only appeared as VtepB in 402 the LTTON output. The three-hop representation also stays invariant 403 whether or not the NVEs and overlay router are implemented by a 404 single device or multiple devices. 406 6. Controlling and selecting ttl behavior 408 The network admin needs to be able to control who can use the layer 409 transcending traceroute, since the operator might not want to 410 disclose the underlay topology to all its users all the time. There 411 are different approaches for this such as designating particular 412 ports (Virtual Access Points in NVO3 terminology) on a NVE to have 413 uniform ttl tunnel model. We have found it useful to be able to 414 enable this capability on a per port and/or virtual network basis, in 415 addition to having a global setting per NVE. 417 When enabled on the NVEs the user on the TS needs to be able to 418 control which traffic is subject to which tunnel mode. The normal 419 traffic would use the pipe ttl tunnel model and only explicit trace 420 applications are likely to want to use the uniform ttl tunnel model. 421 Hence it makes sense to use some marker in the packets sent by the TS 422 to select those packets for uniform model on the NVE. Such a 423 mechanism should usable so that the user can perform both a regular 424 traceroute and a LTTON. 426 Potentially different fields in the packets originated by traceroute 427 on the TS can be used to mark the packets for uniform ttl tunnel 428 model. However, many of those fields such as source and destination 429 port numbers and protocol might be used in hashing for ECMP. The 430 marking that can be used without impacting ECMP is the DSCP field in 431 the packet. That field can be set with an option (--tos) in at least 432 some existing traceroute implementations. 434 Note that when DSCP is used for such marking it is a configured 435 choice subject to agreement between the operator of the TS and NVE. 436 The matching on the NVE should ignore the ECN bits as to not 437 interfere with ECN. 439 However, the DSCP value used in the overlay might have an impact on 440 the forwarding of the packets. In such a case one can use an 441 alternative selector such as the UDP source port number. That has 442 the downside of affecting the has values used for ECMP and link 443 aggregation port selection. 445 7. Introducing a ttl copyin flag in the encapsulation header 447 When this approach is applied to VXLAN [RFC7348] the decapsulating 448 NVE has to be able to identify packets that have to be processed in 449 the uniform ttl tunnel model way. For that purpose we define a new 450 flag which is sent by the encapsulating NVE on selected packets, and 451 is used by the decapsulating NVE to perform the ttl copyin, decrement 452 and check. 454 In addition to the one I-flag defined in [RFC7348] we define a new 455 T-flag to capture this the trace behavior at the decapsulating tunnel 456 endpoint. 458 0 1 2 3 459 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 460 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 461 |R|R|R|R|I|R|R|T| Reserved | 462 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 463 | VXLAN Network Identifier (VNI) | Reserved | 464 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 466 New fields: 468 T-flag: When set indicates that decapsulator should take the 469 outer ttl and copy it to the inner ttl, and then check 470 and decrement the resulting ttl. 472 8. Encapsulation Behavior 474 If the uniform ttl model is enabled for the input, and the received 475 naked packet matches the selector, then the ingress NVE will perform 476 these additional operations as part of encapsulating an IPv4 or IPv6 477 packet: 479 o Examine the IPv4 TTL (or IPv6 hopcount, respectively) on receipt 480 and if 1 or less, then drop the packet and send an ICMPv4 (or 481 ICMPv6) ttl exceeded back to the original host. Since the NVE is 482 operating on a L2 packet, it might not have any layer 3 interfaces 483 or routes for the originating host. Thus it sends the packet back 484 to the source L2 address of the packet back out the ingress port - 485 without any IP address lookup. 487 o If ttl did not expire, then decrement the above ttl/hopcount and 488 place it in the outer IP header. Encapsulate and send the packet 489 as normal. 491 o If some other errors prevent sending the packet (such as unknown 492 VN Context Id, no flood list configured), then the NVE SHOULD send 493 an ICMP host unreachable back to the host. 495 The ingress NVE will receive ICMP errors from underlay routers and 496 the egress NVE; whether due to ttl exceeded or underlay issues such 497 as host unreachable, or packet too big errors. The NVE should take 498 such errors, and in addition to any local syslog etc, generate an 499 ICMP error sent back to the host. The principle for this is 500 specified in [RFC1933] and [RFC2473]. Just like in those 501 specifications, for the inner and outer IP header could be off 502 different version. A common case of that might be an IPv6 overlay 503 with an IPv4 underlay. That case requires some changes in the ICMP 504 type and code values in addition to recreating the packets. The 505 place where LTTON differs from those specifications is that there is 506 an NVO3 header and (for L2 over L3) and L2 header in the packet. 508 The figures below show an example of ICMP header re-generation at 509 VtepA for the case of IPv6 overlay with IPv4 underlay. The case of 510 IPv4 over IPv4 is similar and simpler since the ICMP header is the 511 same for both overlay and underlay. The example uses VXLAN 512 encapsulation to provide the concrete details, but the approach 513 applies to other NVO3 proposals. 515 +--------------+ 516 | IPv4 Header | 517 | src = R1 | 518 | dst = VtepA | 519 +--------------+ 520 | ICMPv4 | 521 | Header | 522 | type = X | 523 | code = Y | 524 - - +--------------+ 525 | IPv4 Header | 526 | src = VtepA | 527 IPv4 | dst = VtepB | 528 +--------------+ 529 Packet | UDP | 530 | dst = VXLAN | 531 in +--------------+ 532 | Ethernet | 533 Error | DA = H2 mac | 534 | SA = H1 mac | 535 +--------------+ - - 536 | IPv6 | 537 | src = H1 ipv6| 538 | dst = H2 ipv6| Original IPv6 539 +--------------+ Packet. 540 | Transport | Used to 541 | Header | generate an 542 +--------------+ ICMPv6 543 | | error message 544 ~ Data ~ back to the source. 545 | | 546 - - +--------------+ - - 548 ICMPv4 Error Message Returned to Encapsulating Node 550 The above underlay ICMPv4 is used to form an overlay ICMPv6 packet by 551 extracting the Ethernet DA from the inner Ethernet SA, and forming an 552 IPv6 header where the source address is based on the source address 553 of the ICMPv4 error. The ICMPv6 type and code values are set based 554 on the ICMPv4 type and code values. 556 +--------------+ 557 | Ethernet | 558 | DA = H1 mac | From ICMPv4 packet 559 | SA = VtepA | in error 560 +--------------+ 561 | IPv6 Header | 562 | src = ::R1 | 96 zeros + IPv4 address 563 | dst = H1 ipv6| 564 +--------------+ 565 | ICMPv6 | 566 | Header | 567 | type = X' | Type and code mapped 568 | code = Y' | from v4 to v6 values 569 - - +--------------+ - - 570 | IPv6 | 571 IPv6 | src = H1 ipv6| 572 | dst = H2 ipv6| Unmodified from 573 Packet +--------------+ ICMPv4 error 574 | Transport | 575 in | Header | 576 +--------------+ 577 Error | | 578 ~ Data ~ 579 | | 580 - - +--------------+ - - 582 Generated ICMPv6 Error Message for Overlay Source 584 In the case of IPv6 over IPv4 the above example setting of the IPv6 585 source address results in this type of traceroute output: 587 traceroute to 2000:0:0:40::2, 30 hops max, 80 byte packets 588 1 ::2.0.1.1 (::2.0.1.1) 1.231 ms 1.004 ms 1.126 ms 589 2 ::2.0.1.2 (::2.0.1.2) 1.994 ms 2.301 ms 2.016 ms 590 3 ::2.0.2.1 (::2.0.2.1) 18.846 ms 30.582 ms 19.776 ms 591 4 2000:0:0:40::2 (2000:0:0:40::2) 48.964 ms 60.131 ms 53.895 ms 593 9. Decapsulating Behavior 595 If this uniform ttl model is enabled on the decapsulating NVE, and 596 the overlay header indicates that uniform ttl model applies (the 597 T-bit in the case of VXLAN), then the NVE will perform these 598 additional operations as part of decapsulating a packet where the 599 inner packet is an IPv4 or IPv6 packet: 601 o Examine the outer IPv4 TTL (or outer IPv6 hopcount, respectively) 602 on receipt and if 1 or less, then drop the packet and send an 603 outer ICMPv4 (or ICMPv6) ttl exceeded back to the source of the 604 outer packet i.e., the ingress NVE. This ICMP packet should look 605 the same as an ICMP error generated by an underlay router, and the 606 requirement in [RFC1812] on the size of the packet in error 607 applies. 609 o If ttl did not expire, then decrement the above ttl/hopcount and 610 place it in the inner IP header. If the inner IP header is IPv4 611 then update the IPv4 header checksum. Then decapsulate and send 612 the packet as for other decapsulated packets. 614 o If some other errors prevent sending the packet (such as unknown 615 VN Context Id), then the NVE SHOULD send an ICMP host unreachable 616 instead of a ttl exceeded error. 618 10. Other ICMP errors 620 The technique for selecting ttl behavior specified in this draft can 621 also be used to trigger other ICMPv4 and ICMPv6 errors. For example, 622 [RFC1933] specifies how ICMP packet too big from underlay routers can 623 be used to report over ICMP packet too big errors to the original 624 source. Other errors that are more specific to the overlay protocol 625 might also be useful, such as not being able to find a VNI ID for the 626 incoming port,vlan, or not being able to flood the packet if the 627 packet is a Broadcast, Unknown unicast, or Multicast packet. 629 11. Security Considerations 631 The considerations in [I-D.ietf-nvo3-security-requirements] apply. 633 In addition, the use of the uniform ttl tunnel model will result in 634 ICMP errors being generated by underlay routers and consumed by NVEs. 635 That presents an attack vector which does not exist in a pipe ttl 636 tunnel model. However, ICMP errors should be rate limited [RFC1812]. 637 Implementations should also take appropriate measures in rate 638 limiting the input rate for ICMP errors that are processed by limited 639 CPU resources. 641 Some implementations might handle the trace packets (with uniform ttl 642 model) in software while the pipe ttl model packets can be handled in 643 hardware. In such a case the implementation should have mechanisms 644 to avoid starvation of limited CPU resources due to these packets. 646 12. IANA Considerations 648 TBD 650 13. Acknowledgements 652 The authors acknowledge the helpful comments from David Black and 653 Diego Garcia del Rio. 655 14. References 657 14.1. Normative References 659 [RFC0792] Postel, J., "Internet Control Message Protocol", STD 5, 660 RFC 792, DOI 10.17487/RFC0792, September 1981, 661 . 663 [RFC1812] Baker, F., Ed., "Requirements for IP Version 4 Routers", 664 RFC 1812, DOI 10.17487/RFC1812, June 1995, 665 . 667 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 668 Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/ 669 RFC2119, March 1997, 670 . 672 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 673 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 674 eXtensible Local Area Network (VXLAN): A Framework for 675 Overlaying Virtualized Layer 2 Networks over Layer 3 676 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 677 . 679 [RFC7365] Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y. 680 Rekhter, "Framework for Data Center (DC) Network 681 Virtualization", RFC 7365, DOI 10.17487/RFC7365, 682 October 2014, . 684 14.2. Informative References 686 [I-D.gross-geneve] 687 Gross, J., Sridhar, T., Garg, P., Wright, C., Ganga, I., 688 Agarwal, P., Duda, K., Dutt, D., and J. Hudson, "Geneve: 689 Generic Network Virtualization Encapsulation", 690 draft-gross-geneve-02 (work in progress), October 2014. 692 [I-D.herbert-gue] 693 Herbert, T., Yong, L., and O. Zia, "Generic UDP 694 Encapsulation", draft-herbert-gue-03 (work in progress), 695 March 2015. 697 [I-D.ietf-nvo3-security-requirements] 698 Hartman, S., Zhang, D., Wasserman, M., Qiang, Z., and M. 699 Zhang, "Security Requirements of NVO3", 700 draft-ietf-nvo3-security-requirements-06 (work in 701 progress), December 2015. 703 [I-D.sridharan-virtualization-nvgre] 704 Garg, P. and Y. Wang, "NVGRE: Network Virtualization using 705 Generic Routing Encapsulation", 706 draft-sridharan-virtualization-nvgre-08 (work in 707 progress), April 2015. 709 [I-D.tissa-lime-yang-oam-model] 710 Senevirathne, T., Finn, N., Kumar, D., Salam, S., Wu, Q., 711 and Z. Wang, "Generic YANG Data Model for Operations, 712 Administration, and Maintenance (OAM)", 713 draft-tissa-lime-yang-oam-model-06 (work in progress), 714 August 2015. 716 [RFC1933] Gilligan, R. and E. Nordmark, "Transition Mechanisms for 717 IPv6 Hosts and Routers", RFC 1933, DOI 10.17487/RFC1933, 718 April 1996, . 720 [RFC2473] Conta, A. and S. Deering, "Generic Packet Tunneling in 721 IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473, 722 December 1998, . 724 [RFC2983] Black, D., "Differentiated Services and Tunnels", 725 RFC 2983, DOI 10.17487/RFC2983, October 2000, 726 . 728 [RFC3270] Le Faucheur, F., Wu, L., Davie, B., Davari, S., Vaananen, 729 P., Krishnan, R., Cheval, P., and J. Heinanen, "Multi- 730 Protocol Label Switching (MPLS) Support of Differentiated 731 Services", RFC 3270, DOI 10.17487/RFC3270, May 2002, 732 . 734 [RFC3443] Agarwal, P. and B. Akyol, "Time To Live (TTL) Processing 735 in Multi-Protocol Label Switching (MPLS) Networks", 736 RFC 3443, DOI 10.17487/RFC3443, January 2003, 737 . 739 [RFC4884] Bonica, R., Gan, D., Tappan, D., and C. Pignataro, 740 "Extended ICMP to Support Multi-Part Messages", RFC 4884, 741 DOI 10.17487/RFC4884, April 2007, 742 . 744 [RFC4950] Bonica, R., Gan, D., Tappan, D., and C. Pignataro, "ICMP 745 Extensions for Multiprotocol Label Switching", RFC 4950, 746 DOI 10.17487/RFC4950, August 2007, 747 . 749 Authors' Addresses 751 Erik Nordmark 752 Arista Networks 753 Santa Clara, CA 754 USA 756 Email: nordmark@arista.com 758 Chandra Appanna 759 Arista Networks 760 Santa Clara, CA 761 USA 763 Email: achandra@arista.com 765 Alton Lo 766 Arista Networks 767 Santa Clara, CA 768 USA 770 Email: altonlo@arista.com