idnits 2.17.1 

draft-nordmark-nvo3-transcending-traceroute-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  == There are 18 instances of lines with non-RFC6890-compliant IPv4
     addresses in the document.  If these are example addresses, they should
     be changed.

  == There are 1 instance of lines with non-RFC3849-compliant IPv6 addresses
     in the document.  If these are example addresses, they should be changed.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords. 

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (Mar 2016) is 2964 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Downref: Normative reference to an Informational RFC: RFC 7348

  ** Downref: Normative reference to an Informational RFC: RFC 7365

  == Outdated reference: A later version (-07) exists of
     draft-ietf-nvo3-security-requirements-06

  -- Obsolete informational reference (is this intentional?): RFC 1933
     (Obsoleted by RFC 2893)


     Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	NVO3 WG                                                      E. Nordmark
3	Internet-Draft                                                C. Appanna
4	Intended status: Standards Track                                   A. Lo
5	Expires: September 2, 2016                               Arista Networks
6	                                                                Mar 2016

8	     Layer-Transcending Traceroute for Overlay Networks like VXLAN
9	             draft-nordmark-nvo3-transcending-traceroute-02

11	Abstract

13	   Tools like traceroute have been very valuable for the operation of
14	   the Internet.  Part of that value comes from being able to display
15	   information about routers and paths over which the user of the tool
16	   has no control, but the traceroute output can be passed along to
17	   someone else that can further investigate or fix the problem.

19	   In overlay networks such as VXLAN and NVGRE the prevailing view is
20	   that since the overlay network has no control of the underlay there
21	   needs to be special tools and agreements to enable extracting traces
22	   from the underlay.  We argue that enabling visibility into the
23	   underlay and using existing tools like traceroute has been overlooked
24	   and would add value in many deployments of overlay networks.

26	   This document specifies an approach that can be used to make
27	   traceroute transcend layers of encapsulation including details for
28	   how to apply this to VXLAN.  The technique can be applied to other
29	   encapsulations used for overlay networks.  It can also be implemented
30	   using current commercial silicon.

32	Status of this Memo

34	   This Internet-Draft is submitted in full conformance with the
35	   provisions of BCP 78 and BCP 79.

37	   Internet-Drafts are working documents of the Internet Engineering
38	   Task Force (IETF).  Note that other groups may also distribute
39	   working documents as Internet-Drafts.  The list of current Internet-
40	   Drafts is at http://datatracker.ietf.org/drafts/current/.

42	   Internet-Drafts are draft documents valid for a maximum of six months
43	   and may be updated, replaced, or obsoleted by other documents at any
44	   time.  It is inappropriate to use Internet-Drafts as reference
45	   material or to cite them other than as "work in progress."

47	   This Internet-Draft will expire on September 2, 2016.

49	Copyright Notice

51	   Copyright (c) 2016 IETF Trust and the persons identified as the
52	   document authors.  All rights reserved.

54	   This document is subject to BCP 78 and the IETF Trust's Legal
55	   Provisions Relating to IETF Documents
56	   (http://trustee.ietf.org/license-info) in effect on the date of
57	   publication of this document.  Please review these documents
58	   carefully, as they describe your rights and restrictions with respect
59	   to this document.  Code Components extracted from this document must
60	   include Simplified BSD License text as described in Section 4.e of
61	   the Trust Legal Provisions and are provided without warranty as
62	   described in the Simplified BSD License.

64	Table of Contents

66	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
67	   2.  Solution Overview  . . . . . . . . . . . . . . . . . . . . . .  4
68	   3.  Goals and Requirements . . . . . . . . . . . . . . . . . . . .  5
69	   4.  Definition Of Terms  . . . . . . . . . . . . . . . . . . . . .  6
70	   5.  Example Topologies . . . . . . . . . . . . . . . . . . . . . .  6
71	   6.  Controlling and selecting ttl behavior . . . . . . . . . . . . 10
72	   7.  Introducing a ttl copyin flag in the encapsulation header  . . 10
73	   8.  Encapsulation Behavior . . . . . . . . . . . . . . . . . . . . 11
74	   9.  Decapsulating Behavior . . . . . . . . . . . . . . . . . . . . 14
75	   10. Other ICMP errors  . . . . . . . . . . . . . . . . . . . . . . 15
76	   11. Security Considerations  . . . . . . . . . . . . . . . . . . . 15
77	   12. IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 16
78	   13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 16
79	   14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16
80	     14.1.  Normative References  . . . . . . . . . . . . . . . . . . 16
81	     14.2.  Informative References  . . . . . . . . . . . . . . . . . 16
82	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18

84	1.  Introduction

86	   Tools like traceroute have been very valuable for the operation of
87	   the Internet.  Part of that value comes from being able to display
88	   information about routers and paths over which the user of the tool
89	   has no control, but the traceroute output can be passed along to
90	   someone else that can further investigate or fix the problem.  The
91	   output of traceroute can be included in an email or a trouble ticket
92	   to report the problem.  This provide a lot more information than the
93	   mere indication that A can't communicate with B, in particular when
94	   the failures are transient.  The ping tool provides some of the same
95	   benefits in being able to return ICMP errors such as host unreachable
96	   messages.

98	   This document shows how those tools can be used to gather information
99	   for both the overlay and underlay parts of an end-to-end path by
100	   providing the option to have some packets use a uniform time-to-live
101	   (ttl) model for the tunnels, and associated ICMP error handling.
102	   These changes are limited to the tunnel ingress and egress points.

104	   The desire to make traceroute provide useful information for overlay
105	   network is not an argument against also using a layered approach for
106	   OAM as specified in e.g., [I-D.tissa-lime-yang-oam-model].  Such
107	   approaches are quite appropriate for continuos monitoring at
108	   different layers and across different domains.  A layer transcending
109	   traceroute complements the ability to do layered and/or continuos
110	   monitoring.

112	   The traceroute tool relies on receiving ICMP errors [RFC0792] in
113	   combination with using different IP time-to-live values.  That
114	   results in the packet making it further and further towards the
115	   destination with ICMP ttl exceeded errors being received from each
116	   hop.  That provides the user the working path even if the packets are
117	   black holed eventually, and also provides any errors like ICMP host
118	   unreachable.  The fundamental assumption is that the ttl is
119	   decremented for each hop and that the resulting ICMP ttl exceeded
120	   errors are delivered back to the host.

122	   When some encapsulation is used to tunnel packets there is an
123	   architectural question how those tunnels should be viewed from the
124	   rest of the network.  Different models were described first for
125	   diffserv in [RFC2983] and then applied to MPLS in [RFC3270] and
126	   expanded to MPLS ttl handling in [RFC3443] and those models apply to
127	   other forms of direct or indirect IP in IP tunnels.  Those RFCs
128	   define two models for ttl that are of interest to us:

130	   o  A pipe model, where the tunnel is invisible to the rest of the
131	      network in that it looks like a direct connection between the
132	      tunnel ingress and egress.

134	   o  A uniform model, where the ttl decrements uniformly for hops
135	      outside and inside the tunnel.

137	   The tunneling mechanisms discussed in NVO3 (such as VXLAN [RFC7348],
138	   NVGRE [I-D.sridharan-virtualization-nvgre], GENEVE
139	   [I-D.gross-geneve], and GUE [I-D.herbert-gue]), have either been
140	   specified to provide the pipe model of a tunnel or are silent on the
141	   setting of the outer ttl.  Those protocols can be extended to have an
142	   optional uniform tunnel model when the payload is IP, following the
143	   same model as in [RFC3443].  Note that these encapsulations carry
144	   Ethernet frames hence are not even aware that the payload is IP.
145	   However, IP is the bulk of what is carried over such tunnels and the
146	   ingress NVE can inspect the IP part of the Ethernet frame.

148	   However, for general application traffic the pipe model is fine and
149	   might even be expected by some applications.  In general, when the
150	   source and destination IP are in the same IP subnet the ttl should
151	   not be decremented.  Thus it makes sense to have a way to selectively
152	   enable the uniform model perhaps based on some method to identify
153	   packets associated with traceroute or some marker in the packet
154	   itself that the traceroute tool can set.

156	2.  Solution Overview

158	   The pieces needed to accomplish this are:

160	   o  One or more ways to select the uniform model packets at the tunnel
161	      ingress.

163	   o  Tunnel ingress copying out the original ttl from a selected packet
164	      to the outer IP header, and then doing a check and decrement of
165	      that ttl.

167	   o  If that ttl check results in ttl expiry at the tunnel ingress,
168	      then deliver an ICMP ttl exceeded packet back to the host.

170	   o  A mechanism by which the tunnel egress knows which packets should
171	      have uniform model, for instance a bit in the encapsulation
172	      header.

174	   o  The tunnel egress copying in the ttl (for identified packets) from
175	      the outer header to the inner IP header, then doing a check and
176	      decrement of that ttl.

178	   o  If ttl check results in ttl expiry at the tunnel egress, then
179	      deliver an ICMP error back to the original host (or, perhaps
180	      better, to tunnel ingress the same way as underlay routers do).

182	   o  IP routers in the underlay will deliver any ICMP errors to the
183	      source IP address of the packet.  For tunneled packets that will
184	      be the tunnel ingress.  Hence the tunnel ingress needs to be able
185	      to take such ICMP errors and form corresponding ICMP errors that
186	      are sent back to the host.  The requirement in [RFC1812] ensures
187	      that the ICMP errors will contain enough headers to form such an
188	      ICMP error.  It has been noted that there are routers in the
189	      Internet which decades later fail to conform to that aspect of
190	      [RFC1812].

192	   The idea to reflect (some) ICMP errors from inside a tunnel back to
193	   the original source goes back to IPv6 in IPv4 encapsulation as
194	   specified in [RFC1933] and [RFC2473].  However, those drafts did not
195	   advocate using a uniform ttl model for the tunnels but did handle
196	   ICMP packet too big and other unreachable messages.  Those drafts
197	   specify how to reflect ICMP errors received from underlay routers to
198	   ICMP errors sent to the original host.  The addition of handling ICMP
199	   ttl exceeded errors for uniform tunnel model is straight forward.

201	   The information carried in the ICMP errors are quite limited - the
202	   original packet plus an ICMP type and code.  However, there are
203	   extension mechanisms specified in [RFC4884] and used for MPLS in
204	   [RFC4950] which include TLVs with additional information.  If there
205	   are additional information to include for overlay networks that
206	   information could be added by defining new ICMP Extensions Objects
207	   based on [RFC4884].  Such extensions are for further study.

209	3.  Goals and Requirements

211	   The following goals and requirements apply:

213	   o  No changes needed in the underlay.

215	   o  Optional changes on the decapsulating end.

217	   o  ECMP friendly.  If the underlay employs equal cost multipath
218	      routing then one should be able to use this mechanism to trace the
219	      same path as a given TCP or UDP flow is using.  In addition, one
220	      should be able to explore different ECMP paths by varying the IP
221	      addresses and port numbers in the packets originated by traceroute
222	      on the host.

224	   o  Provide output which makes it possible to compare a regular
225	      overlay traceroute with the layer-transcending output.

227	4.  Definition Of Terms

229	   The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
230	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
231	   document are to be interpreted as described in [RFC2119].

233	   The terminology such as NVE, and TS are used as specified in
234	   [RFC7365]:

236	   o  Network Virtualization Edge (NVE): An NVE is the network entity
237	      that sits at the edge of an underlay network and implements L2
238	      and/or L3 network virtualization functions.

240	   o  Tenant System (TS): A physical or virtual system that can play the
241	      role of a host or a forwarding element such as a router, switch,
242	      firewall, etc.

244	   o  Virtual Access Points (VAPs): A logical connection point on the
245	      NVE for connecting a Tenant System to a virtual network.

247	   o  Virtual Network (VN): A VN is a logical abstraction of a physical
248	      network that provides L2 or L3 network services to a set of Tenant
249	      Systems.

251	   o  Virtual Network Context (VN Context) Identifier: Field in an
252	      overlay encapsulation header that identifies the specific VN the
253	      packet belongs to.

255	   We use the VTEP term in [RFC7348] as synonymous with NVE, and VNI as
256	   synonymous to VN Context Identifier.

258	5.  Example Topologies

260	   The following example topologies illustrate different cases where we
261	   want a tracing capability.  The examples are for overlay technologies
262	   such as VXLAN which provide a layer 2 overlay on IP.  The cases for
263	   layer 3 overlay on top of IP are simpler and not shown in this
264	   document.

266	   The VXLAN term VTEP is used as synonymous to NVO3's NVE term.

268	   -----------                -----------
269	   |    H1   |                |    H2   |
270	   | 1.0.1.1 |                | 1.0.1.2 |
271	   |         |                |         |
272	   -----------                -----------
273	        |                          |
274	        |                          |
275	   -----------   -----------  -----------
276	   |  VtepA  |   |    R1   |  |  VtepB  |
277	   | 2.0.1.1 | --| 2.0.1.2 |  | 2.0.2.1 |
278	   |         |   | 2.0.2.2 |--|         |
279	   -----------   -----------  -----------

281	                             Simple L2 overlay

283	   The figure above shows two hosts connected using an underlay which
284	   provides a layer two service.  Thus H1 and H2 are in the same subnet
285	   and unaware of the existence of the underlay.  Thus a normal ping or
286	   traceroute would not be able to provide any information about the
287	   nature of a failure; either packets get through or they do not.  When
288	   the packets get through traceroute would output something like:

290	   traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets
291	    1  1.0.2.1 (1.0.2.1)  1.104 ms  1.235 ms  1.729 ms

293	   In this case it would be desirable to be able to traceroute from H1
294	   to H2 (and vice versa) and observe VtepA, R1, VtepB and H2.  Thus in
295	   the case of packets getting through traceroute would output:

297	   traceroute to 1.0.1.2 (1.0.1.2), 30 hops max, 60 byte packets
298	    1  2.0.1.1 (2.0.1.1)  1.104 ms  1.235 ms  1.729 ms
299	    2  2.0.1.2 (2.0.1.2)  2.106 ms  2.007 ms  2.156 ms
300	    3  2.0.2.1 (2.0.2.1)  35.034 ms  24.490 ms  21.626 ms
301	    4  1.0.1.2 (1.0.1.2)  40.830 ms  44.694 ms  75.620 ms

303	   Note that the underlay and overlay might exist in completely separate
304	   addressing domains.  Thus H1 might not be able to reach any of the
305	   underlay addresses.  And the underlay IP addresses might overlap the
306	   overlay IP addresses.  For example, it would be completely valid to
307	   see e.g.  VtepA having the same IP address as H1.  The user of this
308	   tool need to understand that the utility of the traceroute output is
309	   to get information to determine whether the issue is in the underlay
310	   or overlay, and be able to pass the underlay information to the
311	   operator of the underlay.

313	   In overlay networks without any ARP/ND optimizations ARP/ND packets
314	   would be flooded between the tunnel endpoints.  Thus if there is some
315	   communication failure between H1 and H2, then H1 above might not have
316	   an ARP entry for H2.  This results in traceroute not being able to
317	   output any data.  This implies that in order to use traceroute to
318	   trouble shoot the issue one would need some workaround, such as
319	   installing some temporary ARP entries on the hosts.

321	   -----------                -----------  -----------  -----------
322	   |    H1   |                |    R2   |  |    R3   |  |    H4   |
323	   | 1.0.1.1 |                | 1.0.2.2 |--| 1.0.2.3 |  |         |
324	   |         |                | 1.0.1.2 |  | 1.0.3.3 |--| 1.0.3.4 |
325	   -----------                -----------  -----------  -----------
326	        |                          |
327	        |                          |
328	   -----------   -----------  -----------
329	   | VtepA  |   |    R1   |  |  VtepB  |
330	   | 2.0.1.1 | --| 2.0.1.2 |  | 2.0.2.1 |
331	   |         |   | 2.0.2.2 |--|         |
332	   -----------   -----------  -----------

334	                   L2 overlay as part of larger network

336	   The figure above has a overlay router the nexthop as seen by H1.  In
337	   this case a normal overlay traceroute would be able to display the
338	   overlay path i.e.

340	   traceroute to H4, 30 hops max, 60 byte packets
341	    1  R2
342	    2  R3
343	    3  H4

345	   The layer-transcending traceroute would show the combination of the
346	   underlay and overlay paths i.e.,

348	   traceroute to H4, 30 hops max, 60 byte packets
349	    1  VtepA
350	    2  R1
351	    3  VtepB
352	    4  R2
353	    5  R3
354	    6  H4

356	   -----------             -------------------             -----------
357	   |    H1   |             |       R5        |             |    H6   |
358	   | 1.0.1.1 |             |                 |             |         |
359	   |         |             | 1.0.1.2 1.0.5.5 |             | 1.0.5.6 |
360	   -----------             |-----------------|             -----------
361	        |                  |    |       |    |                  |
362	        |                  |    |       |    |                  |
363	   ----------- ----------- |-----------------| ----------- -----------
364	   | VtepA   | |   R1    | |  VtepB    VtepC | |   R6    | |  VtepD  |
365	   | 2.0.1.1 |-| 2.0.1.2 | | 2.0.2.1 3.0.1.1 |-| 3.0.1.2 | |         |
366	   |         | | 2.0.2.2 |-|                 | | 3.0.2.2 |-| 3.0.3.1 |
367	   ----------- ----------- ------------------- ----------- -----------

369	                       Multiple L2 overlays in path

371	   The figure above has multiple overlay network segments, that are
372	   connected in one router which provides the tunnel endpoints for both
373	   overlay segments plus routing for the overlay.  A more general
374	   picture would be to have an overlay routed path between the two NVEs
375	   e.g., VtepB and VtepC connected to different routers in the overlay.
376	   However, such a drawing in ASCII art doesn't fit on the page.

378	   An normal overlay traceroute in the above topology would show the
379	   overlay router i.e.,

381	   traceroute to H6, 30 hops max, 60 byte packets
382	    1  R5
383	    2  H6

385	   The layer-transcending traceroute would show the combination of the
386	   underlay and overlay paths i.e.,

388	   traceroute to H6, 30 hops max, 60 byte packets
389	    1  VtepA
390	    2  R1
391	    3  VtepB
392	    4  R5
393	    5  VtepC
394	    6  R6
395	    7  VtepD
396	    8  H6

398	   Note that the R3 device, which include VtepB and VtepC, appears as
399	   three hops in the traceroute output.  That is needed to be able to
400	   correlate the output with the overlay output which has R3.  That
401	   correlation would be hard if the R3 device only appeared as VtepB in
402	   the LTTON output.  The three-hop representation also stays invariant
403	   whether or not the NVEs and overlay router are implemented by a
404	   single device or multiple devices.

406	6.  Controlling and selecting ttl behavior

408	   The network admin needs to be able to control who can use the layer
409	   transcending traceroute, since the operator might not want to
410	   disclose the underlay topology to all its users all the time.  There
411	   are different approaches for this such as designating particular
412	   ports (Virtual Access Points in NVO3 terminology) on a NVE to have
413	   uniform ttl tunnel model.  We have found it useful to be able to
414	   enable this capability on a per port and/or virtual network basis, in
415	   addition to having a global setting per NVE.

417	   When enabled on the NVEs the user on the TS needs to be able to
418	   control which traffic is subject to which tunnel mode.  The normal
419	   traffic would use the pipe ttl tunnel model and only explicit trace
420	   applications are likely to want to use the uniform ttl tunnel model.
421	   Hence it makes sense to use some marker in the packets sent by the TS
422	   to select those packets for uniform model on the NVE.  Such a
423	   mechanism should usable so that the user can perform both a regular
424	   traceroute and a LTTON.

426	   Potentially different fields in the packets originated by traceroute
427	   on the TS can be used to mark the packets for uniform ttl tunnel
428	   model.  However, many of those fields such as source and destination
429	   port numbers and protocol might be used in hashing for ECMP.  The
430	   marking that can be used without impacting ECMP is the DSCP field in
431	   the packet.  That field can be set with an option (--tos) in at least
432	   some existing traceroute implementations.

434	   Note that when DSCP is used for such marking it is a configured
435	   choice subject to agreement between the operator of the TS and NVE.
436	   The matching on the NVE should ignore the ECN bits as to not
437	   interfere with ECN.

439	   However, the DSCP value used in the overlay might have an impact on
440	   the forwarding of the packets.  In such a case one can use an
441	   alternative selector such as the UDP source port number.  That has
442	   the downside of affecting the has values used for ECMP and link
443	   aggregation port selection.

445	7.  Introducing a ttl copyin flag in the encapsulation header

447	   When this approach is applied to VXLAN [RFC7348] the decapsulating
448	   NVE has to be able to identify packets that have to be processed in
449	   the uniform ttl tunnel model way.  For that purpose we define a new
450	   flag which is sent by the encapsulating NVE on selected packets, and
451	   is used by the decapsulating NVE to perform the ttl copyin, decrement
452	   and check.

454	   In addition to the one I-flag defined in [RFC7348] we define a new
455	   T-flag to capture this the trace behavior at the decapsulating tunnel
456	   endpoint.

458	      0                   1                   2                   3
459	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
460	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
461	      |R|R|R|R|I|R|R|T|            Reserved                           |
462	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
463	      |                VXLAN Network Identifier (VNI) |   Reserved    |
464	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

466	   New fields:

468	   T-flag:        When set indicates that decapsulator should take the
469	                  outer ttl and copy it to the inner ttl, and then check
470	                  and decrement the resulting ttl.

472	8.  Encapsulation Behavior

474	   If the uniform ttl model is enabled for the input, and the received
475	   naked packet matches the selector, then the ingress NVE will perform
476	   these additional operations as part of encapsulating an IPv4 or IPv6
477	   packet:

479	   o  Examine the IPv4 TTL (or IPv6 hopcount, respectively) on receipt
480	      and if 1 or less, then drop the packet and send an ICMPv4 (or
481	      ICMPv6) ttl exceeded back to the original host.  Since the NVE is
482	      operating on a L2 packet, it might not have any layer 3 interfaces
483	      or routes for the originating host.  Thus it sends the packet back
484	      to the source L2 address of the packet back out the ingress port -
485	      without any IP address lookup.

487	   o  If ttl did not expire, then decrement the above ttl/hopcount and
488	      place it in the outer IP header.  Encapsulate and send the packet
489	      as normal.

491	   o  If some other errors prevent sending the packet (such as unknown
492	      VN Context Id, no flood list configured), then the NVE SHOULD send
493	      an ICMP host unreachable back to the host.

495	   The ingress NVE will receive ICMP errors from underlay routers and
496	   the egress NVE; whether due to ttl exceeded or underlay issues such
497	   as host unreachable, or packet too big errors.  The NVE should take
498	   such errors, and in addition to any local syslog etc, generate an
499	   ICMP error sent back to the host.  The principle for this is
500	   specified in [RFC1933] and [RFC2473].  Just like in those
501	   specifications, for the inner and outer IP header could be off
502	   different version.  A common case of that might be an IPv6 overlay
503	   with an IPv4 underlay.  That case requires some changes in the ICMP
504	   type and code values in addition to recreating the packets.  The
505	   place where LTTON differs from those specifications is that there is
506	   an NVO3 header and (for L2 over L3) and L2 header in the packet.

508	   The figures below show an example of ICMP header re-generation at
509	   VtepA for the case of IPv6 overlay with IPv4 underlay.  The case of
510	   IPv4 over IPv4 is similar and simpler since the ICMP header is the
511	   same for both overlay and underlay.  The example uses VXLAN
512	   encapsulation to provide the concrete details, but the approach
513	   applies to other NVO3 proposals.

515	                +--------------+
516	                | IPv4 Header  |
517	                | src = R1     |
518	                | dst = VtepA  |
519	                +--------------+
520	                |    ICMPv4    |
521	                |    Header    |
522	                |   type = X   |
523	                |   code = Y   |
524	         - -    +--------------+
525	                | IPv4 Header  |
526	                | src = VtepA  |
527	        IPv4    | dst = VtepB  |
528	                +--------------+
529	       Packet   |     UDP      |
530	                | dst = VXLAN  |
531	         in     +--------------+
532	                |   Ethernet   |
533	       Error    | DA = H2 mac  |
534	                | SA = H1 mac  |
535	                +--------------+   - -
536	                |    IPv6      |
537	                | src = H1 ipv6|
538	                | dst = H2 ipv6|   Original IPv6
539	                +--------------+   Packet.
540	                |  Transport   |   Used to
541	                |    Header    |   generate an
542	                +--------------+   ICMPv6
543	                |              |   error message
544	                ~     Data     ~   back to the source.
545	                |              |
546	         - -    +--------------+   - -

548	            ICMPv4 Error Message Returned to Encapsulating Node

550	   The above underlay ICMPv4 is used to form an overlay ICMPv6 packet by
551	   extracting the Ethernet DA from the inner Ethernet SA, and forming an
552	   IPv6 header where the source address is based on the source address
553	   of the ICMPv4 error.  The ICMPv6 type and code values are set based
554	   on the ICMPv4 type and code values.

556	                +--------------+
557	                |   Ethernet   |
558	                | DA = H1 mac  |   From ICMPv4 packet
559	                | SA = VtepA   |   in error
560	                +--------------+
561	                | IPv6 Header  |
562	                | src = ::R1   |   96 zeros + IPv4 address
563	                | dst = H1 ipv6|
564	                +--------------+
565	                |    ICMPv6    |
566	                |    Header    |
567	                |   type = X'  |   Type and code mapped
568	                |   code = Y'  |   from v4 to v6 values
569	         - -    +--------------+   - -
570	                |    IPv6      |
571	        IPv6    | src = H1 ipv6|
572	                | dst = H2 ipv6|   Unmodified from
573	       Packet   +--------------+   ICMPv4 error
574	                |  Transport   |
575	         in     |    Header    |
576	                +--------------+
577	       Error    |              |
578	                ~     Data     ~
579	                |              |
580	         - -    +--------------+   - -

582	             Generated ICMPv6 Error Message for Overlay Source

584	   In the case of IPv6 over IPv4 the above example setting of the IPv6
585	   source address results in this type of traceroute output:

587	   traceroute to 2000:0:0:40::2, 30 hops max, 80 byte packets
588	    1  ::2.0.1.1 (::2.0.1.1)  1.231 ms  1.004 ms  1.126 ms
589	    2  ::2.0.1.2 (::2.0.1.2)  1.994 ms  2.301 ms  2.016 ms
590	    3  ::2.0.2.1 (::2.0.2.1)  18.846 ms  30.582 ms  19.776 ms
591	    4  2000:0:0:40::2 (2000:0:0:40::2)  48.964 ms  60.131 ms  53.895 ms

593	9.  Decapsulating Behavior

595	   If this uniform ttl model is enabled on the decapsulating NVE, and
596	   the overlay header indicates that uniform ttl model applies (the
597	   T-bit in the case of VXLAN), then the NVE will perform these
598	   additional operations as part of decapsulating a packet where the
599	   inner packet is an IPv4 or IPv6 packet:

601	   o  Examine the outer IPv4 TTL (or outer IPv6 hopcount, respectively)
602	      on receipt and if 1 or less, then drop the packet and send an
603	      outer ICMPv4 (or ICMPv6) ttl exceeded back to the source of the
604	      outer packet i.e., the ingress NVE.  This ICMP packet should look
605	      the same as an ICMP error generated by an underlay router, and the
606	      requirement in [RFC1812] on the size of the packet in error
607	      applies.

609	   o  If ttl did not expire, then decrement the above ttl/hopcount and
610	      place it in the inner IP header.  If the inner IP header is IPv4
611	      then update the IPv4 header checksum.  Then decapsulate and send
612	      the packet as for other decapsulated packets.

614	   o  If some other errors prevent sending the packet (such as unknown
615	      VN Context Id), then the NVE SHOULD send an ICMP host unreachable
616	      instead of a ttl exceeded error.

618	10.  Other ICMP errors

620	   The technique for selecting ttl behavior specified in this draft can
621	   also be used to trigger other ICMPv4 and ICMPv6 errors.  For example,
622	   [RFC1933] specifies how ICMP packet too big from underlay routers can
623	   be used to report over ICMP packet too big errors to the original
624	   source.  Other errors that are more specific to the overlay protocol
625	   might also be useful, such as not being able to find a VNI ID for the
626	   incoming port,vlan, or not being able to flood the packet if the
627	   packet is a Broadcast, Unknown unicast, or Multicast packet.

629	11.  Security Considerations

631	   The considerations in [I-D.ietf-nvo3-security-requirements] apply.

633	   In addition, the use of the uniform ttl tunnel model will result in
634	   ICMP errors being generated by underlay routers and consumed by NVEs.
635	   That presents an attack vector which does not exist in a pipe ttl
636	   tunnel model.  However, ICMP errors should be rate limited [RFC1812].
637	   Implementations should also take appropriate measures in rate
638	   limiting the input rate for ICMP errors that are processed by limited
639	   CPU resources.

641	   Some implementations might handle the trace packets (with uniform ttl
642	   model) in software while the pipe ttl model packets can be handled in
643	   hardware.  In such a case the implementation should have mechanisms
644	   to avoid starvation of limited CPU resources due to these packets.

646	12.  IANA Considerations

648	   TBD

650	13.  Acknowledgements

652	   The authors acknowledge the helpful comments from David Black and
653	   Diego Garcia del Rio.

655	14.  References

657	14.1.  Normative References

659	   [RFC0792]  Postel, J., "Internet Control Message Protocol", STD 5,
660	              RFC 792, DOI 10.17487/RFC0792, September 1981,
661	              <http://www.rfc-editor.org/info/rfc792>.

663	   [RFC1812]  Baker, F., Ed., "Requirements for IP Version 4 Routers",
664	              RFC 1812, DOI 10.17487/RFC1812, June 1995,
665	              <http://www.rfc-editor.org/info/rfc1812>.

667	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
668	              Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/
669	              RFC2119, March 1997,
670	              <http://www.rfc-editor.org/info/rfc2119>.

672	   [RFC7348]  Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,
673	              L., Sridhar, T., Bursell, M., and C. Wright, "Virtual
674	              eXtensible Local Area Network (VXLAN): A Framework for
675	              Overlaying Virtualized Layer 2 Networks over Layer 3
676	              Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014,
677	              <http://www.rfc-editor.org/info/rfc7348>.

679	   [RFC7365]  Lasserre, M., Balus, F., Morin, T., Bitar, N., and Y.
680	              Rekhter, "Framework for Data Center (DC) Network
681	              Virtualization", RFC 7365, DOI 10.17487/RFC7365,
682	              October 2014, <http://www.rfc-editor.org/info/rfc7365>.

684	14.2.  Informative References

686	   [I-D.gross-geneve]
687	              Gross, J., Sridhar, T., Garg, P., Wright, C., Ganga, I.,
688	              Agarwal, P., Duda, K., Dutt, D., and J. Hudson, "Geneve:
689	              Generic Network Virtualization Encapsulation",
690	              draft-gross-geneve-02 (work in progress), October 2014.

692	   [I-D.herbert-gue]
693	              Herbert, T., Yong, L., and O. Zia, "Generic UDP
694	              Encapsulation", draft-herbert-gue-03 (work in progress),
695	              March 2015.

697	   [I-D.ietf-nvo3-security-requirements]
698	              Hartman, S., Zhang, D., Wasserman, M., Qiang, Z., and M.
699	              Zhang, "Security Requirements of NVO3",
700	              draft-ietf-nvo3-security-requirements-06 (work in
701	              progress), December 2015.

703	   [I-D.sridharan-virtualization-nvgre]
704	              Garg, P. and Y. Wang, "NVGRE: Network Virtualization using
705	              Generic Routing Encapsulation",
706	              draft-sridharan-virtualization-nvgre-08 (work in
707	              progress), April 2015.

709	   [I-D.tissa-lime-yang-oam-model]
710	              Senevirathne, T., Finn, N., Kumar, D., Salam, S., Wu, Q.,
711	              and Z. Wang, "Generic YANG Data Model for Operations,
712	              Administration, and Maintenance (OAM)",
713	              draft-tissa-lime-yang-oam-model-06 (work in progress),
714	              August 2015.

716	   [RFC1933]  Gilligan, R. and E. Nordmark, "Transition Mechanisms for
717	              IPv6 Hosts and Routers", RFC 1933, DOI 10.17487/RFC1933,
718	              April 1996, <http://www.rfc-editor.org/info/rfc1933>.

720	   [RFC2473]  Conta, A. and S. Deering, "Generic Packet Tunneling in
721	              IPv6 Specification", RFC 2473, DOI 10.17487/RFC2473,
722	              December 1998, <http://www.rfc-editor.org/info/rfc2473>.

724	   [RFC2983]  Black, D., "Differentiated Services and Tunnels",
725	              RFC 2983, DOI 10.17487/RFC2983, October 2000,
726	              <http://www.rfc-editor.org/info/rfc2983>.

728	   [RFC3270]  Le Faucheur, F., Wu, L., Davie, B., Davari, S., Vaananen,
729	              P., Krishnan, R., Cheval, P., and J. Heinanen, "Multi-
730	              Protocol Label Switching (MPLS) Support of Differentiated
731	              Services", RFC 3270, DOI 10.17487/RFC3270, May 2002,
732	              <http://www.rfc-editor.org/info/rfc3270>.

734	   [RFC3443]  Agarwal, P. and B. Akyol, "Time To Live (TTL) Processing
735	              in Multi-Protocol Label Switching (MPLS) Networks",
736	              RFC 3443, DOI 10.17487/RFC3443, January 2003,
737	              <http://www.rfc-editor.org/info/rfc3443>.

739	   [RFC4884]  Bonica, R., Gan, D., Tappan, D., and C. Pignataro,
740	              "Extended ICMP to Support Multi-Part Messages", RFC 4884,
741	              DOI 10.17487/RFC4884, April 2007,
742	              <http://www.rfc-editor.org/info/rfc4884>.

744	   [RFC4950]  Bonica, R., Gan, D., Tappan, D., and C. Pignataro, "ICMP
745	              Extensions for Multiprotocol Label Switching", RFC 4950,
746	              DOI 10.17487/RFC4950, August 2007,
747	              <http://www.rfc-editor.org/info/rfc4950>.

749	Authors' Addresses

751	   Erik Nordmark
752	   Arista Networks
753	   Santa Clara, CA
754	   USA

756	   Email: nordmark@arista.com

758	   Chandra Appanna
759	   Arista Networks
760	   Santa Clara, CA
761	   USA

763	   Email: achandra@arista.com

765	   Alton Lo
766	   Arista Networks
767	   Santa Clara, CA
768	   USA

770	   Email: altonlo@arista.com