idnits 2.17.1 

draft-merged-nvo3-ts-address-migration-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an Introduction section.
     (A line matching the expected section header was found, but with an
    unexpected indentation:
     '  1. Introduction' )

  ** The document seems to lack a Security Considerations section.
     (A line matching the expected section header was found, but with an
    unexpected indentation:
     '  10. Security Considerations' )

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)
     (A line matching the expected section header was found, but with an
    unexpected indentation:
     '  11. IANA Considerations' )

  ** There are 2 instances of too long lines in the document, the longest one
     being 10 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 24, 2014) is 3470 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Missing reference section? 'RFC2119' on line 754 looks like a reference

  -- Missing reference section? 'RFC4364' on line 763 looks like a reference

  -- Missing reference section? 'RFC4684' on line 766 looks like a reference

  -- Missing reference section? 'E-VPN' on line 771 looks like a reference

  -- Missing reference section? 'Default-Gateway' on line 774 looks like a
     reference

  -- Missing reference section? 'DC-mobility' on line 777 looks like a
     reference


     Summary: 4 errors (**), 0 flaws (~~), 1 warning (==), 7 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------

1	NVO3 Working Group                                           Y. Rekhter
2	Internet Draft                                         Juniper Networks
3	Intended status: Standards track                              L. Dunbar
4	Expires: April 2015                                              Huawei
5	                                                             R. Aggarwal
6	                                                              Arktan Inc
7	                                                             R. Shekhar
8	                                                        Juniper Networks
9	                                                           W. Henderickx
10	                                                          Alcatel-Lucent
11	                                                                 L. Fang
12	                                                               Microsoft
13	                                                              A. Sajassi
14	                                                                  Cisco

16	                                                       October 24, 2014

18	              Overlay Network Tenant System Address Migration
19	               draft-merged-nvo3-ts-address-migration-01.txt

21	Status of this Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79. This document may not be modified,
25	   and derivative works of it may not be created, except to publish it
26	   as an RFC and to translate it into languages other than English.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF), its areas, and its working groups.  Note that
30	   other groups may also distribute working documents as Internet-
31	   Drafts.

33	   Internet-Drafts are draft documents valid for a maximum of six
34	   months and may be updated, replaced, or obsoleted by other documents
35	   at any time.  It is inappropriate to use Internet-Drafts as
36	   reference material or to cite them other than as "work in progress."

38	   The list of current Internet-Drafts can be accessed at
39	   http://www.ietf.org/ietf/1id-abstracts.txt

41	   The list of Internet-Draft Shadow Directories can be accessed at
42	   http://www.ietf.org/shadow.html
43	   This Internet-Draft will expire on April 24, 2009.

45	Copyright Notice

47	   Copyright (c) 2014 IETF Trust and the persons identified as the
48	   document authors. All rights reserved.

50	   This document is subject to BCP 78 and the IETF Trust's Legal
51	   Provisions Relating to IETF Documents
52	   (http://trustee.ietf.org/license-info) in effect on the date of
53	   publication of this document. Please review these documents
54	   carefully, as they describe your rights and restrictions with
55	   respect to this document. Code Components extracted from this
56	   document must include Simplified BSD License text as described in
57	   Section 4.e of the Trust Legal Provisions and are provided without
58	   warranty as described in the Simplified BSD License.

60	Abstract

62	   This document describes the schemes to overcome the network-related
63	   issues to achieve seamless Virtual Machine mobility in data centers.

65	Table of Contents

67	   1. Introduction...................................................3
68	   2. Conventions used in this document..............................3
69	   3. Terminology....................................................4
70	   4. Scheme to resolve VLAN-IDs usage in L2 access domains..........7
71	   5. Layer 2 Extension..............................................9
72	      5.1. Layer 2 Extension Problem.................................9
73	      5.2. NVA based Layer 2 Extension Solution.....................10
74	   6. Optimal IP Routing............................................11
75	      6.1. Preserving Policies......................................13
76	      6.2. TS Default Gateway solutions.............................13
77	         6.2.1. Solution with Anycast for TS Default Gateways.......13
78	         6.2.2. Distributed Proxy Default Gateway Solution..........15
79	      6.3. Triangular Routing.......................................16
80	   7. L3 Address Migration..........................................16
81	   8. Managing duplicated addresses.................................18
82	   9. Manageability Considerations..................................18
83	   10. Security Considerations......................................18
84	   11. IANA Considerations..........................................19
85	   12. Acknowledgements.............................................19
86	   13. References...................................................19
87	      13.1. Normative References....................................19
88	      13.2. Informative References..................................19

90	  1. Introduction

92	   An important feature of data centers identified in [nvo3-problem] is
93	   the support of Virtual Machine (TS) mobility within the data center
94	   and between data centers. This document describes the schemes to
95	   overcome the network-related issues to achieve seamless Virtual
96	   Machine mobility in the data center and between data centers, where
97	   seamless mobility is defined as the ability to move a TS from one
98	   server in a data center to another server in the same or different
99	   data center, while retaining the IP and MAC address of the TS. In
100	   the context of this document the term mobility or a reference to
101	   moving a TS should be considered to imply seamless mobility, unless
102	   otherwise stated.

104	   Note that in the scenario where a TS is moved between servers
105	   located in different data centers, there are certain issues related
106	   to the current state of the art of the Virtual Machine technology,
107	   the bandwidth that may be available between the data centers, the
108	   distance between the data centers, the ability to manage and operate
109	   such TS mobility, storage-related issues (the moved TS has to have
110	   access to the same virtual disk), etc.  Discussion of these issues
111	   is outside the scope of this document.

113	  2. Conventions used in this document

115	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
116	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
117	   document are to be interpreted as described in RFC-2119 [RFC2119].

119	   In this document, these words will appear with that interpretation
120	   only when in ALL CAPS. Lower case uses of these words are not to be
121	   interpreted as carrying RFC-2119 significance.

123	   DC:   Data Center
124	   DCBR: Data Center Bridge Router

126	   LAG:  Link Aggregation Group

128	   POD:  Modular Performance Optimized Data Center. POD and Data Center
129	   are used interchangeably in this document.

131	   ToR:  Top of Rack switch

133	   TS:   Tenant System (used interchangeably with VM on servers
134	   supporting Virtual Machines)

136	   VEPA: Virtual Ethernet Port Aggregator (IEEE802.1Qbg)

138	   VN: Virtual Network

140	  3. Terminology

142	   In this document "Mobility" refers to "address migration", meaning
143	   TSs move to different locations without changing their addresses
144	   (IP/MAC).

146	   In this document the term "Top of Rack Switch (ToR)" is used to
147	   refer to a switch in a data center that is connected to the servers
148	   that host TSs. A data center may have multiple ToRs. Some servers
149	   may have embedded blade switches, some servers may have virtual
150	   switches to interconnect the TSs, and some servers may not have any
151	   embedded switches. When External Bridge Port Extenders (as defined
152	   by 802.1BR) are used to connect the servers to the data center
153	   network, the ToR switch is the Controlling Bridge.

155	   Several data centers or PODs could be connected by a network. In
156	   addition to providing interconnect among the data centers/PODs, such
157	   a network could provide connectivity between the TSs hosted in these
158	   data centers and the sites that contain hosts communicating with
159	   such TSs. Each data center has one or more Data Center Border Router
160	   (DCBR) that connects the data center to the network, and provides
161	   (a) connectivity between TSs hosted in the data center and TSs
162	   hosted in other data centers, and (b) connectivity between TSs
163	   hosted in the data center and hosts communicating with these TSs.

165	   The following figure illustrates the above:
166	                      __________
167	                     (          )
168	                    ( Data Center)
169	                   ( Interconnect )-------------------------
170	                    (  Network   )                          |
171	                     (__________)                           |
172	                        |    |                              |
173	                    ----      ----                          |
174	                   |              |                         |
175	           --------+--------------+---------------        -------------
176	          |        |              |       Data     |     |             |
177	          |     ------          ------    Center   |     | Data Center |
178	          |    | DCBR |        | DCBR |    /POD    |     |    /POD     |
179	          |     ------          ------             |      -------------
180	          |        |              |                |
181	          |         ---        ---                 |
182	          |         ___|______|__                  |
183	          |        (             )                 |
184	          |       (  Data Center  )                |
185	          |        (   Network   )                 |
186	          |         (___________)                  |
187	          |            |      |                    |
188	          |        ----        ----                |
189	          |       |                |               |
190	          |  ------------        -----             |
191	          | | ToR Switch |      | ToR |            |
192	          |  ------------        -----             |
193	          |   |                    |               |
194	          |   |   ----------       |   ----------  |
195	          |   |--| Server   |      |--| Server   | |
196	          |   |  | vSwitch  |      |   ----------  |
197	          |   |  |  ----    |      |               |
198	          |   |  | | TS |   |      |   ----------  |
199	          |   |  |  -----   |       --| Server   | |
200	          |   |  |  | TS |  |          ----------  |
201	          |   |  |   -----  |                      |
202	          |   |  |   | TS | |                      |
203	          |   |  |    ----  |                      |
204	          |   |   ----------                       |
205	          |   |   ----------                       |
206	          |   |--| Server   |                      |
207	          |   |   ----------                       |
208	          |   |   ----------                       |
209	          |    --| Server   |                      |
210	          |       ----------                       |
211	           ----------------------------------------

213	                     Figure 1: A Typical Data Center Network

215	   The data centers/PODs and the network that interconnects them may be
216	   either (a) under the same administrative control, or (b) controlled
217	   by different administrations.

219	   Consider a set of TSs that (as a matter of policy) are allowed to
220	   communicate with each other, and a collection of devices that
221	   interconnect these TSs. If communication among any TSs in that set
222	   could be accomplished in such a way as to preserve MAC source and
223	   destination addresses in the Ethernet header of the packets
224	   exchanged among these TSs (as these packets traverse from their
225	   sources to their destinations), we will refer to such set of TSs as
226	   an Layer 2 based Virtual Network (VN) or Closed User Group (L2-based
227	   CUG). In this document, the Closed User Group and Virtual Network
228	   (VN) are used interchangeably.

230	   A given TS may be a member of more than one VN or L2-based VN.

232	   In terms of IP address assignment this document assumes that all TSs
233	   of a given L2-based VN have their IP addresses assigned out of a
234	   single IP prefix. Thus, in the context of this document a single IP
235	   subnet corresponds to a single L2-based VN.  If a given TS is a
236	   member of more than one L2-based VN, this TS would have multiple IP
237	   addresses and multiple logical interfaces, one IP address and one
238	   logical interface per each such VN.

240	   A TS that is a member of a given L2-based VN may (as a matter of
241	   policy) be allowed to communicate with TSs that belong to other L2-
242	   based VNs, or with other hosts. Such communication involves IP
243	   forwarding, and thus would result in changing MAC source and
244	   destination addresses in the Ethernet header of the packets being
245	   exchanged.

247	   In this document the term "L2 physical attachment" refers to a
248	   collection of interconnected devices attached to an NVE that perform
249	   forwarding based on the information carried in the Ethernet header.
250	   A trivial L2 physical attachment consists of just one non-
251	   virtualized server. In a non-trivial L2 physical attachment (domain
252	   that contains multiple forwarding entities) forwarding could be
253	   provided by such layer 2 technologies as Spanning Tree Protocol
254	   (STP), VEPA (IEEE802.1Qbg), etc.  Note that any multi-chassis LAG
255	   cannot span more than one L2 physical attachment. This document
256	   assumes that a layer 2 access domain is an L2 physical attachment.

258	   A physical server connected to a given L2 physical domain may host
259	   TSs that belong to different L2-based VNs (while each of these VNs
260	   may span multiple L2 physical domains). If an L2 physical attachment
261	   contains servers that host TSs belonging to different L2-based VNs,
262	   then enforcing L2-based VNs boundaries among these TSs within that
263	   domain is accomplished by relying on Layer 2 mechanisms (e.g.
264	   VLANs).

266	   We say that an L2 physical attachment contains a given TS (or that a
267	   given TS is in a given L2 physical attachment), if the server
268	   presently hosting this TS is part of that domain, or the server is
269	   connected to a ToR that is part of that domain.

271	   We say that a given L2-based VN is present within a given data
272	   center if one or more TSs that are part of that VN are presently
273	   hosted by the servers located in that data center.

275	   In the context of this document when we talk about VLAN-ID used by a
276	   given TS, we refer to the VLAN-ID carried by the traffic that is
277	   within the same L2 physical attachment as the TS, and that is either
278	   originated or destined to that TS - e.g., VLAN-ID only has local
279	   significance within the L2 physical attachment, unless it is stated
280	   otherwise.

282	   Some of the TS-mobility solutions described in this document are E-
283	   VPN based. When using E-VPN in NVO3 environment, the NVE function is
284	   on the PE node.  NVE-PE is used to describe the E-VPN PE node that
285	   supports the NVE function.

287	  4. Scheme to resolve VLAN-IDs usage in L2 access domains

289	   This document assumes that within a given non-trivial L2 physical
290	   attachment traffic from/to TSs belonging to different L2-based VNs
291	   MUST have different VLAN-IDs.

293	   To support tens of thousands of virtual networks, the local VLAN-ID
294	   associated with client payload under each NVE has to be locally
295	   significant. Therefore, the same L2-based VN MAY have either the
296	   same or different VLAN-IDs under different NVEs. Thus when a given
297	   TS moves from one non-trivial L2 physical attachment to another, the
298	   VLAN-ID of the traffic from/to TS in the former may be different
299	   than in the latter, and thus cannot assume to stay the same.

301	   To describe the solution more clearly, here are the terminologies
302	   used:

304	   - Customer administered VLAN-IDs (usually hard coded in a TS's Guest
305	     OS and can't be changed when the TS move from one NVE to another).
306	     Some TSs may not have VLAN-ID attached.
307	   - Provider administered VLAN-IDs of local significance, and
308	   - Provider administered VN-IDs of global significance.

310	   In the scenario where there are provider administered VLAN-IDs of
311	   local significance (e.g. NVE in a TOR), the value is selected by NVA
312	   from the pool of unused VIDs when the first local TS of a VN is
313	   being added, and returned by NVA to the unused pool of VLAN-IDs when
314	   the last TS leaves. For TSs with hard coded VLAN-ID, it is necessary
315	   for an entity, most likely the first switch (virtual or physical) to
316	   which the TS is attached, to change the locally administered VLAN-
317	   IDs to the TSs' hard coded VLAN-IDs. For un-tagged TSs, the first
318	   switch has to remove the locally administered VLAN-IDs before
319	   sending packets to TSs.

321	   The section is intended to describe:
322	      . NVA manages unused VLAN-IDs pool in each access L2 domain
323	      . NVE reports to NVA when first local TS of a VN is reachable, or
324	        none of TS in a VN is reachable by the NVE
325	      . NVA can push the global VN ID <-> locally administered VID
326	        mapping to NVE, or NVE can pull upon detecting a newly attached
327	        VN.
328	      . NVA manages the first switch to which TS is attached on mapping
329	        between TS's own VLAN-ID and "locally administered VID".

331	   Here is the detailed procedure:

333	      . NVE should get the specific VNID from NVA for untagged data
334	        frames arriving at the each Virtual Access Point [VNo3-
335	        framework 3.1.1] of a NVE.

337	        Since local VLAN-IDs under each NVE are locally significant,
338	        here are the possible ways for ingress NVE to assign VLAN-ID in
339	        the overlay header for data frames destined to other NVEs:

341	        a) carry what comes in at ingress Virtual Access point.
342	        Preserving vlan-id can be used to provide bundled
343	        service/PVLAN. In this case many vlan-ids in ingress could map
344	        to one logical VN (n to 1 mapping).

346	        b) not carrying any vlan-id and using logical VN identifier.
347	        The egress NVE gets the vlan-id from NVA to put on the packet
348	        before sending to attached TSs. This is 1-to-1 mapping between
349	        vlan-id and logical-VN.

351	      . If the data frame is already tagged before reaching the NVE's
352	        Virtual Access Point, the NVA should inform the first switch
353	        port that is responsible for adding VLAN-ID to the untagged
354	        data frames of the specific VLAN-ID to be inserted to data
355	        frames.

357	      . If data frames from a TS are already tagged, the first port
358	        facing the TS has be informed by the NVA of the new local VLAN-
359	        ID to replace the VLAN-ID encoded in the data frames.

361	        For data frames coming from network side towards TSs (i.e.
362	        inbound traffic towards TSs), the first switching port facing
363	        TSs have to convert the VLAN-IDs encoded in the data frames to
364	        the VLAN-IDs used by TSs.

366	  5. Layer 2 Extension

368	   5.1. Layer 2 Extension Problem

370	   Consider a scenario where a TS that is a member of a given L2-based
371	   VN moves from one server to another, and these two servers are in
372	   different L2 physical attachments, where these domains may be
373	   located in the same or different data centers (or PODs). In order to
374	   enable communication between this TS and other TSs of that L2-based
375	   VN, the new L2 physical attachment must become interconnected with
376	   the other L2 physical attachment(s) that presently contain the rest
377	   of the TSs of that VN, and the interconnect must not violate the L2-
378	   based VN requirement to preserve source and destination MAC
379	   addresses in the Ethernet header of the packets exchange between
380	   this TS and other members of that VN.

382	   Moreover, if the previous L2 physical attachment no longer contains
383	   any TSs of that VN, the previous domain no longer needs to be
384	   interconnected with the other L2 physical attachments(s) that
385	   contain the rest of the TSs of that VN.

387	   Note that supporting TS mobility implies that the set of L2 physical
388	   attachments that contain TSs that belong to a given L2-based VN may
389	   change over time (new domains added, old domains deleted).

391	   We will refer to this as the "layer 2 extension problem".

393	   Note that the layer 2 extension problem is a special case of
394	   maintaining connectivity in the presence of TS mobility, as the
395	   former restricts communicating TSs to a single/common L2-based VN,
396	   while the latter does not.

398	   5.2. NVA based Layer 2 Extension Solution

400	   Assume NVO3's NVA has at least the following information for each
401	   TS:
402	      . Inner Address: TS (host) Address family (IPv4/IPv6, MAC,
403	        virtual network Identifier MPLS/VLAN, etc)

405	      . Outer Address: The list of locally attached edges (NVEs);
406	        normally one TS is attached to one edge, TS could also be
407	        attached to 2 edges for redundancy (dual homing). One TS is
408	        rarely attached to more than 2 edges, though it could be
409	        possible;

411	      . VN Context (VN ID and/or VN Name)

413	      . Timer for NVEs to keep the entry when pushed down to or pulled
414	        from NVEs.

416	      . Optionally the list of interested remote edges (NVEs). This
417	        information is for NVA to promptly update relevant edges (NVEs)
418	        when there is any change to this TS' attachment to edges
419	        (NVEs). However, this information doesn't have to be kept per
420	        TS. It can be kept per VN.

422	   NVA can offer services in a Push, Pull mode, or the combination of
423	   the two.

425	   In this solution, the NVEs are connected via underlay IP network.
426	   For each VN, the NVA informs all the NVEs to which the TSs of the
427	   given VN are attached.

429	   When the last TS of a VN is moved out of a NVE, NVE can either
430	   confirm with the NVA or the NVA notifies the NVE for it to remove
431	   its connectivity to the VN. When an NVE needs to support
432	   connectivity to a VN not currently supported (as a result of TS turn
433	   up, or TS migration), the NVA will push the necessary VN information
434	   into the NVE.

436	   The term "NVE being connected to a VN" means that the NVE at least
437	   has:
438	      . the inner-outer address mapping information for all the TSs in
439	        the VN or being able to pull the mapping from the NVA,

441	      . the mapping of local VLAN-ID to the VNID used by overlay
442	        header, and

444	      . has the VN's default gateway IP/MAC address.

446	  6. Optimal IP Routing

448	   In the context of this document optimal IP routing, or just optimal
449	   routing, in the presence of TS mobility could be partitioned into
450	   two problems:

452	   - Optimal routing of a TS's outbound traffic. This means that as a
453	     given TS moves from one server to another, the TS's default
454	     gateway should be in a close topological proximity to the ToR that
455	     connects the server presently hosting that TS. Note that when we
456	     talk about optimal routing of the TS's outbound traffic, we mean
457	     traffic from that TS to the destinations that are outside of the
458	     TS's L2-based VN. This document refers to this problem as the TS
459	     default gateway problem.
460	   - Optimal routing of TS's inbound traffic. This means that as a
461	     given TS moves from one server to another, the (inbound) traffic
462	     originated outside of the TS's L2-based VN, and destined to that
463	     TS be routed via the router of the TS's L2-based VN that is in a
464	     close topological proximity to the ToR that connects the server
465	     presently hosting that TS, without first traversing some other
466	     router of that L2-based VN (the router of the TS's L2-based VN may
467	     be either DCBR or ToR itself). This is also known as avoiding
468	     "triangular routing". This document refers to this problem as the
469	     triangular routing problem.

471	   In order to avoid the "triangular routing", routers in the Wide Area
472	   Network have to be aware which DCBRs can reach the designated TSs.
473	   When TSs in a single VN are spread across many different DCBRs, all
474	   individual TSs' addresses have to be visible to those routers, which
475	   can dramatically increase the number of routes in those routers.

477	   If a VN is spread across multiple DCBRs and all those DCBRs announce
478	   the same IP prefix for the VN, there could be many issues,
479	   including:
480	   - Traffic could go to DCBR A where target is in DCBR B. and DCBR "A"
481	     is connected to DCBR "B" via WAN
482	   -  If majority of one VN members are under DCBR "A" and rest are
483	     spread across X number of DCBRs. Will DCBR "A" have same weight as
484	     DCBR "B", "C", etc?

486	   If all those DCBRs announce individual IPs that are directly
487	   attached and those IPs are not segmented well, then all the TSs IP
488	   addresses have to be exposed to the WAN. So overlay hides the TSs IP
489	   from the core switches in one DC or one POD, but exposes them to the
490	   WAN. There are more routers in the WAN than the number of core
491	   switches in one DC/POD.

493	   The ability to deliver optimal routing (as defined above) in the
494	   presence of stateful devices is outside the scope of this document.

496	   6.1. Preserving Policies

498	   Moving TS from one L2 physical attachment to another means (among
499	   other things) that the NVE in the new domain that provides
500	   connectivity between this TS and TSs in other L2 physical
501	   attachments must be able to implement the policies that control
502	   connectivity between this TS and TSs in other L2 physical
503	   attachments. In other words, the policies that control connectivity
504	   between a given TS and its peers MUST NOT change as the TS moves
505	   from one L2 physical attachment to another.  Moreover, policies, if
506	   any, within the L2 physical attachment that contains a given TS MUST
507	   NOT preclude realization of the policies that control connectivity
508	   between this TS and its peers. All of the above is irrespective of
509	   whether the L2 physical attachments are trivial or not.

511	   There could be policies guarding TSs across different VNs, with some
512	   being enforced by Firewall, some enforced by NAT/AntiDDOS/IPS/IDS,
513	   etc.  It is less about NVE polices to be maintained when TSs move,
514	   it is more along the line of dynamically changing policies
515	   associated with the "middleware" boxes attached to NVEs (if those
516	   middle boxes are distributed).

518	   6.2. TS Default Gateway solutions

520	   As TS moves to a new L2 site, the default gateway IP address of the
521	   TS may not change. Further, while with cold TS mobility one may
522	   assume that TS's ARP/ND cache gets flushed once TS moves to another
523	   server, one cannot make such an assumption with hot TS mobility.

525	   Thus the destination MAC address in the inter-VN/inter-subnet
526	   traffic originated by that TS would not change as TS moves to the
527	   new site. Given that, how would NVE(s) connected to the new L2 site
528	   be able to recognize inter-VN/inter-subnet traffic originated by
529	   that TS?  The following describes possible solutions.

531	  6.2.1. Solution with Anycast for TS Default Gateways

533	   This solution relies on the use of an anycast default gateway IP
534	   address and an anycast default gateway MAC address.

536	   If DCBRs act as default gateway to a given L2-based VN, then these
537	   anycast addresses are configured on these DCBRs. Likewise, if ToRs
538	   act as default gateways, then these anycast addresses are configured
539	   on these ToRs. All TSs of that L2-based VN are (auto) configured
540	   with the (anycast) IP address of the default gateway.

542	   DCBRs (or ToRs) acting as default gateway use these anycast
543	   addresses as follows:

545	   - When a particular NVE receives a packet from local L2 attachment
546	   with the (anycast) default gateway MAC address, the NVE applies IP
547	   forwarding to the packet, and perform NVE function if the
548	   destination of the packet is attached to another NVE.

550	   - When a particular DCBR (or ToR) acting as a default gateway
551	   receives an ARP/ND Request from local L2 attachment for the default
552	   gateway (anycast) IP address, the DCBR (or ToR) generates ARP/ND
553	   Reply.

555	   This ensures that a particular DCBR (or ToR), acting as a default
556	   gateway, can always apply IP forwarding to the packets sent by a TS
557	   to the (anycast) default gateway MAC address. It also ensures that
558	   such DCBR (or ToR) can respond to the ARP Request generated by a TS
559	   for the default gateway (anycast) IP address.

561	   Except for gratuitous ARP/ND, DCBRs (or ToRs) acting as default
562	   gateway must never use the anycast default gateway MAC address as
563	   the source MAC address in the packets originated by these DCBRs (or
564	   ToRs), cannot use the anycast default gateway IP address as the
565	   source IP address in the overlay header.

567	   Note that multiple L2-based VNs may share the same MAC address for
568	   the purpose of using as the (anycast) MAC address of the default
569	   gateway for these VNs.

571	   If the default gateway functionality is not in NVEs (TORs), then the
572	   default gateway MAC/IP addresses need to be distributed to all NVEs.

574	  6.2.2. Distributed Proxy Default Gateway Solution

576	   This solution does not require configuring the anycast default
577	   gateway IP and MAC address for TSs.

579	   In this solution, NVEs perform the function of the default gateway
580	   for all the TSs attached. Those NVEs are called "Proxy Default
581	   Gateway" in this document because those NVEs might not be the
582	   Default Gateways explicitly configured on TSs attaches. Some of
583	   those proxy default gateway NVEs might not have the complete inter-
584	   subnet communications policies for the attached VNs.

586	   In order to ensure that the destination MAC address in the inter-
587	   VN/inter-subnet traffic originated by that TS would not change as TS
588	   moves to a different NVE, a pseudo MAC address is assigned to all
589	   NVE-based Proxy Default Gateways.

591	   When a particular NVE acting as Proxy Default Gateway receives an
592	   ARP/ND Request from the attached TSs for their default gateway IP
593	   addresses, the NVE suppresses the ARP/ND request from being
594	   forwarded and generates ARP/ND Reply with the pseudo MAC address.

596	   When a particular NVE acting as a Proxy Default Gateway receives a
597	   packet with the Pseudo default gateway MAC address:

599	  - if the NVE has all the needed policies for the Source &
600	     Destination VNs, the NVE applies the IP forwarding, i.e. forward
601	     the packet from source VN to the destination VN, and apply the NVE
602	     encapsulation function with target NVE as destination address and
603	     destination VN identifier in the header,
604	  - if the NVE doesn't have the needed policies from the source VN to
605	     the destination VN, the NVE applies the NVE encapsulation function
606	     with real host's default gateway as destination address and source
607	     VN identifier in the header

609	   This solution assumes that the NVE-based proxy default gateways
610	   either get the mapping of hosts' default gateway IP <-> default
611	   gateway MAC from the corresponding NVA or via ARP/ND discovery.

613	   6.3. Triangular Routing

615	   The triangular routing solution could be partitioned into two
616	   components: intra data center triangular routing solution, and inter
617	   data center triangular routing solution. The former handles the
618	   situation where communicating TSs are in the same data center. The
619	   latter handles all other cases. This draft only describes the
620	   solution for intra data center triangular routing.

622	   To avoid triangular routing, each NVE needs to have the egress NVEs
623	   for potential designations of packets originated from the attached
624	   TSs.

626	   One approach is for each NVE to announce its directly attached TSs
627	   addresses to all other NVEs that participate in the VNs of the TSs'

629	   Another approach is for NVA to distribute the VN scoped TS Address
630	   <-> NVE mappings to all the NVEs. See Section 7 for the detailed
631	   mechanism.

633	  7. L3 Address Migration

635	   When the attachment to NVE is L3 based, TS migration can cause one
636	   subnetwork to be scatted among many NVEs, or fragmented addresses.

638	   The outbound traffic of fragmented L3 addresses doesn't have the
639	   same issue as L2 address migration, but the inbound traffic has the
640	   same issues as L2 address migration (Section 6).

642	   Optimal routing of TS's inbound traffic: This means that as a given
643	   TS moves from one server to another, the (inbound) traffic
644	   originated outside of the TS's directly attached NVE, and destined
645	   to that TS be routed optimally to the NVE to which the server
646	   presently hosting that TS, without first traversing some other NVEs.
647	   This is also known as avoiding "triangular routing".

649	   In theory, host hosting by every NVE (including the NVEs attached to
650	   DCBR) can achieve the optimal inbound forwarding in very fragmented
651	   network. When TSs' IP addresses under all the NVEs can't be
652	   aggregated at all, a NVE needs to support the combined number of TSs
653	   of all the VNs enabled on the NVE. Here is the math showing that
654	   host routing on server based NVE or ToR based NVE can be relatively
655	   easy to be supported even under the worst case scenario:

657	     . Suppose a NVE has TSs belonging to X number of VNs and suppose
658	        each VN has 200 hosts (spread among many NVEs), then the worst
659	        case scenario (or the maximum routes that NVE needs to have) is
660	        200*X.
661	     . For Server based NVE, the number of VNs enabled on the NVE has
662	        to be less than number of VMs instantiated on the server. The
663	        industry state of art virtualization technology allows maximum
664	        100 VMs on one server. So the worst case scenario (or the
665	        maximum routes that NVE needs to have) is 100*200 = 20,000
666	     . For ToR based NVE, the number of TSs can be number of TSs per
667	        server * the number of servers attached to ToR (typical ToR has
668	        48 downstream ports to servers). So the worst case scenario is
669	        40*100 * 200 = 800,000.

671	   But host routing can be challenging on NVEs attached to Data Center
672	   Gateways. Those NVEs usually need to support all the VNs enabled in
673	   the data center. There could be hundreds of thousands of hosts/VMs,
674	   sometimes in millions, due to business demand and highly advanced
675	   server virtualization technologies.

677	   For those data centers with millions of TSs, the following approach
678	   should be considered:

680	     .  Some NVEs (e.g. ToR/Server based NVEs) support host route, and
681	     . Some NVEs (e.g. the NVEs attached to Data center gateways) that
682	        participate in large number of VNs (if not all VNs), support
683	        "non-host-route". Those NVEs are called "non-host-route" NVEs
684	        in the draft.

686	   Those non-host-route NVEs have one or two egress NVEs as the
687	   designated forwarders for a VN (subnet) even if the VN (subnet) is
688	   spread across many NVEs. For example, if high percentage of TSs of
689	   one subnet is attached to NVE "X", the remaining small percentage of
690	   the subnet is spread around many NVEs. The non-host-route NVEs can
691	   have NVE "X" as the designated egress for the VN. By doing so, it
692	   can greatly reduce the "triangular routing" for the traffic destined
693	   to TSs in this VN (subnet).

695	   To avoid loops, the designated NVEs must support host route.

697	   It worth noting that for the NVEs that have host route, they send
698	   traffic directly to the egress NVEs because they have the detailed
699	   information. Only for the NVEs (most likely the NVEs attached to the
700	   Gateway), they send traffic to the VN's (subnet) designated NVEs if
701	   they don't have host routes for the VN. The NVEs that prefer not to
702	   have host routes need to notify NVA that they only want designated
703	   NVEs, or can be configured in the NVA.

705	   ECMP can be another approach that can be used by those non-host-
706	   route NVEs, when VNs are spread across many NVEs. The ECMP approach
707	   basically assigns all the NVEs that have the TSs of a VN attached as
708	   the "designated egress NVEs" for the VN. Again, to avoid loops,
709	   those designated egress NVEs have to support host route. ECMP
710	   approach may cause most packets from those non-host-route NVEs (it
711	   not all) to traverse two NVEs before reaching packets' destinations.

713	  8. Managing duplicated addresses

715	   This document assumes that during VM migration a given MAC address
716	   within a VN can only exist at one TS at a time. As TSs move around
717	   NVEs, it is possible that the network state may not be immediately
718	   synchronized. It is important for NVEs to report directly attached
719	   TSs to NVA on periodically bases so that NVA can generate alarms and
720	   fix duplicated address issues.

722	  9. Manageability Considerations

724	   Several solutions described in this document depend on the presence
725	   of NVA in the data center.

727	  10. Security Considerations

729	   In addition to the security considerations described in [nvo3-
730	   problem], it is clear that allowing TSs migrating across Data Center
731	   will require more stringent security enforcement. The traditional
732	   placement of security functions, e.g. firewall, at data center
733	   gateways is no longer enough. TS mobility will require security
734	   functions to enforce policies among east-west traffic among TSs.

736	   When TSs move across Data Center, the associated policies have to be
737	   updated and enforced.

739	  11. IANA Considerations

741	   This document requires no IANA actions. RFC Editor: Please remove
742	   this section before publication.

744	  12. Acknowledgements

746	   The authors would like to thank Adrian Farrel, David Black, Dave Allen, Tom
747	   Herbert and Larry Kreeger for their review and comments. The authors would also
748	   like to thank Ivan Pepelnjak for his contributions to this document.

750	  13. References

752	   13.1. Normative References

754	   [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
755	             Requirement Levels", BCP 14, RFC 2119, March 1997.

757	   13.2. Informative References

759	   [nvo3-problem] Narten T.et al., "Overlays for Network
760	          Virtualization", draft-ietf-nvo3-overlay-problem-statement-
761	          04, July 2013.

763	   [RFC4364] Rosen, Rekhter, et. al., "BGP/MPLS IP VPNs", RFC4364,
764	          February 2006

766	   [RFC4684] Pedro Marques, et al., "Constrained Route Distribution for
767	          Border Gateway Protocol/MultiProtocol Label Switching
768	          (BGP/MPLS) Internet Protocol (IP) Virtual Private Networks
769	          (VPNs)", RFC4684, November 2006

771	   [E-VPN] Aggarwal R., et al., "BGP MPLS Based Ethernet VPN", draft-
772	          ietf-l2vpn-evpn, work in progress

774	   [Default-Gateway] http://www.iana.org/assignments/bgp-extended-
775	          communities

777	   [DC-mobility]  R. Aggarwal, et al, "Data Center Mobility based on E-
778	          VPN, BGP/MPLS IP VPN, IP Routing and NHRP", draft-raggarwa-
779	          data-center-mobility-07, June 2014

781	Authors' Addresses

783	      Yakov Rekhter
784	      Juniper Networks
785	      1194 North Mathilda Ave.
786	      Sunnyvale, CA 94089
787	      Email: yakov@juniper.net

789	      Linda Dunbar
790	      Huawei Technologies
791	      5340 Legacy Drive, Suite 175
792	      Plano, TX 75024, USA
793	      Email: ldunbar@huawei.com

795	      Rahul Aggarwal
796	      Arktan, Inc
797	      Email: raggarwa_1@yahoo.com

799	      Wim Henderickx
800	      Alcatel-Lucent
801	      Email: wim.henderickx@alcatel-lucent.com

803	      Ravi Shekhar
804	      Juniper Networks
805	      1194 North Mathilda Ave.
806	      Sunnyvale, CA 94089
807	      Email: rshekhar@juniper.net

809	      Luyuan Fang
810	      Cisco Systems
811	      111 Wood Avenue South
812	      Iselin, NJ 08830
813	      Email: lufang@microsoft.com

815	      Ali Sajassi
816	      Cisco Systems
817	      Email: sajassi@cisco.com