idnits 2.17.1 draft-merged-nvo3-ts-address-migration-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. (A line matching the expected section header was found, but with an unexpected indentation: ' 1. Introduction' ) ** The document seems to lack a Security Considerations section. (A line matching the expected section header was found, but with an unexpected indentation: ' 10. Security Considerations' ) ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) (A line matching the expected section header was found, but with an unexpected indentation: ' 11. IANA Considerations' ) ** There are 2 instances of too long lines in the document, the longest one being 10 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (October 24, 2014) is 3470 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? 'RFC2119' on line 754 looks like a reference -- Missing reference section? 'RFC4364' on line 763 looks like a reference -- Missing reference section? 'RFC4684' on line 766 looks like a reference -- Missing reference section? 'E-VPN' on line 771 looks like a reference -- Missing reference section? 'Default-Gateway' on line 774 looks like a reference -- Missing reference section? 'DC-mobility' on line 777 looks like a reference Summary: 4 errors (**), 0 flaws (~~), 1 warning (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 NVO3 Working Group Y. Rekhter 2 Internet Draft Juniper Networks 3 Intended status: Standards track L. Dunbar 4 Expires: April 2015 Huawei 5 R. Aggarwal 6 Arktan Inc 7 R. Shekhar 8 Juniper Networks 9 W. Henderickx 10 Alcatel-Lucent 11 L. Fang 12 Microsoft 13 A. Sajassi 14 Cisco 16 October 24, 2014 18 Overlay Network Tenant System Address Migration 19 draft-merged-nvo3-ts-address-migration-01.txt 21 Status of this Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. This document may not be modified, 25 and derivative works of it may not be created, except to publish it 26 as an RFC and to translate it into languages other than English. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF), its areas, and its working groups. Note that 30 other groups may also distribute working documents as Internet- 31 Drafts. 33 Internet-Drafts are draft documents valid for a maximum of six 34 months and may be updated, replaced, or obsoleted by other documents 35 at any time. It is inappropriate to use Internet-Drafts as 36 reference material or to cite them other than as "work in progress." 38 The list of current Internet-Drafts can be accessed at 39 http://www.ietf.org/ietf/1id-abstracts.txt 41 The list of Internet-Draft Shadow Directories can be accessed at 42 http://www.ietf.org/shadow.html 43 This Internet-Draft will expire on April 24, 2009. 45 Copyright Notice 47 Copyright (c) 2014 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with 55 respect to this document. Code Components extracted from this 56 document must include Simplified BSD License text as described in 57 Section 4.e of the Trust Legal Provisions and are provided without 58 warranty as described in the Simplified BSD License. 60 Abstract 62 This document describes the schemes to overcome the network-related 63 issues to achieve seamless Virtual Machine mobility in data centers. 65 Table of Contents 67 1. Introduction...................................................3 68 2. Conventions used in this document..............................3 69 3. Terminology....................................................4 70 4. Scheme to resolve VLAN-IDs usage in L2 access domains..........7 71 5. Layer 2 Extension..............................................9 72 5.1. Layer 2 Extension Problem.................................9 73 5.2. NVA based Layer 2 Extension Solution.....................10 74 6. Optimal IP Routing............................................11 75 6.1. Preserving Policies......................................13 76 6.2. TS Default Gateway solutions.............................13 77 6.2.1. Solution with Anycast for TS Default Gateways.......13 78 6.2.2. Distributed Proxy Default Gateway Solution..........15 79 6.3. Triangular Routing.......................................16 80 7. L3 Address Migration..........................................16 81 8. Managing duplicated addresses.................................18 82 9. Manageability Considerations..................................18 83 10. Security Considerations......................................18 84 11. IANA Considerations..........................................19 85 12. Acknowledgements.............................................19 86 13. References...................................................19 87 13.1. Normative References....................................19 88 13.2. Informative References..................................19 90 1. Introduction 92 An important feature of data centers identified in [nvo3-problem] is 93 the support of Virtual Machine (TS) mobility within the data center 94 and between data centers. This document describes the schemes to 95 overcome the network-related issues to achieve seamless Virtual 96 Machine mobility in the data center and between data centers, where 97 seamless mobility is defined as the ability to move a TS from one 98 server in a data center to another server in the same or different 99 data center, while retaining the IP and MAC address of the TS. In 100 the context of this document the term mobility or a reference to 101 moving a TS should be considered to imply seamless mobility, unless 102 otherwise stated. 104 Note that in the scenario where a TS is moved between servers 105 located in different data centers, there are certain issues related 106 to the current state of the art of the Virtual Machine technology, 107 the bandwidth that may be available between the data centers, the 108 distance between the data centers, the ability to manage and operate 109 such TS mobility, storage-related issues (the moved TS has to have 110 access to the same virtual disk), etc. Discussion of these issues 111 is outside the scope of this document. 113 2. Conventions used in this document 115 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 116 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 117 document are to be interpreted as described in RFC-2119 [RFC2119]. 119 In this document, these words will appear with that interpretation 120 only when in ALL CAPS. Lower case uses of these words are not to be 121 interpreted as carrying RFC-2119 significance. 123 DC: Data Center 124 DCBR: Data Center Bridge Router 126 LAG: Link Aggregation Group 128 POD: Modular Performance Optimized Data Center. POD and Data Center 129 are used interchangeably in this document. 131 ToR: Top of Rack switch 133 TS: Tenant System (used interchangeably with VM on servers 134 supporting Virtual Machines) 136 VEPA: Virtual Ethernet Port Aggregator (IEEE802.1Qbg) 138 VN: Virtual Network 140 3. Terminology 142 In this document "Mobility" refers to "address migration", meaning 143 TSs move to different locations without changing their addresses 144 (IP/MAC). 146 In this document the term "Top of Rack Switch (ToR)" is used to 147 refer to a switch in a data center that is connected to the servers 148 that host TSs. A data center may have multiple ToRs. Some servers 149 may have embedded blade switches, some servers may have virtual 150 switches to interconnect the TSs, and some servers may not have any 151 embedded switches. When External Bridge Port Extenders (as defined 152 by 802.1BR) are used to connect the servers to the data center 153 network, the ToR switch is the Controlling Bridge. 155 Several data centers or PODs could be connected by a network. In 156 addition to providing interconnect among the data centers/PODs, such 157 a network could provide connectivity between the TSs hosted in these 158 data centers and the sites that contain hosts communicating with 159 such TSs. Each data center has one or more Data Center Border Router 160 (DCBR) that connects the data center to the network, and provides 161 (a) connectivity between TSs hosted in the data center and TSs 162 hosted in other data centers, and (b) connectivity between TSs 163 hosted in the data center and hosts communicating with these TSs. 165 The following figure illustrates the above: 166 __________ 167 ( ) 168 ( Data Center) 169 ( Interconnect )------------------------- 170 ( Network ) | 171 (__________) | 172 | | | 173 ---- ---- | 174 | | | 175 --------+--------------+--------------- ------------- 176 | | | Data | | | 177 | ------ ------ Center | | Data Center | 178 | | DCBR | | DCBR | /POD | | /POD | 179 | ------ ------ | ------------- 180 | | | | 181 | --- --- | 182 | ___|______|__ | 183 | ( ) | 184 | ( Data Center ) | 185 | ( Network ) | 186 | (___________) | 187 | | | | 188 | ---- ---- | 189 | | | | 190 | ------------ ----- | 191 | | ToR Switch | | ToR | | 192 | ------------ ----- | 193 | | | | 194 | | ---------- | ---------- | 195 | |--| Server | |--| Server | | 196 | | | vSwitch | | ---------- | 197 | | | ---- | | | 198 | | | | TS | | | ---------- | 199 | | | ----- | --| Server | | 200 | | | | TS | | ---------- | 201 | | | ----- | | 202 | | | | TS | | | 203 | | | ---- | | 204 | | ---------- | 205 | | ---------- | 206 | |--| Server | | 207 | | ---------- | 208 | | ---------- | 209 | --| Server | | 210 | ---------- | 211 ---------------------------------------- 213 Figure 1: A Typical Data Center Network 215 The data centers/PODs and the network that interconnects them may be 216 either (a) under the same administrative control, or (b) controlled 217 by different administrations. 219 Consider a set of TSs that (as a matter of policy) are allowed to 220 communicate with each other, and a collection of devices that 221 interconnect these TSs. If communication among any TSs in that set 222 could be accomplished in such a way as to preserve MAC source and 223 destination addresses in the Ethernet header of the packets 224 exchanged among these TSs (as these packets traverse from their 225 sources to their destinations), we will refer to such set of TSs as 226 an Layer 2 based Virtual Network (VN) or Closed User Group (L2-based 227 CUG). In this document, the Closed User Group and Virtual Network 228 (VN) are used interchangeably. 230 A given TS may be a member of more than one VN or L2-based VN. 232 In terms of IP address assignment this document assumes that all TSs 233 of a given L2-based VN have their IP addresses assigned out of a 234 single IP prefix. Thus, in the context of this document a single IP 235 subnet corresponds to a single L2-based VN. If a given TS is a 236 member of more than one L2-based VN, this TS would have multiple IP 237 addresses and multiple logical interfaces, one IP address and one 238 logical interface per each such VN. 240 A TS that is a member of a given L2-based VN may (as a matter of 241 policy) be allowed to communicate with TSs that belong to other L2- 242 based VNs, or with other hosts. Such communication involves IP 243 forwarding, and thus would result in changing MAC source and 244 destination addresses in the Ethernet header of the packets being 245 exchanged. 247 In this document the term "L2 physical attachment" refers to a 248 collection of interconnected devices attached to an NVE that perform 249 forwarding based on the information carried in the Ethernet header. 250 A trivial L2 physical attachment consists of just one non- 251 virtualized server. In a non-trivial L2 physical attachment (domain 252 that contains multiple forwarding entities) forwarding could be 253 provided by such layer 2 technologies as Spanning Tree Protocol 254 (STP), VEPA (IEEE802.1Qbg), etc. Note that any multi-chassis LAG 255 cannot span more than one L2 physical attachment. This document 256 assumes that a layer 2 access domain is an L2 physical attachment. 258 A physical server connected to a given L2 physical domain may host 259 TSs that belong to different L2-based VNs (while each of these VNs 260 may span multiple L2 physical domains). If an L2 physical attachment 261 contains servers that host TSs belonging to different L2-based VNs, 262 then enforcing L2-based VNs boundaries among these TSs within that 263 domain is accomplished by relying on Layer 2 mechanisms (e.g. 264 VLANs). 266 We say that an L2 physical attachment contains a given TS (or that a 267 given TS is in a given L2 physical attachment), if the server 268 presently hosting this TS is part of that domain, or the server is 269 connected to a ToR that is part of that domain. 271 We say that a given L2-based VN is present within a given data 272 center if one or more TSs that are part of that VN are presently 273 hosted by the servers located in that data center. 275 In the context of this document when we talk about VLAN-ID used by a 276 given TS, we refer to the VLAN-ID carried by the traffic that is 277 within the same L2 physical attachment as the TS, and that is either 278 originated or destined to that TS - e.g., VLAN-ID only has local 279 significance within the L2 physical attachment, unless it is stated 280 otherwise. 282 Some of the TS-mobility solutions described in this document are E- 283 VPN based. When using E-VPN in NVO3 environment, the NVE function is 284 on the PE node. NVE-PE is used to describe the E-VPN PE node that 285 supports the NVE function. 287 4. Scheme to resolve VLAN-IDs usage in L2 access domains 289 This document assumes that within a given non-trivial L2 physical 290 attachment traffic from/to TSs belonging to different L2-based VNs 291 MUST have different VLAN-IDs. 293 To support tens of thousands of virtual networks, the local VLAN-ID 294 associated with client payload under each NVE has to be locally 295 significant. Therefore, the same L2-based VN MAY have either the 296 same or different VLAN-IDs under different NVEs. Thus when a given 297 TS moves from one non-trivial L2 physical attachment to another, the 298 VLAN-ID of the traffic from/to TS in the former may be different 299 than in the latter, and thus cannot assume to stay the same. 301 To describe the solution more clearly, here are the terminologies 302 used: 304 - Customer administered VLAN-IDs (usually hard coded in a TS's Guest 305 OS and can't be changed when the TS move from one NVE to another). 306 Some TSs may not have VLAN-ID attached. 307 - Provider administered VLAN-IDs of local significance, and 308 - Provider administered VN-IDs of global significance. 310 In the scenario where there are provider administered VLAN-IDs of 311 local significance (e.g. NVE in a TOR), the value is selected by NVA 312 from the pool of unused VIDs when the first local TS of a VN is 313 being added, and returned by NVA to the unused pool of VLAN-IDs when 314 the last TS leaves. For TSs with hard coded VLAN-ID, it is necessary 315 for an entity, most likely the first switch (virtual or physical) to 316 which the TS is attached, to change the locally administered VLAN- 317 IDs to the TSs' hard coded VLAN-IDs. For un-tagged TSs, the first 318 switch has to remove the locally administered VLAN-IDs before 319 sending packets to TSs. 321 The section is intended to describe: 322 . NVA manages unused VLAN-IDs pool in each access L2 domain 323 . NVE reports to NVA when first local TS of a VN is reachable, or 324 none of TS in a VN is reachable by the NVE 325 . NVA can push the global VN ID <-> locally administered VID 326 mapping to NVE, or NVE can pull upon detecting a newly attached 327 VN. 328 . NVA manages the first switch to which TS is attached on mapping 329 between TS's own VLAN-ID and "locally administered VID". 331 Here is the detailed procedure: 333 . NVE should get the specific VNID from NVA for untagged data 334 frames arriving at the each Virtual Access Point [VNo3- 335 framework 3.1.1] of a NVE. 337 Since local VLAN-IDs under each NVE are locally significant, 338 here are the possible ways for ingress NVE to assign VLAN-ID in 339 the overlay header for data frames destined to other NVEs: 341 a) carry what comes in at ingress Virtual Access point. 342 Preserving vlan-id can be used to provide bundled 343 service/PVLAN. In this case many vlan-ids in ingress could map 344 to one logical VN (n to 1 mapping). 346 b) not carrying any vlan-id and using logical VN identifier. 347 The egress NVE gets the vlan-id from NVA to put on the packet 348 before sending to attached TSs. This is 1-to-1 mapping between 349 vlan-id and logical-VN. 351 . If the data frame is already tagged before reaching the NVE's 352 Virtual Access Point, the NVA should inform the first switch 353 port that is responsible for adding VLAN-ID to the untagged 354 data frames of the specific VLAN-ID to be inserted to data 355 frames. 357 . If data frames from a TS are already tagged, the first port 358 facing the TS has be informed by the NVA of the new local VLAN- 359 ID to replace the VLAN-ID encoded in the data frames. 361 For data frames coming from network side towards TSs (i.e. 362 inbound traffic towards TSs), the first switching port facing 363 TSs have to convert the VLAN-IDs encoded in the data frames to 364 the VLAN-IDs used by TSs. 366 5. Layer 2 Extension 368 5.1. Layer 2 Extension Problem 370 Consider a scenario where a TS that is a member of a given L2-based 371 VN moves from one server to another, and these two servers are in 372 different L2 physical attachments, where these domains may be 373 located in the same or different data centers (or PODs). In order to 374 enable communication between this TS and other TSs of that L2-based 375 VN, the new L2 physical attachment must become interconnected with 376 the other L2 physical attachment(s) that presently contain the rest 377 of the TSs of that VN, and the interconnect must not violate the L2- 378 based VN requirement to preserve source and destination MAC 379 addresses in the Ethernet header of the packets exchange between 380 this TS and other members of that VN. 382 Moreover, if the previous L2 physical attachment no longer contains 383 any TSs of that VN, the previous domain no longer needs to be 384 interconnected with the other L2 physical attachments(s) that 385 contain the rest of the TSs of that VN. 387 Note that supporting TS mobility implies that the set of L2 physical 388 attachments that contain TSs that belong to a given L2-based VN may 389 change over time (new domains added, old domains deleted). 391 We will refer to this as the "layer 2 extension problem". 393 Note that the layer 2 extension problem is a special case of 394 maintaining connectivity in the presence of TS mobility, as the 395 former restricts communicating TSs to a single/common L2-based VN, 396 while the latter does not. 398 5.2. NVA based Layer 2 Extension Solution 400 Assume NVO3's NVA has at least the following information for each 401 TS: 402 . Inner Address: TS (host) Address family (IPv4/IPv6, MAC, 403 virtual network Identifier MPLS/VLAN, etc) 405 . Outer Address: The list of locally attached edges (NVEs); 406 normally one TS is attached to one edge, TS could also be 407 attached to 2 edges for redundancy (dual homing). One TS is 408 rarely attached to more than 2 edges, though it could be 409 possible; 411 . VN Context (VN ID and/or VN Name) 413 . Timer for NVEs to keep the entry when pushed down to or pulled 414 from NVEs. 416 . Optionally the list of interested remote edges (NVEs). This 417 information is for NVA to promptly update relevant edges (NVEs) 418 when there is any change to this TS' attachment to edges 419 (NVEs). However, this information doesn't have to be kept per 420 TS. It can be kept per VN. 422 NVA can offer services in a Push, Pull mode, or the combination of 423 the two. 425 In this solution, the NVEs are connected via underlay IP network. 426 For each VN, the NVA informs all the NVEs to which the TSs of the 427 given VN are attached. 429 When the last TS of a VN is moved out of a NVE, NVE can either 430 confirm with the NVA or the NVA notifies the NVE for it to remove 431 its connectivity to the VN. When an NVE needs to support 432 connectivity to a VN not currently supported (as a result of TS turn 433 up, or TS migration), the NVA will push the necessary VN information 434 into the NVE. 436 The term "NVE being connected to a VN" means that the NVE at least 437 has: 438 . the inner-outer address mapping information for all the TSs in 439 the VN or being able to pull the mapping from the NVA, 441 . the mapping of local VLAN-ID to the VNID used by overlay 442 header, and 444 . has the VN's default gateway IP/MAC address. 446 6. Optimal IP Routing 448 In the context of this document optimal IP routing, or just optimal 449 routing, in the presence of TS mobility could be partitioned into 450 two problems: 452 - Optimal routing of a TS's outbound traffic. This means that as a 453 given TS moves from one server to another, the TS's default 454 gateway should be in a close topological proximity to the ToR that 455 connects the server presently hosting that TS. Note that when we 456 talk about optimal routing of the TS's outbound traffic, we mean 457 traffic from that TS to the destinations that are outside of the 458 TS's L2-based VN. This document refers to this problem as the TS 459 default gateway problem. 460 - Optimal routing of TS's inbound traffic. This means that as a 461 given TS moves from one server to another, the (inbound) traffic 462 originated outside of the TS's L2-based VN, and destined to that 463 TS be routed via the router of the TS's L2-based VN that is in a 464 close topological proximity to the ToR that connects the server 465 presently hosting that TS, without first traversing some other 466 router of that L2-based VN (the router of the TS's L2-based VN may 467 be either DCBR or ToR itself). This is also known as avoiding 468 "triangular routing". This document refers to this problem as the 469 triangular routing problem. 471 In order to avoid the "triangular routing", routers in the Wide Area 472 Network have to be aware which DCBRs can reach the designated TSs. 473 When TSs in a single VN are spread across many different DCBRs, all 474 individual TSs' addresses have to be visible to those routers, which 475 can dramatically increase the number of routes in those routers. 477 If a VN is spread across multiple DCBRs and all those DCBRs announce 478 the same IP prefix for the VN, there could be many issues, 479 including: 480 - Traffic could go to DCBR A where target is in DCBR B. and DCBR "A" 481 is connected to DCBR "B" via WAN 482 - If majority of one VN members are under DCBR "A" and rest are 483 spread across X number of DCBRs. Will DCBR "A" have same weight as 484 DCBR "B", "C", etc? 486 If all those DCBRs announce individual IPs that are directly 487 attached and those IPs are not segmented well, then all the TSs IP 488 addresses have to be exposed to the WAN. So overlay hides the TSs IP 489 from the core switches in one DC or one POD, but exposes them to the 490 WAN. There are more routers in the WAN than the number of core 491 switches in one DC/POD. 493 The ability to deliver optimal routing (as defined above) in the 494 presence of stateful devices is outside the scope of this document. 496 6.1. Preserving Policies 498 Moving TS from one L2 physical attachment to another means (among 499 other things) that the NVE in the new domain that provides 500 connectivity between this TS and TSs in other L2 physical 501 attachments must be able to implement the policies that control 502 connectivity between this TS and TSs in other L2 physical 503 attachments. In other words, the policies that control connectivity 504 between a given TS and its peers MUST NOT change as the TS moves 505 from one L2 physical attachment to another. Moreover, policies, if 506 any, within the L2 physical attachment that contains a given TS MUST 507 NOT preclude realization of the policies that control connectivity 508 between this TS and its peers. All of the above is irrespective of 509 whether the L2 physical attachments are trivial or not. 511 There could be policies guarding TSs across different VNs, with some 512 being enforced by Firewall, some enforced by NAT/AntiDDOS/IPS/IDS, 513 etc. It is less about NVE polices to be maintained when TSs move, 514 it is more along the line of dynamically changing policies 515 associated with the "middleware" boxes attached to NVEs (if those 516 middle boxes are distributed). 518 6.2. TS Default Gateway solutions 520 As TS moves to a new L2 site, the default gateway IP address of the 521 TS may not change. Further, while with cold TS mobility one may 522 assume that TS's ARP/ND cache gets flushed once TS moves to another 523 server, one cannot make such an assumption with hot TS mobility. 525 Thus the destination MAC address in the inter-VN/inter-subnet 526 traffic originated by that TS would not change as TS moves to the 527 new site. Given that, how would NVE(s) connected to the new L2 site 528 be able to recognize inter-VN/inter-subnet traffic originated by 529 that TS? The following describes possible solutions. 531 6.2.1. Solution with Anycast for TS Default Gateways 533 This solution relies on the use of an anycast default gateway IP 534 address and an anycast default gateway MAC address. 536 If DCBRs act as default gateway to a given L2-based VN, then these 537 anycast addresses are configured on these DCBRs. Likewise, if ToRs 538 act as default gateways, then these anycast addresses are configured 539 on these ToRs. All TSs of that L2-based VN are (auto) configured 540 with the (anycast) IP address of the default gateway. 542 DCBRs (or ToRs) acting as default gateway use these anycast 543 addresses as follows: 545 - When a particular NVE receives a packet from local L2 attachment 546 with the (anycast) default gateway MAC address, the NVE applies IP 547 forwarding to the packet, and perform NVE function if the 548 destination of the packet is attached to another NVE. 550 - When a particular DCBR (or ToR) acting as a default gateway 551 receives an ARP/ND Request from local L2 attachment for the default 552 gateway (anycast) IP address, the DCBR (or ToR) generates ARP/ND 553 Reply. 555 This ensures that a particular DCBR (or ToR), acting as a default 556 gateway, can always apply IP forwarding to the packets sent by a TS 557 to the (anycast) default gateway MAC address. It also ensures that 558 such DCBR (or ToR) can respond to the ARP Request generated by a TS 559 for the default gateway (anycast) IP address. 561 Except for gratuitous ARP/ND, DCBRs (or ToRs) acting as default 562 gateway must never use the anycast default gateway MAC address as 563 the source MAC address in the packets originated by these DCBRs (or 564 ToRs), cannot use the anycast default gateway IP address as the 565 source IP address in the overlay header. 567 Note that multiple L2-based VNs may share the same MAC address for 568 the purpose of using as the (anycast) MAC address of the default 569 gateway for these VNs. 571 If the default gateway functionality is not in NVEs (TORs), then the 572 default gateway MAC/IP addresses need to be distributed to all NVEs. 574 6.2.2. Distributed Proxy Default Gateway Solution 576 This solution does not require configuring the anycast default 577 gateway IP and MAC address for TSs. 579 In this solution, NVEs perform the function of the default gateway 580 for all the TSs attached. Those NVEs are called "Proxy Default 581 Gateway" in this document because those NVEs might not be the 582 Default Gateways explicitly configured on TSs attaches. Some of 583 those proxy default gateway NVEs might not have the complete inter- 584 subnet communications policies for the attached VNs. 586 In order to ensure that the destination MAC address in the inter- 587 VN/inter-subnet traffic originated by that TS would not change as TS 588 moves to a different NVE, a pseudo MAC address is assigned to all 589 NVE-based Proxy Default Gateways. 591 When a particular NVE acting as Proxy Default Gateway receives an 592 ARP/ND Request from the attached TSs for their default gateway IP 593 addresses, the NVE suppresses the ARP/ND request from being 594 forwarded and generates ARP/ND Reply with the pseudo MAC address. 596 When a particular NVE acting as a Proxy Default Gateway receives a 597 packet with the Pseudo default gateway MAC address: 599 - if the NVE has all the needed policies for the Source & 600 Destination VNs, the NVE applies the IP forwarding, i.e. forward 601 the packet from source VN to the destination VN, and apply the NVE 602 encapsulation function with target NVE as destination address and 603 destination VN identifier in the header, 604 - if the NVE doesn't have the needed policies from the source VN to 605 the destination VN, the NVE applies the NVE encapsulation function 606 with real host's default gateway as destination address and source 607 VN identifier in the header 609 This solution assumes that the NVE-based proxy default gateways 610 either get the mapping of hosts' default gateway IP <-> default 611 gateway MAC from the corresponding NVA or via ARP/ND discovery. 613 6.3. Triangular Routing 615 The triangular routing solution could be partitioned into two 616 components: intra data center triangular routing solution, and inter 617 data center triangular routing solution. The former handles the 618 situation where communicating TSs are in the same data center. The 619 latter handles all other cases. This draft only describes the 620 solution for intra data center triangular routing. 622 To avoid triangular routing, each NVE needs to have the egress NVEs 623 for potential designations of packets originated from the attached 624 TSs. 626 One approach is for each NVE to announce its directly attached TSs 627 addresses to all other NVEs that participate in the VNs of the TSs' 629 Another approach is for NVA to distribute the VN scoped TS Address 630 <-> NVE mappings to all the NVEs. See Section 7 for the detailed 631 mechanism. 633 7. L3 Address Migration 635 When the attachment to NVE is L3 based, TS migration can cause one 636 subnetwork to be scatted among many NVEs, or fragmented addresses. 638 The outbound traffic of fragmented L3 addresses doesn't have the 639 same issue as L2 address migration, but the inbound traffic has the 640 same issues as L2 address migration (Section 6). 642 Optimal routing of TS's inbound traffic: This means that as a given 643 TS moves from one server to another, the (inbound) traffic 644 originated outside of the TS's directly attached NVE, and destined 645 to that TS be routed optimally to the NVE to which the server 646 presently hosting that TS, without first traversing some other NVEs. 647 This is also known as avoiding "triangular routing". 649 In theory, host hosting by every NVE (including the NVEs attached to 650 DCBR) can achieve the optimal inbound forwarding in very fragmented 651 network. When TSs' IP addresses under all the NVEs can't be 652 aggregated at all, a NVE needs to support the combined number of TSs 653 of all the VNs enabled on the NVE. Here is the math showing that 654 host routing on server based NVE or ToR based NVE can be relatively 655 easy to be supported even under the worst case scenario: 657 . Suppose a NVE has TSs belonging to X number of VNs and suppose 658 each VN has 200 hosts (spread among many NVEs), then the worst 659 case scenario (or the maximum routes that NVE needs to have) is 660 200*X. 661 . For Server based NVE, the number of VNs enabled on the NVE has 662 to be less than number of VMs instantiated on the server. The 663 industry state of art virtualization technology allows maximum 664 100 VMs on one server. So the worst case scenario (or the 665 maximum routes that NVE needs to have) is 100*200 = 20,000 666 . For ToR based NVE, the number of TSs can be number of TSs per 667 server * the number of servers attached to ToR (typical ToR has 668 48 downstream ports to servers). So the worst case scenario is 669 40*100 * 200 = 800,000. 671 But host routing can be challenging on NVEs attached to Data Center 672 Gateways. Those NVEs usually need to support all the VNs enabled in 673 the data center. There could be hundreds of thousands of hosts/VMs, 674 sometimes in millions, due to business demand and highly advanced 675 server virtualization technologies. 677 For those data centers with millions of TSs, the following approach 678 should be considered: 680 . Some NVEs (e.g. ToR/Server based NVEs) support host route, and 681 . Some NVEs (e.g. the NVEs attached to Data center gateways) that 682 participate in large number of VNs (if not all VNs), support 683 "non-host-route". Those NVEs are called "non-host-route" NVEs 684 in the draft. 686 Those non-host-route NVEs have one or two egress NVEs as the 687 designated forwarders for a VN (subnet) even if the VN (subnet) is 688 spread across many NVEs. For example, if high percentage of TSs of 689 one subnet is attached to NVE "X", the remaining small percentage of 690 the subnet is spread around many NVEs. The non-host-route NVEs can 691 have NVE "X" as the designated egress for the VN. By doing so, it 692 can greatly reduce the "triangular routing" for the traffic destined 693 to TSs in this VN (subnet). 695 To avoid loops, the designated NVEs must support host route. 697 It worth noting that for the NVEs that have host route, they send 698 traffic directly to the egress NVEs because they have the detailed 699 information. Only for the NVEs (most likely the NVEs attached to the 700 Gateway), they send traffic to the VN's (subnet) designated NVEs if 701 they don't have host routes for the VN. The NVEs that prefer not to 702 have host routes need to notify NVA that they only want designated 703 NVEs, or can be configured in the NVA. 705 ECMP can be another approach that can be used by those non-host- 706 route NVEs, when VNs are spread across many NVEs. The ECMP approach 707 basically assigns all the NVEs that have the TSs of a VN attached as 708 the "designated egress NVEs" for the VN. Again, to avoid loops, 709 those designated egress NVEs have to support host route. ECMP 710 approach may cause most packets from those non-host-route NVEs (it 711 not all) to traverse two NVEs before reaching packets' destinations. 713 8. Managing duplicated addresses 715 This document assumes that during VM migration a given MAC address 716 within a VN can only exist at one TS at a time. As TSs move around 717 NVEs, it is possible that the network state may not be immediately 718 synchronized. It is important for NVEs to report directly attached 719 TSs to NVA on periodically bases so that NVA can generate alarms and 720 fix duplicated address issues. 722 9. Manageability Considerations 724 Several solutions described in this document depend on the presence 725 of NVA in the data center. 727 10. Security Considerations 729 In addition to the security considerations described in [nvo3- 730 problem], it is clear that allowing TSs migrating across Data Center 731 will require more stringent security enforcement. The traditional 732 placement of security functions, e.g. firewall, at data center 733 gateways is no longer enough. TS mobility will require security 734 functions to enforce policies among east-west traffic among TSs. 736 When TSs move across Data Center, the associated policies have to be 737 updated and enforced. 739 11. IANA Considerations 741 This document requires no IANA actions. RFC Editor: Please remove 742 this section before publication. 744 12. Acknowledgements 746 The authors would like to thank Adrian Farrel, David Black, Dave Allen, Tom 747 Herbert and Larry Kreeger for their review and comments. The authors would also 748 like to thank Ivan Pepelnjak for his contributions to this document. 750 13. References 752 13.1. Normative References 754 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 755 Requirement Levels", BCP 14, RFC 2119, March 1997. 757 13.2. Informative References 759 [nvo3-problem] Narten T.et al., "Overlays for Network 760 Virtualization", draft-ietf-nvo3-overlay-problem-statement- 761 04, July 2013. 763 [RFC4364] Rosen, Rekhter, et. al., "BGP/MPLS IP VPNs", RFC4364, 764 February 2006 766 [RFC4684] Pedro Marques, et al., "Constrained Route Distribution for 767 Border Gateway Protocol/MultiProtocol Label Switching 768 (BGP/MPLS) Internet Protocol (IP) Virtual Private Networks 769 (VPNs)", RFC4684, November 2006 771 [E-VPN] Aggarwal R., et al., "BGP MPLS Based Ethernet VPN", draft- 772 ietf-l2vpn-evpn, work in progress 774 [Default-Gateway] http://www.iana.org/assignments/bgp-extended- 775 communities 777 [DC-mobility] R. Aggarwal, et al, "Data Center Mobility based on E- 778 VPN, BGP/MPLS IP VPN, IP Routing and NHRP", draft-raggarwa- 779 data-center-mobility-07, June 2014 781 Authors' Addresses 783 Yakov Rekhter 784 Juniper Networks 785 1194 North Mathilda Ave. 786 Sunnyvale, CA 94089 787 Email: yakov@juniper.net 789 Linda Dunbar 790 Huawei Technologies 791 5340 Legacy Drive, Suite 175 792 Plano, TX 75024, USA 793 Email: ldunbar@huawei.com 795 Rahul Aggarwal 796 Arktan, Inc 797 Email: raggarwa_1@yahoo.com 799 Wim Henderickx 800 Alcatel-Lucent 801 Email: wim.henderickx@alcatel-lucent.com 803 Ravi Shekhar 804 Juniper Networks 805 1194 North Mathilda Ave. 806 Sunnyvale, CA 94089 807 Email: rshekhar@juniper.net 809 Luyuan Fang 810 Cisco Systems 811 111 Wood Avenue South 812 Iselin, NJ 08830 813 Email: lufang@microsoft.com 815 Ali Sajassi 816 Cisco Systems 817 Email: sajassi@cisco.com