idnits 2.17.1 draft-xu-l3vpn-virtual-subnet-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 172 instances of too long lines in the document, the longest one being 15 characters in excess of 72. == There are 41 instances of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 15, 2013) is 3938 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC2338' is mentioned on line 206, but not defined ** Obsolete undefined reference: RFC 2338 (Obsoleted by RFC 3768) Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network working group X. Xu 2 Internet Draft Huawei Technologies 3 Category: Informational 4 S. Hares 6 Y. Fan 7 China Telecom 9 C. Jacquenet 10 France Telecom 12 Expires: January 2014 July 15, 2013 14 Virtual Subnet: A L3VPN-based Subnet Extension Solution 16 draft-xu-l3vpn-virtual-subnet-00 18 Abstract 20 This document describes a Layer3 Virtual Private Network (L3VPN)- 21 based subnet extension solution referred to as Virtual Subnet, which 22 can be used as a kind of Layer3 network virtualization overlay 23 approach for data center interconnect. 25 Status of this Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF), its areas, and its working groups. Note that 32 other groups may also distribute working documents as Internet- 33 Drafts. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 The list of current Internet-Drafts can be accessed at 41 http://www.ietf.org/ietf/1id-abstracts.txt. 43 The list of Internet-Draft Shadow Directories can be accessed at 44 http://www.ietf.org/shadow.html. 46 This Internet-Draft will expire on January 15, 2014. 48 Copyright Notice 50 Copyright (c) 2013 IETF Trust and the persons identified as the 51 document authors. All rights reserved. 53 This document is subject to BCP 78 and the IETF Trust's Legal 54 Provisions Relating to IETF Documents 55 (http://trustee.ietf.org/license-info) in effect on the date of 56 publication of this document. Please review these documents 57 carefully, as they describe your rights and restrictions with respect 58 to this document. Code Components extracted from this document must 59 include Simplified BSD License text as described in Section 4.e of 60 the Trust Legal Provisions and are provided without warranty as 61 described in the Simplified BSD License. 63 Conventions used in this document 65 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 66 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 67 document are to be interpreted as described in RFC-2119 [RFC2119]. 69 Table of Contents 71 1. Introduction ................................................ 4 72 2. Terminology ................................................. 6 73 3. Solution Description......................................... 6 74 3.1. Unicast ................................................ 6 75 3.1.1. Intra-subnet Unicast .............................. 6 76 3.1.2. Inter-subnet Unicast .............................. 7 77 3.2. Multicast .............................................. 9 78 3.3. CE Host Discovery ...................................... 9 79 3.4. ARP/ND Proxy .......................................... 10 80 3.5. CE Host Mobility ...................................... 10 81 3.6. Forwarding Table Scalability .......................... 10 82 3.6.1. MAC Table Reduction on Data Center Switches ...... 10 83 3.6.2. PE Router FIB Reduction .......................... 11 84 3.6.3. PE Router RIB Reduction .......................... 12 85 3.7. ARP/ND Cache Table Scalability on Default Gateways .... 14 86 3.8. ARP/ND and Unknown Uncast Flood Avoidance ............. 14 87 3.9. Path Optimization ..................................... 14 88 4. Considerations for Non-IP traffic .......................... 15 89 5. Security Considerations .................................... 15 90 6. IANA Considerations ........................................ 15 91 7. Acknowledgements ........................................... 15 92 8. References ................................................. 15 93 8.1. Normative References .................................. 15 94 8.2. Informative References ................................ 15 95 Authors' Addresses ............................................ 16 97 1. Introduction 99 For business continuity purposes, Virtual Machine (VM) migration 100 across data centers is commonly used in those situations such as data 101 center maintenance, data center migration, data center consolidation, 102 data center expansion, and data center disaster avoidance. It's 103 generally admitted that IP renumbering of servers (i.e., VMs) after 104 the migration is usually complex and costly at the risk of extending 105 the business downtime during the process of migration. To allow the 106 migration of a VM from one data center to another without IP 107 renumbering, the subnet on which the VM resides needs to be extended 108 across these data centers. 110 In Infrastructure-as-a-Service (IaaS) cloud data center environments, 111 to achieve subnet extension across multiple data centers in a 112 scalable way, the following requirements SHOULD be considered for any 113 data center interconnect solution: 115 1) VPN Instance Space Scalability 117 In a modern cloud data center environment, thousands or even tens 118 of thousands of tenants could be hosted over a shared network 119 infrastructure. For security and performance isolation purposes, 120 these tenants need to be isolated from one another. Hence, the 121 data center interconnect solution SHOULD be capable of providing a 122 large enough Virtual Private Network (VPN) instance space for 123 tenant isolation. 125 2) Forwarding Table Scalability 127 With the development of server virtualization technologies, a 128 single cloud data center containing millions of VMs is not 129 uncommon. This number already implies a big challenge for data 130 center switches, especially for core/aggregation switches, from 131 the perspective of forwarding table scalability. Provided that 132 multiple data centers of such scale were interconnected at layer2, 133 this challenge would be even worse. Hence an ideal data center 134 interconnect solution SHOULD prevent the forwarding table size of 135 data center switches from growing by folds as the number of data 136 centers to be interconnected increases. Furthermore, if any kind 137 of L2VPN or L3VPN technologies is used for interconnecting data 138 centers, the scale of forwarding tables on PE routers SHOULD be 139 taken into consideration as well. 141 3) ARP/ND Cache Table Scalability on Default Gateways 143 [RFC6820] notes that the Address Resolution Protocol 144 (ARP)/Neighbor Discovery (ND) cache tables maintained by data 145 center default gateways in cloud data centers can raise both 146 scalability and security issues. Therefore, an ideal data center 147 interconnect solution SHOULD prevent the ARP/ND cache table size 148 from growing by multiples as the number of data centers to be 149 connected increases. 151 4) ARP/ND and Unknown Unicast Flood Suppression or Avoidance 153 It's well-known that the flooding of Address Resolution Protocol 154 (ARP)/Neighbor Discovery (ND) broadcast/multicast and unknown 155 unicast traffic within a large Layer2 network are likely to affect 156 performances of networks and hosts. As multiple data centers each 157 containing millions of VMs are interconnected together across the 158 Wide Area Network (WAN) at layer2, the impact of flooding as 159 mentioned above will become even worse. As such, it becomes 160 increasingly desirable for data center operators to suppress or 161 even avoid the flooding of ARP/ND broadcast/multicast and unknown 162 unicast traffic across data centers. 164 5) Path Optimization 166 A subnet usually indicates a location in the network. However, 167 when a subnet has been extended across multiple geographically 168 dispersed data center locations, the location semantics of such 169 subnet is not retained any longer. As a result, the traffic from a 170 cloud user (i.e., a VPN user) which is destined for a given server 171 located at one data center location of such extended subnet may 172 arrive at another data center location firstly according to the 173 subnet route, and then be forwarded to the location where the 174 service is actually located. This suboptimal routing would 175 obviously result in the unnecessary consumption of the bandwidth 176 resources which are intended for data center interconnection. 177 Furthermore, in the case where the traditional VPLS technology 178 [RFC4761, RFC4762] is used for data center interconnect and 179 default gateways of different data center locations are configured 180 within the same virtual router redundancy group, the returning 181 traffic from that server to the cloud user may be forwarded at 182 layer2 to a default gateway located at one of the remote data 183 center premises, rather than the one placed at the local data 184 center location. This suboptimal routing would also unnecessarily 185 consume the bandwidth resources which are intended for data center 186 interconnect. 188 This document describes a L3VPN-based subnet extension solution 189 referred to as Virtual Subnet (VS), which can meet all of the 190 requirements of cloud data center interconnect as described above. 191 Since VS mainly reuses existing technologies including BGP/MPLS IP 192 VPN [RFC4364] and ARP/ND proxy [RFC925][RFC1027][RFC4389], it allows 193 those service providers offering IaaS public cloud services to 194 interconnect their geographically dispersed data centers in a much 195 scalable way, and more importantly, data center interconnection 196 design can rely upon their existing MPLS/BGP IP VPN infrastructures 197 and their experiences in the delivery and the operation of MPLS/BGP 198 IP VPN services. 200 Although Virtual Subnet is described as a data center interconnection 201 solution in this document, there is no reason to assume that this 202 technology couldn't be used within data centers. 204 2. Terminology 206 This memo makes use of the terms defined in [RFC4364], [RFC2338] 207 [MVPN] and [VA-AUTO]. 209 3. Solution Description 211 3.1. Unicast 213 3.1.1. Intra-subnet Unicast 214 +--------------------+ 215 +-----------------+ | | +-----------------+ 216 |VPN_A:1.1.1.1/24 | | | |VPN_A:1.1.1.1/24 | 217 | \ | | | | / | 218 | +------+ \++---+-+ +-+---++/ +------+ | 219 | |Host A+----+ PE-1 | | PE-2 +----+Host B| | 220 | +------+\ ++-+-+-+ +-+-+-++ /+------+ | 221 | 1.1.1.2/24 | | | | | | 1.1.1.3/24 | 222 | | | | | | | | 223 | DC West | | | IP/MPLS Backbone | | | DC East | 224 +-----------------+ | | | | +-----------------+ 225 | +--------------------+ | 226 | | 227 VRF_A : V VRF_A : V 228 +------------+---------+--------+ +------------+---------+--------+ 229 | Prefix | Nexthop |Protocol| | Prefix | Nexthop |Protocol| 230 +------------+---------+--------+ +------------+---------+--------+ 231 | 1.1.1.1/32 |127.0.0.1| Direct | | 1.1.1.1/32 |127.0.0.1| Direct | 232 +------------+---------+--------+ +------------+---------+--------+ 233 | 1.1.1.2/32 | 1.1.1.2 | Direct | | 1.1.1.2/32 | PE-1 | IBGP | 234 +------------+---------+--------+ +------------+---------+--------+ 235 | 1.1.1.3/32 | PE-2 | IBGP | | 1.1.1.3/32 | 1.1.1.3 | Direct | 236 +------------+---------+--------+ +------------+---------+--------+ 237 | 1.1.1.0/24 | 1.1.1.1 | Direct | | 1.1.1.0/24 | 1.1.1.1 | Direct | 238 +------------+---------+--------+ +------------+---------+--------+ 239 Figure 1: Intra-subnet Unicast Example 241 As shown in Figure 1, two CE hosts (i.e., Hosts A and B) belonging to 242 the same subnet (i.e., 1.1.1.0/24) are located at different data 243 centers (i.e., DC West and DC East) respectively. PE routers (i.e., 244 PE-1 and PE-2) which are used for interconnecting these two data 245 centers create host routes for their local CE hosts respectively and 246 then advertise them via L3VPN signaling. Meanwhile, ARP proxy is 247 enabled on VRF attachment circuits of these PE routers. 249 Now assume host A sends an ARP request for host B before 250 communicating with host B. Upon receiving the ARP request, PE-1 251 acting as an ARP proxy returns its own MAC address as a response. 252 Host A then sends IP packets for host B to PE-1. PE-1 tunnels such 253 packets towards PE-2 which in turn forwards them to host B. Thus, 254 hosts A and B can communicate with each other as if they were located 255 within the same subnet. 257 3.1.2. Inter-subnet Unicast 258 +--------------------+ 259 +-----------------+ | | +-----------------+ 260 |VPN_A:1.1.1.1/24 | | | |VPN_A:1.1.1.1/24 | 261 | \ | | | | / | 262 | +------+ \++---+-+ +-+---++/ +------+ | 263 | |Host A+------+ PE-1 | | PE-2 +-+----+Host B| | 264 | +------+\ ++-+-+-+ +-+-+-++ | /+------+ | 265 | 1.1.1.2/24 | | | | | | | 1.1.1.3/24 | 266 | GW=1.1.1.4 | | | | | | | GW=1.1.1.4 | 267 | | | | | | | | +------+ | 268 | | | | | | | +----+ GW +--| 269 | | | | | | | /+------+ | 270 | | | | | | | 1.1.1.4/24 | 271 | | | | | | | | 272 | DC West | | | IP/MPLS Backbone | | | DC East | 273 +-----------------+ | | | | +-----------------+ 274 | +--------------------+ | 275 | | 276 VRF_A : V VRF_A : V 277 +------------+---------+--------+ +------------+---------+--------+ 278 | Prefix | Nexthop |Protocol| | Prefix | Nexthop |Protocol| 279 +------------+---------+--------+ +------------+---------+--------+ 280 | 1.1.1.1/32 |127.0.0.1| Direct | | 1.1.1.1/32 |127.0.0.1| Direct | 281 +------------+---------+--------+ +------------+---------+--------+ 282 | 1.1.1.2/32 | 1.1.1.2 | Direct | | 1.1.1.2/32 | PE-1 | IBGP | 283 +------------+---------+--------+ +------------+---------+--------+ 284 | 1.1.1.3/32 | PE-2 | IBGP | | 1.1.1.3/32 | 1.1.1.3 | Direct | 285 +------------+---------+--------+ +------------+---------+--------+ 286 | 1.1.1.4/32 | PE-2 | IBGP | | 1.1.1.4/32 | 1.1.1.4 | Direct | 287 +------------+---------+--------+ +------------+---------+--------+ 288 | 1.1.1.0/24 | 1.1.1.1 | Direct | | 1.1.1.0/24 | 1.1.1.1 | Direct | 289 +------------+---------+--------+ +------------+---------+--------+ 290 | 0.0.0.0/0 | PE-2 | IBGP | | 0.0.0.0/0 | 1.1.1.4 | Static | 291 +------------+---------+--------+ +------------+---------+--------+ 292 Figure 2: Inter-subnet Unicast Example (1) 294 As shown in Figure 2, only one data center (i.e., DC East) is 295 deployed with a default gateway (i.e., GW). PE-2 which is connected 296 to GW would either be configured with or learn from GW a default 297 route with next-hop being pointed to GW. Meanwhile, this route is 298 distributed to other PE routers (i.e., PE-1) as per normal [RFC4364] 299 operation. Assume host A sends an ARP request for its default 300 gateway (i.e., 1.1.1.4) prior to communicating with a destination 301 host outside of its subnet. Upon receiving this ARP request, PE-1 302 acting as an ARP proxy returns its own MAC address as a response. 303 Host A then sends a packet for Host B to PE-1. PE-1 tunnels such 304 packet towards PE-2 according to the default route learnt from PE-2, 305 which in turn forwards that packet to GW. 306 +--------------------+ 307 +-----------------+ | | +-----------------+ 308 |VPN_A:1.1.1.1/24 | | | |VPN_A:1.1.1.1/24 | 309 | \ | | | | / | 310 | +------+ \++---+-+ +-+---++/ +------+ | 311 | |Host A+----+-+ PE-1 | | PE-2 +-+----+Host B| | 312 | +------+\ | ++-+-+-+ +-+-+-++ | /+------+ | 313 | 1.1.1.2/24 | | | | | | | | 1.1.1.3/24 | 314 | GW=1.1.1.4 | | | | | | | | GW=1.1.1.4 | 315 | +------+ | | | | | | | | +------+ | 316 |--+ GW-1 +----+ | | | | | | +----+ GW-2 +--| 317 | +------+\ | | | | | | /+------+ | 318 | 1.1.1.4/24 | | | | | | 1.1.1.4/24 | 319 | | | | | | | | 320 | DC West | | | IP/MPLS Backbone | | | DC East | 321 +-----------------+ | | | | +-----------------+ 322 | +--------------------+ | 323 | | 324 VRF_A : V VRF_A : V 325 +------------+---------+--------+ +------------+---------+--------+ 326 | Prefix | Nexthop |Protocol| | Prefix | Nexthop |Protocol| 327 +------------+---------+--------+ +------------+---------+--------+ 328 | 1.1.1.1/32 |127.0.0.1| Direct | | 1.1.1.1/32 |127.0.0.1| Direct | 329 +------------+---------+--------+ +------------+---------+--------+ 330 | 1.1.1.2/32 | 1.1.1.2 | Direct | | 1.1.1.2/32 | PE-1 | IBGP | 331 +------------+---------+--------+ +------------+---------+--------+ 332 | 1.1.1.3/32 | PE-2 | IBGP | | 1.1.1.3/32 | 1.1.1.3 | Direct | 333 +------------+---------+--------+ +------------+---------+--------+ 334 | 1.1.1.4/32 | 1.1.1.4 | Direct | | 1.1.1.4/32 | 1.1.1.4 | Direct | 335 +------------+---------+--------+ +------------+---------+--------+ 336 | 1.1.1.0/24 | 1.1.1.1 | Direct | | 1.1.1.0/24 | 1.1.1.1 | Direct | 337 +------------+---------+--------+ +------------+---------+--------+ 338 | 0.0.0.0/0 | 1.1.1.4 | Static | | 0.0.0.0/0 | 1.1.1.4 | Static | 339 +------------+---------+--------+ +------------+---------+--------+ 340 Figure 3: Inter-subnet Unicast Example (2) 342 As shown in Figure 3, in the case where each data center is deployed 343 with a default gateway, CE hosts will get ARP responses directly from 344 their local default gateways, rather than from their local PE routers 345 when sending ARP requests for their default gateways. 347 +------+ 348 +------+ PE-3 +------+ 349 +-----------------+ | +------+ | +-----------------+ 350 |VPN_A:1.1.1.1/24 | | | |VPN_A:1.1.1.1/24 | 351 | \ | | | | / | 352 | +------+ \++---+-+ +-+---++/ +------+ | 353 | |Host A+------+ PE-1 | | PE-2 +------+Host B| | 354 | +------+\ ++-+-+-+ +-+-+-++ /+------+ | 355 | 1.1.1.2/24 | | | | | | 1.1.1.3/24 | 356 | GW=1.1.1.1 | | | | | | GW=1.1.1.1 | 357 | | | | | | | | 358 | DC West | | | IP/MPLS Backbone | | | DC East | 359 +-----------------+ | | | | +-----------------+ 360 | +--------------------+ | 361 | | 362 VRF_A : V VRF_A : V 363 +------------+---------+--------+ +------------+---------+--------+ 364 | Prefix | Nexthop |Protocol| | Prefix | Nexthop |Protocol| 365 +------------+---------+--------+ +------------+---------+--------+ 366 | 1.1.1.1/32 |127.0.0.1| Direct | | 1.1.1.1/32 |127.0.0.1| Direct | 367 +------------+---------+--------+ +------------+---------+--------+ 368 | 1.1.1.2/32 | 1.1.1.2 | Direct | | 1.1.1.2/32 | PE-1 | IBGP | 369 +------------+---------+--------+ +------------+---------+--------+ 370 | 1.1.1.3/32 | PE-2 | IBGP | | 1.1.1.3/32 | 1.1.1.3 | Direct | 371 +------------+---------+--------+ +------------+---------+--------+ 372 | 1.1.1.0/24 | 1.1.1.1 | Direct | | 1.1.1.0/24 | 1.1.1.1 | Direct | 373 +------------+---------+--------+ +------------+---------+--------+ 374 | 0.0.0.0/0 | PE-3 | IBGP | | 0.0.0.0/0 | PE-3 | IBGP | 375 +------------+---------+--------+ +------------+---------+--------+ 376 Figure 4: Inter-subnet Unicast Example (3) 378 Alternatively, as shown in Figure 4, PE routers themselves could be 379 directly configured as default gateways of their locally connected CE 380 hosts as long as these PE routers have routes for outside networks. 382 3.2. Multicast 384 To support IP multicast between CE hosts of the same virtual subnet, 385 MVPN technology [MVPN] could be directly reused. For example, PE 386 routers attached to a given VPN join a default provider multicast 387 distribution tree which is dedicated for that VPN. Ingress PE routers, 388 upon receiving multicast packets from their local CE hosts, forward 389 them towards remote PE routers through the corresponding default 390 provider multicast distribution tree. 392 More details about how to support multicast and broadcast in VS will 393 be explored in a later version of this document. 395 3.3. CE Host Discovery 397 PE routers SHOULD be able to discover their local CE hosts and keep 398 the list of these hosts up to date in a timely manner so as to ensure 399 the availability and accuracy of the corresponding host routes 400 originated from them. PE routers could accomplish local CE host 401 discovery by some traditional host discovery mechanisms using ARP or 402 ND protocols. Furthermore, Link Layer Discovery Protocol (LLDP) 403 described in [802.1AB] or VSI Discovery and Configuration Protocol 404 (VDP) described in [802.1Qbg], or even interaction with the data 405 center orchestration system could also be considered as a means to 406 dynamically discover local CE hosts. 408 3.4. ARP/ND Proxy 410 Acting as ARP or ND proxies, PE routers SHOULD only respond to an ARP 411 request or Neighbor Solicitation (NS) message for the target host 412 when there is a corresponding host route in the associated VRF and 413 the outgoing interface of that route is different from the one over 414 which the ARP request or the NS message arrived. 416 In the scenario where a given VPN site (i.e., a data center) is 417 multi-homed to more than one PE router via an Ethernet switch or an 418 Ethernet network, Virtual Router Redundancy Protocol (VRRP) [RFC5798] 419 is usually enabled on these PE routers. In this case, only the PE 420 router being elected as the VRRP Master is allowed to perform the 421 ARP/ND proxy function. 423 3.5. CE Host Mobility 425 During the VM migration process, the PE router to which the moving VM 426 is now attached would create a host route for that CE host upon 427 receiving a notification message of VM attachment while the PE router 428 to which the moving VM was previously attached would withdraw the 429 corresponding host route when receiving a notification message of VM 430 detachment. Meanwhile, the latter PE router could optionally 431 broadcast a gratuitous ARP/ND message on behalf of that CE host with 432 source MAC address being one of its own. In the way, the ARP/ND entry 433 of that moved CE host which has been cached on any local CE host 434 would be updated accordingly. 436 3.6. Forwarding Table Scalability 438 3.6.1. MAC Table Reduction on Data Center Switches 440 In a VS environment, the MAC learning domain associated with a given 441 virtual subnet which has been extended across multiple data centers 442 is partitioned into segments and each segment is confined within a 443 single data center. Therefore data center switches only need to learn 444 local MAC addresses, rather than learning both local and remote MAC 445 addresses. 447 3.6.2. PE Router FIB Reduction 448 +------+ 449 +------+RR/APR+------+ 450 +-----------------+ | +------+ | +-----------------+ 451 |VPN_A:1.1.1.1/24 | | | |VPN_A:1.1.1.1/24 | 452 | \ | | | | / | 453 | +------+ \++---+-+ +-+---++/ +------+ | 454 | |Host A+------+ PE-1 | | PE-2 +------+Host B| | 455 | +------+\ ++-+-+-+ +-+-+-++ /+------+ | 456 | 1.1.1.2/24 | | | | | | 1.1.1.3/24 | 457 | | | | | | | | 458 | DC West | | | IP/MPLS Backbone | | | DC East | 459 +-----------------+ | | | | +-----------------+ 460 | +--------------------+ | 461 | | 462 VRF_A : V VRF_A : V 463 +------------+---------+--------+------+ +------------+---------+--------+------+ 464 | Prefix | Nexthop |Protocol|In_FIB| | Prefix | Nexthop |Protocol|In_FIB| 465 +------------+---------+--------+------+ +------------+---------+--------+------+ 466 | 1.1.1.1/32 |127.0.0.1| Direct | Yes | | 1.1.1.1/32 |127.0.0.1| Direct | Yes | 467 +------------+---------+--------+------+ +------------+---------+--------+------+ 468 | 1.1.1.2/32 | 1.1.1.2 | Direct | Yes | | 1.1.1.2/32 | PE-1 | IBGP | No | 469 +------------+---------+--------+------+ +------------+---------+--------+------+ 470 | 1.1.1.3/32 | PE-2 | IBGP | No | | 1.1.1.3/32 | 1.1.1.3 | Direct | Yes | 471 +------------+---------+--------+------+ +------------+---------+--------+------+ 472 | 1.1.1.0/25 | RR | IBGP | Yes | | 1.1.1.0/25 | RR | IBGP | Yes | 473 +------------+---------+--------+------+ +------------+---------+--------+------+ 474 |1.1.1.128/25| RR | IBGP | Yes | |1.1.1.128/25| RR | IBGP | Yes | 475 +------------+---------+--------+------+ +------------+---------+--------+------+ 476 | 1.1.1.0/24 | 1.1.1.1 | Direct | Yes | | 1.1.1.0/24 | 1.1.1.1 | Direct | Yes | 477 +------------+---------+--------+------+ +------------+---------+--------+------+ 478 Figure 5: FIB Reduction Example 480 To reduce the FIB size of PE routers, Virtual Aggregation (VA) [VA- 481 AUTO] technology can be used. Take the VPN instance A shown in Figure 482 5 as an example, the procedures of FIB reduction are as follows: 484 1) Multiple more specific prefixes (e.g., 1.1.1.0/25 and 1.1.1.128/25) 485 corresponding to the prefix of virtual subnet (i.e., 1.1.1.0/24) 486 are configured as Virtual Prefixes (VPs) and a Route-Reflector (RR) 487 is configured as an Aggregation Point Router (APR) for these VPs. 488 PE routers as RR clients advertise host routes for their own local 489 CE hosts to the RR which in turn, as an APR, installs those host 490 routes into its FIB and then attach the "can-suppress" tag to those 491 host routes before reflecting them to its clients. 493 2) Those host routes which have been attached with the "can suppress" 494 tag would not be installed into FIBs by clients who are VA-aware 495 since they are not APRs for those host routes. In addition, the RR 496 as an APR would advertise the corresponding VP routes to all of its 497 clients, and those of which who are VA-aware in turn would install 498 these VP routes into their FIBs. 500 3) Upon receiving a packet from a local CE host, if no matching host 501 route found, the ingress PE router will forward the packet to the 502 RR according to one of the VP routes learnt from the RR, which in 503 turn forwards the packet to the relevant egress PE router according 504 to the host route learnt from that egress PE router. In a word, the 505 FIB table size of PE routers can be greatly reduced at the cost of 506 path stretch. Note that in the case where the RR is not available 507 for transferring L3VPN traffic between PE routers for some reason 508 (e.g., the RR is implemented on a server, rather than a router), 509 the APR function could actually be performed by a given PE router 510 other than the RR as long as that PE router has installed all host 511 routes belonging to the virtual subnet into its FIB. Thus, the RR 512 only needs to attach a "can-suppress" tag to the host routes learnt 513 from its clients before reflecting them to the other clients. 514 Furthermore, PE routers themselves could directly attach the "can- 515 suppress" tag to those host routes for their local CE hosts before 516 distributing them to remote peers as well. 518 4) Provided a given local CE host sends an ARP request for a remote 519 CE host, the PE router that receives such request will install the 520 host route for that remote CE host into its FIB, in case there is a 521 host route for that CE host in its RIB and has not yet been 522 installed into the FIB. Therefore, the subsequent packets destined 523 for that remote CE host will be forwarded directly to the egress PE 524 router. To save the FIB space, FIB entries corresponding to remote 525 host routes which have been attached with "can-suppress" tags would 526 expire if they have not been used for forwarding packets for a 527 certain period of time. 529 3.6.3. PE Router RIB Reduction 531 +------+ 532 +------+ RR +------+ 533 +-----------------+ | +------+ | +-----------------+ 534 |VPN_A:1.1.1.1/24 | | | |VPN_A:1.1.1.1/24 | 535 | \ | | | | / | 536 | +------+ \++---+-+ +-+---++/ +------+ | 537 | |Host A+------+ PE-1 | | PE-2 +------+Host B| | 538 | +------+\ ++-+-+-+ +-+-+-++ /+------+ | 539 | 1.1.1.2/24 | | | | | | 1.1.1.3/24 | 540 | | | | | | | | 541 | DC West | | | IP/MPLS Backbone | | | DC East | 542 +-----------------+ | | | | +-----------------+ 543 | +--------------------+ | 544 | | 546 VRF_A : V VRF_A : V 547 +------------+---------+--------+ +------------+---------+--------+ 548 | Prefix | Nexthop |Protocol| | Prefix | Nexthop |Protocol| 549 +------------+---------+--------+ +------------+---------+--------+ 550 | 1.1.1.1/32 |127.0.0.1| Direct | | 1.1.1.1/32 |127.0.0.1| Direct | 551 +------------+---------+--------+ +------------+---------+--------+ 552 | 1.1.1.2/32 | 1.1.1.2 | Direct | | 1.1.1.3/32 | 1.1.1.3 | Direct | 553 +------------+---------+--------+ +------------+---------+--------+ 554 | 1.1.1.0/25 | RR | IBGP | | 1.1.1.0/25 | RR | IBGP | 555 +------------+---------+--------+ +------------+---------+--------+ 556 |1.1.1.128/25| RR | IBGP | |1.1.1.128/25| RR | IBGP | 557 +------------+---------+--------+ +------------+---------+--------+ 558 | 1.1.1.0/24 | 1.1.1.1 | Direct | | 1.1.1.0/24 | 1.1.1.1 | Direct | 559 +------------+---------+--------+ +------------+---------+--------+ 560 Figure 6: RIB Reduction Example 562 To reduce the RIB size of PE routers, BGP Outbound Route Filtering 563 (ORF) mechanism is used to realize on-demand route announcement. Take 564 the VPN instance A shown in Figure 6 as an example, the procedures of 565 RIB reduction are as follows: 567 1) PE routers as RR clients advertise host routes for their local CE 568 hosts to a RR which however doesn't reflect these host routes by 569 default unless it receives explicit ORF requests for them from its 570 clients. The RR is configured with routes for more specific subnets 571 (e.g., 1.1.1.0/25 and 1.1.1.128/25) corresponding to the virtual 572 subnet (i.e., 1.1.1.0/24) with next-hop being pointed to Null0 and 573 then advertises these routes to its clients via BGP. 575 2) Upon receiving a packet from a local CE host, if no matching host 576 route found, the ingress PE router will forward the packet to the 577 RR according to one of the subnet routes learnt from the RR, which 578 in turn forwards the packet to the relevant egress PE router 579 according to the host route learnt from that egress PE router. In a 580 word, the RIB table size of PE routers can be greatly reduced at 581 the cost of path stretch. 583 3) Just as the approach mentioned in section 3.6.2, in the case where 584 the RR is not available for transferring L3VPN traffic between PE 585 routers for some reason, a PE router other than the RR could 586 advertise the more specific subnet routes as long as that PE router 587 has installed all host routes belonging to that virtual subnet into 588 its FIB. 590 4) Provided a given local CE host sends an ARP request for a remote 591 CE host, the ingress PE router that receives such request will 592 request the corresponding host route from its RR by using the ORF 593 mechanism (e.g., a group ORF containing Route-Target (RT) and 594 prefix information) in case there is no host route for that CE host 595 in its RIB yet. Once the host route for the remote CE host is 596 learnt from the RR, the subsequent packets destined for that CE 597 host would be forwarded directly to the egress PE router. Note that 598 the RIB entries of remote host routes could expire if they have not 599 been used for forwarding packets for a certain period of time. Once 600 the expiration time for a given RIB entry is approaching, the PE 601 router would notify its RR not to pass the updates for 602 corresponding host route by using the ORF mechanism. 604 3.7. ARP/ND Cache Table Scalability on Default Gateways 606 In case where data center default gateway functions are implemented 607 on PE routers of the VS as shown in Figure 4, since the ARP/ND cache 608 table on each PE router only needs to contain ARP/ND entries of local 609 CE hosts, the ARP/ND cache table size will not grow as the number of 610 data centers to be connected increases. 612 3.8. ARP/ND and Unknown Uncast Flood Avoidance 614 In VS, the flooding domain associated with a given virtual subnet 615 that has been extended across multiple data centers, has been 616 partitioned into segments and each segment is confined within a 617 single data center. Therefore, the performance impact on networks and 618 servers caused by the flooding of ARP/ND broadcast/multicast and 619 unknown unicast traffic is alleviated. 621 3.9. Path Optimization 623 Take the scenario shown in Figure 4 as an example, to optimize the 624 forwarding path for traffic between cloud users and cloud data 625 centers, PE routers located at cloud data centers (i.e., PE-1 and PE- 626 2), which are also data center default gateways, propagate host 627 routes for their local CE hosts respectively to remote PE routers 628 which are attached to cloud user sites (i.e., PE-3). 630 As such, traffic from cloud user sites to a given server on the 631 virtual subnet which has been extended across data centers would be 632 forwarded directly to the data center location where that server 633 resides, since traffic is now forwarded according to the host route 634 for that server, rather than the subnet route. 636 Furthermore, for traffic coming from cloud data centers and forwarded 637 to cloud user sites, each PE router acting as a default gateway would 638 forward the traffic received from its local CE hosts according to the 639 best-match route in the corresponding VRF. As a result, traffic from 640 data centers to cloud user sites is forwarded along the optimal path 641 as well. 643 4. Considerations for Non-IP traffic 645 Although most traffic within and across data centers is IP traffic, 646 there may still be a few legacy clustering applications which rely on 647 non-IP communications (e.g., heartbeat messages between cluster 648 nodes). To support those few non-IP traffic (if present) in the 649 Virtual Subnet solution, the approach following the idea of "route 650 all IP traffic, bridge non-IP traffic" could be considered as an 651 enhancement to the original Virtual Subnet solution. 653 Note that more and more cluster vendors are offering clustering 654 applications based on Layer 3 interconnection. 656 5. Security Considerations 658 This document doesn't introduce additional security risk to BGP/MPLS 659 L3VPN, nor does it provide any additional security feature for 660 BGP/MPLS L3VPN. 662 6. IANA Considerations 664 There is no requirement for any IANA action. 666 7. Acknowledgements 668 Thanks to Dino Farinacci, Himanshu Shah, Nabil Bitar, Giles Heron, 669 Ronald Bonica, Monique Morrow, Rajiv Asati and Eric Osborne for their 670 valuable comments and suggestions on this document. 672 8. References 674 8.1. Normative References 676 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 677 Requirement Levels", BCP 14, RFC 2119, March 1997. 679 8.2. Informative References 681 [RFC4364] Rosen. E and Y. Rekhter, "BGP/MPLS IP Virtual Private 682 Networks (VPNs)", RFC 4364, February 2006. 684 [MVPN] Rosen. E and Aggarwal. R, "Multicast in MPLS/BGP IP VPNs", 685 draft-ietf-l3vpn-2547bis-mcast-10.txt, Work in Progress, 686 Janurary 2010. 688 [VA-AUTO] Francis, P., Xu, X., Ballani, H., Jen, D., Raszuk, R., and 689 L. Zhang, "Auto-Configuration in Virtual Aggregation", 690 draft-ietf-grow-va-auto-05.txt, Work in Progress, December 691 2011. 693 [RFC925] Postel, J., "Multi-LAN Address Resolution", RFC-925, USC 694 Information Sciences Institute, October 1984. 696 [RFC1027] Smoot Carl-Mitchell, John S. Quarterman, "Using ARP to 697 Implement Transparent Subnet Gateways", RFC 1027, October 698 1987. 700 [RFC4389] D. Thaler, M. Talwar, and C. Patel, "Neighbor Discovery 701 Proxies (ND Proxy) ", RFC 4389, April 2006. 703 [RFC5798] S. Nadas., "Virtual Router Redundancy Protocol", RFC 5798, 704 March 2010. 706 [RFC4761] Kompella, K. and Y. Rekhter, "Virtual Private LAN Service 707 (VPLS) Using BGP for Auto-Discovery and Signaling", RFC 708 4761, January 2007. 710 [RFC4762] Lasserre, M. and V. Kompella, "Virtual Private LAN Service 711 (VPLS) Using Label Distribution Protocol (LDP) Signaling", 712 RFC 4762, January 2007. 714 [802.1AB] IEEE Standard 802.1AB-2009, "Station and Media Access 715 Control Connectivity Discovery", September 17, 2009. 717 [802.1Qbg] IEEE Draft Standard P802.1Qbg/D2.0, "Virtual Bridged Local 718 Area Networks -Amendment XX: Edge Virtual Bridging", Work 719 in Progress, December 1, 2011. 721 [RFC6820] Narten, T., Karir, M., and I. Foo, "Problem Statement for 722 ARMD", RFC 6820, January 2013. 724 Authors' Addresses 726 Xiaohu Xu 727 Huawei Technologies, 728 Beijing, China. 729 Phone: +86 10 60610041 730 Email: xuxiaohu@huawei.com 732 Susan Hares 733 Email: shares@ndzh.com 734 Yongbing Fan 735 Guangzhou Institute, China Telecom 736 Guangzhou, China. 737 Phone: +86 20 38639121 738 Email: fanyb@gsta.com 740 Christian Jacquenet 741 France Telecom 742 Rennes 743 France 744 Email: christian.jacquenet@orange.com