idnits 2.17.1 draft-rabadan-bess-evpn-optimized-ir-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an Introduction section. ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: o AR-LEAF nodes SHALL send service-level BM control plane packets following regular IR procedures. An example would be IGMP, MLD or PIM multicast packets. The AR-REPLICATORs MUST not replicate these control plane packets to other overlay tunnels since they will use the regular IR-IP Address. -- The document date (October 24, 2014) is 3465 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC6514' is mentioned on line 246, but not defined == Missing Reference: 'RFC2119' is mentioned on line 870, but not defined -- Possible downref: Normative reference to a draft: ref. 'EVPN-OVERLAY' Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Workgroup J. Rabadan, Ed. 3 Internet Draft S. Sathappan 4 Intended status: Standards Track W. Henderickx 5 Alcatel-Lucent 6 R. Shekhar 7 N. Sheth A. Sajassi 8 W. Lin Cisco 9 M. Katiyar 10 Juniper A. Isaac 11 Bloomberg 12 M. Tufail 13 Citibank 15 Expires: April 27, 2015 October 24, 2014 17 Optimized Ingress Replication solution for EVPN 18 draft-rabadan-bess-evpn-optimized-ir-00 20 Abstract 22 Network Virtualization Overlay (NVO) networks using EVPN as control 23 plane may use ingress replication (IR) or PIM-based trees to convey 24 the overlay multicast traffic. PIM provides an efficient solution to 25 avoid sending multiple copies of the same packet over the same 26 physical link, however it may not always be deployed in the NVO core 27 network. IR avoids the dependency on PIM in the NVO network core. 28 While IR provides a simple multicast transport, some NVO networks 29 with demanding multicast applications require a more efficient 30 solution without PIM in the core. This document describes a solution 31 to optimize the efficiency of IR in NVO networks. 33 Status of this Memo 35 This Internet-Draft is submitted in full conformance with the 36 provisions of BCP 78 and BCP 79. 38 Internet-Drafts are working documents of the Internet Engineering 39 Task Force (IETF), its areas, and its working groups. Note that 40 other groups may also distribute working documents as Internet- 41 Drafts. 43 Internet-Drafts are draft documents valid for a maximum of six months 44 and may be updated, replaced, or obsoleted by other documents at any 45 time. It is inappropriate to use Internet-Drafts as reference 46 material or to cite them other than as "work in progress." 48 The list of current Internet-Drafts can be accessed at 49 http://www.ietf.org/ietf/1id-abstracts.txt 51 The list of Internet-Draft Shadow Directories can be accessed at 52 http://www.ietf.org/shadow.html 54 This Internet-Draft will expire on April 27, 2015. 56 Copyright Notice 58 Copyright (c) 2014 IETF Trust and the persons identified as the 59 document authors. All rights reserved. 61 This document is subject to BCP 78 and the IETF Trust's Legal 62 Provisions Relating to IETF Documents 63 (http://trustee.ietf.org/license-info) in effect on the date of 64 publication of this document. Please review these documents 65 carefully, as they describe your rights and restrictions with respect 66 to this document. Code Components extracted from this document must 67 include Simplified BSD License text as described in Section 4.e of 68 the Trust Legal Provisions and are provided without warranty as 69 described in the Simplified BSD License. 71 Table of Contents 73 1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 3 74 2. Solution requirements . . . . . . . . . . . . . . . . . . . . . 4 75 3. EVPN BGP Attributes for optimized-IR . . . . . . . . . . . . . 5 76 4. Non-selective Assisted-Replication (AR) Solution Description . 7 77 4.1. Non-selective AR-REPLICATOR procedures . . . . . . . . . . 8 78 4.2. Non-selective AR-LEAF procedures . . . . . . . . . . . . . 9 79 4.3. RNVE procedures . . . . . . . . . . . . . . . . . . . . . . 10 80 4.4. Forwarding behavior in non-selective AR EVIs . . . . . . . 10 81 4.4.1. Broadcast and Multicast forwarding behavior . . . . . . 11 82 4.4.1.1. Non-selective AR-REPLICATOR BM forwarding . . . . . 11 83 4.4.1.2. Non-selective AR-LEAF BM forwarding . . . . . . . . 11 84 4.4.1.3. RNVE BM forwarding . . . . . . . . . . . . . . . . 12 85 4.4.2. Unknown unicast forwarding behavior . . . . . . . . . . 12 86 4.4.2.1. Non-selective AR-REPLICATOR/LEAF Unknown unicast 87 forwarding . . . . . . . . . . . . . . . . . . . . 12 88 4.4.2.2. RNVE Unknown unicast forwarding . . . . . . . . . . 13 89 5. Selective Assisted-Replication (AR) Solution Description . . . 13 90 5.1. Selective AR-REPLICATOR procedures . . . . . . . . . . . . 13 91 5.2. Selective AR-LEAF procedures . . . . . . . . . . . . . . . 15 92 5.3. Forwarding behavior in selective AR EVIs . . . . . . . . . 16 93 5.3.1. Selective AR-REPLICATOR BM forwarding . . . . . . . . . 16 94 5.3.2. Selective AR-LEAF BM forwarding . . . . . . . . . . . . 17 95 6. Pruned-Flood-Lists (PFL) . . . . . . . . . . . . . . . . . . . 17 96 6.1. A PFL example . . . . . . . . . . . . . . . . . . . . . . . 18 97 7. Benefits of the optimized-IR solution . . . . . . . . . . . . . 19 98 8. Conventions used in this document . . . . . . . . . . . . . . . 19 99 9. Security Considerations . . . . . . . . . . . . . . . . . . . . 19 100 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 20 101 11. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 20 102 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20 103 14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 21 104 15. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 21 106 1. Problem Statement 108 EVPN may be used as the control plane for a Network Virtualization 109 Overlay (NVO) network. Network Virtualization Edge (NVE) devices and 110 PEs that are part of the same EVI use Ingress Replication (IR) or 111 PIM-based trees to transport the tenant's multicast traffic. In NVO 112 networks where PIM-based trees cannot be used, IR is the only 113 alternative. Examples of these situations are NVO networks where the 114 core nodes don't support PIM or the network operator does not want to 115 run PIM in the core. 117 In some use-cases, the amount of replication for BUM (Broadcast, 118 Unknown unicast and Multicast traffic) is kept under control on the 119 NVEs due to the following fairly common assumptions: 121 a) Broadcast is greatly reduced due to the proxy-ARP and proxy-ND 122 capabilities supported by EVPN on the NVEs. Some NVEs can even 123 provide DHCP-server functions for the attached Tenant Systems (TS) 124 reducing the broadcast even further. 126 b) Unknown unicast traffic is greatly reduced in virtualized NVO 127 networks where all the MAC and IP addresses are learnt in the 128 control plane. 130 c) Multicast applications are not used. 132 If the above assumptions are true for a given NVO network, then IR 133 provides a simple solution for multi-destination traffic. However, 134 the statement c) above is not always true and multicast applications 135 are required in many use-cases. 137 When the multicast sources are attached to NVEs residing in 138 hypervisors or low-performance-replication TORs, the ingress 139 replication of a large amount of multicast traffic to a significant 140 number of remote NVEs/PEs can seriously degrade the performance of 141 the NVE and impact the application. 143 This document describes a solution that makes use of two IR 144 optimizations: 146 i) Assisted-Replication (AR) 147 ii) Pruned-Flood-Lists (PFL) 149 Both optimizations may be used together or independently so that the 150 performance and efficiency of the network to transport multicast can 151 be improved. Both solutions require some extensions to [EVPN] that 152 are described in section 3. 154 Section 2 lists the requirements of the combined optimized-IR 155 solution, whereas section 4 describes the Assisted-Replication (AR) 156 solution and section 5 the Pruned-Flood-Lists (PFL) solution. 158 2. Solution requirements 160 The IR optimization solution (optimized-IR hereafter) MUST meet the 161 following requirements: 163 a) The solution MUST provide an IR optimization for BM (Broadcast and 164 Multicast) traffic, while preserving the packet order for unicast 165 applications, i.e. known and unknown unicast traffic SHALL follow 166 the same path. 168 b) The solution MUST be compatible with [EVPN] and [EVPN-OVERLAY] and 169 not have any impact on the EVPN procedures for BM traffic. In 170 particular, the solution MUST support the following EVPN 171 functions: 173 o All-active multi-homing, including the split-horizon and 174 Designated Forwarder (DF) functions. 176 o Single-active multi-homing, including the DF function. 178 o Handling of multi-destination traffic and processing of 179 broadcast and multicast as per [EVPN]. 181 c) The solution MUST be backwards compatible with existing NVEs using 182 a non-optimized version of IR. A given EVI can have NVEs/PEs 183 supporting regular-IR and optimized-IR. 185 d) The solution MUST be independent of the NVO specific data plane 186 encapsulation and the virtual identifiers being used, e.g.: VXLAN 187 VNIs, NVGRE VSIDs or MPLS labels. 189 3. EVPN BGP Attributes for optimized-IR 191 This solution proposes some changes to the [EVPN] Inclusive Multicast 192 Ethernet Tag routes and attributes so that an NVE/PE can signal its 193 optimized-IR capabilities. 195 The Inclusive Multicast Ethernet Tag route (RT-3) and its PMSI Tunnel 196 Attribute's (PTA) general format used in [EVPN] are shown below: 198 +---------------------------------+ 199 | RD (8 octets) | 200 +---------------------------------+ 201 | Ethernet Tag ID (4 octets) | 202 +---------------------------------+ 203 | IP Address Length (1 octet) | 204 +---------------------------------+ 205 | Originating Router's IP Addr | 206 | (4 or 16 octets) | 207 +---------------------------------+ 209 +---------------------------------+ 210 | Flags (1 octet) | 211 +---------------------------------+ 212 | Tunnel Type (1 octets) | 213 +---------------------------------+ 214 | MPLS Label (3 octets) | 215 +---------------------------------+ 216 | Tunnel Identifier (variable) | 217 +---------------------------------+ 219 The Flags field is defined as follows: 221 0 1 2 3 4 5 6 7 222 +-+-+-+-+-+--+-+-+ 223 |rsved| T |BM|U|L| 224 +-+-+-+-+-+--+-+-+ 226 Where a new type field (for AR) and two new flags (for PFL signaling) 227 are defined: 229 - T is the AR Type field (2 bits) that defines the AR role of the 230 advertising router: 232 + 00 (decimal 0) = RNVE (non-AR support) 233 + 01 (decimal 1) = AR-REPLICATOR 235 + 10 (decimal 2) = AR-LEAF 237 - The PFL (Pruned-Flood-Lists) flags defined the desired behavior of 238 the advertising router for the different types of traffic: 240 + BM= Broadcast and Multicast (BM) flag. BM=1 means "prune-me" from 241 the BM flooding list. BM=0 means regular behavior. 243 + U= Unknown flag. U=1 means "prune-me" from the Unknown flooding 244 list. U=0 means regular behavior. 246 - Flag L is an existing flag defined in [RFC6514] (L=Leaf Information 247 Required) and it will be used only in the Selective AR Solution. 249 The definition of the "Flags" octect is specific to PTAs with Tunnel 250 Types IR (0x06) and AR (TBD). The use of the described flags for 251 other Tunnel Types is out of the scope of this document. 253 In this document, the above RT-3 and PTA can be used in three 254 different modes for the same EVI/Ethernet Tag: 256 o Regular-IR route: in this route, Originating Router's IP Address, 257 Tunnel Type (0x06), MPLS Label, Tunnel Identifier and Flags MUST be 258 used as described in [EVPN]. The Originating Router's IP Address 259 and Tunnel Identifier are set to an IP address that we denominate 260 IR-IP in this document. 262 o Replicator-AR route: this route is used by the AR-REPLICATOR to 263 advertise its AR capabilities, with the fields set as follows. 265 + Originating Router's IP Address as well as the Tunnel Identifier 266 are set to the same routable IP address that we denominate AR-IP 267 and MUST be different than the IR-IP for a given PE/NVE. 269 + Tunnel Type = Assisted-Replication (AR). Section 11 provides the 270 allocated type value. 272 + T (AR role type) = 01 (AR-REPLICATOR). 274 + L (Leaf Information Required) = 0 (for non-selective AR) or 1 275 (for selective AR). 277 o Leaf-AR route: this route MAY be used by the AR-LEAF to advertise 278 its desire to receive the multicast traffic from a specific AR- 279 REPLICATOR. It is only used for selective AR and its fields are set 280 as follows: 282 + Originating Router's IP Address is set to the advertising IR-IP 283 (same IP used by the AR-LEAF in regular-IR routes). 285 + Tunnel Identifier is set to the AR-IP of the AR-REPLICATOR from 286 which the multicast traffic is requested. 288 + Tunnel Type = Assisted-Replication (AR). Section 11 provides the 289 allocated type value. 291 + T (AR role type) = 02 (AR-LEAF). 293 Each AR-enabled node MUST understand and process the AR type field in 294 the PTA (Flags field) of replicator-AR and leaf-AR routes, and MUST 295 signal the corresponding type (1 or 2) according to its 296 administrative choice for replicator-AR and leaf-AR routes. 298 Each node, part of the EVI, MAY understand and process the BM/U 299 flags. Note that these BM/U flags may be used to optimize the 300 delivery of multi-destination traffic and its use SHOULD be an 301 administrative choice, and independent of the AR role. 303 Non-optimized-IR nodes will be unaware of the new PMSI attribute flag 304 definition as well as the new Tunnel Type (AR), i.e. they will ignore 305 the information contained in the flags field for any RT-3 and will 306 ignore the RT-3 routes with an unknown Tunnel Type (type AR in this 307 case). 309 4. Non-selective Assisted-Replication (AR) Solution Description 311 The following figure illustrates an example NVO network where the 312 non-selective AR function is enabled. Three different roles are 313 defined for a given EVI: AR-REPLICATOR, AR-LEAF and RNVE (Regular 314 NVE). The solution is called "non-selective" because the chosen AR- 315 REPLICATOR for a given flow MUST replicate the multicast traffic to 316 'all' the NVE/PEs in the EVI except for the source NVE/PE. 318 ( ) 319 (_ WAN _) 320 +---(_ _)----+ 321 | (_ _) | 322 PE1 | PE2 | 323 +------+----+ +----+------+ 324 TS1--+ (EVI-1) | | (EVI-1) +--TS2 325 |REPLICATOR | |REPLICATOR | 326 +--------+--+ +--+--------+ 327 | | 328 +--+----------------+--+ 329 | | 330 | | 331 +----+ VXLAN/nvGRE/MPLSoGRE +----+ 332 | | IP Fabric | | 333 | | | | 334 NVE1 | +-----------+----------+ | NVE3 335 Hypervisor| TOR | NVE2 |Hypervisor 336 +---------+-+ +-----+-----+ +-+---------+ 337 | (EVI-1) | | (EVI-1) | | (EVI-1) | 338 | LEAF | | RNVE | | LEAF | 339 +--+-----+--+ +--+-----+--+ +--+-----+--+ 340 | | | | | | 341 VM11 VM12 TS3 TS4 VM31 VM32 343 Figure 1 Optimized-IR scenario 345 4.1. Non-selective AR-REPLICATOR procedures 347 An AR-REPLICATOR is defined as an NVE/PE capable of replicating 348 ingress BM (Broadcast and Multicast) traffic received on an overlay 349 tunnel to other overlay tunnels and local Attachment Circuits (ACs). 350 The AR-REPLICATOR signals its role in the control plane and 351 understands where the other roles (AR-LEAF nodes, RNVEs and other AR- 352 REPLICATORs) are located. A given AR-enabled EVI service may have 353 zero, one or more AR-REPLICATORs. In our example in figure 1, PE1 and 354 PE2 are defined as AR-REPLICATORs. The following considerations apply 355 to the AR-REPLICATOR role: 357 a) The AR-REPLICATOR role SHOULD be an administrative choice in any 358 NVE/PE that is part of an AR-enabled EVI. This administrative 359 option to enable AR-REPLICATOR capabilities MAY be implemented as 360 a system level option as opposed to as a per-EVI option. 362 b) An AR-REPLICATOR MUST advertise a Replicator-AR route and MAY 363 advertise a Regular-IR route. The AR-REPLICATOR MUST NOT generate 364 a Regular-IR route if it does not have local attachment circuits 365 (AC). 367 c) The Replicator-AR and Regular-IR routes will be generated 368 according to section 3. The AR-IP and IR-IP used by the 369 Replicator-AR will be different routable IP addresses. 371 d) When a node defined as AR-REPLICATOR receives a packet on an 372 overlay tunnel, it will do a tunnel destination IP lookup and 373 apply the following procedures: 375 o If the destination IP is the AR-REPLICATOR IR-IP Address the 376 node will process the packet normally as in [EVPN]. 378 o If the destination IP is the AR-REPLICATOR AR-IP Address the 379 node MUST replicate the packet to local ACs and overlay 380 tunnels (excluding the overlay tunnel to the source of the 381 packet). When replicating to remote AR-REPLICATORs the tunnel 382 destination IP will be an IR-IP. That will be an indication 383 for the remote AR-REPLICATOR that it MUST NOT replicate to 384 overlay tunnels. The tunnel source IP will be the AR-IP of the 385 AR-REPLICATOR. 387 4.2. Non-selective AR-LEAF procedures 389 AR-LEAF is defined as an NVE/PE that - given its poor replication 390 performance - sends all the BM traffic to an AR-REPLICATOR that can 391 replicate the traffic further on its behalf. It MAY signal its AR- 392 LEAF capability in the control plane and understands where the other 393 roles are located (AR-REPLICATOR and RNVEs). A given service can have 394 zero, one or more AR-LEAF nodes. Figure 1 shows NVE1 and NVE2 (both 395 residing in hypervisors) acting as AR-LEAF. The following 396 considerations apply to the AR-LEAF role: 398 a) The AR-LEAF role SHOULD be an administrative choice in any NVE/PE 399 that is part of an AR-enabled EVI. This administrative option to 400 enable AR-LEAF capabilities MAY be implemented as a system level 401 option as opposed to as per-EVI option. 403 b) In this non-selective AR solution, the AR-LEAF MUST advertise a 404 single Regular-IR inclusive multicast route as in [EVPN]. 406 c) In a service where there are no AR-REPLICATORs, the AR-LEAF MUST 407 use regular ingress replication. This will happen when a new 408 update from the last former AR-REPLICATOR is received and contains 409 a non-REPLICATOR AR type, or when the AR-LEAF detects that the 410 last AR-REPLICATOR is down (next-hop tracking in the IGP or any 411 other detection mechanism). Ingress replication MUST use the 412 forwarding information given by the remote Regular-IR Inclusive 413 Multicast Routes as described in [EVPN]. 415 d) In a service where there is one or more AR-REPLICATORs (based on 416 the received Replicator-AR routes for the EVI), the AR-LEAF can 417 locally select which AR-REPLICATOR it sends the BM traffic to: 419 o A single AR-REPLICATOR MAY be selected for all the BM packets 420 received on the AR-LEAF attachment circuits (ACs) for a given 421 EVI. This selection is a local decision and it does not have 422 to match other AR-LEAF's selection within the same EVI. 424 o An AR-LEAF MAY select more than one AR-REPLICATOR and do 425 either per-flow or per-EVI load balancing. 427 o In case of a failure on the selected AR-REPLICATOR, another 428 AR-REPLICATOR will be selected. 430 o When an AR-REPLICATOR is selected, the AR-LEAF MUST send all 431 the BM packets to that AR-REPLICATOR using the forwarding 432 information given by the Replicator-AR route for the chosen 433 AR-REPLICATOR, with tunnel type = TBD (AR tunnel). The 434 underlay destination IP address MUST be the AR-IP advertised 435 by the AR-REPLICATOR in the Replicator-AR route. 437 o AR-LEAF nodes SHALL send service-level BM control plane 438 packets following regular IR procedures. An example would be 439 IGMP, MLD or PIM multicast packets. The AR-REPLICATORs MUST 440 not replicate these control plane packets to other overlay 441 tunnels since they will use the regular IR-IP Address. 443 4.3. RNVE procedures 445 RNVE (Regular Network Virtualization Edge node) is defined as an 446 NVE/PE without AR-REPLICATOR or AR-LEAF capabilities that does IR as 447 described in [EVPN]. The RNVE does not signal any AR role and is 448 unaware of the AR-REPLICATOR/LEAF roles in the EVI. The RNVE will 449 ignore the Flags in the Regular-IR routes and will ignore the 450 Replicator-AR and Leaf-AR routes entirely (due to an unknown tunnel 451 type in the PTA). 453 This role provides EVPN with the backwards compatibility required in 454 optimized-IR EVIs. Figure 1 shows NVE2 as RNVE. 456 4.4. Forwarding behavior in non-selective AR EVIs 458 In AR EVIs, BM (Broadcast and Multicast) traffic between two NVEs may 459 follow a different path than unicast traffic. This solution proposes 460 the replication of BM through the AR-REPLICATOR node, whereas 461 unknown/known unicast will be delivered directly from the source node 462 to the destination node without being replicated by any intermediate 463 node. Unknown unicast SHALL follow the same path as known unicast 464 traffic in order to avoid packet reordering for unicast applications 465 and simplify the control and data plane procedures. Section 4.4.1. 466 describes the expected forwarding behavior for BM traffic in nodes 467 acting as AR-REPLICATOR, AR-LEAF and RNVE. Section 4.4.2. describes 468 the forwarding behavior for unknown unicast traffic. 470 Note that known unicast forwarding is not impacted by this solution. 472 4.4.1. Broadcast and Multicast forwarding behavior 474 The expected behavior per role is described in this section. 476 4.4.1.1. Non-selective AR-REPLICATOR BM forwarding 478 The AR-REPLICATORs will build a flooding list composed of ACs and 479 overlay tunnels to remote nodes in the EVI. Some of those overlay 480 tunnels MAY be flagged as non-BM receivers based on the BM flag 481 received from the remote nodes in the EVI. 483 o When an AR-REPLICATOR receives a BM packet on an AC, it will 484 forward the BM packet to its flooding list (including local ACs and 485 remote NVE/PEs), skipping the non-BM overlay tunnels. 487 o When an AR-REPLICATOR receives a BM packet on an overlay tunnel, it 488 will check the destination IP of the underlay IP header and: 490 - If the destination IP matches its AR-IP, the AR-REPLICATOR will 491 forward the BM packet to its flooding list (ACs and overlay 492 tunnels) excluding the non-BM overlay tunnels. The AR-REPLICATOR 493 will do source squelching to ensure the traffic is not sent back 494 to the originating AR-LEAF. If the overlay encapsulation is MPLS 495 and the EVI label is not the bottom of the stack, the AR- 496 REPLICATOR MUST copy the rest of the labels and forward them to 497 the egress overlay tunnels. 499 - If the destination IP matches its IR-IP, the AR-REPLICATOR will 500 skip all the overlay tunnels from the flooding list, i.e. it 501 will only replicate to local ACs. This is the regular IR 502 behavior described in [EVPN]. 504 4.4.1.2. Non-selective AR-LEAF BM forwarding 505 The AR-LEAF nodes will build two flood-lists: 507 1) Flood-list #1 - composed of ACs and an AR-REPLICATOR-set of 508 overlay tunnels. The AR-REPLICATOR-set is defined as one or more 509 overlay tunnels to the AR-IP Addresses of the remote AR- 510 REPLICATOR(s) in the EVI. The selection of more than one AR- 511 REPLICATOR is described in section 4.2. and it is a local AR- 512 LEAF decision. 514 2) Flood-list #2 - composed of ACs and overlay tunnels to the 515 remote IR-IP Addresses. 517 When an AR-LEAF receives a BM packet on an AC, it will check the 518 AR-REPLICATOR-set: 520 o If the AR-REPLICATOR-set is empty, the AR-LEAF will send the packet 521 to flood-list #2. 523 o If the AR-REPLICATOR-set is NOT empty, the AR-LEAF will send the 524 packet to flood-list #1, where only one of the overlay tunnels of 525 the AR-REPLICATOR-set is used. 527 When an AR-LEAF receives a BM packet on an overlay tunnel, will 528 forward the BM packet to its local ACs and never to an overlay 529 tunnel. This is the regular IR behavior described in [EVPN]. 531 4.4.1.3. RNVE BM forwarding 533 The RNVE is completely unaware of the AR-REPLICATORs, AR-LEAF nodes 534 and BM/U flags (that information is ignored). Its forwarding behavior 535 is the regular IR behavior described in [EVPN]. Any regular non-AR 536 node is fully compatible with the RNVE role described in this 537 document. 539 4.4.2. Unknown unicast forwarding behavior 541 The expected behavior is described in this section. 543 4.4.2.1. Non-selective AR-REPLICATOR/LEAF Unknown unicast forwarding 545 While the forwarding behavior in AR-REPLICATORs and AR-LEAF nodes is 546 different for BM traffic, as far as Unknown unicast traffic 547 forwarding is concerned, AR-LEAF nodes behave exactly in the same way 548 as AR-REPLICATORs do. 550 The AR-REPLICATOR/LEAF nodes will build a flood-list composed of ACs 551 and overlay tunnels to the IR-IP Addresses of the remote nodes in the 552 EVI. Some of those overlay tunnels MAY be flagged as non-U (Unknown 553 unicast) receivers based on the U flag received from the remote nodes 554 in the EVI. 556 o When an AR-REPLICATOR/LEAF receives an unknown packet on an AC, it 557 will forward the unknown packet to its flood-list, skipping the 558 non-U overlay tunnels. 560 o When an AR-REPLICATOR/LEAF receives an unknown packet on an overlay 561 tunnel will forward the unknown packet to its local ACs and never 562 to an overlay tunnel. This is the regular IR behavior described in 563 [EVPN]. 565 4.4.2.2. RNVE Unknown unicast forwarding 567 As described for BM traffic, the RNVE is completely unaware of the 568 REPLICATORs, LEAF nodes and BM/U flags (that information is ignored). 569 Its forwarding behavior is the regular IR behavior described in 570 [EVPN], also for Unknown unicast traffic. Any regular non-AR node is 571 fully compatible with the RNVE role described in this document. 573 5. Selective Assisted-Replication (AR) Solution Description 575 Figure 1 is also used to describe the selective AR solution, however 576 in this section we consider NVE2 as one more AR-LEAF for EVI-1. The 577 solution is called "selective" because a given AR-REPLICATOR MUST 578 replicate the BM traffic to only the AR-LEAF that requested the 579 replication (as opposed to all the AR-LEAF nodes) and MAY replicate 580 the BM traffic to the RNVEs. The same AR roles defined in section 4 581 are used here, however the procedures are slightly different. 583 The following sub-sections describe the differences in the procedures 584 of AR-REPLICATOR/LEAFs compared to the non-selective AR solution. 585 There is no change on the RNVEs. 587 5.1. Selective AR-REPLICATOR procedures 589 In our example in figure 1, PE1 and PE2 are defined as Selective AR- 590 REPLICATORs. The following considerations apply to the Selective AR- 591 REPLICATOR role: 593 a) The Selective AR-REPLICATOR capability SHOULD be an administrative 594 choice in any NVE/PE that is part of an AR-enabled EVI, as the AR 595 role itself. This administrative option MAY be implemented as a 596 system level option as opposed to as a per-EVI option. 598 b) Each AR-REPLICATOR will build a list of AR-REPLICATOR, AR-LEAF and 599 RNVE nodes (AR-LEAF nodes that sent only a regular-IR route are 600 accounted as RNVEs by the AR-REPLICATOR). In spite of the 601 'Selective' administrative option, an AR-REPLICATOR MUST NOT 602 behave as a Selective AR-REPLICATOR if at least one of the AR- 603 REPLICATORs has the L flag NOT set. If at least one AR-REPLICATOR 604 sends a Replicator-AR route with L=0 (in the EVI context), the 605 rest of the AR-REPLICATORs will fall back to non-selective AR 606 mode. 608 b) The Selective AR-REPLICATOR MUST follow the procedures described 609 in section 4.1, except for the following differences: 611 o The Replicator-AR route MUST include L=1 (Leaf Information 612 Required) in the Replicator-AR route. This flag is used by the 613 AR-REPLICATORs to advertise their 'selective' AR-REPLICATOR 614 capabilities. 616 o The AR-REPLICATOR will build a 'selective' AR-LEAF-set with 617 the list of nodes that requested replication to its own AR-IP. 618 For instance, assuming NVE1 and NVE2 advertise a Leaf-AR route 619 with PE1's AR-IP (as Tunnel Identifier) and NVE3 advertises a 620 Leaf-AR route with PE2's AR-IP, PE1 MUST only add NVE1/NVE2 in 621 its selective AR-LEAF-set for EVI-1, and exclude NVE3. 623 o When a node defined and operating as Selective AR-REPLICATOR 624 receives a packet on an overlay tunnel, it will do a tunnel 625 destination IP lookup and if the destination IP is the AR- 626 REPLICATOR AR-IP Address, the node MUST replicate the packet 627 to: 629 + local ACs 630 + overlay tunnels in the Selective AR-LEAF-set (excluding the 631 overlay tunnel to the source AR-LEAF). 632 + overlay tunnels to the RNVEs if the tunnel source IP is the 633 IR-IP of an AR-LEAF (in any other case, the AR-REPLICATOR 634 MUST NOT replicate the BM traffic to remote RNVEs). In other 635 words, the first-hop selective AR-REPLICATOR will replicate 636 to all the RNVEs. 637 + overlay tunnels to the remote Selective AR-REPLICATORs if 638 the tunnel source IP is the IR-IP of an AR-LEAF (in any 639 other case, the AR-REPLICATOR MUST NOT replicate the BM 640 traffic to remote AR-REPLICATORs), where the tunnel 641 destination IP is the AR-IP of the remote Selective AR- 642 REPLICATOR. The tunnel destination IP AR-IP will be an 643 indication for the remote Selective AR-REPLICATOR that the 644 packet needs further replication to its AR-LEAFs. 646 5.2. Selective AR-LEAF procedures 648 A Selective AR-LEAF chooses a single Selective AR-REPLICATOR per EVI 649 and: 651 o Sends all the EVI BM traffic to that AR-REPLICATOR and 652 o Expects to receive the BM traffic for a given EVI from the same AR- 653 REPLICATOR. 655 In the example of Figure 1, we consider that NVE1/NVE2/NVE3 as 656 Selective AR-LEAFs. NVE1 selects PE1 as its Selective AR-REPLICATOR. 657 If that is so, NVE1 will send all its BM traffic for EVI-1 to PE1. If 658 other AR-LEAF/REPLICATORs send BM traffic, NVE1 will receive that 659 traffic from PE1. These are the differences in the behavior of a 660 Selective AR-LEAF compared to a non-selective AR-LEAF: 662 a) The AR-LEAF role selective capability SHOULD be an administrative 663 choice in any NVE/PE that is part of an AR-enabled EVI. This 664 administrative option to enable AR-LEAF capabilities MAY be 665 implemented as a system level option as opposed to as per-EVI 666 option. 668 b) The AR-LEAF MAY advertise a Regular-IR route if there are RNVEs or 669 non-selective AR-LEAFs in the EVI. The Selective AR-LEAF MUST 670 advertise a Leaf-AR route after receiving a Replicator-AR route 671 with L=1. It is recommended that the Selective AR-LEAF waits for a 672 timer t before sending the Leaf-AR route, so that the AR-LEAF 673 receives all the Replicator-AR routes for the EVI. 675 c) In a service where there is more than one Selective AR-REPLICATORs 676 the Selective AR-LEAF MUST locally select a single Selective AR- 677 REPLICATOR for the EVI. Once selected: 679 o The Selective AR-LEAF will send a Leaf-AR route including the 680 AR-IP of the selected AR-REPLICATOR. 682 o The Selective AR-LEAF will send all the BM packets received on 683 the attachment circuits (ACs) for a given EVI to that AR- 684 REPLICATOR. 686 o In case of a failure on the selected AR-REPLICATOR, another 687 AR-REPLICATOR will be selected and a new Leaf-AR update will 688 be issued, including the new AR-IP. This new route will update 689 the selective list in the new Selective AR-REPLICATOR. In case 690 of failure on the active Selective AR-REPLICATOR, it is 691 recommended for the Selective AR-LEAF to revert to IR behavior 692 for a timer t to speed up the convergence. When the timer 693 expires, the Selective AR-LEAF will resume its AR mode with 694 the new Selective AR-REPLICATOR. 696 5.3. Forwarding behavior in selective AR EVIs 698 This section describes the differences of the selective AR forwarding 699 mode compared to the non-selective mode. Compared to section 4.4, 700 there are no changes for the forwarding behavior in RNVEs or for 701 unknown unicast traffic. 703 5.3.1. Selective AR-REPLICATOR BM forwarding 705 The Selective AR-REPLICATORs will build two flood-lists: 707 1) Flood-list #1 - composed of ACs and overlay tunnels to the 708 remote nodes in the EVI, always using the IR-IPs in the tunnel 709 destination IP addresses. Some of those overlay tunnels MAY be 710 flagged as non-BM receivers based on the BM flag received from 711 the remote nodes in the EVI. 713 2) Flood-list #2 - composed of ACs, a Selective AR-LEAF-set and a 714 Selective AR-REPLICATOR-set, where: 716 o The Selective AR-LEAF-set is composed of the overlay tunnels 717 to the AR-LEAFs that advertise a Leaf-AR route with the AR-IP 718 of the local AR-REPLICATOR. This set is updated with every 719 Leaf-AR route received with a change in the AR-IP included in 720 the PTA's Tunnel Identifier. 722 o The Selective AR-REPLICATOR-set is composed of the overlay 723 tunnels to all the AR-REPLICATORs that send a Replicator-AR 724 route with L=1. The AR-IP addresses are used as tunnel 725 destination IP. 727 When a Selective AR-REPLICATOR receives a BM packet on an AC, it will 728 forward the BM packet to its flood-list #1, skipping the non-BM 729 overlay tunnels. 731 When a Selective AR-REPLICATOR receives a BM packet on an overlay 732 tunnel, it will check the destination and source IPs of the underlay 733 IP header and: 735 - If the destination IP matches its AR-IP and the source IP 736 matches an IP of the Selective AR-LEAF-set, the AR-REPLICATOR 737 will forward the BM packet to its flood-list #2, as long as the 738 list of AR-REPLICATORs for the EVI matches the Selective AR- 739 REPLICATOR-set. If the Selective AR-REPLICATOR-set does not 740 match the list of AR-REPLICATORs, the node reverts back to non- 741 selective mode and flood-list #1 is used. 743 - If the destination IP matches its AR-IP but the source IP does 744 not match any IP of the Selective AR-LEAF-set, the AR-REPLICATOR 745 will forward the BM packet to flood-list #2 but skipping the AR- 746 REPLICATOR-set. 748 - If the destination IP matches its IR-IP, the AR-REPLICATOR will 749 use flood-list #1 but MUST skip all the overlay tunnels from the 750 flooding list, i.e. it will only replicate to local ACs. This is 751 the regular-IR behavior described in [EVPN]. 753 In any case, non-BM overlay tunnels are excluded from flood- 754 lists and also source squelching is always done in order to 755 ensure the traffic is not sent back to the originating source. 756 If the overlay encapsulation is MPLS and the EVI label is not 757 the bottom of the stack, the AR-REPLICATOR MUST copy the rest of 758 the labels when forwarding them to the egress overlay tunnels. 760 5.3.2. Selective AR-LEAF BM forwarding 762 The Selective AR-LEAF nodes will build two flood-lists: 764 1) Flood-list #1 - composed of ACs and the overlay tunnel to the 765 selected AR-REPLICATOR (using the AR-IP as the tunnel 766 destination IP). 768 2) Flood-list #2 - composed of ACs and overlay tunnels to the 769 remote IR-IP Addresses. 771 When an AR-LEAF receives a BM packet on an AC, it will check if there 772 is any selected AR-REPLICATOR. If there is, flood-list #1 will be 773 used. Otherwise, flood-list #2 will. 775 When an AR-LEAF receives a BM packet on an overlay tunnel, will 776 forward the BM packet to its local ACs and never to an overlay 777 tunnel. This is the regular IR behavior described in [EVPN]. 779 6. Pruned-Flood-Lists (PFL) 781 In addition to AR, the second optimization supported by this solution 782 is the ability for the all the EVI nodes to signal Pruned-Flood-Lists 783 (PFL). As described in section 3, an EVPN node can signal a given 784 value for the BM and U PFL flags in the IR Inclusive Multicast 785 Routes, where: 787 + BM= Broadcast and Multicast (BM) flag. BM=1 means "prune-me" from 788 the BM flood-list. BM=0 means regular behavior. 790 + U= Unknown flag. U=1 means "prune-me" from the Unknown flood-list. 791 U=0 means regular behavior. 793 The ability to signal these PFL flags is an administrative choice. 794 Upon receiving a non-zero PFL flag, a node MAY decide to honor the 795 PFL flag and remove the sender from the corresponding flood-list. A 796 given EVI node receiving BUM traffic on an overlay tunnel MUST 797 replicate the traffic normally, regardless of the signaled PFL 798 flags. 800 This optimization MAY be used along with the AR solution. 802 6.1. A PFL example 804 In order to illustrate the use of the solution described in this 805 document, we will assume that EVI-1 in figure 1 is optimized-IR 806 enabled and: 808 o PE1 and PE2 are administratively configured as AR-REPLICATORs, due 809 to their high-performance replication capabilities. PE1 and PE2 810 will send a Replicator-AR route with BM/U flags = 00. 812 o NVE1 and NVE3 are administratively configured as AR-LEAF nodes, due 813 to their low-performance software-based replication capabilities. 814 They will advertise a Leaf-AR route. Assuming both NVEs advertise 815 all the attached VMs in EVPN as soon as they come up and don't have 816 any VMs interested in multicast applications, they will be 817 configured to signal BM/U flags = 11 for EVI-1. 819 o NVE2 is optimized-IR unaware; therefore it takes on the RNVE role 820 in EVI-1. 822 Based on the above assumptions the following forwarding behavior will 823 take place: 825 (1) Any BM packets sent from VM11 will be sent to VM12 and PE1. PE1 826 will forward further the BM packets to TS1, WAN link, PE2 and 827 NVE2, but not to NVE3. PE2 and NVE2 will replicate the BM packets 828 to their local ACs but we will avoid NVE3 having to replicate 829 unnecessarily those BM packets to VM31 and VM32. 831 (2) Any BM packets received on PE2 from the WAN will be sent to PE1 832 and NVE2, but not to NVE1 and NVE3, sparing the two hypervisors 833 from replicating unnecessarily to their local VMs. PE1 and NVE2 834 will replicate to their local ACs only. 836 (3) Any Unknown unicast packet sent from VM31 will be forwarded by 837 NVE3 to NVE2, PE1 and PE2 but not NVE1. The solution avoids the 838 unnecessary replication to NVE1, since the destination of the 839 unknown traffic cannot be at NVE1. 841 (4) Any Unknown unicast packet sent from TS1 will be forwarded by PE1 842 to the WAN link, PE2 and NVE2 but not to NVE1 and NVE3, since the 843 target of the unknown traffic cannot be at those NVEs. 845 7. Benefits of the optimized-IR solution 847 A solution for the optimization of Ingress Replication in EVPN is 848 described in this document (optimized-IR). The solution brings the 849 following benefits: 851 o Optimizes the multicast forwarding in low-performance NVEs, by 852 relaying the replication to high-performance NVEs (AR-REPLICATORs) 853 and while preserving the packet ordering for unicast applications. 855 o Reduces the flooded traffic in NVO networks where some NVEs do not 856 need broadcast/multicast and/or unknown unicast traffic. 858 o It is fully compatible with existing EVPN implementations and EVPN 859 functions for NVO overlay tunnels. Optimized-IR NVEs and regular 860 NVEs can be even part of the same EVI. 862 o It does not require any PIM-based tree in the NVO core of the 863 network. 865 8. Conventions used in this document 867 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 868 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 869 document are to be interpreted as described in RFC-2119 [RFC2119]. 871 In this document, these words will appear with that interpretation 872 only when in ALL CAPS. Lower case uses of these words are not to be 873 interpreted as carrying RFC-2119 significance. 875 In this document, the characters ">>" preceding an indented line(s) 876 indicates a compliance requirement statement using the key words 877 listed above. This convention aids reviewers in quickly identifying 878 or finding the explicit compliance requirements of this RFC. 880 9. Security Considerations 882 This section will be added in future versions. 884 10. IANA Considerations 886 A new Tunnel-Type (AR) must be requested and allocated by IANA for 887 the PTA (PMSI Tunnel Attribute) used in this document. 889 11. Terminology 891 Regular-IR: Refers to Regular Ingress Replication, where the source 892 NVE/PE sends a copy to each remote NVE/PE part of the EVI. 894 AR-IP: IP address owned by the AR-REPLICATOR and used to 895 differentiate the ingress traffic that must follow the AR 896 procedures. 898 AR forwarding mode: for an AR-LEF, it means sending an AC BM packet 899 to a single AR-REPLICATOR with tunnel destination IP AR-IP. 900 For an AR-REPLICATOR, it means sending a BM packet to a 901 selective number or all the overlay tunnels when the packet 902 was previously received from an overlay tunnel. 904 IR-IP: IP address used for Ingress Replication as in [EVPN]. 906 IR forwarding mode: it refers to the Ingress Replication behavior 907 explained in [EVPN]. It means sending an AC BM packet copy to 908 each remote PE/NVE in the EVI and sending an overlay BM packet 909 only to the ACs and not other overlay tunnels. 911 PTA: PMSI Tunnel Attribute 913 RT-3: EVPN Route Type 3, Inclusive Multicast Ethernet Tag route 915 13. References 917 [EVPN] Sajassi et al., "BGP MPLS Based Ethernet VPN", draft-ietf- 918 l2vpn-evpn-11.txt, work in progress, October, 2014 920 [EVPN-OVERLAY] Sajassi-Drake et al., "A Network Virtualization 921 Overlay Solution using EVPN", draft-sd-l2vpn-evpn-overlay-03.txt, 922 work in progress, October, 2014 924 [RFC6514]Aggarwal, R., Rosen, E., Morin, T., and Y. Rekhter, "BGP 925 Encodings and Procedures for Multicast in MPLS/BGP IP VPNs", 926 RFC 6514, February 2012, . 928 14. Acknowledgments 930 The authors would like to thank Neil Hart and David Motz for their 931 valuable feedback and contributions. 933 15. Authors' Addresses 935 Jorge Rabadan (Editor) 936 Alcatel-Lucent 937 777 E. Middlefield Road 938 Mountain View, CA 94043 USA 939 Email: jorge.rabadan@alcatel-lucent.com 941 Senthil Sathappan 942 Alcatel-Lucent 943 Email: senthil.sathappan@alcatel-lucent.com 945 Mukul Katiyar 946 Juniper 947 Email: mkatiyar@juniper.net 949 Wim Henderickx 950 Alcatel-Lucent 951 Email: wim.henderickx@alcatel-lucent.com 953 Ravi Shekhar 954 Juniper Networks 955 Email: rshekhar@juniper.net 957 Nischal Sheth 958 Juniper Networks 959 Email: nsheth@juniper.net 961 Wen Lin 962 Juniper Networks 963 Email: wlin@juniper.net 965 Ali Sajassi 966 Cisco 967 Email: sajassi@cisco.com 969 Aldrin Isaac 970 Bloomberg 971 Email: aisaac71@bloomberg.net 973 Mudassir Tufail 974 Citibank 975 mudassir.tufail@citi.com