idnits 2.17.1 draft-malhotra-bess-evpn-pe-ce-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 289 has weird spacing: '...CE-Host to PE...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The exact meaning of the all-uppercase expression 'MAY NOT' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. == The expression 'MAY NOT', while looking like RFC 2119 requirements text, is not defined in RFC 2119, and should not be used. Consider using 'MUST NOT' instead (if that is what you mean). Found 'MAY NOT' in this paragraph: o The EVPN flag 'E' MUST NOT be set in type 8/9 PDU from a CE. o A MAC entry for the MAC received in a type 8/9 PDU MUST be installed in the MAC-VRF table pointing to the AC to which the session is bound. o If an IPv4/IPv6 address is set in the PDU, an IPv4/IPv6 neighbor binding MUST be established for the IPv4/IPv6 address in the PDU to the MAC address in the PDU. In other words, a next-hop re-write for these IPv4/IPv6 neighbor entries MUST be installed using the MAC address in the PDU, and if required by forwarding logic, bound to the AC associated with the L3DL session. o Note that an IPv4/IPv6 address MAY NOT be set in a type 8/9 PDU received from a CE, in which case this PDU is only used for MAC learning. This MAY be the case in a non-IRB EVPN network, wherein, an EVPN PE is not a first-hop router for the attached CEs. -- The document date (Nov 2, 2019) is 1636 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC 826' is mentioned on line 226, but not defined == Missing Reference: 'RFC 4861' is mentioned on line 227, but not defined == Missing Reference: 'RFC 7432' is mentioned on line 1030, but not defined == Missing Reference: 'EVPN IRB' is mentioned on line 308, but not defined == Missing Reference: 'RFC4861' is mentioned on line 895, but not defined == Missing Reference: 'RT-2' is mentioned on line 927, but not defined == Missing Reference: 'ESI' is mentioned on line 929, but not defined == Missing Reference: 'EVI' is mentioned on line 929, but not defined == Missing Reference: 'RT-1' is mentioned on line 929, but not defined == Unused Reference: 'RFC7432' is defined on line 1194, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'L3DL' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-IRB' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-PREFIX-ADV' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-IRB-MOBILITY' -- Possible downref: Non-RFC (?) normative reference: ref. 'EVPN-IP-ALIASING' Summary: 2 errors (**), 0 flaws (~~), 14 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group N. Malhotra, Ed. 3 Internet Draft Individual 4 Intended Status: Proposed Standard 5 K. Patel 6 Arrcus 8 J. Rabadan 9 Nokia 11 Expires: May 5, 2020 Nov 2, 2019 13 PE-CE Control Plane for EVPN 14 draft-malhotra-bess-evpn-pe-ce-00 16 Abstract 18 In an EVPN network, EVPN PEs provide VPN bridging and routing service 19 to connected CE devices based on BGP EVPN control plane. At present, 20 there is no PE-CE control plane defined for an EVPN PE to learn CE 21 MAC, IP, and any other routes from a CE that may be distributed in 22 EVPN control plane to enable unicast flows between CE devices. As a 23 result, EVPN PEs rely on data plane based gleaning of source MACs for 24 CE MAC learning, ARP/ND snooping for CE IPv4/IPv6 learning, and in 25 some cases, local configuration for learning prefix routes behind a 26 CE. A PE-CE control plane alternative to this traditional learning 27 approach, where applicable, offers certain distinct advantages that 28 in turn result in simplified EVPN operation. 30 This document defines a PE-CE control plane as an optional 31 alternative to traditional non-control-plane based PE-CE learning in 32 an EVPN network. It defines PE-CE control plane procedures and TLVs 33 based on L3DL as the base protocol, enumerates advantages that may be 34 achieved by using this PE-CE control plane, and discusses in detail 35 EVPN use cases that are simplified as a result. 37 Status of this Memo 39 This Internet-Draft is submitted to IETF in full conformance with the 40 provisions of BCP 78 and BCP 79. 42 Internet-Drafts are working documents of the Internet Engineering 43 Task Force (IETF), its areas, and its working groups. Note that 44 other groups may also distribute working documents as 45 Internet-Drafts. 47 Internet-Drafts are draft documents valid for a maximum of six months 48 and may be updated, replaced, or obsoleted by other documents at any 49 time. It is inappropriate to use Internet-Drafts as reference 50 material or to cite them other than as "work in progress". 52 The list of current Internet-Drafts can be accessed at 53 http://www.ietf.org/1id-abstracts.html 55 The list of Internet-Draft Shadow Directories can be accessed at 56 http://www.ietf.org/shadow.html 58 Copyright and License Notice 60 Copyright (c) 2017 IETF Trust and the persons identified as the 61 document authors. All rights reserved. 63 This document is subject to BCP 78 and the IETF Trust's Legal 64 Provisions Relating to IETF Documents 65 (http://trustee.ietf.org/license-info) in effect on the date of 66 publication of this document. Please review these documents 67 carefully, as they describe your rights and restrictions with respect 68 to this document. Code Components extracted from this document must 69 include Simplified BSD License text as described in Section 4.e of 70 the Trust Legal Provisions and are provided without warranty as 71 described in the Simplified BSD License. 73 Table of Contents 75 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 76 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 5 77 2. PE <-> CE Control Plane Overview . . . . . . . . . . . . . . . 7 78 3. TLVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 79 3.1 Overlay IPv4 Encapsulation PDU . . . . . . . . . . . . . . . 9 80 3.2 Overlay IPv6 Encapsulation PDU . . . . . . . . . . . . . . . 11 81 3.3 Overlay IPv4 Prefix Encapsulation PDU . . . . . . . . . . . 13 82 3.4 Overlay IPv6 Prefix Encapsulation PDU . . . . . . . . . . . 14 83 4. CE MAC/IP Learning on a PE AC . . . . . . . . . . . . . . . . . 15 84 4.1 PE <-> CE L3DL Session Establishment . . . . . . . . . . . . 15 85 4.2 CE MAC/IP Learning . . . . . . . . . . . . . . . . . . . . . 15 86 5. PE Any-cast GW MAC/IP Learning on CE . . . . . . . . . . . . . 15 87 6. Remote CE MAC/IP Learning on CE . . . . . . . . . . . . . . . . 16 88 7. PE <-> CE Control Plane with EVPN All-active Multi-Homing . . . 17 89 7.1 All-active Multi-Homing Mode . . . . . . . . . . . . . . . . 17 90 7.2 Source MAC . . . . . . . . . . . . . . . . . . . . . . . . . 18 91 7.3 CE MAC/IP Learning with EVPN All-active Multi-Homing . . . . 18 92 7.4 LAG Member Link Failure . . . . . . . . . . . . . . . . . . 19 93 7.4.1 Session Re-establishment . . . . . . . . . . . . . . . . 19 94 7.4.2 TLV Retention . . . . . . . . . . . . . . . . . . . . . 19 95 7.4 LAG Failure . . . . . . . . . . . . . . . . . . . . . . . . 19 96 7.5 Example PE <-> CE Control Plane Flow with All-active 97 Multi-Homing . . . . . . . . . . . . . . . . . . . . . . . . 20 98 8. Software Neighbor Tables . . . . . . . . . . . . . . . . . . . 22 99 9. MAC/IP Learning Conflict Resolution . . . . . . . . . . . . . . 22 100 10. EVPN SLA Signaling . . . . . . . . . . . . . . . . . . . . . . 23 101 11. PE-CE Overlay Prefix Learning . . . . . . . . . . . . . . . . 23 102 12. Asymmetric EVPN-IRB . . . . . . . . . . . . . . . . . . . . . 23 103 13. Centralized Gateway EVPN-IRB . . . . . . . . . . . . . . . . . 24 104 14. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 24 105 14.1 CE Application SLA . . . . . . . . . . . . . . . . . . . . 24 106 14.2 Simplified EVPN Operations . . . . . . . . . . . . . . . . 24 107 14.2.1 EVPN All-active Multi-Homing . . . . . . . . . . . . . 25 108 14.2.2 Convergence on CE Host Moves . . . . . . . . . . . . . 26 109 14.2.2.1 Silent Hosts . . . . . . . . . . . . . . . . . . . 26 110 14.2.2.2 Probing . . . . . . . . . . . . . . . . . . . . . 27 111 14.2.3 ARP Gleaning Latency . . . . . . . . . . . . . . . . . 28 112 14.3 Applicability to non-EVPN Use Cases . . . . . . . . . . . . 28 113 15. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 114 16. References . . . . . . . . . . . . . . . . . . . . . . . . . 30 115 16.1 Normative References . . . . . . . . . . . . . . . . . . . 30 116 15.2 Informative References . . . . . . . . . . . . . . . . . . 30 117 17. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 31 118 Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 119 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 31 121 1 Introduction 123 In an EVPN network, CE devices typically connect to an EVPN PE via 124 layer-2 interfaces that terminate in a BD on the PE. Multi-homed LAG 125 interfaces together with EVPN all-active multi-homing procedures are 126 used to achieve PE-CE link and PE node redundancy for fault-tolerance 127 and load-balancing. PEs provide overlay bridging and, optionally, 128 first-hop routing service for these CE devices based on an EVPN 129 control plane that is used to distribute CE MAC, IP, and prefix 130 reachability across PEs. 132 At present, there is no PE-CE control plane defined for an EVPN PE to 133 learn connected CE host MACs and IPs. As a result, EVPN PEs rely on: 135 o data plane based gleaning of source MAC for MAC learning, 136 o ARP snooping for IPv4 + MAC learning, and 137 o ND snooping for IPv6 + MAC learning. 139 A PE-CE control plane alternative to this traditional learning 140 approach, where applicable, can offer some distinct advantages across 141 various boot-up, mobility, and convergence scenarios: 143 o PE-CE learning is decoupled from non-deterministic hashing of 144 data, ARP, and ND packets from CEs over all-active multi-homed 145 LAG interfaces. 146 o PE-CE learning is decoupled from non-deterministic periodicity 147 of data traffic from CEs or, in an extreme scenario, from CE 148 device being silent for an extended period. 149 o PE-CE learning is decoupled from non-deterministic CE behavior 150 with respect to unsolicited ARPs and NAs following boot-up and 151 moves. 152 o PE-CE learning is decoupled from latencies associated with data 153 packet triggered ARP and ND gleaning. 155 This results in simplification of certain EVPN operations such as 156 aliasing, MAC and IP syncing across multi-homing PEs, and probing on 157 MAC/IP moves. It also helps achieve a deterministic convergence 158 behavior across various boot-up, mobility, and failure scenarios. 160 Beside simplification of existing EVPN procedures, PE-CE protocol is 161 also leveraged to enable new use cases that would not be possible 162 otherwise: 164 o Signal application SLA requirements to an EVPN PE that may 165 in-turn be used by the PE to influence overlay and underlay 166 routing policies for a host. 167 o Signal prefix routes behind a CE for cases where a CE does not 168 run a dynamic routing protocol on the PE-CE link. 170 This document defines a new PE-CE control plane as an alternative to 171 traditional data-plane and ARP/ND snooping based PE-CE host learning 172 and to local configuration-based PE-CE prefix learning. It defines 173 PE-CE control plane procedures and TLVs based on [L3DL] as the base 174 protocol, enumerates advantages that may be achieved by using this 175 PE-CE control plane, and discusses in detail EVPN operations that are 176 simplified as a result. Use of PE-CE control plane defined in this 177 document is intended to be optional and backwards compatible with CEs 178 that use traditional PE-CE learning within the same BD. While the 179 protocol is discussed using L3DL as the base protocol, signaling 180 described in this document may also, in future, be extended to use 181 LLDP as the base protocol. 183 1.1 Terminology 185 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 186 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 187 "OPTIONAL" in this document are to be interpreted as described in 188 BCP14 [RFC2119] [RFC8174] when, and only when, they appear in all 189 capitals, as shown here. 191 The following terms are used in this document: 193 o L3DL: Layer 3 Discovery and Liveness Protocol defined in [L3DL] 194 o EVPN-IRB: A BGP-EVPN distributed control plane based integrated 195 routing and bridging fabric overlay discussed in [EVPN-IRB] 196 o Underlay: IP or MPLS fabric core network that provides IP or 197 MPLS routed reachability between EVPN PEs. 198 o Overlay: VPN or service layer network consisting of EVPN PEs 199 OR VPN provider-edge (PE) switch-router devices that runs on top 200 of an underlay routed core. 201 o EVPN PE: A PE switch-router in a data-center fabric that 202 runs overlay BGP-EVPN control plane and connects to overlay CE 203 host devices. An EVPN PE may also be the first-hop layer-3 204 gateway for CE/host devices. This document refers to EVPN PE as a 205 logical function in a data-center fabric. This EVPN PE function 206 may be physically hosted on a top-of-rack switching device (ToR) 207 OR at layer(s) above the ToR in the Clos fabric. An EVPN PE is 208 typically also an IP or MPLS tunnel end-point for overlay VPN 209 flows. 210 o CE: A tenant host device that has layer 2 connectivity to an 211 EVPN PE switch-router, either directly OR via intermediate 212 switching device(s). 213 o Symmetric EVPN-IRB: An overlay fabric first-hop routing 214 architecture as defined in [EVPN-IRB], wherein, overlay host-to- 215 host routed inter-subnet flows are routed at both ingress and 216 egress EVPN PEs. 217 o Asymmetric EVPN-IRB: An overlay fabric first-hop routing 218 architecture as defined in [EVPN-IRB], wherein, overlay host-to- 219 host routed inter-subnet flows are routed and bridged at ingress 220 PE and bridged at egress PEs. 221 o Centralized EVPN-IRB: An overlay fabric first-hop routing 222 architecture, wherein, overlay host-to-host routed inter-subnet 223 flows are routed at a centralized gateway, typically at the one 224 of the spine layers, and where EVPN PEs are pure bridging 225 devices. 226 o ARP: Address Resolution Protocol [RFC 826]. 227 o ND: IPv6 Neighbor Discovery Protocol [RFC 4861]. 228 o Ethernet-Segment: physical Ethernet or LAG port that connects an 229 access device to an EVPN PE, as defined in [RFC 7432]. 230 o ESI: Ethernet Segment Identifier as defined in [RFC 7432]. 231 o LAG: Layer-2 link-aggregation, also known as layer-2 bundle 232 port-channel, or bond interface. 233 o EVPN all-active multi-homing: PE-CE all-active multi-homing 234 achieved via a multi-homed layer-2 LAG interface on a CE with 235 member links to multiple PEs and related EVPN procedures on the 236 PEs. 237 o EVPN Aliasing: multi-homing procedure as defined in [RFC 7432]. 238 o BD: Broadcast Domain. 239 o Bridge Table: An instantiation of a broadcast domain on a 240 MAC-VRF. 241 o AC: A PE Attachment Circuit. This may be an access (untagged) or 242 trunk (tagged) layer-2 interface that is a member of a local VLAN 243 or a BD. 244 o SLA: Service Layer Agreement 246 2. PE <-> CE Control Plane Overview 248 Layer 3 Discovery and Liveness (L3DL) protocol is defined in [L3DL] 249 as a protocol over Ethernet links to auto-discover connected 250 neighbor's layer 2, layer 3 attributes, and encapsulations for the 251 purpose of bringing up upper layer routing protocols. This document 252 leverages L3DL as a PE-CE protocol in an EVPN network fabric on 253 access links between an EVPN PE and CE. Specifically, 255 o PE-CE control plane based on L3DL protocol is proposed for CE 256 MAC learning as an alternative to data-plane based source MAC 257 learning. 258 o PE-CE control plane based on L3DL protocol is proposed for CE 259 MAC-IP adjacency learning as an alternative to MAC-IP learning 260 based on ARP/ND snooping. 261 o PE-CE control plane based on L3DL is proposed for learning of 262 IP Prefixes and associated overlay indexes, as an alternative to 263 local configuration on the PE for use case defined in section 4.1 264 of [EVPN-PREFIX-ADV]. 266 Note that any specification related to base L3DL protocol itself is 267 considered out of scope for this document and will continue to be 268 covered in the base protocol spec. This document will instead focus 269 on procedures and TLV extensions needed to achieve the above learning 270 on PE-CE links in an EVPN network. Any text that relates to the base 271 protocol included in this document is simply background information 272 in the context of use cases covered in this document. The reader 273 should refer to the base L3DL protocol document for the exact L3DL 274 protocol specification. 276 +------------------------+ 277 | Underlay Network Fabric| 278 +------------------------+ 280 BGP-EVPN Peering 281 <------------------------------> 283 +------+ +------+ +------+ 284 | PE1 | ..... | PE2 | | PE3 | 285 +------+ +------+ +------+ 286 | \ / 287 L3DL Session \ ESI / 288 | L3DL \ / L3DL 289 CE-host to PE2 CE-Host to PE3 291 Figure 1 293 An L3DL session is established on layer-2 logical interfaces between 294 the EVPN PE and each connected CE host device. A session end-point on 295 a local logical interface is identified by peer Logical Link Endpoint 296 Identifier (LLEI) as defined in [L3DL]. L3DL HELLO messages are used 297 for end-point discovery and OPEN messages are exchanged between two 298 end-points to establish an L3DL peering. Once L3DL peering is 299 established, encapsulation TLVs are exchanged for learning. 301 In the context of an EVPN network, CE Attachment Circuits (AC logical 302 interfaces) typically terminate in a BD on the PE, with multi-homed 303 LAG interfaces used for EVPN all-active multi-homing. CE hosts may be 304 directly connected to EVPN PEs via access ports, or may be connected 305 on trunk-ports via another switch. In a common EVPN-IRB design, EVPN 306 PEs also function as distributed first-hop gateways for hosts in a 307 BD. While symmetric and asymmetric IRB designs are possible as 308 discussed in [EVPN IRB], procedures described in subsequent sections 309 assume symmetric IRB with distributed any-cast gateways on EVPN PEs. 310 Any deviations from these procedures for asymmetric IRB design or a 311 centralized IRB design will be covered in future updates to this 312 document. 314 The next few sections will focus on additional L3DL TLVs and 315 procedures needed for PE-CE learning on EVPN PE ACs without and with 316 all-active multi-homing. 318 3. TLVs 320 This section defines new TLVs that are used by PE-CE control plane 321 defined in this document. 323 3.1 Overlay IPv4 Encapsulation PDU 325 A new encapsulation PDU type is defined for the purpose of carrying 326 overlay IPv4 and MAC bindings. Alternatively, it may also be used to 327 carry an overlay MAC with a NULL IPv4 address in a non-IRB use case. 329 0 1 2 3 330 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 331 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 332 | Type = 11 | PDU Length | 333 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 334 | | Count | 335 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 336 | Serial Number | 337 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 338 | IPv4 Address | 339 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 340 | PrefixLen |E| RSVD | | 341 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 342 | MAC Address | 343 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 344 | SLA | 345 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 346 | more... | 347 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 348 | | Sig Type | Signature Length | 349 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 350 | Signature | 351 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 353 Figure 2 355 o A new L3DL PDU type (11) is requested for this PDU. 356 o The IPv4 Address is that of an overlay. 357 o MAC address carries the MAC binding for the particular IPv4 358 address if one is set in the PDU. If an IPv4 address is not set, 359 it simply signals an overlay MAC address. 360 o EVPN flag 'E' indicates if this encapsulation is being sent on 361 behalf of a remote host learnt via EVPN. Use of this flag is 362 covered in a later section. 363 o A 32 bit 'SLA' word is used to signal SLA requirements of a CE 364 host to the EVPN PE. An EVPN PE may use these to implement 365 routing policies needed to fulfil the CE SLA requirement. As an 366 example, if a CE indicates a minimum delay requirement for the 367 applications it runs, EVPN provider network may route or bridge 368 traffic destined to this host over traffic engineered paths that 369 implement a minimum delay routing policy. 371 In addition to carrying CE host IP and MAC to a PE, this PDU may also 372 be used to carry PE's any-cast gateway IPv4 address and MAC bindings 373 to a CE host device. Optionally, it may also be used to relay a 374 remote CE's IPv4 address and MAC bindings to a local CE host within a 375 subnet. Procedures related to use of this PDU are discussed in 376 subsequent sections. 378 The encapsulation list in this PDU MUST follow full replace semantics 379 as in the L3DL protocol specification. 381 3.2 Overlay IPv6 Encapsulation PDU 383 A new encapsulation PDU type is defined for the purpose of carrying 384 overlay IPv6 and MAC bindings: 386 0 1 2 3 387 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 388 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 389 | Type = 12 | PDU Length | 390 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 391 | | Count | 392 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 393 | Serial Number | 394 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 395 | | 396 + + 397 | | 398 + IPv6 Address + 399 | | 400 + + 401 | | 402 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 403 | PrefixLen |E|R|O| SLA |Rsv| | 404 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 405 | MAC Address | 406 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 407 | SLA | 408 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 409 | more... | 410 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 411 | | Sig Type | Signature Length | 412 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 413 | Signature | 414 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 416 Figure 3 418 o A new L3DL PDU type (12) is requested for this PDU. 419 o The IPv6 Address is that of an overlay. 420 o MAC address carries the MAC binding for IPv6 address in the PDU. 421 o An EVPN flag 'E' indicates if this encapsulation is being sent 422 on behalf of a remote host learnt via EVPN. Usage of this flag is 423 covered in a later section. 424 o A Router flag 'R' is used to carry "Router Flag" or "R-bit" as 425 defined in [RFC4861]. Usage of this flag for the purpose of 426 installing ND cache entries based on learning via this TLV is as 427 defined in [RFC4861] 429 o An Override flag 'O' is used to carry "Override Flag" or "O-bit" 430 as defined in [RFC4861]. Usage of this flag for the purpose of 431 installing ND cache entries based on learning via this TLV is as 432 defined in [RFC4861] 433 o A 32 bit 'SLA' word is used to signal SLA requirements of a CE 434 host to the EVPN PE. An EVPN PE may use these to implement 435 routing policies needed to fulfil the CE SLA requirement. As an 436 example, if a CE indicates a minimum delay requirement for the 437 applications it runs, EVPN provider network may route or bridge 438 traffic destined to this host over traffic engineered paths that 439 implement a minimum delay routing policy. 441 In addition to carrying CE host IP and MAC to a PE, this PDU may also 442 be used to carry PE's any-cast gateway IPv6 address and MAC bindings 443 to a CE host device. Optionally, it may also be used to relay a 444 remote CE's IPv6 address and MAC bindings to a local CE host within a 445 subnet. Procedures related to use of this PDU are discussed in 446 subsequent sections. 448 The encapsulation list contained in this PDU MUST follow full replace 449 semantics as in the L3DL protocol specification. 451 3.3 Overlay IPv4 Prefix Encapsulation PDU 453 A new encapsulation PDU type is defined for the purpose of carrying 454 overlay IPv4 prefix routes for prefixes behind a CE that does not run 455 a dynamic routing protocol for use-case as defined in section 4.1 of 456 [EVPN-PREFIX-ADV]: 458 0 1 2 3 459 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 460 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 461 | Type = 13 | PDU Length | 462 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 463 | | Count | 464 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 465 | Serial Number | 466 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 467 | Prefix Count | IPv4 Prefix | 468 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 469 | | PrefixLen | | 470 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 471 | IPv4 Prefix | PrefixLen | 472 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 473 | more... | 474 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 475 | GW IP | 476 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 477 | more... | 478 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 479 | | Sig Type | Signature Length | 480 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 481 | Signature | 482 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 484 Figure 4 486 A CE device as defined in [EVPN-PREFIX-ADV], with prefixes behind it 487 MAY use the above PDU to send these prefixes to an EVPN PE with 488 itself as the GW. An EVPN PE MAY then advertise prefixes received via 489 this PDU as RT-5, with TS as the GW, as defined in [EVPN-PREFIX-ADV]. 491 o A new L3DL PDU type (10) is requested for this PDU. 492 o IPv4 Prefix is set to a prefix behind a CE. 493 o PrefixLen is set to IPv4 prefix length for the advertised prefix. 494 o GW-IP is set to the CE IPv4 address (advertised via Type 8 PDU). 496 Multiple prefixes may be set for a single GW IP. The encapsulation 497 list contained in this PDU MUST follow full replace semantics as in 498 the L3DL protocol specification. 500 3.4 Overlay IPv6 Prefix Encapsulation PDU 502 A new encapsulation PDU type is defined for the purpose of carrying 503 overlay IPv6 prefix routes for prefixes behind a CE that does not run 504 a dynamic routing protocol for use-case as defined in section 4.1 of 505 [EVPN-PREFIX-ADV]: 507 0 1 2 3 508 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 509 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 510 | Type = 14 | PDU Length | 511 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 512 | | Count | 513 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 514 | Serial Number | 515 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 516 | Prefix Count | | 517 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 518 | | 519 + + 520 | | 521 + + 522 | IPv6 Prefix | 523 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 524 | | PrefixLen | more... | 525 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 526 | | 527 + + 528 | | 529 + GW IP + 530 | | 531 + + 532 | | 533 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 534 | more... | 535 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 536 | | Sig Type | Signature Length | 537 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 538 | Signature | 539 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 541 Figure 5 543 A CE device as defined in [EVPN-PREFIX-ADV], with prefixes behind it 544 MAY use the above PDU to send these prefixes to an EVPN PE with 545 itself as the GW. An EVPN PE MAY then advertise prefixes received via 546 this PDU as RT-5, with TS as the GW, as defined in [EVPN-PREFIX-ADV]. 548 o A new L3DL PDU type (14) is requested for this PDU. 549 o IPv6 Prefix is set to an IPv6 prefix behind a CE. 550 o PrefixLen is set to IPv6 prefix length for the advertised prefix. 551 o GW-IP is set to the CE IPv6 address (advertised via Type 9 PDU). 553 Multiple prefixes may be set for a single GW IP. The encapsulation 554 list contained in this PDU MUST follow full replace semantics as in 555 the L3DL protocol specification. 557 4. CE MAC/IP Learning on a PE AC 559 This section defines procedures for learning a connected CE MAC and 560 IP on a PE local attachment circuit (AC). 562 4.1 PE <-> CE L3DL Session Establishment 564 On an EVPN PE, 566 o A HELLO and/or OPEN PDU sent from a CE host source MAC is 567 received on a tagged or untagged interface that is member of a 568 local BD, referred here to as an AC. 569 o OPEN messages are exchanged with the host on the AC. 570 o L3DL session is established to the host source MAC and bound to a 571 local AC. 573 4.2 CE MAC/IP Learning 575 Overlay IPv4 and IPv6 encapsulation PDU types 8/9 from a CE are used 576 for the purpose of CE MAC/IP learning on a PE: 578 o The EVPN flag 'E' MUST NOT be set in type 8/9 PDU from a CE. 579 o A MAC entry for the MAC received in a type 8/9 PDU MUST be 580 installed in the MAC-VRF table pointing to the AC to which the 581 session is bound. 582 o If an IPv4/IPv6 address is set in the PDU, an IPv4/IPv6 neighbor 583 binding MUST be established for the IPv4/IPv6 address in the PDU to 584 the MAC address in the PDU. In other words, a next-hop re-write for 585 these IPv4/IPv6 neighbor entries MUST be installed using the MAC 586 address in the PDU, and if required by forwarding logic, bound to 587 the AC associated with the L3DL session. 588 o Note that an IPv4/IPv6 address MAY NOT be set in a type 8/9 PDU 589 received from a CE, in which case this PDU is only used for MAC 590 learning. This MAY be the case in a non-IRB EVPN network, wherein, 591 an EVPN PE is not a first-hop router for the attached CEs. 593 5. PE Any-cast GW MAC/IP Learning on CE 595 If L3DL based host learning is enabled on a PE with a distributed 596 any-cast gateway on the EVPN PE, 598 o EVPN PE MUST send type 8/9 Overlay Encapsulation PDUs on 599 associated ACs with L3DL sessions toward CE hosts. 600 o Type 8/9 PDUs from an EVPN PE MUST be encoded with the any-cast 601 gateway IPv4/IPv6 address and any-cast gateway MAC address. 602 o EVPN flag 'E' MUST NOT be set in this PDU. 603 o A CE MAY process type 8/9 PDUs to establish GW IP to MAC 604 bindings and learn gateway MAC to LAG AC bindings, similar to 605 handling of type 8/9 PDUs on the PE described above. 607 Handling of type 8/9 PDUs for the purpose of gateway learning on the 608 host is desirable but optional. A CE MAY continue to use ARP and ND 609 for this purpose. 611 6. Remote CE MAC/IP Learning on CE 613 For CE to CE intra-subnet flows across the overlay, CE needs to learn 614 and install a neighbor IP to MAC binding for remote CEs. This is 615 handled today either by flooding ARP/ND requests across the overlay 616 bridge and optionally implementing an ARP/ND suppression cache on the 617 PE that is populated via MAC+IP EVPN route-type 2. ARP/ND request 618 frames are trapped on the PE that does a local ARP/ND reply on behalf 619 of the remote CE. If L3DL based learning is enabled in the fabric, 620 L3DL may be used for this purpose to avoid overlay ARP/ND flooding, 621 data frame triggered ARP learning, and to avoid maintaining an ARP 622 suppression cache on the PE. 624 o Remote MAC-IP routes learned via BGP EVPN route-type 2 that are 625 imported to a local MAC-VRF MAY also be sent as type 8/9 PDUs on 626 L3DL sessions to CEs over local ACs in that BD. 627 o EVPN flag 'E' MUST be set in this encapsulation in the PDU. 628 o A CE MAY install IPv4/IPv6 neighbor MAC bindings for remote 629 CEs within a subnet based on 'E' flagged type 8/9 PDUs received 630 from the PE. 632 Handling of type 8/9 PDUs for this purpose is optional but desirable 633 to get full benefit of a fabric that is completely setup on boot-up, 634 avoids overlay flooding, and is decoupled from latencies associated 635 with data plane driven ARP and ND learning. 637 7. PE <-> CE Control Plane with EVPN All-active Multi-Homing 639 +------------------------+ 640 | Underlay Network Fabric| 641 +------------------------+ 643 BGP-EVPN Peering 644 <---------------------------------------------------> 645 +------+ +------+ +------+ +------+ 646 | PE1 | | PE2 | ..... | PEx | | PEy | 647 +------+ +------+ +------+ +------+ 648 \ / \ / 649 \ / \ / 650 \ / \ / 651 \ ESI-a / \ ESI-b / 652 L3DL \ / L3DL L3DL \ / L3DL 653 to PE1\ / to PE2 to PEx\ / to PEy 654 CE-Host CE-Host 656 Figure 6 658 In an EVPN all-active multi-homing setup, a LAG interface on the CE 659 includes member physical ports that connect to multiple PE devices. A 660 subset of these member ports that terminate at a PE are configured as 661 members of a local LAG interface at that PE. A LAG AC at the PE is a 662 logical interface in a BD, identified by this LAG interface and 663 optionally, an Ethernet Tag in case of trunk ports. 665 In order for L3DL based learning to work with EVPN all-active multi- 666 homing, a separate L3DL peering MUST be established between the CE 667 host and each PE device. For this reason, while an EVPN PE MAY form 668 an L3DL peering to a CE host on its local LAG AC, the CE host MUST 669 form an L3DL peering to a PE on a local LAG "member physical port". 671 A configurable All-active Multi-Homing mode is defined below in order 672 to be able to bind an L3DL peering to a LAG member-port as opposed to 673 a LAG interface. 675 7.1 All-active Multi-Homing Mode 677 When configured to run on a local LAG port in this mode, 679 o L3DL HELLO messages MUST be replicated on ALL LAG member ports. 680 o An L3DL OPEN message sent in response to a HELLO MUST be sent on 681 the LAG member port on which the HELLO was received. 682 o An L3DL session MUST be bound to the local LAG member port on 683 which the OPEN message was received. 684 o L3DL encapsulation PDUs MUST be sent on the local LAG member 685 port on which the session was bound. 686 o L3DL Keep-Alives MUST be sent on the local LAG member port on 687 which the session was bound. 689 Note that this may result in a PE receiving multiple HELLO PDUs from 690 a CE end-point. This however is harmless, as per the [L3DL] 691 specification. A PE simply drops redundant HELLOs from a MAC that it 692 has already replied to with an OPEN, within a retry time window. 694 7.2 Source MAC 696 L3DL relies on the source MAC address in the Ethernet frame to 697 establish a peering. When running L3DL on a LAG port (in all-active 698 multi-homing mode or regular mode), L3DL frames MUST use the LAG 699 interface MAC as the source MAC address in the Ethernet frame. 701 7.3 CE MAC/IP Learning with EVPN All-active Multi-Homing 703 In order to accomplish MAC/IP learning of CE host devices multi-homed 704 to EVPN fabric PEs via EVPN All-active Multi-Homing: 706 o A multi-homed CE device MUST be configured to run L3DL on a 707 local LAG interfaces in All-active Multi-Homing mode defined 708 above. 709 o EVPN PE MAY run L3DL on local LAG interfaces to multi-homed CE 710 devices in regular mode. 711 o EVPN PEs that share the same Ethernet Segment MUST use unique 712 source MACs (that of the local LAG) in HELLO/OPEN messages to 713 establish separate L3DL sessions to a CE. 715 With the above rules in place, 717 o An L3DL session on the CE is bound to a local LAG member-port. 718 o An L3DL session on the PE is bound to a local LAG AC port. 719 o A single L3DL session is established at the PE to a CE on the 720 local LAG AC. 721 o 'N' L3DL sessions are established at the CE, one to each PE on a 722 local LAG member interface, where N = number of multi-homing PEs 723 in an Ethernet Segment. 725 Once an L3DL session is established as above, all other host learning 726 procedures defined earlier for CE MAC/IP learning on a PE's AC port 727 apply as is to a LAG AC in an EVPN all-active multi-homing setup. 729 7.4 LAG Member Link Failure 731 On a CE that is running in all-active multi-homing mode, an L3DL 732 session to a PE is bound to a LAG member interface. If the link that 733 the L3DL session is bound to fails, L3DL session will get torn down 734 at the CE by virtue of the session interface going down. If the CE 735 has additional active member link(s) to this PE, a new L3DL session 736 must be established on one of the active member links via HELLO PDUs 737 sent by the CE on its remaining active member links to the PE. 739 7.4.1 Session Re-establishment 741 L3DL session at the CE is torn down immediately following the session 742 interface failure. While the LAG interface at the PE is still 743 operationally UP, L3DL session at the PE is subject to Keep Alive 744 PDUs received from the CE. Once the session expires at the PE because 745 of missed Keep Alive PDUs from the CE, PE will respond to HELLO on 746 one of the active member link with an OPEN to re-establish a new 747 session. Note that the new session is still bound to the LAG AC at 748 the PE and to a new member link at the CE. 750 7.4.2 TLV Retention 752 TLVs learnt from a CE over a failed session MUST be retained at the 753 PE if the PE LAG AC is still operationally up following a member link 754 failure because of active member link(s) in the LAG. TLV retention 755 logic at the PE MAY be based on an age-out time, that is a local 756 matter at the PE. TLV age-out time MUST be higher than the missed 757 Keep Alive duration, after which the session is considered closed. 758 Once a new L3DL session is established, PE MUST implement a mark and 759 sweep logic to reconcile retained TLVs from the CE peer with the new 760 set of TLVs received from this CE. 762 7.4 LAG Failure 764 When a LAG member link failure results in the LAG interface being 765 operationally down, TLV age-out logic discussed above MUST NOT be in 766 effect. L3DL session MAY be be considered as DOWN immediately on the 767 LAG being down at the PE. This is so that, in the event of a total 768 connectivity loss between a PE and CE, CE learnt routes can be 769 withdrawn immediately. 771 7.5 Example PE <-> CE Control Plane Flow with All-active Multi-Homing 773 An example L3DL over all-active multi-homing session flow is 774 discussed below for clarity. 776 +-------------+ +-------------+ 777 | | | | 778 | PE2 | | PE3 | 779 | | | | 780 +-+-----------+ +-+-----------+ 781 | LAG | | LAG | 782 ++--+---+--++ ++--+---+--++ 783 | | | | | | | | 784 | | | | | | | | 785 | | | | | | | | 786 | | | | | | | | 787 | | | | | | | | 788 +--+--+---+--+----------+--+---+--++ 789 | LAG | 790 +----------------------------------+-+ 791 | | 792 | H1 | 793 | | 794 +------------------------------------+ 796 Figure 7 798 Example topology with CE H1 multi-homed to PE2 and PE3 via EVPN all- 799 active multi-homing LAG with four member ports to each PE: 801 H1 member ports to PE2: i121, i122, i123, i124 802 | | | | 803 PE2 member ports to H1: i211, i212, i213, i214 805 H1 member ports to PE3: i131, i132, i133, i134 806 | | | | 807 PE3 member ports to H1: i311, i312, i313, i314 809 H1 LAG port to PE2/PE3: MLAG1 810 PE2 LAG port to H1: LAG2 811 PE3 LAG port to H1: LAG3 812 H1 LAG MAC: LMAC1 813 PE2 LAG MAC: LMAC2 814 PE3 LAG MAC: LMAC3 816 H1 running L3DL on MLAG1 in All-active Multi-Homing mode 817 PE2 running L3DL on LAG2 in regular mode 818 PE3 running L3DL on LAG3 in regular mode 819 PE2 H1 PE3 821 | HELLOs | HELLOs | 822 LAG2|<-------------------|------------------->|LAG3 823 LAG2|<-------------------|------------------->|LAG3 824 LAG2|<-------------------|------------------->|LAG3 825 LAG2|<-------------------|------------------->|LAG3 826 | | | 827 | OPEN | OPEN | 828 |------------------->|<-------------------| 829 LAG2| i122|i132 |LAG3 830 | | | 831 | OPEN | OPEN | 832 |<-------------------|------------------->| 833 LAG2| i122|i132 |LAG3 834 | | | 835 Session to | Session to |Session to |Session to 836 LMAC1 on LAG2| LMAC2 on i122|LMAC3 on i132 |LMAC1 on LAG3 837 | | | 838 | Encap-PDU | Encap-PDU | 839 |<-------------------|------------------->| 840 LAG2| i122|i132 |LAG3 841 | ACK | ACK | 842 |------------------->|<-------------------| 843 LAG2| | |LAG3 844 | | | 845 | Overlay-PDU | Overlay-PDU | 846 |------------------->|<-------------------| 847 LAG2| | |LAG3 848 | ACK | ACK | 849 |<-------------------|------------------->| 850 LAG2| i122|i132 |LAG3 851 | | | 853 Figure 8 855 In an example flow shown above: 857 o H1: originates HELLO(SMAC=LMAC2) on all MLAG member ports 858 o PE2: Multiple HELLO(SMAC=LMAC2) copies received on port LAG2 859 o PE3: Multiple HELLO(SMAC=LMAC2) copies received on port LAG3 860 o PE2: A single OPEN(SMAC=LMAC2, DMAC=LMAC1) sent on port LAG2 861 o PE3: A single OPEN(SMAC=LMAC3, DMAC=LMAC1) sent on port LAG3 862 o PE2/PE3:duplicate HELLOs from same source LMAC2 are ignored 863 o H1: OPEN(SMAC=LMAC2, DMAC=LMAC1) received on member port i122 864 o H1: OPEN(SMAC=LMAC1, DMAC=LMAC2) sent on member port i122 865 o H1: Session established to LMAC2 on MLAG1 member port i122 866 o PE2: Session established to LMAC1 on LAG AC LAG2 867 o H1: OPEN(SMAC=LMAC3, DMAC=LMAC1) received on member port i132 868 o H1: OPEN(SMAC=LMAC1, DMAC=LMAC3) sent on member port i132 869 o H1: Session established to LMAC3 on MLAG member port i132 870 o PE3: Session established to LMAC1 on LAG AC LAG3 871 o H1: IP encapsulation PDUs (type 4/5) sent to LMAC2 and LMAC3 872 o PE2/PE3: H1 MAC and IP are learned 873 o PE2/PE3: overlay IP encapsulation PDUs (type 8/9) sent to LMAC1 874 o H1: Any-cast GW MAC and IP are learned 875 o H1: Remote host MAC and IP are learned 877 8. Software Neighbor Tables 879 Some networking stack implementations rely on ARP and ND populated 880 neighbor tables for software forwarding. In order to inter-work with 881 such an implementation, an L3DL learned IPv4/IPv6 neighbor entry MAY 882 also be installed in ARP and ND neighbor table as a static / 883 permanent entry. 885 In addition, 887 o Pre-installing L3DL learned neighbor entries may help reduce 888 potential conflict with ARP or ND learned neighbor entries. 889 o Pre-installing L3DL learned neighbor entries may help reduce 890 reliance on data traffic triggered ARP requests / ND 891 solicitations and associated learning latency. 893 With respect to installing IPv6 entries learnt via LSoE in IPv6 ND 894 cache, Router flag (R-bit) and Override flag (O-bit) received in LSoE 895 PDU should be handled as defined in [RFC4861]. 897 9. MAC/IP Learning Conflict Resolution 899 If L3DL learned neighbor entries are not already installed as static 900 entries in ARP/ND neighbor table, it is possible that a neighbor 901 IPv4/IPv6 adjacency may be learned both via L3DL and ARP/ND. Even if 902 L3DL learned entries were pre-installed in neighbor table, a race 903 condition is still possible leading to a potential conflict between 904 ARP/ND learned and L3DL learned neighbor IP adjacency. In such 905 scenarios, L3DL learned entry should be preferred for the purpose of 906 programming neighbor IP adjacencies in forwarding. 908 With respect to MAC-VRF entries, it is recommended that data plane 909 learning be turned off when L3DL based learning is enabled. However, 910 if it is not, data plane learned entries MUST be reconciled with L3DL 911 learned entries in software and, in case of a conflict, L3DL learned 912 entries preferred if L3DL based learning is enabled. 914 10. EVPN SLA Signaling 916 Application SLA requirements received from a CE need to be signaled 917 by the local PE to remote PEs in order for remote PEs to route or 918 bridge overlay traffic destined to this CE via traffic engineered 919 paths that meet the SLA. As an example, if SLA requirement for a CE 920 is specified to be "minimum delay", remote PEs need to direct overlay 921 bridged and routed traffic to this CE over traffic engineered 922 underlay paths that implement a "minimum delay" routing policy. 924 Overlay SLA may also be required to be implemented at different 925 levels of granularity: 927 o per-host: [RT-2] 928 o per-EVI 929 o per-[ESI, EVI]: [RT-1] 931 Exact signaling specification and handling procedures for the above 932 would be detailed either in future revisions of this document or in a 933 separate document. 935 11. PE-CE Overlay Prefix Learning 937 [EVPN-PREFIX-ADV] section 4.1 defines a use case, wherein, a PE may 938 advertise IP prefixes and subnets behind a CE. In this use case, CE 939 device does not run a dynamic routing protocol. Instead, these 940 prefixes are learnt on the PE via local policy or configuration. 941 Prefixes are then advertised by PE as RT-5 with the CE as the GW. 943 PE-CE control plane defined in this document MAY be used to learn 944 these prefixes from a CE as an alternative to local configuration on 945 the PE. Once an L3DL session is established between a CE and a PE, as 946 discussed earlier, 948 o A CE MAY send type 10/11 PDUs with these IPv4/IPv6 prefixes over 949 an L3DL session to a PE with the CE IP as the GW IP. 950 o A PE MAY advertise prefixes learnt via type 10/11 PDUs as RT-5 951 with CE IP as the GW IP. 953 To summarize, A PE would advertise: 955 o RT-2 for the CE MAC-IP learnt via type 8/9 PDU 956 o RT-5 for Prefixes learnt via type 10/11 PDU with GW IP = CE IP 958 12. Asymmetric EVPN-IRB 960 Any deviations from the above procedures proposed in this document 961 for asymmetric IRB design will be covered in subsequent updates to 962 this document. 964 13. Centralized Gateway EVPN-IRB 966 Any deviations from the above procedures proposed in this document 967 for centralized GW based IRB design will be covered in subsequent 968 updates to this document. 970 14. Use Cases 972 14.1 CE Application SLA 974 Application SLA requirements signaled by a CE to an EVPN PE provide a 975 mechanism for EVPN provider network to provide overlay routing and 976 bridging services in accordance with customer application 977 requirements. As an example, a CE may specify an SLA requirement to 978 tunnel overlay application traffic destined to this CE over the 979 lowest delay path. An EVPN PE may signal this SLA requirement to 980 remote PEs along with CE MAC-IP route that in-turn result in the 981 remote PEs bridging and routing traffic destined to this CE over 982 traffic engineered underlay paths that are setup using the lowest 983 delay metric. 985 Future revisions of this document will specify the exact encoding of 986 SLA bits to achieve different SLA requirements. 988 14.2 Simplified EVPN Operations 990 This section will discuss in detail, benefits and simplifications 991 that may be achieved in the context of an EVPN network, if one 992 chooses to implement PE-CE control plane defined in this document as 993 opposed to using traditional data-plane and ARP/ND snooping based PE- 994 CE learning. 996 14.2.1 EVPN All-active Multi-Homing 998 +------------------------+ 999 | Underlay Network Fabric| 1000 +------------------------+ 1002 BGP-EVPN Peering 1003 <---------------------------------------------------> 1004 +------+ +------+ +------+ +------+ 1005 | PE1 | | PE2 | ..... | PEx | | PEy | 1006 +------+ +------+ +------+ +------+ 1007 \ / \ / 1008 \ / \ / 1009 \ / \ / 1010 \ ESI-a / \ ESI-b / 1012 LAG Bundle LAG Bundle 1013 to CE Host to CE Host 1015 Figure 9 1017 Data plane and ARP/ND snooping based MAC/IP learning on PE-CE all- 1018 active multi-homed LAG ports is subject to unpredictable hashing of 1019 ARP, ND, and data frames from host to PE. As an example, an ARP 1020 request for a connected host might originate at PE1 but the resulting 1021 ARP response from the host might be received at PE2. Redundant EVPN 1022 PEs in all-active multi-homing mode typically handle this 1023 unpredictability via combination of methods below: 1025 o PEs can handle unsolicited ARP and ND response frames. 1026 o PEs can implement additional mechanism to SYNC ARP, ND, and 1027 MAC tables across all PEs in a redundancy group for optimal 1028 forwarding to locally connected hosts. 1029 o PEs can implement EVPN aliasing procedures discussed in 1030 [RFC 7432] OR re-originate SYNCed MAC-IP adjacencies as local RT- 1031 2 to achieve MAC ECMP across the overlay. 1032 o PEs can also re-originate SYNCed MAC-IP adjacencies as local 1033 RT-2 to achieve IP ECMP across the overlay OR implement IP 1034 aliasing procedures discussed in [EVPN-IP-ALIASING]. 1035 o PEs can also ensure EVPN sequence number SYNC for local MAC 1036 entries for EVPN mobility procedures to work correctly, as 1037 discussed in [EVPN-IRB-MOBILITY]. 1039 The PE-CE control plane learning alternative defined in this document 1040 fully decouples MAC and IP learning over MLAG ports from 1041 unpredictable hashing of data, AR, ND frames on all-active multi- 1042 homed LAG member links. As a result, above procedures that 1043 essentially result from data-plane PE-CE learning on all-active 1044 multi-homed LAGs can be simplified via the PE-CE control plane 1045 alternative defined in this document. 1047 14.2.2 Convergence on CE Host Moves 1049 +------------------------+ 1050 | Underlay Network Fabric| 1051 +------------------------+ 1053 BGP-EVPN Peering 1054 <------------------------------> 1056 +------+ +------+ +------+ 1057 | PE1 | | PE2 | ..... | PEx | 1058 +------+ +------+ +------+ 1059 | | | 1060 Hosts Hosts Hosts 1062 Figure 10 1064 Host mobility across EVPN PE switches is a common occurrence in a 1065 data center fabric for flexibility in work load placement across a 1066 DC. Further, a host move must result in minimal, if any, disruption 1067 to traffic flows / services to / from the device. 1069 Data plane and ARP/ND snooping based PE-CE learning may result in 1070 unpredictable convergence times, following host moves for the 1071 following cases: 1073 o A host may or may not send any data packet immediately following 1074 a move. 1075 o A host may or may not send an unsolicited ARP following a move. 1077 While probing procedures, discussed in the next sub-sections are 1078 typically used to minimize convergence time, certain scenarios 1079 discussed below may still result in extended convergence times and 1080 flooding. 1082 14.2.2.1 Silent Hosts 1084 If a host is silent for an extended period following a move from PE1 1085 to PE2, any bridged traffic flow destined to this host will continue 1086 to be black-holed by PE1 until the MAC ages out at PE1. Once the the 1087 MAC ages out at PE1, any bridged traffic flow destined to the host is 1088 flooded across the overlay bridge. Flooding of unknown unicast 1089 traffic on the overlay is enabled for this purpose. In summary, PE-CE 1090 learning that is based on data-plane and AR/ND snooping may be 1091 subject to non-deterministic convergence time and flooding following 1092 host moves because of being heavily dependent on unpredictable CE 1093 behavior. 1095 PE-CE control plane based learning defined in this document fully 1096 decouples convergence in such scenarios from non-deterministic data 1097 flows and unsolicited ARP/ND behavior on a CE. 1099 14.2.2.2 Probing 1101 ARP and ND probing procedures are typically used to achieve host re- 1102 learning and convergence following host moves across the overlay: 1104 o Following a host move from PE1 to PE2, the host's MAC is 1105 discovered at PE2 as a local MAC via a data frames received from 1106 the host. If PE2 has a prior REMOTE MAC-IP host route for this 1107 MAC from PE1, an ARP probe is typically triggered at PE2 to learn 1108 the MAC-IP as a local IP adjacency and triggers EVPN RT-2 1109 advertisement for this MAC-IP across the overlay with new 1110 reachability via PE2. 1112 o Following a host move from PE1 to PE2, once PE1 receives a MAC 1113 or MAC-IP route from PE2 with a higher sequence number, an ARP 1114 probe is triggered at PE1 to clear the stale local MAC-IP 1115 neighbor adjacency OR re-learn the local MAC-IP in case the host 1116 has moved back or is duplicate. 1118 o Following a local MAC age-out, if there is a local IP adjacency 1119 with this MAC, an ARP probe is triggered for this IP to either 1120 re-learn the local MAC and maintain local l3 and l2 reachability 1121 to this host OR to clear the ARP entry in case the host is indeed 1122 no longer local. Note that clearing of stale ARP entries, 1123 following a move is required for traffic to converge in the event 1124 that the host was silent and not discovered at its new location. 1125 Once stale ARP entry for the host is cleared, routed traffic flow 1126 destined for the host can re-trigger ARP discovery for this host 1127 at the new location. ARP flooding on the overlay MUST also be 1128 done to enable ARP discovery via routed flows. 1130 o Alternatively, ARP probing timer may be tuned to be smaller than 1131 the MAC aging timer to avoid MAC age-out. 1133 PE-CE control plane learning alternative defined in this document 1134 decouples host learning following moves from unpredictable host 1135 behavior with respect to sending data traffic and unsolicited ARPs, 1136 and as a result from ARP probing and MAC aging timer settings. Host 1137 move handling is hence greatly simplified to a very predictable and 1138 deterministic behavior. 1140 14.2.3 ARP Gleaning Latency 1142 If a CE's ARP binding is not already learned on a PE via an 1143 unsolicited ARP sent by the CE following events such as boot-up, 1144 flaps, and moves, a data frame that needs to be routed to the CE 1145 triggers ARP or ND discovery process on the PE. On a typical hardware 1146 switching platform, an IP packet that does not resolve to a link 1147 layer re-write would be punted to host stack that delivers packets 1148 with incomplete link-layer resolution to ARP or ND for resolution. An 1149 ARP request / ND Solicitation is generated for the CE IP and an ARP 1150 response or NA results in installing a link-layer re-write for the CE 1151 IP. In an EVPN multi-homing environment, this procedure is further 1152 complicated as the response is only received by one of the PEs that 1153 may or may not be the one that generated the ARP or ND request. 1154 Learned neighbor binding is SYNCed to other PEs that share the multi- 1155 homed Ethernet Segment. Routed flows can now be forwarded to the host 1156 via all PEs. Latency associated with such data frame driven ARP 1157 discovery may result in significant initial convergence hit, 1158 following triggers that warrant re-gleaning of CE IP to MAC binding. 1160 PE-CE control plane learning alternative defined in this document 1161 results in proactive host learning following these scenarios, 1162 potentially avoiding a convergence hit on initial data packets. 1164 14.3 Applicability to non-EVPN Use Cases 1166 While the L3DL based host learning procedure described in this 1167 document focuses on EVPN-IRB overlay fabric use case, it may also 1168 have benefits and applicability in non-EVPN use cases. Applicability 1169 of procedures described in this document to non-EVPN use cases is a 1170 topic for further study. 1172 15. Summary 1174 PE-CE control plane is proposed as an alternative to data plane and 1175 ARP/ND snooping based PE-CE host MAC/IP learning and for PE-CE prefix 1176 learning. With a PE-CE control plane, CE host MAC and IP are 1177 deterministically learned on host boot-up, on host configuration, 1178 across host moves, on convergence triggers such as link failures, 1179 flaps, and PE re-boots and on all-active multi-homing LAG links. A 1180 PE-CE control plane decouples CE MAC and IP learning from traffic 1181 flows sourced by a CE, from varying CE behavior with respect to 1182 sending unsolicited ARP/ND frames, and from hashing of CE sourced 1183 frames over all-active multi-homed LAG links. As a result, it helps 1184 achieve a predictable and reliable convergence behavior across these 1185 triggers and helps simplify certain EVPN procedures that are 1186 otherwise needed with a data-plane and ARP/ND snooping based PE-CE 1187 learning. In addition, it may also be used for non-host learning use 1188 cases such as prefix learning. 1190 16. References 1192 16.1 Normative References 1194 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 1195 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 1196 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 1197 2015, . 1199 [L3DL] Bush, R., Austein R., Patel, K., "Layer 3 Discovery and 1200 Liveness", Feb 2019, . 1203 [EVPN-IRB] Sajassi, A., Salem, S., Thoria S., Drake J., Rabadan J., 1204 "Integrated Routing and Bridging in EVPN", July 2018, 1205 . 1208 [EVPN-PREFIX-ADV] Rabadan J., Henderickx W., Drake J., Lin W., 1209 Sajassi, A., "IP Prefix Advertisement in EVPN", May 2018, 1210 . 1213 [EVPN-IRB-MOBILITY] Malhotra, N., Sajassi, A., Rabadan, J., Drake 1214 J., Lingala A., Patekar A., "Extended Mobility Procedures 1215 for EVPN-IRB", June 2019, 1216 . 1219 [EVPN-IP-ALIASING] Sajassi, A., Badoni, G., "L3 Aliasing and Mass 1220 Withdrawal Support for EVPN", July 2017, 1221 . 1224 [RFC2119] S. Bradner, "Key words for use in RFCs to Indicate 1225 Requirement Levels", March 1997, 1226 . 1228 [RFC8174] B. Leiba, "Ambiguity of Uppercase vs Lowercase in RFC 2119 1229 Key Words", May 2017, 1230 . 1232 15.2 Informative References 1233 17. Acknowledgements 1235 Authors would like to thank Randy Bush and Rob Austein for detailed 1236 review and feedback to ensure consistency with base L3DL protocol 1237 specification, as well as for helping build detailed L3DL flows 1238 included in this document. 1240 Authors would like to thank Ali Sajassi and John Drake for detailed 1241 review and very valuable input on PE-CE protocol design for EVPN use 1242 cases as well as structuring this document for EVPN use cases. 1244 Contributors 1246 Randy Bush 1247 Arrcus & IIJ 1248 5147 Crystal Springs 1249 Bainbridge Island, WA 98110 1250 United States of America 1251 Email: randy@psg.com 1253 Authors' Addresses 1255 Neeraj Malhotra (Editor) 1256 Individual 1257 Email: neeraj.ietf@gmail.com 1259 Keyur Patel 1260 Arrcus 1261 2077 Gateway Place, Suite #400 1262 San Jose, CA 95119, USA 1263 Email: keyur@arrcus.com 1265 Jorge Rabadan 1266 Nokia 1267 777 E. Middlefield Road 1268 Mountain View, CA 94043, USA 1269 Email: jorge.rabadan@nokia.com