idnits 2.17.1 draft-bookham-rtgwg-nfix-arch-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet has text resembling RFC 2119 boilerplate text. -- The document date (June 24, 2020) is 1400 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'ELI' is mentioned on line 757, but not defined == Missing Reference: 'EL' is mentioned on line 757, but not defined == Missing Reference: 'RFC7130' is mentioned on line 1061, but not defined == Outdated reference: A later version (-10) exists of draft-ietf-bess-evpn-ipvpn-interworking-03 == Outdated reference: A later version (-22) exists of draft-ietf-spring-segment-routing-policy-07 == Outdated reference: A later version (-13) exists of draft-ietf-rtgwg-segment-routing-ti-lfa-03 == Outdated reference: A later version (-18) exists of draft-ietf-bess-nsh-bgp-control-plane-15 == Outdated reference: A later version (-19) exists of draft-ietf-idr-te-lsp-distribution-13 == Outdated reference: A later version (-09) exists of draft-filsfils-spring-sr-policy-considerations-05 == Outdated reference: A later version (-20) exists of draft-ietf-rtgwg-bgp-pic-11 == Outdated reference: A later version (-08) exists of draft-ietf-idr-next-hop-capability-05 == Outdated reference: A later version (-06) exists of draft-ietf-idr-long-lived-gr-00 -- Obsolete informational reference (is this intentional?): RFC 7752 (Obsoleted by RFC 9552) Summary: 0 errors (**), 0 flaws (~~), 14 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RTG Working Group C. Bookham, Ed. 3 Internet-Draft A. Stone 4 Intended status: Informational Nokia 5 Expires: December 26, 2020 J. Tantsura 6 Apstra 7 M. Durrani 8 Equinix Inc 9 B. Decraene 10 Orange 11 June 24, 2020 13 An Architecture for Network Function Interconnect 14 draft-bookham-rtgwg-nfix-arch-01 16 Abstract 18 The emergence of technologies such as 5G, the Internet of Things 19 (IoT), and Industry 4.0, coupled with the move towards network 20 function virtualization, means that the service requirements demanded 21 from networks are changing. This document describes an architecture 22 for a Network Function Interconnect (NFIX) that allows for 23 interworking of physical and virtual network functions in a unified 24 and scalable manner across wide-area network and data center domains 25 while maintaining the ability to deliver against SLAs. 27 Requirements Language 29 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 30 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 31 document are to be interpreted as described in BCP 14 32 [RFC2119][RFC8174] when, and only when, they appear in all capitals, 33 as shown here. 35 Status of This Memo 37 This Internet-Draft is submitted in full conformance with the 38 provisions of BCP 78 and BCP 79. 40 Internet-Drafts are working documents of the Internet Engineering 41 Task Force (IETF). Note that other groups may also distribute 42 working documents as Internet-Drafts. The list of current Internet- 43 Drafts is at https://datatracker.ietf.org/drafts/current/. 45 Internet-Drafts are draft documents valid for a maximum of six months 46 and may be updated, replaced, or obsoleted by other documents at any 47 time. It is inappropriate to use Internet-Drafts as reference 48 material or to cite them other than as "work in progress." 49 This Internet-Draft will expire on December 26, 2020. 51 Copyright Notice 53 Copyright (c) 2020 IETF Trust and the persons identified as the 54 document authors. All rights reserved. 56 This document is subject to BCP 78 and the IETF Trust's Legal 57 Provisions Relating to IETF Documents 58 (https://trustee.ietf.org/license-info) in effect on the date of 59 publication of this document. Please review these documents 60 carefully, as they describe your rights and restrictions with respect 61 to this document. Code Components extracted from this document must 62 include Simplified BSD License text as described in Section 4.e of 63 the Trust Legal Provisions and are provided without warranty as 64 described in the Simplified BSD License. 66 Table of Contents 68 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 69 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 70 3. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 4 71 4. Requirements . . . . . . . . . . . . . . . . . . . . . . . . 6 72 5. Theory of Operation . . . . . . . . . . . . . . . . . . . . . 7 73 5.1. VNF Assumptions . . . . . . . . . . . . . . . . . . . . . 7 74 5.2. Overview . . . . . . . . . . . . . . . . . . . . . . . . 8 75 5.3. Use of a Centralized Controller . . . . . . . . . . . . . 9 76 5.4. Routing and LSP Underlay . . . . . . . . . . . . . . . . 11 77 5.4.1. Intra-Domain Routing . . . . . . . . . . . . . . . . 11 78 5.4.2. Inter-Domain Routing . . . . . . . . . . . . . . . . 13 79 5.4.3. Intra-Domain and Inter-Domain Traffic-Engineering . . 14 80 5.5. Service Layer . . . . . . . . . . . . . . . . . . . . . . 17 81 5.6. Service Differentiation . . . . . . . . . . . . . . . . . 19 82 5.7. Automated Service Activation . . . . . . . . . . . . . . 20 83 5.8. Service Function Chaining . . . . . . . . . . . . . . . . 21 84 5.9. Stability and Availability . . . . . . . . . . . . . . . 23 85 5.9.1. IGP Reconvergence . . . . . . . . . . . . . . . . . . 23 86 5.9.2. Data Center Reconvergence . . . . . . . . . . . . . . 23 87 5.9.3. Exchange of Inter-Domain Routes . . . . . . . . . . . 24 88 5.9.4. Controller Redundancy . . . . . . . . . . . . . . . . 24 89 5.9.5. Path and Segment Liveliness . . . . . . . . . . . . . 26 90 5.10. Scalability . . . . . . . . . . . . . . . . . . . . . . . 28 91 5.10.1. Asymmetric Model B for VPN Families . . . . . . . . 30 92 6. Illustration of Use . . . . . . . . . . . . . . . . . . . . . 32 93 6.1. Reference Topology . . . . . . . . . . . . . . . . . . . 32 94 6.2. PNF to PNF Connectivity . . . . . . . . . . . . . . . . . 34 95 6.3. VNF to PNF Connectivity . . . . . . . . . . . . . . . . . 35 96 6.4. VNF to VNF Connectivity . . . . . . . . . . . . . . . . . 36 98 7. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 37 99 8. Security Considerations . . . . . . . . . . . . . . . . . . . 38 100 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 38 101 10. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 38 102 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 103 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 39 104 12.1. Normative References . . . . . . . . . . . . . . . . . . 39 105 12.2. Informative References . . . . . . . . . . . . . . . . . 40 106 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 45 108 1. Introduction 110 With the introduction of technologies such as 5G, the Internet of 111 Things (IoT), and Industry 4.0, service requirements are changing. 112 In addition to the ever-increasing demand for more capacity, these 113 services have other stringent service requirements that need to be 114 met such as ultra-reliable and/or low-latency communication. 116 Parallel to this, there is a continued trend to move towards network 117 function virtualization. Operators are building digitalized 118 infrastructure capable of hosting numerous virtualized network 119 functions (VNFs). Infrastructure that can scale in and scale out 120 depending on the application demand and can deliver flexibility and 121 service velocity. Much of this virtualization activity is driven by 122 the afore-mentioned emerging technologies as new infrastructure is 123 deployed in support of them. To try and meet the new service 124 requirements some of these VNFs are becoming more dispersed, so it is 125 common for networks to have a mix of centralized medium- or large- 126 sized sized data centers together with more distributed smaller 127 'edge-clouds'. VNFs hosted within these data centers require 128 seamless connectivity to each other, and to their existing physical 129 network function (PNF) counterparts. This connectivity also needs to 130 deliver against agreed SLAs. 132 Coupled with the deployment of virtualization is automation. Many of 133 these VNFs are deployed within SDN-enabled data centers where 134 automation is simply a must-have capability to improve service 135 activation lead-times. The expectation is that services will be 136 instantiated in an abstract point-and-click manner and be 137 automatically created by the underlying network, dynamically adapting 138 to service connectivity changes as virtual entities move between 139 hosts. 141 This document describes an architecture for a Network Function 142 Interconnect (NFIX) that allows for interworking of physical and 143 virtual network functions in a unified and scalable manner. It 144 describes a mechanism for establishing connectivity across multiple 145 discreet domains in both the wide-area network (WAN) and the data 146 center (DC) while maintaining the ability to deliver against SLAs. 147 To achieve this NFIX works with the underlying topology to build a 148 unified over-the-top topology. 150 The NFIX architecture described in this document does not define any 151 new protocols but rather outlines an architecture utilizing a 152 collaboration of existing standards-based protocols. 154 2. Terminology 156 o A physical network function (PNF) refers to a network device such 157 as a Provider Edge (PE) router that connects physically to the 158 wide-area network. 160 o A virtualized network function (VNF) refers to a network device 161 such as a provider edge (PE) router that is hosted on an 162 application server. The VNF may be bare-metal in that it consumes 163 the entire resources of the server, or it may be one of numerous 164 virtual functions instantiated as a VM or number of containers on 165 a given server that is controlled by a hypervisor or container 166 management platform. 168 o A Data Center Border (DCB) router refers to the network function 169 that spans the border between the wide-area and the data center 170 networks, typically interworking the different encapsulation 171 techniques employed within each domain. 173 o An Interconnect controller is the controller responsible for 174 managing the NFIX fabric and services. 176 o A DC controller is the term used for a controller that resides 177 within an SDN-enabled data center and is responsible for the DC 178 network(s) 180 3. Motivation 182 Industrial automation and business-critical environments use 183 applications that are demanding on the network. These applications 184 present different requirements from low-latency to high-throughput, 185 to application-specific traffic conditioning, or a combination. The 186 evolution to 5G equally presents challenges for mobile back-, front- 187 and mid-haul networks. The requirement for ultra-reliable low- 188 latency communication means that operators need to re-evaluate their 189 network architecture to meet these requirements. 191 At the same time, the service edge is evolving. Where the service 192 edge device was historically a PNF, the adoption of virtualization 193 means VNFs are becoming more commonplace. Typically, these VNFs are 194 hosted in some form of data center environment but require end-to-end 195 connectivity to other VNFs and/or other PNFs. This represents a 196 challenge because generally transport layer connectivity differs 197 between the WAN and the data center environment. The WAN includes 198 all levels of hierarchy (core, aggregation, access) that form the 199 networks footprint, where transport layer connectivity using IP/MPLS 200 is commonplace. In the data center native IP is commonplace, 201 utilizing network virtualization overlay (NVO) technologies such as 202 virtual extensible LAN (VXLAN) [RFC7348], network virtualization 203 using generic routing encapsulation (NVGRE) [RFC7637], or generic 204 network virtualization encapsulation (GENEVE) [I-D.ietf-nvo3-geneve]. 205 There is a requirement to seamlessly integrate these islands and 206 avoid heavy-lifting at interconnects as well as providing a means to 207 provision end-to-end services with a single touch point at the edge. 209 The service edge boundary is also changing. Some functions that were 210 previously reasonably centralized are now becoming more distributed. 211 One reason for this is to attempt to deal with low latency 212 requirements. Another reason is that operators seek to reduce costs 213 by deploying low/medium-capacity VNFs closer to the edge. Equally, 214 virtualization also sees some of the access network moving towards 215 the core. Examples of this include cloud-RAN or Software-Defined 216 Access Networks. 218 Historically service providers have architected data centers 219 independently from the wide-area network, creating two independent 220 domains or islands. As VNFs become part of the service landscape the 221 service data-path must be extended across the WAN into the data 222 center infrastructure, but in a manner that still allows operators to 223 meet deterministic performance requirements. Methods for stitching 224 WAN and DC infrastructures together with some form of service- 225 interworking at the data center border have been implemented and 226 deployed, but this service-interworking approach has several 227 limitations: 229 o The data center environment typically uses encapsulation 230 techniques such as VXLAN or NVGRE while the WAN typically uses 231 encapsulation techniques such as MPLS [RFC3031]. Underlying 232 optical infrastructure might also need to be programmed. These 233 are incompatible and require interworking at the service layer. 235 o It typically requires heavy-touch service provisioning on the data 236 center border. In an end-to-end service, midpoint provisioning is 237 undesirable and should be avoided. 239 o Automation is difficult; largely due to the first two points but 240 with additional contributing factors. In the virtualization world 241 automation is a must-have capability. 243 o When a service is operating at Layer 3 in a data center with 244 redundant interconnects the risk of routing loops exists. There 245 is no inherent loop avoidance mechanism when redistributing routes 246 between address families so extreme care must be taken. Proposals 247 such as the Domain Path (D-PATH) attribute 248 [I-D.ietf-bess-evpn-ipvpn-interworking] attempt to address this 249 issue but as yet are not widely implemented or deployed. 251 o Some or all the above make the service-interworking gateway 252 cumbersome with questionable scaling attributes. 254 Hence there is a requirement to create an open, scalable, and unified 255 network architecture that brings together the wide-area network and 256 data center domains. It is not an architecture e xclusively targeted 257 at greenfield deployments, nor does it require a flag day upgrade to 258 deploy in a brownfield network. It is an evolutionary step to a 259 consolidated network that uses the constructs of seamless MPLS 260 [I-D.ietf-mpls-seamless-mpls] as a baseline and extends upon that to 261 include topologies that may not be link-state based and to provide 262 end-to-end path control. Overall the NFIX architecture aims to 263 deliver the following: 265 o Allows for an evolving service edge boundary without having to 266 constantly restructure the architecture. 268 o Provides a mechanism for providing seamless connectivity between 269 VNF to VNF, VNF to PNF, and PNF to PNF, with deterministic SLAs, 270 and with the ability to provide differentiated SLAs to suit 271 different service requirements. 273 o Delivers a unified transport fabric using Segment Routing (SR) 274 [RFC8402] where service delivery mandates touching only the 275 service edge without imposing additional encapsulation 276 requirements in the DC. 278 o Embraces automation by providing an environment where any end-to- 279 end connectivity can be instantiated in a single request manner 280 while maintaining SLAs. 282 4. Requirements 284 The following section outlines the requirements that the proposed 285 solution must meet. From an overall perspective, the proposed 286 generic architecture must: 288 o Deliver end-to-end transport LSPs using traffic-engineering (TE) 289 as required to meet appropriate SLAs for the service using(s) 290 using those LSPs. End-to-end refers to VNF and/or PNF 291 connectivity or a combination of both. 293 o Provide a solution that allows for optimal end-to-end path 294 placement; where optimal not only meets the requirements of the 295 path in question but also meets the global network objectives. 297 o Support varying types of VNF physical network attachment and 298 logical (underlay/overlay) connectivity. 300 o Facilitate automation of service provision. As such the solution 301 should avoid heavy-touch service provisioning and decapsulation/ 302 encapsulation at data center border routers. 304 o Provide a framework for delivering logical end-to-end networks 305 using differentiated logical topologies and/or constraints. 307 o Provide a high level of stability; faults in one domain should not 308 propagate to another domain. 310 o Provide a mechanism for homogeneous end-to-end OAM. 312 o Hide/localize instabilities in the different domains that 313 participate in the end-to-end service. 315 o Provide a mechanism to minimize the label-stack depth required at 316 path head-ends for SR-TE LSPs. 318 o Offer a high level of scalability. 320 o Although not considered in-scope of the current version of this 321 document, the solution should not preclude the deployment of 322 multicast. This subject may be covered in later versions of this 323 document. 325 5. Theory of Operation 327 This section describes the NFIX architecture including the building 328 blocks and protocol machinery that is used to form the fabric. Where 329 considered appropriate rationale is given for selection of an 330 architectural component where other seemingly applicable choices 331 could have been made. 333 5.1. VNF Assumptions 335 For the sake of simplicity, references to VNF are made in a broad 336 sense. Equally, the differences between VNF and Container Network 337 Function (CNF) are largely immaterial for the purposes of this 338 document, therefore VNF is used to represent both. The way in which 339 a VNF is instantiated and provided network connectivity will differ 340 based on environment and VNF capability, but for conciseness this is 341 not explicitly detailed with every reference to a VNF. Common 342 examples of VNF variants include but are not limited to: 344 o A VNF that functions as a routing device and has full IP routing 345 and MPLS capabilities. It can be connected simultaneously to the 346 data center fabric underlay and overlay and serves as the NVO 347 tunnel endpoint [RFC8014]. Examples of this might be a 348 virtualized PE router, or a virtualized Broadband Network Gateway 349 (BNG). 351 o A VNF that functions as a device (host or router) with limited IP 352 routing capability. It does not connect directly to the data 353 center fabric underlay but rather connects to one or more external 354 physical or virtual devices that serve as the NVO tunnel 355 endpoint(s). It may however have single or multiple connections 356 to the overlay. Examples of this might be a mobile network 357 control or management plane function. 359 o A VNF that has no routing capability. It is a virtualized 360 function hosted within an application server and is managed by a 361 hypervisor or container host. The hypervisor/container host acts 362 as the NVO endpoint and interfaces to some form of SDN controller 363 responsible for programming the forwarding plane of the 364 virtualization host using, for example, OpenFlow. Examples of 365 this might be an Enterprise application server or a web server 366 running as a virtual machine and front-ended by a virtual routing 367 function such as OVS/xVRS/VTF. 369 Where considered necessary exceptions to the examples provided above 370 or focus on a particular scenario will be highlighted. 372 5.2. Overview 374 The NFIX architecture makes no assumptions about how the network is 375 physically composed, nor does it impose any dependencies upon it. It 376 also makes no assumptions about IGP hierarchies and the use of areas/ 377 levels or discrete IGP instances within the WAN is fully endorsed to 378 enhance scalability and constrain fault propagation. This could 379 apply for instance to a hierarchical WAN from core to edge or from 380 WAN to LAN connections. The overall architecture uses the constructs 381 of seamless MPLS as a baseline and extends upon that. The concept of 382 decomposing the network into multiple domains is one that has been 383 widely deployed and has been proven to scale in networks with large 384 numbers of nodes. 386 The proposed architecture uses segment routing (SR) as its preferred 387 choice of transport. Segment routing is chosen for construction of 388 end-to-end LSPs given its ability to traffic-engineer through source- 389 routing while concurrently scaling exceptionally well due to its lack 390 of network state other than the ingress node. This document uses SR 391 instantiated on an MPLS forwarding plane(SR-MPLS), although it does 392 not preclude the use of SRv6 either now or at some point in the 393 future. The rationale for selecting SR-MPLS is simply maturity and 394 more widespread applicability across a potentially broad range of 395 network devices. This document may be updated in future versions to 396 include more description of SRv6 applicability. 398 5.3. Use of a Centralized Controller 400 It is recognized that for most operators the move towards the use of 401 a controller within the wide-area network is a significant change in 402 operating model. In the NFIX architecture it is a necessary 403 component. Its use is not simply to offload inter-domain path 404 calculation from network elements; it provides many more benefits: 406 o It offers the ability to enforce constraints on paths that 407 originate/terminate on different network elements, thereby 408 providing path diversity, and/or bidirectionality/co-routing, and/ 409 or disjointness. 411 o It avoids collisions, re-tries, and packing problems that has been 412 observed in networks using distributed TE path calculation, where 413 head-ends make autonomous decisions. 415 o A controller can take a global view of path placement strategies, 416 including the ability to make path placement decisions over a high 417 number of LSPs concurrently as opposed to considering each LSP 418 independently. In turn, this allows for 'global' optimization of 419 network resources such as available capacity. 421 o A controller can make decisions based on near-real-time network 422 state and optimize paths accordingly. For example, if a network 423 link becomes congested it may recompute some of the paths 424 transiting that link to other links that may not be quite as 425 optimal but do have available capacity. Or if a link latency 426 crosses a certain threshold, it may select to reoptimize some 427 latency-sensitive paths away from that link. 429 o The logic of a controller can be extended beyond pure path 430 computation and placement. If the controller is aware of 431 services, service requirements, and available paths within the 432 network it can cross-correlate between them and ensure that the 433 appropriate paths are used for the appropriate services. 435 o The controller can provide assurance and verification of the 436 underlying SLA provided to a given service. 438 As the main objective of the NFIX architecture is to unify the data 439 center and wide-area network domains, using the term controller is 440 not sufficiently succinct. The centralized controller may need to 441 interface to other controllers that potentially reside within an SDN- 442 enabled data center. Therefore, to avoid interchangeably using the 443 term controller for both functions, we distinguish between them 444 simply by using the terms 'DC controller' which as the name suggests 445 is responsible for the DC, and 'Interconnect controller' responsible 446 for managing the extended SR fabric and services. 448 The Interconnect controller learns wide-area network topology 449 information and allocation of segment routing SIDs within that domain 450 using BGP link-state [RFC7752] with appropriate SR extensions. 451 Equally it learns data center topology information and Prefix-SID 452 allocation using BGP labeled unicast [RFC8277] with appropriate SR 453 extensions, or BGP link-state if a link-state IGP is used within the 454 data center. If Route-Reflection is used for exchange of BGP link- 455 state or labeled unicast NLRI within one or more domains, then the 456 Interconnect controller need only peer as a client with those Route- 457 Reflectors in order to learn topology information. 459 Where BGP link-state is used to learn the topology of a data center 460 (or any IGP routing domain) the BGP-LS Instance Identifier (Instance- 461 ID) is carried within Node/Link/Prefix NLRI and is used to identify a 462 given IGP routing domain. Where labeled unicast BGP is used to 463 discover the topology of one or more data center domains there is no 464 equivalent way for the Interconnect controller to achieve a level of 465 routing domain correlation. The controller may learn some splintered 466 connectivity map consisting of 10 leaf switches, four spine switches, 467 and four DCB's, but it needs some form of key to inform it that leaf 468 switches 1-5, spine switches 1 and 2, and DCB's 1 and 2 belong to 469 data center 1, while leaf switches 6-10, spine switches 3 and 4, and 470 DCB's 3 and 4 belong to data center 2. What is needed is a form of 471 'data center membership identification' to provide this correlation. 472 Optionally this could be achieved at BGP level using a standard 473 community to represent each data center, or it could be done at a 474 more abstract level where for example the DC controller provides the 475 membership identification to the Interconnect controller through an 476 application programming interface (API). 478 Understanding real-time network state is an important part of the 479 Interconnect controllers role, and only with this information is the 480 controller able to make informed decisions and take preventive or 481 corrective actions as necessary. There are numerous methods 482 implemented and deployed that allow for harvesting of network state, 483 including (but not limited to) IPFIX [RFC7011], Netconf/YANG 484 [RFC6241][RFC6020], streaming telemetry, BGP link-state [RFC7752] 485 [I-D.ietf-idr-te-lsp-distribution], and the BGP Monitoring Protocol 486 (BMP) [RFC7854]. 488 5.4. Routing and LSP Underlay 490 This section describes the mechanisms and protocols that are used to 491 establish end-to-end LSPs; where end-to-end refers to VNF-to-VNF, 492 PNF-to-PNF, or VNF-to-PNF. 494 5.4.1. Intra-Domain Routing 496 In a seamless MPLS architecture domains are based on geographic 497 dispersion (core, aggregation, access). Within this document a 498 domain is considered as any entity with a captive topology; be it a 499 link-state topology or otherwise. Where reference is made to the 500 wide-area network domain, it refers to one or more domains that 501 constitute the wide-area network domain. 503 This section discusses the basic building blocks required within the 504 wide-area network and the data center, noting from above that the 505 wide-area network may itself consist of multiple domains. 507 5.4.1.1. Wide-Area Network Domains 509 The wide-area network includes all levels of hierarchy (core, 510 aggregation, access) that constitute the networks MPLS footprint as 511 well as the data Center border routers. Each domain that constitutes 512 part of the wide-area network runs a link-state interior gateway 513 protocol (IGP) such as ISIS or OSPF, and each domain may use IGP- 514 inherent hierarchy (OSPF areas, ISIS levels) with an assumption that 515 visibility is domain-wide using, for example, L2 to L1 516 redistribution. Alternatively, or additionally, there may be 517 multiple domains that are split by using separate and distinct 518 instances of IGP. There is no requirement for IGP redistribution of 519 any link or loopback addresses between domains. 521 Each IGP should be enabled with the relevant extensions for segment 522 routing [RFC8667][RFC8665], and each SR-capable router should 523 advertise a Node-SID for its loopback address, and an Adjacency-SID 524 (Adj-SID) for every connected interface (unidirectional adjacency) 525 belonging to the SR domain. SR Global Blocks (SRGB) can be allocated 526 to each domain as deemed appropriate to specific network 527 requirements. Border routers belonging to multiple domains have an 528 SRGB for each domain. 530 The default forwarding path for intra-domain LSPs that do not require 531 TE is simply an SR LSP containing a single label advertised by the 532 destination as a Node-SID and representing the ECMP-aware shortest 533 path to that destination. Intra-domain TE LSPs are constructed as 534 required by the Interconnect controller. Once a path is calculated 535 it is advertised as an explicit SR Policy 536 [I-D.ietf-spring-segment-routing-policy] containing one or more paths 537 expressed as one or more segment-lists, which may optionally contain 538 binding SIDs if requirements dictate. An SR Policy is identified 539 through the tuple [headend, color, endpoint] and this tuple is used 540 extensively by the Interconnect controller to associate services with 541 an underlying SR Policy that meets its objectives. 543 To provide support for ECMP the Entropy Label [RFC6790][RFC8662] 544 should be utilized. Entropy Label Capability (ELC) should be 545 advertised into the IGP using the IS-IS Prefix Attributes TLV 546 [I-D.ietf-isis-mpls-elc] or the OSPF Extended Prefix TLV 547 [I-D.ietf-ospf-mpls-elc] coupled with the Node MSD Capability sub-TLV 548 to advertise Entropy Readable Label Depth (ERLD) [RFC8491][RFC8476] 549 and the base MPLS Imposition (BMI). Equally, support for ELC 550 together with the supported ERLD should be signaled in BGP using the 551 BGP Next-Hop Capability [I-D.ietf-idr-next-hop-capability]. Ingress 552 nodes and or DCBs should ensure sufficient entropy is applied to 553 packets to exercise available ECMP links. 555 5.4.1.2. Data Center Domain 557 The data center domain includes all fabric switches, network 558 virtualization edge (NVE), and the data center border routers. The 559 data center routing design may align with the framework of [RFC7938] 560 running eBGP single-hop sessions established over direct point-to- 561 point links, or it may use an IGP for dissemination of topology 562 information. This document focuses on the former, simply because the 563 ue of an IGP largely makes the data centers behaviour analogous to 564 that of a wide-area network domain. 566 The chosen method of transport or encapsulation within the data 567 center for NFIX is SR-MPLS over IP/UDP [RFC8663] or, where possible, 568 native SR-MPLS. The choice of SR-MPLS over IP/UDP or native SR-MPLS 569 allows for good entropy to maximize the use of equal-cost Clos fabric 570 links. Native SR-MPLS encapsulation provides entropy through use of 571 the Entropy Label, and, like the wide-area network, support for ELC 572 together with the support ERLD should be signaled using the BGP Next- 573 Hop Capability attribute. As described in [RFC6790] the ELC is an 574 indication from the egress node of an MPLS tunnel to the ingress node 575 of the MPLS tunnel that is is capable of processing an Entropy Label. 576 The BGP Next-Hop Capability is a non-transitive attribute which is 577 modified or deleted when the next-hop is changed to reflect the 578 capabilities of the new next-hop. If we assume that the path of a 579 BGP-signaled LSP transits through multiple ASNs, and/or a single ASN 580 with multiple next-hops, then it is not possible for the ingress node 581 to determine the ELC of the egress node. Without this end-to-end 582 signaling capability the entropy label must only be used when it is 583 explicitly known, through configuration or other means, that the 584 egress node has support for it. Entropy for SR-MPLS over IP/UDP 585 encapsulation uses the source UDP port for IPv4 and the Flow Label 586 for IPv6. Again, the ingress network function should ensure 587 sufficient entropy is applied to exercise available ECMP links. 589 Another significant advantage of the use of native SR-MPLS or SR-MPLS 590 over IP/UDP is that it allows for a lightweight interworking function 591 at the DCB without the requirement for midpoint provisioning; 592 interworking between the data center and the wide-area network 593 domains becomes an MPLS label swap/continue action. 595 Loopback addresses of network elements within the data center are 596 advertised using labeled unicast BGP with the addition of SR Prefix 597 SID extensions [RFC8669] containing a globally unique and persistent 598 Prefix-SID. The data-plane encapsulation of SR-MPLS over IP/UDP or 599 native SR-MPLS allows network elements within the data center to 600 consume BGP Prefix-SIDs and legitimately use those in the 601 encapsulation. 603 5.4.2. Inter-Domain Routing 605 Inter-domain routing is responsible for establishing connectivity 606 between any domains that form the wide-area network, and between the 607 wide-area network and data center domains. It is considered unlikely 608 that every end-to-end LSP will require a TE path, hence there is a 609 requirement for a default end-to-end forwarding path. This default 610 forwarding path may also become the path of last resort in the event 611 of a non-recoverable failure of a TE path. Similar to the seamless 612 MPLS architecture this inter-domain MPLS connectivity is realized 613 using labeled unicast BGP [RFC8277] with the addition of SR Prefix 614 SID extensions. 616 Within each wide-area network domain all service edge routers, DCBs, 617 and ABRs/ASBRs form part of the labeled BGP mesh, which can be either 618 full-mesh, or more likely based on the use of route-reflection. Each 619 of these routers advertises its respective loopback addresses into 620 labeled BGP together with an MPLS label and a globally unique Prefix- 621 SID. Routes are advertised between wide-area network domains by 622 ABRs/ASBRs that impose next-hop-self on advertised routes. The 623 function of imposing next-hop-self for labeled routes means that the 624 ABR/ASBR allocates a new label for advertised routes and programs a 625 label-swap entry in the forwarding plane for received and advertised 626 routes. In short it becomes part of the forwarding path. 628 DCB routers have labeled BGP sessions towards the wide-area network 629 and labeled BGP sessions towards the data center. Routes are 630 bidirectionally advertised between the domains subject to policy, 631 with the DCB imposing itself as next-hop on advertised routes. As 632 above, the function of imposing next-hop-self for labeled routes 633 implies allocation of a new label for advertised routes and a label- 634 swap entry being programmed in the forwarding plane for received and 635 advertised labels. The DCB thereafter becomes the anchor point 636 between the wide-area network domain and the data center domain. 638 Within the wide-area network next-hops for labeled unicast routes 639 containing Prefix-SIDs are resolved to SR LSPs, and within the data 640 center domain next-hops for labeled unicast routes containing Prefix- 641 SIDs are resolved to SR LSPs or IP/UDP tunnels. This provides end- 642 to-end connectivity without a traffic-engineering capability. 644 +---------------+ +----------------+ +---------------+ 645 | Data Center | | Wide-Area | | Wide-Area | 646 | +-----+ Domain 1 +-----+ Domain 'n' | 647 | | DCB | | ABR | | 648 | +-----+ +-----+ | 649 | | | | | | 650 +---------------+ +----------------+ +---------------+ 651 <-- SR/SRoUDP --> <---- IGP/SR ----> <--- IGP/SR ----> 652 <--- BGP-LU ---> NHS <--- BGP-LU ---> NHS <--- BGP-LU ---> 654 Default Inter-Domain Forwarding Path 656 Figure 1 658 5.4.3. Intra-Domain and Inter-Domain Traffic-Engineering 660 The capability to traffic-engineer intra- and inter-domain end-to-end 661 paths is considered a key requirement in order to meet the service 662 objectives previously outlined. To achieve optimal end-to-end path 663 placement the key components to be considered are path calculation, 664 path activation, and FEC-to-path binding procedures. 666 In the NFIX architecture end-to-end path calculation is performed by 667 the Interconnect controller. The mechanics of how the objectives of 668 each path is calculated is beyond the scope of this document. Once a 669 path is calculated based upon its objectives and constraints, the 670 path is advertised from the controller to the LSP headend as an 671 explicit SR Policy containing one or more paths expressed as one or 672 more segment-lists. An SR Policy is identified through the tuple 673 [headend, color, endpoint] and this tuple is used extensively by the 674 Interconnect controller to associate services with an underlying SR 675 Policy that meets its objectives. 677 The segment-list of an SR Policy encodes a source-routed path towards 678 the endpoint. When calculating the segment-list the Interconnect 679 controller makes comprehensive use of the Binding-SID (BSID), 680 instantiating BSID anchors as necessary at path midpoints when 681 calculating and activating a path. The use of BSID is considered 682 fundamental to segment routing as described in 683 [I-D.filsfils-spring-sr-policy-considerations]. It provides opacity 684 between domains, ensuring that any segment churn is constrained to a 685 single domain. It also reduces the number of segments/labels that 686 the headend needs to impose, which is particularly important given 687 that network elements within a data center generally have limited 688 label imposition capabilities. In the context of the NFIX 689 architecture it is also the vehicle that allows for removal of heavy 690 midpoint provisioning at the DCB. 692 For example, assume that VNF1 is situated in data center 1, which is 693 interconnected to the wide-area network via DCB1. VNF1 requires 694 connectivity to VNF2, situated in data center 2, which is 695 interconnected to the wide-area network via DCB2. Assuming there is 696 no existing TE path that meet VNF1's requirements, the Interconnect 697 controller will: 699 o Instantiate an SR Policy on DCB1 with BSID n and a segment-list 700 containing the relevant segments of a TE path to DCB2. DCB1 701 therefore becomes a BSID anchor. 703 o Instantiate an SR Policy on VNF1 with BSID m and a segment-list 704 containing segments {DCB1, n, VNF2}. 706 +---------------+ +----------------+ +---------------+ 707 | Data Center 1 | | Wide-Area | | Data Center 2 | 708 | +----+ +----+ 3 +----+ +----+ | 709 | |VNF1| |DCB1|-1 / \ 5--|DCB2| |VNF2| | 710 | +----+ +----+ \ / \ / +----+ +----+ | 711 | | | 2 4 | | | 712 +---------------+ +----------------+ +---------------+ 713 SR Policy SR Policy 714 BSID m BSID n 715 {DCB1,n,VNF2} {1,2,3,4,5,DCB2} 717 Traffic-Engineered Path using BSID 719 Figure 2 721 In the above figure a single DCB is used to interconnect two domains. 722 Similarly, in the case of two wide-area domains the DCB would be 723 represented as an ABR or ASBR. In some single operator environments 724 domains may be interconnected using adjacent ASBRs connected via a 725 distinct physical link. In this scenario the procedures outlined 726 above may be extended to incorporate the mechanisms used in Egress 727 Peer Engineering (EPE) [I-D.ietf-spring-segment-routing-central-epe] 728 to form a traffic-engineered path spanning distinct domains. 730 5.4.3.1. Traffic-Engineering and ECMP 732 Where the Interconnect controller is used to place SR policies, 733 providing support for ECMP requires some consideration. An SR Policy 734 is described with one or more segment-lists, end each of those 735 segment-lists may or may not provide ECMP as a sum instruction and 736 each SID itself may or may not support ECMP forwarding. When an 737 individual SID is a BSID, an ECMP path may or may not also be nested 738 within. The Interconnect controller may choose to place a path 739 consisting entirely of non-ECMP-aware Adj-SIDs (each SID representing 740 a single adjacency) such that the controller has explicit hop-by-hop 741 knowledge of where that SR-TE LSP is routed. This is beneficial to 742 allow the controller to take corrective action if the criteria that 743 was used to initially select a particular link in a particular path 744 subsequently changes. For example, if the latency of a link 745 increases or a link becomes congested and a path should be rerouted. 746 If ECMP-aware SIDs are used in the SR policy segment-list (including 747 Node-SIDs, Adj-SIDs representing parallel links, and Anycast SIDs) SR 748 routers are able to make autonomous decisions about where traffic is 749 forwarded. As a result, it is not possible for the controller to 750 fully understand the impact of a change in network state and react to 751 it. With this in mind there are a number of approaches that could be 752 adopted: 754 o If there is no requirement for the Interconnect controller to 755 explicitly track path on a hop-by-hop basis, ECMP-aware SIDs may 756 be used in the SR policy segment-list. This approach may require 757 multiple [ELI, EL] pairs to be inserted at the ingress node; for 758 example, above and below a BSID to provide entropy in multiple 759 domains. 761 o If there is a requirement for the Interconnect controller to 762 explicitly track paths on a hop-by-hop to provide the capability 763 to reroute them based on changes in network state, SR policy 764 segment-lists should be constructed of non-ECMP-aware Adj-SIDs. 766 o A hybrid approach that allows for a level of ECMP (at the headend) 767 together with the ability for the Interconnect controller to 768 explicitly track paths is to instantiate an SR policy consisting 769 of a set of segment-lists, each containing non-ECMP-aware Adj- 770 SIDs. Each segment-list will be assigned a weight to allow for 771 ECMP or UCMP. This approach does however imply computation and 772 programing of two paths instead of one. 774 o Another hybrid approach might work as follows. Redundant DCBs 775 advertise an Anycast-SID 'A' into the data center, and also 776 instantiate an SR policy with a segment-list consisting of non- 777 ECMP-aware Adj-SIDs meeting the required connectivity and SLA. 778 The BSID value of this SR policy 'B' must be common to both 779 redundant DCBs, but the calculated paths are diverse. Indeed, 780 multiple segment-lists could be used in this SR policy. A VNF 781 could then instantiate an SR policy with a segment-list of {A, B} 782 to achieve ECMP in the data center and TE in the wide-area network 783 with the option of ECMP at the BSID anchor 785 5.5. Service Layer 787 The service layer is intended to deliver Layer 2 and/or Layer 3 VPN 788 connectivity between network functions to create an overlay utilizing 789 the routing and LSP underlay described in section 5.4. To do this 790 the solution employs the EVPN and/or VPN-IPv4/IPv6 address families 791 to exchange Layer 2 and Layer 3 Network Layer Reachability 792 Information (NLRI). When these NLRI are exchanged between domains it 793 is typical for the border router to set next-hop-self on advertised 794 routes. With the proposed routing and LSP underlay however, this is 795 not required and EVPN/VPN-IPv4/IPv6 routes should be passed end-to- 796 end without transit routers modifying the next-hop attribute. 798 Section 5.4.2 describes the use of labeled unicast BGP to exchange 799 inter-domain routes to establish a default forwarding path. Labeled- 800 unicast BGP is used to exchange prefix reachability between service 801 edge routers, with domain border routes imposing next-hop-self on 802 routes advertised between domains. This provides a default inter- 803 domain forwarding path and provides the required connectivity to 804 establish inter-domain BGP sessions between service edges for the 805 exchange of EVPN and/or VPN-IPv4/IPv6 NLRI. If route-reflection is 806 used for the EVPN and/or VPN-IPv4/IPv6 address families within one or 807 more domains, it may be desirable to create inter-domain BGP sessions 808 between route-reflectors. In this case the peering addresses of the 809 route-reflectors should also be exchanged between domains using 810 labeled unicast BGP. This creates a connectivity model analogous to 811 BGP/MPLS IP-VPN Inter-AS option C [RFC4364]. 813 +----------------+ +----------------+ +----------------+ 814 | +----+ | | +----+ | | +----+ | 815 +----+ | RR | +----+ | RR | +----+ | RR | +----+ 816 | NF | +----+ | DCI| +----+ | DCI| +----+ | NF | 817 +----+ +----+ +----+ +----+ 818 | Domain | | Domain | | Domain | 819 +----------------+ +----------------+ +----------------+ 820 <-------> <-----> NHS <-- BGP-LU ---> NHS <-----> <------> 821 <-------> <--------- EVPN/VPN-IPv4/v6 ----------> <------> 823 Inter-Domain Service Layer 825 Figure 3 827 EVPN and/or VPN-IPv4/v6 routes received from a peer in a different 828 domain will contain a next-hop equivalent to the router that sourced 829 the route. The next-hop of these routes can be resolved to labeled- 830 unicast route (default forwarding path) or to an SR policy (traffic- 831 engineered forwarding path) as appropriate to the service 832 requirements. The exchange of EVPN and/or VPN-IPv4/IPv6 routes in 833 this manner implies that Route-Distinguisher and Route-Target values 834 remain intact end-to-end. 836 The use of end-to-end EVPN and/or VPN-IPv4/IPv6 address families 837 without the imposition of next-hop-self at border routers complements 838 the gateway-less transport layer architecture. It negates the 839 requirement for midpoint service provisioning and as such provides 840 the following benefits: 842 o Avoids the translation of MAC/IP EVPN routes to IP-VPN routes (and 843 vice versa) that is typically associated with service 844 interworking. 846 o Avoids instantiation of MAC-VRFs and IP-VPNs for each tenant 847 resident in the DCB. 849 o Avoids provisioning of demarcation functions between the data 850 center and wide-area network such as QoS, access-control, 851 aggregation and isolation. 853 5.6. Service Differentiation 855 As discussed in section 5.4.3, the use of TE paths is a key 856 capability of the NFIX solution framework described in this document. 857 The Interconnect controller computes end-to-end TE paths between NFs 858 and programs DC nodes, DCBs, ABR/ASBRs, via SR Policy, with the 859 necessary label forwarding entries for each [headend, color, 860 endpoint]. The collection of [headend, endpoint] pairs for the same 861 color constitutes a logical network topology, where each topology 862 satisfies a given SLA requirement. 864 The Interconnect controller discovers the endpoints associated to a 865 given topology (color) upon the reception of EVPN or IPVPN routes 866 advertised by the endpoint. The EVPN and IPVPN NLRIs are advertised 867 by the endpoint nodes along with a color extended community which 868 identifies the topology to which the owner of the NLRI belongs. At a 869 coarse level all the EVPN/IPVPN routes of the same VPN can be 870 advertised with the same color, and therefore a TE topology would be 871 established on a per-VPN basis. At a more granular level IPVPN and 872 especially EVPN provide a more granular way of coloring routes, that 873 will allow the Interconnect controller to associate multiple 874 topologies to the same VPN. For example: 876 o All the EVPN MAC/IP routes for a given VNF may be advertised with 877 the same color. This would allow the Interconnect controller to 878 associate topologies per VNF within the same VPN; that is, VNF1 879 could be blue (e.g., low-latency topology) and VNF2 could be green 880 (e.g., high-throughput). 882 o The EVPN MAC/IP routes and Inclusive Multicast Ethernet Tag (IMET) 883 route for VNF1 may be advertised with different colors, e.g., red 884 and brown, respectively. This would allow the association of 885 e.g., a low-latency topology for unicast traffic to VNF1 and best- 886 effort topology for BUM traffic to VNF1. 888 o Each EVPN MAC/IP route or IP-Prefix route from a given VNF may be 889 advertised with different color. This would allow the association 890 of topologies at the host level or host route granularity. 892 5.7. Automated Service Activation 894 The automation of network and service connectivity for instantiation 895 and mobility of virtual machines is a highly desirable attribute 896 within data centers. Since this concerns service connectivity, it 897 should be clear that this automation is relevant to virtual functions 898 that belong to a service as opposed to a virtual network function 899 that delivers services, such as a virtual PE router. 901 Within an SDN-enabled data center, a typical hierarchy from top to 902 bottom would include a policy engine (or policy repository), one or 903 more DC controllers, numerous hypervisors/container hosts that 904 function as NVO endpoints, and finally the virtual 905 machines(VMs)/containers, which we'll refer to generically as 906 virtualization hosts. 908 The mechanisms used to communicate between the policy engine and DC 909 controller, and between the DC controller and hypervisor/container 910 are not relevant here and as such they are not discussed further. 911 What is important is the interface and information exchange between 912 the Interconnect controller and the data center SDN functions: 914 o The Interconnect controller interfaces with the data center policy 915 engine and publishes the available colors, where each color 916 represents a topological service connectivity map that meets a set 917 of constraints and SLA objectives. This interface is a 918 straightforward API. 920 o The Interconnect controller interfaces with the DC controller to 921 learn overlay routes. This interface is BGP and uses the EVPN 922 Address Family. 924 With the above framework in place, automation of network and service 925 connectivity can be implemented as follows: 927 o The virtualization host is turned-up. The NVO endpoint notifies 928 the DC controller of the startup. 930 o The DC controller retrieves service information, IP addressing 931 information, and service 'color' for the virtualization host from 932 the policy engine. The DC controller subsequently programs the 933 associated forwarding information on the virtualization host. 934 Since the DC controller is now aware of MAC and IP address 935 information for the virtualization host, it advertises that 936 information as an EVPN MAC Advertisement Route into the overlay. 938 o The Interconnect controller receives the EVPN MAC Advertisement 939 Route (potentially via a Route-Reflector) and correlates it with 940 locally held service information and SLA requirements using Route 941 Target and Color communities. If the relevant SR policies are not 942 already in place to support the service requirements and logical 943 connectivity, including any binding-SIDs, they are calculated and 944 advertised to the relevant headends. 946 The same automated service activation principles can also be used to 947 support the scenario where virtualization hosts are moved between 948 hypervisors/container hosts for resourcing or other reasons. We 949 refer to this simply as mobility. If a virtualization host is turned 950 down the parent NVO endpoint notifies the DC controller, which in 951 turn notifies the policy engine and withdraws any EVPN MAC 952 Advertisement Routes. Thereafter all associated state is removed. 953 When the virtualization host is turned up on a different hypervisor/ 954 container host, the automated service connectivity process outlined 955 above is simply repeated. 957 5.8. Service Function Chaining 959 Service Function Chaining (SFC) defines an ordered set of abstract 960 service functions and the subsequent steering of traffic through 961 them. Packets are classified at ingress for processing by the 962 required set of service functions (SFs) in an SFC-capable domain and 963 are then forwarded through each SF in turn for processing. The 964 ability to dynamically construct SFCs containing the relevant SFs in 965 the right sequence is a key requirement for operators. 967 To enable flexible service function deployment models that support 968 agile service insertion the NFIX architecture adopts the use of BGP 969 as the control plane to distribute SFC information. The BGP control 970 plane for Network Service Header (NSH) SFC 971 [I-D.ietf-bess-nsh-bgp-control-plane] is used for this purpose and 972 defines two route types; the Service Function Instance Route (SFIR) 973 and the Service Function Path Route (SFPR). 975 The SFIR is used to advertise the presence of a service function 976 instance (SFI) as a function type (i.e. firewall, TCP optimizer) and 977 is advertised by the node hosting that SFI. The SFIR is advertised 978 together with a BGP Tunnel Encapsulation attribute containing details 979 of how to reach that particular service function through the underlay 980 network (i.e. IP address and encapsulation information). 982 The SFPRs contain service function path (SFP) information and one 983 SFPR is originated for each SFP. Each SFPR contains the service path 984 identifier (SPI) of the path, the sequence of service function types 985 that make up the path (each of which has at least one instance 986 advertised in an SFIR), and the service index (SI) for each listed 987 service function to identify its position in the path. 989 Once a Classifier has determined which flows should be mapped to a 990 given SFP, it imposes an NSH [RFC8300] on those packets, setting the 991 SPI to that of the selected service path (advertised in an SFPR), and 992 the SI to the first hop in the path. As NSH is encapsulation 993 agnostic, the NSH encapsulated packet is then forwarded through the 994 appropriate tunnel to reach the service function forwarder (SFF) 995 supporting that service function instance (advertised in an SFIR). 996 The SFF removes the tunnel encapsulation and forwards the packet with 997 the NSH to the relevant SF based upon a lookup of the SPI/SI. When 998 it is returned from the SF with a decremented SI value, the SFF 999 forwards the packet to the next hop in the SFP using the tunnel 1000 information advertised by that SFI. This procedure is repeated until 1001 the last hop of the SFP is reached. 1003 The use of the NSH in this manner allows for service chaining with 1004 topological and transport independence. It also allows for the 1005 deployment of SFIs in a condensed or dispersed fashion depending on 1006 operator preference or resource availability. Service function 1007 chains are built in their own overlay network and share a common 1008 underlay network, where that common underlay network is the NFIX 1009 fabric described in section 5.4. BGP updates containing an SFIR or 1010 SFPR are advertised in conjunction with one or more Route Targets 1011 (RTs), and each node in a service function overlay network is 1012 configured with one or more import RTs. As a result, nodes will only 1013 import routes that are applicable and that local policy dictates. 1014 This provides the ability to support multiple service function 1015 overlay networks or the construction of service function chains 1016 within L3VPN or EVPN services. 1018 Although SFCs are constructed in a unidirectional manner, the BGP 1019 control plane for NSH SFC allows for the optional association of 1020 multiple paths (SFPRs). This provides the ability to construct a 1021 bidirectional service function chain in the presence of multiple 1022 equal-cost paths between source and destination to avoid problems 1023 that SFs may suffer with traffic asymmetry. 1025 The proposed SFC model can be considered decoupled in that the use of 1026 SR as a transport between SFFs is completely independent of the use 1027 of NSH to define the SFC. That is, it uses an NSH-based SFC and SR 1028 is just one of many encapsulations that could be used between SFFs. 1029 A similar more integrated approach proposes encoding a service 1030 function as a segment so that an SFC can be constructed as a segment- 1031 list. In this case it can be considered an SR-based SFC with an NSH- 1032 based service plane since the SF is unaware of the presence of the 1033 SR. Functionally both approaches are very similar and as such both 1034 could be adopted and could work in parallel. Construction of SFCs 1035 based purely on SR (SF is SR-aware) are not considered at this time. 1037 5.9. Stability and Availability 1039 Any network architecture should have the capability to self-restore 1040 following the failure of a network element. The time to reconverge 1041 following the failure needs to be minimal to avoid evident 1042 disruptions in service. This section discusses protection mechanisms 1043 that are available for use and their applicability to the proposed 1044 architecture. 1046 5.9.1. IGP Reconvergence 1048 Within the construct of an IGP topology the Topology Independent Loop 1049 Free Alternate (TI-LFA) [I-D.ietf-rtgwg-segment-routing-ti-lfa] can 1050 be used to provide a local repair mechanism that offers both link and 1051 node protection. 1053 TI-LFA is a repair mechanism, and as such it is reactive and 1054 initially needs to detect a given failure. To provide fast failure 1055 detection the Bidirectional Forwarding Mechanism (BFD) is used. 1056 Consideration needs to be given to the restoration capabilities of 1057 the underlying transmission when deciding values for message 1058 intervals and multipliers to avoid race conditions, but failure 1059 detection in the order of 50 milliseconds can reasonably be 1060 anticipated. Where Link Aggregation Groups (LAG) are used, micro-BFD 1061 [RFC7130] can be used to similar effect. Indeed, to allow for 1062 potential incremental growth in capacity it is not uncommon for 1063 operators to provision all network links as LAG and use micro-BFD 1064 from the outset. 1066 5.9.2. Data Center Reconvergence 1068 Clos fabrics are extremely common within data centers, and 1069 fundamental to a Clos fabric is the ability to load-balance using 1070 Equal Cost Multipath (ECMP). The number of ECMP paths will vary 1071 dependent on the number of devices in the parent tier but will never 1072 be less than two for redundancy purposes with traffic hashed over the 1073 available paths. In this scenario the availability of a backup path 1074 in the event of failure is implicit. Commonly within the DC, rather 1075 than computing protect paths (like LFA), techniques such as 'fast 1076 rehash' are often utilized. In this particular case, the failed 1077 next-hop is removed from the multi-path forwarding data structure and 1078 traffic is then rehashed over the remaining active paths. 1080 In BGP-only data centers this relies on the implementation of BGP 1081 multipath. As network elements in the lower tier of a Clos fabric 1082 will frequently belong to different ASNs, this includes the ability 1083 to load-balance to a prefix with different AS_PATH attribute values 1084 while having the same AS_PATH length; sometimes referred to as 1085 'multipath relax' or 'multipath multiple-AS' [RFC7938]. 1087 Failure detection relies upon declaring a BGP session down and 1088 removing any prefixes learnt over that session as soon as the link is 1089 declared down. As links between network elements predominantly use 1090 direct point-to-point fiber, a link failure should be detected within 1091 milliseconds. BFD is also commonly used to detect IP layer failures. 1093 5.9.3. Exchange of Inter-Domain Routes 1095 Labeled unicast BGP together with SR Prefix-SID extensions are used 1096 to exchange PNF and/or VNF endpoints between domains to create end- 1097 to-end connectivity without TE. When advertising between domains we 1098 assume that a given BGP prefix is advertised by at least two border 1099 routers (DCBs, ABRs, ASBRs) making prefixes reachable via at least 1100 two next-hops. 1102 BGP Prefix Independent Convergence (PIC) [I-D.ietf-rtgwg-bgp-pic] 1103 allows failover to a pre-computed and pre-installed secondary next- 1104 hop when the primary next-hop fails and is independent of the number 1105 of destination prefixes that are affected by the failure. When the 1106 primary BGP next-hop fails, it should be clear that BGP PIC depends 1107 on the availability o f a secondary next-hop in the Pathlist. To 1108 ensure that multiple paths to the same destination are visible the 1109 BGP ADD-PATH [RFC7911] can be used to allow for advertisement of 1110 multiple paths for the same address prefix. Dual-homed EVPN/IP-VPN 1111 prefixes also have the alternative option of allocating different 1112 Route-Distinguishers (RDs). To trigger the switch from primary to 1113 secondary next-hop PIC needs to detect the failure and many 1114 implementations support 'next-hop tracking' for this purpose. Next- 1115 hop tracking monitors the routing-table and if the next-hop prefix is 1116 removed will immediately invalidate all BGP prefixes learnt through 1117 that next-hop. In the absence of next-hop tracking, multihop BFD 1118 [RFC5883] could optionally be used as a fast failure detection 1119 mechanism. 1121 5.9.4. Controller Redundancy 1123 With the Interconnect controller providing an integral part of the 1124 networks' capabilities a redundant controller design is clearly 1125 prudent. To this end we can consider both availability and 1126 redundancy. Availability refers to the survivability of a single 1127 controller system in a failure scenario. A common strategy for 1128 increasing the availability of a single controller system is to build 1129 the system in a high-availability cluster such that it becomes a 1130 confederation of redundant constituent parts as opposed to a single 1131 monolithic system. Should a single part fail, the system can still 1132 survive without the requirement to failover to a standby controller 1133 system. Methods for detection of a failure of one or more member 1134 parts of the cluster are implementation specific. 1136 To provide contingency for a complete system failure a geo-redundant 1137 standby controller system is required. When redundant controllers 1138 are deployed a coherent strategy is needed that provides a master/ 1139 standby election mechanism, the ability to propagate the outcome of 1140 that election to network elements as required, an inter-system 1141 failure detection mechanism, and the ability to synchronize state 1142 across both systems such that the standby controller is fully aware 1143 of current state should it need to transition to master controller. 1145 Master/standby election, state synchronisation, and failure detection 1146 between geo-redundant sites can largely be considered a local 1147 implementation matter. The requirement to propagate the outcome of 1148 the master/standby election to network elements depends on a) the 1149 mechanism that is used to instantiate SR policies, and b) whether the 1150 SR policies are controller-initiated or headend-initiated, and these 1151 are discussed in the following sub-sections. In either scenario, 1152 state of SR policies should be advertised northbound to both master/ 1153 standby controllers using either PCEP LSP State Report messages or SR 1154 policy extensions to BGP link-state 1155 [I-D.ietf-idr-te-lsp-distribution]. 1157 5.9.4.1. SR Policy Initiator 1159 Controller-initiated SR policies are suited for auto-creation of 1160 tunnels based on service route discovery and policy-driven route/flow 1161 programming and are ephemeral. Headend-initiated tunnels allow for 1162 permanent configuration state to be held on the headend and are 1163 suitable for static services that are not subject to dynamic changes. 1164 If all SR policies are controller-initiated, it negates the 1165 requirement to propagate the outcome of the master/standby election 1166 to network elements. This is because headends have no requirement 1167 for unsolicited requests to a controller, and therefore have no 1168 requirement to know which controller is master and which one is 1169 standby. A headend may respond to a message from a controller, but 1170 it is not unsolicited. 1172 If some or all SR policies are headend-initiated, then the 1173 requirement to propagate the outcome of the master/standby election 1174 exists. This is further discussed in the following sub-section. 1176 5.9.4.2. SR Policy Instantiation Mechanism 1178 While candidate paths of SR policies may be provided using BGP, PCEP, 1179 Netconf, or local policy/configuration, this document primarily 1180 considers the use of PCEP or BGP. 1182 When PCEP [RFC5440][RFC8231][RFC8281] is used for instantiation of 1183 candidate paths of SR policies 1184 [I-D.barth-pce-segment-routing-policy-cp] every headend/PCC should 1185 establish a PCEP session with the master and standby controllers. To 1186 signal standby state to the PCC the standby controller may use a PCEP 1187 Notification message to set the PCEP session into overload state. 1188 While in this overload state the standby controller will accept path 1189 computation LSP state report (PCRpt) messages without delegation but 1190 will reject path computation requests (PCReq) and any path 1191 computation reports (PCRpt) with the delegation bit set. Further, 1192 the standby controller will not path computation originate initiate 1193 messages (PCInit) or path computation update request messages 1194 (PCUpd). In the event of the failure of the master controller, the 1195 standby controller will transition to active and remove the PCEP 1196 overload state. Following expiration of the PCEP redelegation 1197 timeout at the PCC any LSPs will be redelegated to the newly 1198 transitioned active controller. LSP state is not impacted unless 1199 redelegation is not possible before the state timeout interval 1200 expires. 1202 When BGP is used for instantiation of SR policies every headend 1203 should establish a BGP session with the master and standby controller 1204 capable of exchanging SR TE Policy SAFI. Candidate paths of SR 1205 policies are advertised only by the active controller. If the master 1206 controller should experience a failure, then SR policies learnt from 1207 that controller may be removed before they are re-advertised by the 1208 standby (or newly-active) controller. To minimize this possibility 1209 BGP speakers that advertise and instantiate SR policies can implement 1210 Long Lived Graceful Retart (LLGR) [I-D.ietf-idr-long-lived-gr], also 1211 known as BGP persistence, to retain existing routes treated as least- 1212 preferred until the new route arrives. In the absence of LLGR, two 1213 other alternatives are possible: 1215 o Provide a static backup SR policy. 1217 o Fallback to the default forwarding path. 1219 5.9.5. Path and Segment Liveliness 1221 When using traffic-engineered SR paths only the ingress router holds 1222 any state. The exception here is where BSIDs are used, which also 1223 implies some state is maintained at the BSID anchor. As there is no 1224 control plane set-up, it follows that there is no feedback loop from 1225 transit nodes of the path to notify the headend when a non-adjacent 1226 point of the SR path fails. The Interconnect controller however is 1227 aware of all paths that are impacted by a given network failure and 1228 should take the appropriate action. This action could include 1229 withdrawing an SR policy if a suitable candidate path is already in 1230 place, or simply sending a new SR policy with a different segment- 1231 list and a higher preference value assigned to it. 1233 Verification of data plane liveliness is the responsibility of the 1234 path headend. A given SR policy may be associated with multiple 1235 candidate paths and for the sake of clarity, we'll assume two for 1236 redundancy purposes (which can be diversely routed). Verification of 1237 the liveliness of these paths can be achieved using seamless BFD 1238 (S-BFD)[RFC7880], which provides an in-band failure detection 1239 mechanism capable of detecting failure in the order of tens of 1240 milliseconds. Upon failure of the active path, failover to a 1241 secondary candidate path can be activated at the path headend. 1242 Details of the actual failover and revert mechanisms are a local 1243 implementation matter. 1245 S-BFD provides a fast and scalable failure detection mechanism but is 1246 unlikely to be implemented in many VNFs given their inability to 1247 offload the process to purpose-built hardware. In the absence of an 1248 active failure detection mechanism such as S-BFD the failover from 1249 active path to secondary candidate path can be triggered using 1250 continuous path validity checks. One of the criteria that a 1251 candidate path uses to determine its validity is the ability to 1252 perform path resolution for the first SID to one or more outgoing 1253 interface(s) and next-hop(s). From the perspective of the VNF 1254 headend the first SID in the segment-list will very likely be the DCB 1255 (as BSID anchor) but could equally be another Prefix-SID hop within 1256 the data center. Should this segment experience a non-recoverable 1257 failure, the headend will be unable to resolve the first SID and the 1258 path will be considered invalid. This will trigger a failover action 1259 to a secondary candidate path. 1261 Injection of S-BFD packets is not just constrained to the source of 1262 an end-to-end LSP. When an S-BFD packet is injected into an SR 1263 policy path it is encapsulated with the label stack of the associated 1264 segment-list. It is possible therefore to run S-BFD from a BSID 1265 anchor for just that section of the end-to-end path (for example, 1266 from DCB to DCB). This allows a BSID anchor to detect failure of a 1267 path and take corrective action, while maintaining opacity between 1268 domains. 1270 5.10. Scalability 1272 There are many aspects to consider regarding scalability of the NFIX 1273 architecture. The building blocks of NFIX are standards-based 1274 technologies individually designed to scale for internet provider 1275 networks. When combined they provide a flexible and scalable 1276 solution: 1278 o BGP has been proven to scale and operate with millions of routes 1279 being exchanged. Specifically, BGP labeled unicast has been 1280 deployed and proven to scale in existing seamless-MPLS networks. 1282 o By placing forwarding instructions in the header of a packet, 1283 segment routing reduces the amount of state required in the 1284 network allowing the scale of greater number of transport tunnels. 1285 This aids in the feasibility of the NFIX architecture to permit 1286 the automated aspects of SR policy creation without having an 1287 impact on the state in the core of the network. 1289 o The choice of utilizing native SR-MPLS or SR over IP in the data 1290 center continues to permit horizontal scaling without introducing 1291 new state inside of the data center fabric while still permitting 1292 seamless end to end path forwarding integration. 1294 o BSIDs play a key role in the NFIX architecture as their use 1295 provides the ability to traffic-engineer across large network 1296 topologies consisting of many hops regardless of hardware 1297 capability at the headend. From a scalability perspective the use 1298 of BSIDs facilitates better scale due to the fact that detailed 1299 information about the SR paths in a domain has been abstracted and 1300 localized to the BSID anchor point only. When BSIDs are re-used 1301 amongst one or many headends they reduce the amount of path 1302 calculation and updates required at network edges while still 1303 providing seamless end to end path forwarding. 1305 o The architecture of NFIX continues to use an independent DC 1306 controller. This allows continued independent scaling of data 1307 center management in both policy and local forwarding functions, 1308 while off-loading the end-to-end optimal path placement and 1309 automation to the Interconnect controller. The optimal path 1310 placement is already a scalable function provided in a PCE 1311 architecture. The Interconnect controller must compute paths, but 1312 it is not burdened by the management of virtual entity lifecycle 1313 and associated forwarding policies. 1315 It must be acknowledged that with the amalgamation of the technology 1316 building blocks and the automation required by NFIX, there is an 1317 additional burden on the Interconnect controller. The scaling 1318 considerations are dependent on many variables, but an implementation 1319 of a Interconnect controller shares many overlapping traits and 1320 scaling concerns as PCE, where the controller and PCE both must: 1322 o Discover and listen to topological state changes of the IP/MPLS 1323 topology. 1325 o Compute traffic-engineered intra and inter domain paths across 1326 large service provider topologies. 1328 o Synchronize, track and update thousands of LSPs to network devices 1329 upon network state changes. 1331 Both entail topologies that contain tens of thousands of nodes and 1332 links. The Interconnect controller in an NFIX architecture takes on 1333 the additional role of becoming end to end service aware and 1334 discovering data center entities that were traditionally excluded 1335 from a controllers scope. Although not exhaustive, an NFIX 1336 Interconnect controller is impacted by some of the following: 1338 o The number of individual services, the number of endpoints that 1339 may exist in each service, the distribution of endpoints in a 1340 virtualized environment, and how many data centers may exist. 1341 Medium or large sized data centers may be capable to host more 1342 virtual endpoints per host, but with the move to smaller edge- 1343 clouds the number of headends that require inter-connectivity 1344 increases compared to the density of localized routing in a 1345 centralized data center model. The outcome has an impact on the 1346 number of headend devices which may require tunnel management by 1347 the Interconnect controller. 1349 o Assuming a given BSID satisfies SLA, the ability to re-use BSIDs 1350 across multiple services reduces the number of paths to track and 1351 manage. However, the number of color or unique SLA definitions, 1352 and criteria such as bandwidth constraints impacts WAN traffic 1353 distribution requirements. As BSIDs play a key role for VNF 1354 connectivity, this potentially increases the number of BSID paths 1355 required to permit appropriate traffic distribution. This also 1356 impacts the number of tunnels which may be re-used on a given 1357 headend for different services. 1359 o The frequency of virtualized hosts being created and destroyed and 1360 the general activity within a given service. The controller must 1361 analyze, track, and correlate the activity of relevant BGP routes 1362 to track addition and removal of service host or host subnets, and 1363 determine whether new SR policies should be instantiated, or stale 1364 unused SR policies should be removed from the network. 1366 o The choice of SR instantiation mechanism impacts the number of 1367 communication sessions the controller may require. For example, 1368 the BGP based mechanism may only require a small number of 1369 sessions to route reflectors, whereas PCEP may require a 1370 connection to every possible leaf in the network and any BSID 1371 anchors. 1373 o The number of hops within one or many WAN domains may affect the 1374 number of BSIDs required to provide transit for VNF/PNF, PNF/PNF, 1375 or VNF/VNF inter-connectivity. 1377 o Relative to traditional WAN topologies, traditional data centers 1378 are generally topologically denser in node and link connectivity 1379 which is required to be discovered by the Interconnect controller, 1380 resulting in a much larger, dense link-state database on the 1381 Interconnect controller. 1383 5.10.1. Asymmetric Model B for VPN Families 1385 With the instantiation of multiple TE paths between any two VNFs in 1386 the NFIX network, the number of SR Policy (remote endpoint, color) 1387 routes, BSIDs and labels to support on VNFs becomes a choke point in 1388 the architecture. The fact that some VNFs are limited in terms of 1389 forwarding resources makes this aspect an important scale issue. 1391 As an example, if VNF1 and VNF2 in Figure 1 are associated to 1392 multiple topologies 1..n, the Interconnect controller will 1393 instantiate n TE paths in VNF1 to reach VNF2: 1395 [VNF1,color-1,VNF2] --> BSID 1 1397 [VNF1,color-2,VNF2] --> BSID 2 1399 ... 1401 [VNF1,color-n,VNF2] --> BSID n 1403 Similarly, m TE paths may be instantiated on VNF1 to reach VNF3, 1404 another p TE paths to reach VNF4, and so on for all the VNFs that 1405 VNF1 needs to communicate with in DC2. As it can be observed, the 1406 number of forwarding resources to be instantiated on VNF1 may 1407 significantly grow with the number of remote [endpoint, color] pairs, 1408 compared with a best-effort architecture in which the number 1409 forwarding resources in VNF1 grows with the number of endpoints only. 1411 This scale issue on the VNFs can be relieved by the use of an 1412 asymmetric model B service layer. The concept is illustrated in 1413 Figure 3. 1415 +------------+ 1416 <-------------------------------------| WAN | 1417 | SR Policy +-------------------| Controller | 1418 | BSID m | SR Policy +------------+ 1419 v {DCI1,n,DCI2} v BSID n 1420 {1,2,3,4,5,DCI2} 1421 +----------------+ +----------------+ +----------------+ 1422 | +----+ | | | | +----+ | 1423 +----+ | RR | +----+ +----+ | RR | +----+ 1424 |VNF1| +----+ |DCI1| |DCI2| +----+ |VNF2| 1425 +----+ +----+ +----+ +----+ 1426 | DC1 | | WAN | | DC2 | 1427 +----------------+ +----------------+ +----------------+ 1429 <-------- <-------------------------- NHS <------ <------ 1430 EVPN/VPN-IPv4/v6(colored) 1432 +-----------------------------------> +-------------> 1433 TE path to DCI2 ECMP path to VNF2 1434 (BSID to segment-list 1435 expansion on DCI1) 1437 Asymmetric Model B Service Layer 1439 Figure 4 1441 Consider the different n topologies needed between VNF1 and VNF2 are 1442 really only relevant to the different TE paths that exist in the WAN. 1443 The WAN is the domain in the network where there can be significant 1444 differences in latency, throughput or packet loss depending on the 1445 sequence of nodes and links the traffic goes through. Based on that 1446 assumption, for traffic from VNF1 to DCB2 in Figure 4, traffic from 1447 DCB2 to VNF2 can simply take an ECMP path. In this case an 1448 asymmetric model B Service layer can significantly relieve the scale 1449 pressure on VNF1. 1451 From a service layer perspective, the NFIX architecture described up 1452 to now can be considered 'symmetric', meaning that the EVPN/IPVPN 1453 advertisements from e.g., VNF2 in Figure 2, are received on VNF1 with 1454 the next-hop of VNF2, and vice versa for VNF1's routes on VNF2. SR 1455 Policies to each VNF2 [endpoint, color] are then required on the 1456 VNF1. 1458 In the 'asymmetric' service design illustrated in Figure 4, VNF2's 1459 EVPN/IPVPN routes are received on VNF1 with the next-hop of DCB2, and 1460 VNF1's routes are received on VNF2 with next-hop of DCB1. Now SR 1461 policies instantiated on VNFs can be reduced to only the number of TE 1462 paths required to reach the remote DCB. For example, considering n 1463 topologies, in a symmetric model VNF1 has to be instantiated with n 1464 SR policy paths per remote VNF in DC2, whereas in the asymmetric 1465 model of Figure 4, VNF1 only requires n SR policy paths per DC, i.e., 1466 to DCB2. 1468 Asymmetric model B is a simple design choice that only requires the 1469 ability (on the DCB nodes) to set next-hop-self on the EVPN/IPVPN 1470 routes advertised to the WAN neighbors and not do next-hop-self for 1471 routes advertised to the DC neighbors. With this option, the 1472 Interconnect controller only needs to establish TE paths from VNFs to 1473 remote DCBs, as opposed to VNFs to remote VNFs. 1475 6. Illustration of Use 1477 For the purpose of illustration, this section provides some examples 1478 of how different end-to-end tunnels are instantiated (including the 1479 relevant protocols, SID values/label stacks etc.) and how services 1480 are then overlaid onto those LSPs. 1482 6.1. Reference Topology 1484 The following network diagram illustrates the reference network 1485 topology that is used for illustration purposes in this section. 1486 Within the data centers leaf and spine network elements may be 1487 present but are not shown for the purpose of clarity. 1489 +----------+ 1490 |Controller| 1491 +----------+ 1492 / | \ 1493 +----+ +----+ +----+ +----+ 1494 ~ ~ ~ ~ | R1 |----------| R2 |----------| R3 |-----|AGN1| ~ ~ ~ ~ 1495 ~ +----+ +----+ +----+ +----+ ~ 1496 ~ DC1 | / | | DC2 ~ 1497 +----+ | L=5 +----+ L=5 / | +----+ +----+ 1498 | Sn | | +-------| R4 |--------+ | |AGN2| | Dn | 1499 +----+ | / M=20 +----+ M=20 | +----+ +----+ 1500 ~ | / | | ~ 1501 ~ +----+ +----+ +----+ +----+ +----+ ~ 1502 ~ ~ ~ ~ | R5 |-----| R6 |----| R7 |-----| R8 |-----|AGN3| ~ ~ ~ ~ 1503 +----+ +----+ +----+ +----+ +----+ 1505 Reference Topology 1507 Figure 5 1509 The following applies to the reference topology in figure 5: 1511 o Data center 1 and data center 2 both run BGP/SR. Both data 1512 centers run leaf/spine topologies, which are not shown for the 1513 purpose of clarity. 1515 o R1 and R5 function as data center border routers for DC 1. AGN1 1516 and AGN3 function as data center border routers for DC 2. 1518 o Routers R1 through R8 form an independent ISIS-OSPF/SR instance. 1520 o Routers R3, R8, AGN1, AGN2, and AGN2 form an independent ISIS- 1521 OSPF/SR instance. 1523 o All IGP link metrics within the wide area network are metric 10 1524 except for links R5-R4 and R4-R3 which are both metric 20. 1526 o All links have a unidirectional latency of 10 milliseconds except 1527 for links R5-R4 and R4-R3 which both have a unidirectional latency 1528 of 5 milliseconds. 1530 o Source 'Sn' and destination 'Dn' represent one or more network 1531 functions. 1533 6.2. PNF to PNF Connectivity 1535 The first example demonstrates the simplest form of connectivity; PNF 1536 to PNF. The example illustrates the instantiation of a 1537 unidirectional TE path from R1 to AGN2 and its consumption by an EVPN 1538 service. The service has a requirement for high-throughput with no 1539 strict latency requirements. These service requirements are 1540 catalogued and represented using the color blue. 1542 o An EVPN service is provisioned at R1 and AGN2. 1544 o The Interconnect controller computes the path from R1 to AGN2 and 1545 calculates that the optimal path based on the service requirements 1546 and overall network optimization is R1-R5-R6-R7-R8-AGN3-AGN2. The 1547 segment-list to represent the calculated path could be constructed 1548 in numerous ways. It could be strict hops represented by a series 1549 of Adj-SIDs. It could be loose hops using ECMP-aware Node-SIDs, 1550 for example {R7, AGN2}, or it could be a combination of both Node- 1551 SIDs and Adj-SIDs. Equally, BSIDs could be used to reduce the 1552 number of labels that need to be imposed at the headend. In this 1553 example, strict Adj-SID hops are used with a BSID at the area 1554 border router R8, but this should not be interpreted as the only 1555 way a path and segment-list can be represented. 1557 o The Interconnect controller advertises a BGP SR Policy to R8 with 1558 BSID 1000, and a segment-list containing segments {AGN3, AGN2}. 1560 o The Interconnect controller advertises a BGP SR Policy to R1 with 1561 BSID 1001, and a segment-list containing segments {R5, R6, R7, R8, 1562 1000}. The policy is identified using the tuple [headed = R1, 1563 color = blue, endpoint = AGN2]. 1565 o AGN2 advertises an EVPN MAC Advertisement Route for MAC M1, which 1566 is learned by R1. The route has a next-hop of AGN2, an MPLS label 1567 of L1, and it carries a color extended community with the value 1568 blue. 1570 o R1 has a valid SR policy [color = blue, next-hop = AGN2] with 1571 segment-list {R5, R6, R7, R8, 1000}. R1 therefore associates the 1572 MAC address M1 with that policy and programs the relevant 1573 information into the forwarding path. 1575 o The Interconnect controller also learns the EVPN MAC Route 1576 advertised by AGN2. The purpose of this is two-fold. It allows 1577 the controller to correlate the service overlay with the 1578 underlying transport LSPs, thus creating a service connectivity 1579 map. It also allows the controller to dynamically create LSPs 1580 based upon service requirements if they do not already exist, or 1581 to optimize them if network conditions change. 1583 6.3. VNF to PNF Connectivity 1585 The next example demonstrates VNF to PNF connectivity and illustrates 1586 the instantiation of a unidirectional TE path from S1 to AGN2. The 1587 path is consumed by an IP-VPN service that has a basic set of service 1588 requirements and as such simply uses IGP metric as a path computation 1589 objective. These basic service requirements are cataloged and 1590 represented using the color red. 1592 In this example S1 is a VNF with full IP routing and MPLS capability 1593 that interfaces to the data center underlay/overlay and serves as the 1594 NVO tunnel endpoint. 1596 o An IP-VPN service is provisioned at S1 and AGN2. 1598 o The Interconnect controller computes the path from S1 to AGN2 and 1599 calculates that the optimal path based on IGP metric is 1600 R1-R2-R3-AGN1-AGN2. 1602 o The Interconnect controller advertises a BGP SR Policy to R1 with 1603 BSID 1002, and a segment-list containing segments {R2, R3, AGN1, 1604 AGN2}. 1606 o The Interconnect controller advertises a BGP SR Policy to S1 with 1607 BSID 1003, and a segment-list containing segments {R1, 1002}. The 1608 policy is identified using the tuple [headend = S1, color = red, 1609 endpoint = AGN2]. 1611 o Source S1 learns an VPN-IPv4 route for prefix P1, next-hop AGN2. 1612 The route has an VPN label of L1, and it carries a color extended 1613 community with value red. 1615 o S1 has a valid SR policy [color = red, endpoint = AGN2] with 1616 segment-list {R1, 1002} and BSID 1003. S1 therefore associates 1617 the VPN-IPv4 prefix P1 with that policy and programs the relevant 1618 information into the forwarding path. 1620 o As in the previous example the Interconnect controller also learns 1621 the VPN-IPv4 route advertised by AGN2 in order to correlate the 1622 service overlay with the underlying transport LSPs, creating or 1623 optimizing them as required. 1625 6.4. VNF to VNF Connectivity 1627 The last example demonstrates VNF to VNF connectivity and illustrates 1628 the instantiation of a unidirectional TE path from S2 to D2. The 1629 path is consumed by an EVPN service that requires low latency as a 1630 service requirement and as such uses latency as a path computation 1631 objective. This service requirement is cataloged and represented 1632 using the color green. 1634 In this example S2 is a VNF that has no routing capability. It is 1635 hosted by hypervisor H1 that in turn has an interface to a DC 1636 controller through which forwarding instructions are programmed. H1 1637 serves as the NVO tunnel endpoint and overlay next-hop. 1639 D2 is a VNF with partial routing capability that is connected to a 1640 leaf switch L1. L1 connects to underlay/overlay in data center 2 and 1641 serves as the NVO tunnel endpoint for D2. L1 advertises BGP Prefix- 1642 SID 9001 into the underlay. 1644 o The relevant details of the EVPN service are entered in the data 1645 center policy engines within data center 1 and 2. 1647 o Source S2 is turned-up. Hypervisor H1 notifies its parent DC 1648 controller, which in turn retrieves the service (EVPN) 1649 information, color, IP and MAC information from the policy engine 1650 and subsequently programs the associated forwarding entries onto 1651 S2. The DC controller also dynamically advertises an EVPN MAC 1652 Advertisement Route for S2's IP and MAC into the overlay with 1653 next-hop H1. (This would trigger the return path set-up between 1654 L1 and H2 not covered in this example.) 1656 o The DC controller in data center 1 learns an EVPN MAC 1657 Advertisement Route for D2, MAC M, next-nop L1. The route has an 1658 MPLS label of L2, and it carries a color extended community with 1659 the value green. 1661 o The Interconnect controller computes the path between H1 and L1 1662 and calculates that the optimal path based on latency is 1663 R5-R4-R3-AGN1. 1665 o The Interconnect controller advertises a BGP SR Policy to R5 with 1666 BSID 1004, and a segment-list containing segments {R4, R3, AGN1}. 1668 o The Interconnect controller advertises a BGP SR Policy to the DC 1669 controller in data center 1 with BSID 1005 and a segment-list 1670 containing segments {R5, 1004, 9001}. The policy is identified 1671 using the tuple [headend = H1, color = green, endpoint = L1]. 1673 o The DC controller in data center 1 has a valid SR policy [color = 1674 green, endpoint = L1] with segment-list {R5, 1004, 9001} and BSID 1675 1005. The controller therefore associates the MAC Advertisement 1676 Route with that policy, and programs the associated forwarding 1677 rules into S2. 1679 o As in the previous example the Interconnect controller also learns 1680 the MAC Advertisement Route advertised by D2 in order to correlate 1681 the service overlay with the underlying transport LSPs, creating 1682 or optimizing them as required. 1684 7. Conclusions 1686 The NFIX architecture provides an evolutionary path to a unified 1687 network fabric. It uses the base constructs of seamless-MPLS and 1688 adds end-to-end LSPs capable of delivering against SLAs, seamless 1689 data center interconnect, service differentiation, service function 1690 chaining, and a Layer-2/Layer-3 infrastructure capable of 1691 interconnecting PNF-to-PNF, PNF-to-VNF, and VNF-to-VNF. 1693 NFIX establishes a dynamic, seamless, and automated connectivity 1694 model that overcomes the operational barriers and interworking issues 1695 between data centers and the wide-area network and delivers the 1696 following using standards-based protocols: 1698 o A unified routing control plane: Multiprotocol BGP (MP-BGP) to 1699 acquire inter-domain NLRI from the IP/MPLS underlay and the 1700 virtualized IP-VPN/EVPN service overlay. 1702 o A unified forwarding control plane: SR provides dynamic service 1703 tunnels with fast restoration options to meet deterministic 1704 bandwidth, latency and path diversity constraints. SR utilizes 1705 the appropriate data path encapsulation for seamless, end-to-end 1706 connectivity between distributed edge and core data centers across 1707 the wide-area network. 1709 o Service Function Chaining: Leverage SFC extensions for BGP and 1710 segment routing to interconnect network and service functions into 1711 SFPs, with support for various data path implementations. 1713 o Service Differentiation: Provide a framework that allows for 1714 construction of logical end-to-end networks with differentiated 1715 logical topologies and/or constraints through use of SR policies 1716 and coloring. 1718 o Automation: Facilitates automation of service provisioning and 1719 avoids heavy service interworking at DCBs. 1721 NFIX is deployable on existing data center and wide-area network 1722 infrastructures and allows the underlying data forwarding plane to 1723 evolve with minimal impact on the services plane. 1725 8. Security Considerations 1727 The NFIX architecture based on SR-MPLS is subject to the same 1728 security concerns as any MPLS network. No new protocols are 1729 introduced, hence security issues of the protocols encompassed by 1730 this architecture are addressed within the relevant individual 1731 standards documents. It is recommended that the security framework 1732 for MPLS and GMPLS networks defined in [RFC5920] are adhered to. 1733 Although [RFC5920] focuses on the use of RSVP-TE and LDP control 1734 plane, the practices and procedures are extendable to an SR-MPLS 1735 domain. 1737 The NFIX architecture makes extensive use of Multiprotocol BGP, and 1738 it is recommended that the TCP Authentication Option (TCP-AO) 1739 [RFC5925] is used to protect the integrity of long-lived BGP sessions 1740 and any other TCP-based protocols. 1742 Where PCEP is used between controller and path headend the use of 1743 PCEPS [RFC8253] is recommended to provide confidentiality to PCEP 1744 communication using Transport Layer Security (TLS). 1746 9. Acknowledgements 1748 The authors would like to acknowledge Mustapha Aissaoui, Wim 1749 Henderickx, and Gunter Van de Velde. 1751 10. Contributors 1753 The following people contributed to the content of this document and 1754 should be considered co-authors. 1756 Juan Rodriguez 1757 Nokia 1758 United States of America 1760 Email: juan.rodriguez@nokia.com 1762 Jorge Rabadan 1763 Nokia 1764 United States of America 1766 Email: jorge.rabadan@nokia.com 1768 Nick Morris 1769 Verizon 1770 United States of America 1772 Email: nicklous.morris@verizonwireless.com 1774 Eddie Leyton 1775 Verizon 1776 United States of America 1778 Email: edward.leyton@verizonwireless.com 1780 Figure 6 1782 11. IANA Considerations 1784 This memo does not include any requests to IANA for allocation. 1786 12. References 1788 12.1. Normative References 1790 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1791 Requirement Levels", BCP 14, RFC 2119, March 1997, 1792 . 1794 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1795 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1796 May 2017, . 1798 12.2. Informative References 1800 [I-D.ietf-nvo3-geneve] 1801 Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic 1802 Network Virtualization Encapsulation", draft-ietf- 1803 nvo3-geneve-16 (work in progress), March 2020. 1805 [I-D.ietf-mpls-seamless-mpls] 1806 Leymann, N., Decraene, B., Filsfils, C., Konstantynowicz, 1807 M., and D. Steinberg, "Seamless MPLS Architecture", draft- 1808 ietf-mpls-seamless-mpls-07 (work in progress), June 2014. 1810 [I-D.ietf-bess-evpn-ipvpn-interworking] 1811 Rabadan, J., Sajassi, A., Rosen, E., Drake, J., Lin, W., 1812 Uttaro, J., and A. Simpson, "EVPN Interworking with 1813 IPVPN", draft-ietf-bess-evpn-ipvpn-interworking-03 (work 1814 in progress), May 2020. 1816 [I-D.ietf-spring-segment-routing-policy] 1817 Filsfils, C., Sivabalan, S., Voyer, D., Bogdanov, A., and 1818 P. Mattes, "Segment Routing Policy Architecture", draft- 1819 ietf-spring-segment-routing-policy-07 (work in progress), 1820 May 2020. 1822 [I-D.ietf-rtgwg-segment-routing-ti-lfa] 1823 Litkowski, S., Bashandy, A., Filsfils, C., Decraene, B., 1824 Francois, P., Voyer, D., Clad, F., and P. Camarillo, 1825 "Topology Independent Fast Reroute using Segment Routing", 1826 draft-ietf-rtgwg-segment-routing-ti-lfa-03 (work in 1827 progress), March 2020. 1829 [I-D.ietf-bess-nsh-bgp-control-plane] 1830 Farrel, A., Drake, J., Rosen, E., Uttaro, J., and L. 1831 Jalil, "BGP Control Plane for the Network Service Header 1832 in Service Function Chaining", draft-ietf-bess-nsh-bgp- 1833 control-plane-15 (work in progress), June 2020. 1835 [I-D.ietf-idr-te-lsp-distribution] 1836 Previdi, S., Talaulikar, K., Dong, J., Chen, M., Gredler, 1837 H., and J. Tantsura, "Distribution of Traffic Engineering 1838 (TE) Policies and State using BGP-LS", draft-ietf-idr-te- 1839 lsp-distribution-13 (work in progress), April 2020. 1841 [I-D.barth-pce-segment-routing-policy-cp] 1842 Koldychev, M., Sivabalan, S., Barth, C., Peng, S., and H. 1843 Bidgoli, "PCEP extension to support Segment Routing Policy 1844 Candidate Paths", draft-barth-pce-segment-routing-policy- 1845 cp-06 (work in progress), June 2020. 1847 [I-D.filsfils-spring-sr-policy-considerations] 1848 Filsfils, C., Talaulikar, K., Krol, P., Horneffer, M., and 1849 P. Mattes, "SR Policy Implementation and Deployment 1850 Considerations", draft-filsfils-spring-sr-policy- 1851 considerations-05 (work in progress), April 2020. 1853 [I-D.ietf-rtgwg-bgp-pic] 1854 Bashandy, A., Filsfils, C., and P. Mohapatra, "BGP Prefix 1855 Independent Convergence", draft-ietf-rtgwg-bgp-pic-11 1856 (work in progress), February 2020. 1858 [I-D.ietf-isis-mpls-elc] 1859 Xu, X., Kini, S., Psenak, P., Filsfils, C., Litkowski, S., 1860 and M. Bocci, "Signaling Entropy Label Capability and 1861 Entropy Readable Label Depth Using IS-IS", draft-ietf- 1862 isis-mpls-elc-13 (work in progress), May 2020. 1864 [I-D.ietf-ospf-mpls-elc] 1865 Xu, X., Kini, S., Psenak, P., Filsfils, C., Litkowski, S., 1866 and M. Bocci, "Signaling Entropy Label Capability and 1867 Entropy Readable Label Depth Using OSPF", draft-ietf-ospf- 1868 mpls-elc-15 (work in progress), June 2020. 1870 [I-D.ietf-idr-next-hop-capability] 1871 Decraene, B., Kompella, K., and W. Henderickx, "BGP Next- 1872 Hop dependent capabilities", draft-ietf-idr-next-hop- 1873 capability-05 (work in progress), June 2019. 1875 [I-D.ietf-spring-segment-routing-central-epe] 1876 Filsfils, C., Previdi, S., Dawra, G., Aries, E., and D. 1877 Afanasiev, "Segment Routing Centralized BGP Egress Peer 1878 Engineering", draft-ietf-spring-segment-routing-central- 1879 epe-10 (work in progress), December 2017. 1881 [I-D.ietf-idr-long-lived-gr] 1882 Uttaro, J., Chen, E., Decraene, B., and J. Scudder, 1883 "Support for Long-lived BGP Graceful Restart", draft-ietf- 1884 idr-long-lived-gr-00 (work in progress), September 2019. 1886 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 1887 BGP for Routing in Large-Scale Data Centers", RFC 7938, 1888 DOI 10.17487/RFC7938, August 2016, 1889 . 1891 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and 1892 S. Ray, "North-Bound Distribution of Link-State and 1893 Traffic Engineering (TE) Information Using BGP", RFC 7752, 1894 DOI 10.17487/RFC7752, March 2016, 1895 . 1897 [RFC8277] Rosen, E., "Using BGP to Bind MPLS Labels to Address 1898 Prefixes", RFC 8277, DOI 10.17487/RFC8277, October 2017, 1899 . 1901 [RFC8667] Previdi, S., Ed., Ginsberg, L., Ed., Filsfils, C., 1902 Bashandy, A., Gredler, H., and B. Decraene, "IS-IS 1903 Extensions for Segment Routing", RFC 8667, 1904 DOI 10.17487/RFC8667, December 2019, 1905 . 1907 [RFC8665] Psenak, P., Ed., Previdi, S., Ed., Filsfils, C., Gredler, 1908 H., Shakir, R., Henderickx, W., and J. Tantsura, "OSPF 1909 Extensions for Segment Routing", RFC 8665, 1910 DOI 10.17487/RFC8665, December 2019, 1911 . 1913 [RFC8669] Previdi, S., Filsfils, C., Lindem, A., Ed., Sreekantiah, 1914 A., and H. Gredler, "Segment Routing Prefix Segment 1915 Identifier Extensions for BGP", RFC 8669, 1916 DOI 10.17487/RFC8669, December 2019, 1917 . 1919 [RFC8663] Xu, X., Bryant, S., Farrel, A., Hassan, S., Henderickx, 1920 W., and Z. Li, "MPLS Segment Routing over IP", RFC 8663, 1921 DOI 10.17487/RFC8663, December 2019, 1922 . 1924 [RFC7911] Walton, D., Retana, A., Chen, E., and J. Scudder, 1925 "Advertisement of Multiple Paths in BGP", RFC 7911, 1926 DOI 10.17487/RFC7911, July 2016, 1927 . 1929 [RFC7880] Pignataro, C., Ward, D., Akiya, N., Bhatia, M., and S. 1930 Pallagatti, "Seamless Bidirectional Forwarding Detection 1931 (S-BFD)", RFC 7880, DOI 10.17487/RFC7880, July 2016, 1932 . 1934 [RFC4364] Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private 1935 Networks (VPNs)", RFC 4364, DOI 10.17487/RFC4364, February 1936 2006, . 1938 [RFC5920] Fang, L., Ed., "Security Framework for MPLS and GMPLS 1939 Networks", RFC 5920, DOI 10.17487/RFC5920, July 2010, 1940 . 1942 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 1943 "Specification of the IP Flow Information Export (IPFIX) 1944 Protocol for the Exchange of Flow Information", STD 77, 1945 RFC 7011, DOI 10.17487/RFC7011, September 2013, 1946 . 1948 [RFC6241] Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., 1949 and A. Bierman, Ed., "Network Configuration Protocol 1950 (NETCONF)", RFC 6241, DOI 10.17487/RFC6241, June 2011, 1951 . 1953 [RFC6020] Bjorklund, M., Ed., "YANG - A Data Modeling Language for 1954 the Network Configuration Protocol (NETCONF)", RFC 6020, 1955 DOI 10.17487/RFC6020, October 2010, 1956 . 1958 [RFC7854] Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP 1959 Monitoring Protocol (BMP)", RFC 7854, 1960 DOI 10.17487/RFC7854, June 2016, 1961 . 1963 [RFC8300] Quinn, P., Ed., Elzur, U., Ed., and C. Pignataro, Ed., 1964 "Network Service Header (NSH)", RFC 8300, 1965 DOI 10.17487/RFC8300, January 2018, 1966 . 1968 [RFC5440] Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation 1969 Element (PCE) Communication Protocol (PCEP)", RFC 5440, 1970 DOI 10.17487/RFC5440, March 2009, 1971 . 1973 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 1974 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 1975 eXtensible Local Area Network (VXLAN): A Framework for 1976 Overlaying Virtualized Layer 2 Networks over Layer 3 1977 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 1978 . 1980 [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network 1981 Virtualization Using Generic Routing Encapsulation", 1982 RFC 7637, DOI 10.17487/RFC7637, September 2015, 1983 . 1985 [RFC3031] Rosen, E., Viswanathan, A., and R. Callon, "Multiprotocol 1986 Label Switching Architecture", RFC 3031, 1987 DOI 10.17487/RFC3031, January 2001, 1988 . 1990 [RFC8014] Black, D., Hudson, J., Kreeger, L., Lasserre, M., and T. 1991 Narten, "An Architecture for Data-Center Network 1992 Virtualization over Layer 3 (NVO3)", RFC 8014, 1993 DOI 10.17487/RFC8014, December 2016, 1994 . 1996 [RFC8402] Filsfils, C., Ed., Previdi, S., Ed., Ginsberg, L., 1997 Decraene, B., Litkowski, S., and R. Shakir, "Segment 1998 Routing Architecture", RFC 8402, DOI 10.17487/RFC8402, 1999 July 2018, . 2001 [RFC5883] Katz, D. and D. Ward, "Bidirectional Forwarding Detection 2002 (BFD) for Multihop Paths", RFC 5883, DOI 10.17487/RFC5883, 2003 June 2010, . 2005 [RFC8231] Crabbe, E., Minei, I., Medved, J., and R. Varga, "Path 2006 Computation Element Communication Protocol (PCEP) 2007 Extensions for Stateful PCE", RFC 8231, 2008 DOI 10.17487/RFC8231, September 2017, 2009 . 2011 [RFC8281] Crabbe, E., Minei, I., Sivabalan, S., and R. Varga, "Path 2012 Computation Element Communication Protocol (PCEP) 2013 Extensions for PCE-Initiated LSP Setup in a Stateful PCE 2014 Model", RFC 8281, DOI 10.17487/RFC8281, December 2017, 2015 . 2017 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 2018 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 2019 June 2010, . 2021 [RFC8253] Lopez, D., Gonzalez de Dios, O., Wu, Q., and D. Dhody, 2022 "PCEPS: Usage of TLS to Provide a Secure Transport for the 2023 Path Computation Element Communication Protocol (PCEP)", 2024 RFC 8253, DOI 10.17487/RFC8253, October 2017, 2025 . 2027 [RFC6790] Kompella, K., Drake, J., Amante, S., Henderickx, W., and 2028 L. Yong, "The Use of Entropy Labels in MPLS Forwarding", 2029 RFC 6790, DOI 10.17487/RFC6790, November 2012, 2030 . 2032 [RFC8662] Kini, S., Kompella, K., Sivabalan, S., Litkowski, S., 2033 Shakir, R., and J. Tantsura, "Entropy Label for Source 2034 Packet Routing in Networking (SPRING) Tunnels", RFC 8662, 2035 DOI 10.17487/RFC8662, December 2019, 2036 . 2038 [RFC8491] Tantsura, J., Chunduri, U., Aldrin, S., and L. Ginsberg, 2039 "Signaling Maximum SID Depth (MSD) Using IS-IS", RFC 8491, 2040 DOI 10.17487/RFC8491, November 2018, 2041 . 2043 [RFC8476] Tantsura, J., Chunduri, U., Aldrin, S., and P. Psenak, 2044 "Signaling Maximum SID Depth (MSD) Using OSPF", RFC 8476, 2045 DOI 10.17487/RFC8476, December 2018, 2046 . 2048 Authors' Addresses 2050 Colin Bookham (editor) 2051 Nokia 2052 740 Waterside Drive 2053 Almondsbury, Bristol 2054 UK 2056 Email: colin.bookham@nokia.com 2058 Andrew Stone 2059 Nokia 2060 600 March Road 2061 Kanata, Ontario 2062 Canada 2064 Email: andrew.stone@nokia.com 2066 Jeff Tantsura 2067 Apstra 2068 333 Middlefield Road #200 2069 Menlo Park, CA 94025 2070 USA 2072 Email: jefftant.ietf@gmail.com 2073 Muhammad Durrani 2074 Equinix Inc 2075 1188 Arques Ave 2076 Sunnyvale CA 2077 USA 2079 Email: mdurrani@equinix.com 2081 Bruno Decraene 2082 Orange 2083 38-40 Rue de General Leclerc 2084 92794 Issey Moulineaux cedex 9 2085 France 2087 Email: bruno.decraene@orange.com