idnits 2.17.1 draft-filsfils-spring-large-scale-interconnect-11.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (August 2, 2018) is 2093 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'I-D.ietf-mpls-seamless-mpls' is defined on line 436, but no explicit reference was found in the text == Outdated reference: A later version (-05) exists of draft-bashandy-rtgwg-segment-routing-ti-lfa-04 -- Obsolete informational reference (is this intentional?): RFC 7752 (Obsoleted by RFC 9552) Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Filsfils, Ed. 3 Internet-Draft S. Previdi 4 Intended status: Informational Cisco Systems, Inc. 5 Expires: February 3, 2019 G. Dawra, Ed. 6 LinkedIn 7 W. Henderickx 8 Nokia 9 D. Cooper 10 Level 3 11 August 2, 2018 13 Interconnecting Millions Of Endpoints With Segment Routing 14 draft-filsfils-spring-large-scale-interconnect-11 16 Abstract 18 This document describes an application of Segment Routing to scale 19 the network to support hundreds of thousands of network nodes, and 20 tens of millions of physical underlay endpoints. This use-case can 21 be applied to the interconnection of massive-scale DCs and/or large 22 aggregation networks. Forwarding tables of midpoint and leaf nodes 23 only require a few tens of thousands of entries. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on February 3, 2019. 42 Copyright Notice 44 Copyright (c) 2018 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (https://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 60 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 61 3. Reference Design . . . . . . . . . . . . . . . . . . . . . . 3 62 4. Control Plane . . . . . . . . . . . . . . . . . . . . . . . . 4 63 5. Illustration of the scale . . . . . . . . . . . . . . . . . . 5 64 6. Design Options . . . . . . . . . . . . . . . . . . . . . . . 6 65 6.1. Segment Routing Global Block(SRGB) Size . . . . . . . . . 6 66 6.2. Redistribution of Agg nodes routes . . . . . . . . . . . 6 67 6.3. Sizing and hierarchy . . . . . . . . . . . . . . . . . . 6 68 6.4. Local Segments to Hosts/Servers . . . . . . . . . . . . . 7 69 6.5. Compressed SRTE policies . . . . . . . . . . . . . . . . 7 70 7. Deployment Model . . . . . . . . . . . . . . . . . . . . . . 7 71 8. Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . 8 72 8.1. Simplified operations . . . . . . . . . . . . . . . . . . 8 73 8.2. Inter-domain SLA . . . . . . . . . . . . . . . . . . . . 8 74 8.3. Scale . . . . . . . . . . . . . . . . . . . . . . . . . . 8 75 8.4. ECMP . . . . . . . . . . . . . . . . . . . . . . . . . . 8 76 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 77 10. Manageability Considerations . . . . . . . . . . . . . . . . 9 78 11. Security Considerations . . . . . . . . . . . . . . . . . . . 9 79 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 80 13. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 9 81 14. Informative References . . . . . . . . . . . . . . . . . . . 10 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 84 1. Introduction 86 This document describes how SR can be used to interconnect millions 87 of endpoints.The following terminology is used in this document: 89 2. Terminology 91 The following terms and abbreviations are used in this document: 93 Term Definition 94 --------------------------------------------------------- 95 Agg Aggregation 96 BGP Border Gateway Protocol 97 DC Data Center 98 DCI Data Center Interconnect 99 ECMP Equal Cost MultiPathing 100 FIB Forwarding Information Base 101 LDP Label Distribution Protocol 102 LFIB Label Forwarding Information Base 103 MPLS Multi-Protocol Label Switching 104 PCE Path Computation Element 105 PCEP Path Computation Element Protocol 106 PW Pseudowire 107 SLA Service level Agreement 108 SR Segment Routing 109 SRTE Policy Segment Routing Traffic Engineering Policy 110 TE Traffic Engineering 111 TI-LFA Topology Independent - Loop Free Alternative 113 3. Reference Design 115 The network diagram here below describes the reference network 116 topology used in this document: 118 +-------+ +--------+ +--------+ +-------+ +-------+ 119 A DCI1 Agg1 Agg3 DCI3 Z 120 | DC1 | | M1 | | C | | M2 | | DC2 | 121 | DCI2 Agg2 Agg4 DCI4 | 122 +-------+ +--------+ +--------+ +-------+ +-------+ 124 Figure 1: Reference Topology 126 The following applies to the reference topology above: 128 Independent ISIS-OSPF/SR instance in core (C) region. 130 Independent ISIS-OSPF/SR instance in Metro1 (M1) region. 132 Independent ISIS-OSPF/SR instance in Metro2 (M2) region. 134 BGP/SR in DC1. 136 BGP/SR in DC2. 138 Agg routes (Agg1, Agg2, Agg3, Agg4) are redistributed from C to M 139 (M1 and M2) and from M to DC domains. 141 No other route is advertised or redistributed between regions. 143 The same homogeneous SRGB is used throughout the domains (e.g. 144 16000-23999). 146 Unique SRGB sub-ranges are allocated to each metro (M) and core 147 (C) domains: 149 16000-16999 range is allocated to the core (C) domain/region. 151 17000-17999 range is allocated to the M1 domain/region. 153 18000-18999 range is allocated to the M2 domain/region. 155 Specifically, Agg1 router has SID 16001 allocated and Agg2 156 router has SID 16002 allocated. 158 Specifically, Agg3 router has SID 16003 allocated and the 159 anycast SID for Agg3 and Agg4 is 16006. 161 Specifically, DCI3 router has SID 18003 allocated and the 162 anycast SID for DCI3 and DCI4 is 18006. 164 Specifically, at Agg1 router Binding SID 4001 leads to DCI Pair 165 DCI3, DCI4 via specific low-latency path {16002, 16003, 18006}. 167 The same SRGB sub-range is re-used within each DC (DC1 and DC2) 168 region. for each DC: e.g. 20000-23999. Specifically, range 169 20000-23999 range is used in both DC1 and DC2 regions and nodes A 170 and Z have both SID 20001 allocated to them. 172 4. Control Plane 174 This section provides a high-level description of a how a control 175 plane could be implemented using protocol components already defined 176 in other RFCs. 178 The mechanism through which SRTE Policies are defined, computed and 179 programmed in the source nodes, are outside the scope of this 180 document. 182 Typically, a controller or a service orchestration system programs 183 node A with a pseudowire (PW) to a remote next-hop Z with a given SLA 184 contract (e.g. low-latency path, be disjoint from a specific core 185 plane, be disjoint from a different PW service, etc.). 187 Node A automatically detects that it does not have reachability to Z. 188 It then automatically sends a PCEP request to an SR PCE for an SRTE 189 policy that provides reachability to Z with the requested SLA. 191 The SR PCE [RFC4655] is made of two components. A multi-domain 192 topology and a computation engine. The multi-domain topology is 193 continuously refreshed through BGP-LS [RFC7752] feeds from each 194 domain. The computing engine implements Traffic Engineering (TE) 195 algorithms designed specifically for SR path expression. Upon 196 receiving the PCEP [RFC5440] request, the SR PCE computes the 197 requested path. The path is expressed through a list of segments 198 (e.g. {16003, 18006, 20001} and provided to node A. 200 The SR PCE logs the request as a stateful query and hence is capable 201 to recompute the path at each network topology change. 203 Node A receives the PCEP reply with the path (expressed as a segment 204 list). Node A installs the received SRTE policy in the dataplane. 205 Node A then automatically steers the PW into that SRTE policy. 207 5. Illustration of the scale 209 According to the reference topology described in Figure 1 the 210 following assumptions are made: 212 There's 1 core domain and 100 of leaf (metro) domains. 214 The core domain includes 200 nodes. 216 Two nodes connect each leaf (metro) domain. Each node connecting 217 a leaf domain has a SID allocated. Each pair of nodes connecting 218 a leaf domain also has a common anycast SID. This brings up to 219 300 prefix segments in total. 221 A core node connects only one leaf domain. 223 Each leaf domain has 6000 leaf node segments. Each leaf-node has 224 500 endpoints attached, thus 500 adjacency segments. In total, it 225 is 3 millions endpoints for a leaf domain. 227 Based on the above, the network scaling numbers are as follows: 229 6,000 leaf node segments multiplied by 100 leaf domains: 600,000 230 nodes. 232 600,000 nodes multiplied by 500 endpoints: 300 millions of 233 endpoints. 235 The node scaling numbers are as follows: 237 Leaf node segment scale: 6,000 leaf node segments + 300 core node 238 segments + 500 adjacency segments = 6,800 segments 240 Core node segment scale: 6,000 leaf domain segments + 300 core 241 domain segments = 6,300 segments 243 In the above calculations, the link adjacency segments are not taken 244 into account. These are local segments and, typically, less than 100 245 per node. 247 It has to be noted that, depending on leaf node FIB capabilities, 248 leaf domains could be split into multiple smaller domains. In the 249 above example, the leaf domains could be split into 6 smaller domains 250 so that each leaf node only need to learn 1000 leaf node segments + 251 300 core node segments + 500 adjacency segments which gives a total 252 of 1,800 segments. 254 6. Design Options 256 This section describes multiple design options to the illustration of 257 previous section. 259 6.1. Segment Routing Global Block(SRGB) Size 261 In the simplified illustrations of this document, we picked a small 262 homogeneous SRGB range of 16000-23999. In practice, a large-scale 263 design would use a bigger range such as 16000-80000, or even larger. 264 Larger range provides allocations for various Traffic Engineering 265 applications within a given domain 267 6.2. Redistribution of Agg nodes routes 269 The operator might choose to not redistribute the Agg nodes routes 270 into the Metro/DC domains. In that case, more segments are required 271 in order to express an inter-domain path. 273 For example, node A would use an SRTE Policy {DCI1, Agg1, Agg3, DCI3, 274 Z} in order to reach Z instead of {Agg3, DCI3, Z} in the reference 275 design. 277 6.3. Sizing and hierarchy 279 The operator is free to choose among a small number of larger leaf 280 domains, a large number of small leaf domains or a mix of small and 281 large core/leaf domains. 283 The operator is free to use a 2-tier design (Core/Metro) or a 3-tier 284 (Core/Metro/DC). 286 6.4. Local Segments to Hosts/Servers 288 Local segments can be programmed at any leaf node (e.g. node Z) in 289 order to identify locally-attached hosts (or VM's). For example, if 290 node Z has bound a local segment 40001 to a local host ZH1, then node 291 A uses the following SRTE Policy in order to reach that host: {16006, 292 18006, 20001, 40001}. Such local segment could represent the NID 293 (Network Interface Device) in the context of the SP access network, 294 or VM in the context of the DC network. 296 6.5. Compressed SRTE policies 298 As an example and according to Section 3, we assume node A can reach 299 node Z (e.g., with a low-latency SLA contract) via the SRTE policy 300 consisting of the path: Agg1, Agg2, Agg3, DCI3/4(anycast), Z. The 301 path is represented by the segment list: {16001, 16002, 16003, 18006, 302 20001}. 304 It is clear that the control-plane solution can install an SRTE 305 Policy {16002, 16003, 18006} at Agg1, collect the Binding SID 306 allocated by Agg1 to that policy (e.g. 4001) and hence program node A 307 with the compressed SRTE Policy {16001, 4001, 20001}. 309 From node A, 16001 leads to Agg1. Once at Agg1, 4001 leads to the 310 DCI pair (DCI3, DCI4) via a specific low-latency path {16002, 16003, 311 18006}. Once at that DCI pair, 20001 leads to Z. 313 Binding SID's allocated to "intermediate" SRTE Policies allow to 314 compress end-to-end SRTE Policies. 316 The segment list {16001, 4001, 20001} expresses the same path as 317 {16001, 16002, 16003, 18006, 20001} but with 2 less segments. 319 The Binding SID also provide for an inherent churn protection. 321 When the core topology changes, the control-plane can update the low- 322 latency SRTE Policy from Agg1 to the DCI pair to DC2 without updating 323 the SRTE Policy from A to Z. 325 7. Deployment Model 327 It is expected that this design be deployed as a green field but as 328 well in interworking (brown field) with MPLS design across multiple 329 domains. 331 8. Benefits 333 The design options illustrated in this document allow the 334 interconnection on a very large scale. Millions of endpoints across 335 different domains can be interconnected. 337 8.1. Simplified operations 339 Two protocols have been removed from the network: LDP and RSVP-TE. 340 No new protocol has been introduced. The design leverage the core IP 341 protocols: ISIS, OSPF, BGP, PCEP with straightforward SR extensions. 343 8.2. Inter-domain SLA 345 Fast reroute and resiliency is provided by TI-LFA with sub-50msec FRR 346 upon Link/Node/SRLG failure. TI-LFA is described in 347 [I-D.bashandy-rtgwg-segment-routing-ti-lfa]. 349 The use of anycast SID's also provide an improvement in availability 350 and resiliency. 352 Inter-domain SLA's can be delivered, e.g., latency vs. cost optimized 353 path, disjointness from backbone planes, disjointness from other 354 services, disjointness between primary and backup paths. 356 Existing inter-domain solutions (Seamless MPLS) do not provide any 357 support for SLA contracts. They just provide a best-effort 358 reachability across domains. 360 8.3. Scale 362 In addition to having eliminated two control plane protocols, per- 363 service midpoint states have also been removed from the network. 365 8.4. ECMP 367 Each policy (intra or inter-domain, with or without TE) is expressed 368 as a list of segments. Since each segment is optimized for ECMP, 369 then the entire policy is optimized for ECMP. The ECMP gain of 370 anycast prefix segment should also be considered (e.g. 16001 load- 371 shares across any gateway from M1 leaf domain to Core and 16002 load- 372 shares across any gateway from Core to M1 leaf domain. 374 9. IANA Considerations 376 This document does not make any IANA request. 378 10. Manageability Considerations 380 This document describes an application of Segment Routing over the 381 MPLS data plane. Segment Routing does not introduce any change in 382 the MPLS data plane. Manageability considerations described in 383 [I-D.ietf-spring-segment-routing] apply to the MPLS data plane when 384 used with Segment Routing. 386 11. Security Considerations 388 This document does not introduce additional security requirements and 389 mechanisms other than the ones described in 390 [I-D.ietf-spring-segment-routing]. 392 12. Acknowledgements 394 We would like to thank Giles Heron, Alexander Preusche, Steve Braaten 395 and Francis Ferguson for their contribution to the content of this 396 document. 398 13. Contributors 400 The following people have substantially contributed to the editing of 401 this document: 403 Dennis Cai 404 Individual 406 Tim Laberge 407 Individual 409 Steven Lin 410 Google Inc. 412 Steven Lin 413 Google Inc. 415 Bruno Decraene 416 Orange 418 Luay Jalil 419 Verizon 421 Jeff Tantsura 422 Individual 424 Rob Shakir 425 Google 427 14. Informative References 429 [I-D.bashandy-rtgwg-segment-routing-ti-lfa] 430 Bashandy, A., Filsfils, C., Decraene, B., Litkowski, S., 431 Francois, P., and d. daniel.voyer@bell.ca, "Topology 432 Independent Fast Reroute using Segment Routing", draft- 433 bashandy-rtgwg-segment-routing-ti-lfa-04 (work in 434 progress), April 2018. 436 [I-D.ietf-mpls-seamless-mpls] 437 Leymann, N., Decraene, B., Filsfils, C., Konstantynowicz, 438 M., and D. Steinberg, "Seamless MPLS Architecture", draft- 439 ietf-mpls-seamless-mpls-07 (work in progress), June 2014. 441 [I-D.ietf-spring-segment-routing] 442 Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., 443 Litkowski, S., and R. Shakir, "Segment Routing 444 Architecture", draft-ietf-spring-segment-routing-15 (work 445 in progress), January 2018. 447 [RFC4655] Farrel, A., Vasseur, J., and J. Ash, "A Path Computation 448 Element (PCE)-Based Architecture", RFC 4655, 449 DOI 10.17487/RFC4655, August 2006, 450 . 452 [RFC5440] Vasseur, JP., Ed. and JL. Le Roux, Ed., "Path Computation 453 Element (PCE) Communication Protocol (PCEP)", RFC 5440, 454 DOI 10.17487/RFC5440, March 2009, 455 . 457 [RFC7752] Gredler, H., Ed., Medved, J., Previdi, S., Farrel, A., and 458 S. Ray, "North-Bound Distribution of Link-State and 459 Traffic Engineering (TE) Information Using BGP", RFC 7752, 460 DOI 10.17487/RFC7752, March 2016, 461 . 463 Authors' Addresses 465 Clarence Filsfils (editor) 466 Cisco Systems, Inc. 467 Brussels 468 Belgium 470 Email: cfilsfil@cisco.com 471 Stefano Previdi 472 Cisco Systems, Inc. 473 Via Del Serafico, 200 474 Rome 00142 475 Italy 477 Email: stefano@previdi.net 479 Gaurav Dawra (editor) 480 LinkedIn 481 USA 483 Email: gdawra.ietf@gmail.com 485 Wim Henderickx 486 Nokia 487 Copernicuslaan 50 488 Antwerp 2018 489 Belgium 491 Email: wim.henderickx@nokia.com 493 Dave Cooper 494 Level 3 496 Email: Dave.Cooper@Level3.com