idnits 2.17.1 draft-filsfils-spring-large-scale-interconnect-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 12, 2018) is 2144 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC2119' is defined on line 431, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Filsfils, Ed. 3 Internet-Draft S. Previdi 4 Intended status: Informational Cisco Systems, Inc. 5 Expires: December 14, 2018 G. Dawra, Ed. 6 LinkedIn 7 W. Henderickx 8 Nokia 9 D. Cooper 10 Level 3 11 June 12, 2018 13 Interconnecting Millions Of Endpoints With Segment Routing 14 draft-filsfils-spring-large-scale-interconnect-10 16 Abstract 18 This document describes an application of Segment Routing to scale 19 the network to support hundreds of thousands of network nodes, and 20 tens of millions of physical underlay endpoints. This use-case can 21 be applied to the interconnection of massive-scale DC's and/or large 22 aggregation networks. Forwarding tables of midpoint and leaf nodes 23 only require a few tens of thousands of entries. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on December 14, 2018. 42 Copyright Notice 44 Copyright (c) 2018 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (https://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 60 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 61 3. Reference Design . . . . . . . . . . . . . . . . . . . . . . 3 62 4. Control Plane . . . . . . . . . . . . . . . . . . . . . . . . 4 63 5. Illustration of the scale . . . . . . . . . . . . . . . . . . 5 64 6. Design Options . . . . . . . . . . . . . . . . . . . . . . . 6 65 6.1. SRGB Size . . . . . . . . . . . . . . . . . . . . . . . . 6 66 6.2. Redistribution of Agg nodes routes . . . . . . . . . . . 6 67 6.3. Sizing and hierarchy . . . . . . . . . . . . . . . . . . 6 68 6.4. Local Segments to Hosts/Servers . . . . . . . . . . . . . 7 69 6.5. Compressed SRTE policies . . . . . . . . . . . . . . . . 7 70 7. Deployment Model . . . . . . . . . . . . . . . . . . . . . . 7 71 8. Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . 8 72 8.1. Simplified operations . . . . . . . . . . . . . . . . . . 8 73 8.2. Inter-domain SLA . . . . . . . . . . . . . . . . . . . . 8 74 8.3. Scale . . . . . . . . . . . . . . . . . . . . . . . . . . 8 75 8.4. ECMP . . . . . . . . . . . . . . . . . . . . . . . . . . 8 76 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 77 10. Manageability Considerations . . . . . . . . . . . . . . . . 9 78 11. Security Considerations . . . . . . . . . . . . . . . . . . . 9 79 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 80 13. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 9 81 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 10 82 14.1. Normative References . . . . . . . . . . . . . . . . . . 10 83 14.2. Informative References . . . . . . . . . . . . . . . . . 10 84 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 86 1. Introduction 88 This document describes how SR can be used to interconnect millions 89 of endpoints.The following terminology is used in this document: 91 2. Terminology 93 The following terms and abbreviations are used in this document: 95 Term Definition 96 --------------------------------------------------------- 97 Agg Aggregation 98 BGP Border Gateway Protocol 99 DC Data Center 100 DCI Data Center Interconnect 101 ECMP Equal Cost MultiPathing 102 FIB Forwarding Information Base 103 LDP Label Distribution Protocol 104 LFIB Label Forwarding Information Base 105 MPLS Multi-Protocol Label Switching 106 PCE Path Computation Element 107 PCEP Path Computation Element Protocol 108 PW Pseudowire 109 SLA Service level Agreement 110 SR Segment Routing 111 SRTE Policy Segment Routing Traffic Engineering Policy 112 TE Traffic Engineering 113 TI-LFA Topology Independent - Loop Free Alternative 115 3. Reference Design 117 The network diagram here below describes the reference network 118 topology used in this document: 120 +-------+ +--------+ +--------+ +-------+ +-------+ 121 A DCI1 Agg1 Agg3 DCI3 Z 122 | DC1 | | M1 | | C | | M2 | | DC2 | 123 | DCI2 Agg2 Agg4 DCI4 | 124 +-------+ +--------+ +--------+ +-------+ +-------+ 126 Figure 1: Reference Topology 128 The following applies to the reference topology above: 130 Independent ISIS-OSPF/SR instance in core (C) region. 132 Independent ISIS-OSPF/SR instance in Metro1 (M1) region. 134 Independent ISIS-OSPF/SR instance in Metro2 (M2) region. 136 BGP/SR in DC1. 138 BGP/SR in DC2. 140 Agg routes (Agg1, Agg2, Agg3, Agg4) are redistributed from C to M 141 (M1 and M2) and from M to DC domains. 143 No other route is advertised or redistributed between regions. 145 The same homogeneous SRGB is used throughout the domains (e.g. 146 16000-23999). 148 Unique SRGB sub-ranges are allocated to each metro (M) and core 149 (C) domains: 151 16000-16999 range is allocated to the core (C) domain/region. 153 17000-17999 range is allocated to the M1 domain/region. 155 18000-18999 range is allocated to the M2 domain/region. 157 Specifically, Agg1 router has SID 16001 allocated and Agg2 158 router has SID 16002 allocated. 160 Specifically, Agg3 router has SID 16003 allocated and the 161 anycast SID for Agg3 and Agg4 is 16006. 163 Specifically, DCI3 router has SID 18003 allocated and the 164 anycast SID for DCI3 and DCI4 is 18006. 166 Specifically, at Agg1 router Binding SID 4001 leads to DCI Pair 167 DCI3, DCI4 via specific low-latency path {16002, 16003, 18006}. 169 The same SRGB sub-range is re-used within each DC (DC1 and DC2) 170 region. for each DC: e.g. 20000-23999. Specifically, range 171 20000-23999 range is used in both DC1 and DC2 regions and nodes A 172 and Z have both SID 20001 allocated to them. 174 4. Control Plane 176 This section provides a high-level description of an implemented 177 control-plane. 179 The mechanism through which SRTE Policies are defined, computed and 180 programmed in the source nodes, are outside the scope of this 181 document. 183 Typically, a controller or a service orchestration system programs 184 node A with a pseudowire (PW) to a remote next-hop Z with a given SLA 185 contract (e.g. low-latency path, be disjoint from a specific core 186 plane, be disjoint from a different PW service, etc.). 188 Node A automatically detects that it does not have reachability to Z. 189 It then automatically sends a PCEP request to an SR PCE for an SRTE 190 policy that provides reachability to Z with the requested SLA. 192 The SR PCE is made of two components. A multi-domain topology and a 193 computation engine. The multi-domain topology is continuously 194 refreshed through BGP-LS feeds from each domain. The computing 195 engine implements Traffic Engineering (TE) algorithms designed 196 specifically for SR path expression. Upon receiving the PCEP 197 request, the SR PCE computes the requested path. The path is 198 expressed through a list of segments (e.g. {16003, 18006, 20001} and 199 provided to node A. 201 The SR PCE logs the request as a stateful query and hence is capable 202 to recompute the path at each network topology change. 204 Node A receives the PCEP reply with the path (expressed as a segment 205 list). Node A installs the received SRTE policy in the dataplane. 206 Node A then automatically steers the PW into that SRTE policy. 208 5. Illustration of the scale 210 According to the reference topology described in Figure 1 the 211 following assumptions are made: 213 There's 1 core domain and 100 of leaf (metro) domains. 215 The core domain includes 200 nodes. 217 Two nodes connect each leaf (metro) domain. Each node connecting 218 a leaf domain has a SID allocated. Each pair of nodes connecting 219 a leaf domain also has a common anycast SID. This brings up to 220 300 prefix segments in total. 222 A core node connects only one leaf domain. 224 Each leaf domain has 6000 leaf node segments. Each leaf-node has 225 500 endpoints attached, thus 500 adjacency segments. In total, it 226 is 3 millions endpoints for a leaf domain. 228 Based on the above, the network scaling numbers are as follows: 230 6,000 leaf node segments multiplied by 100 leaf domains: 600,000 231 nodes. 233 600,000 nodes multiplied by 500 endpoints: 300 millions of 234 endpoints. 236 The node scaling numbers are as follows: 238 Leaf node segment scale: 6,000 leaf node segments + 300 core node 239 segments + 500 adjacency segments = 6,800 segments 240 Core node segment scale: 6,000 leaf domain segments + 300 core 241 domain segments = 6,300 segments 243 In the above calculations, the link adjacency segments are not taken 244 into account. These are local segments and, typically, less than 100 245 per node. 247 It has to be noted that, depending on leaf node FIB capabilities, 248 leaf domains could be split into multiple smaller domains. In the 249 above example, the leaf domains could be split into 6 smaller domains 250 so that each leaf node only need to learn 1000 leaf node segments + 251 300 core node segments + 500 adjacency segments which gives a total 252 of 1,800 segments. 254 6. Design Options 256 This section describes multiple design options to the illustration of 257 previous section. 259 6.1. SRGB Size 261 In the simplified illustrations of this document, we picked a small 262 homogeneous SRGB range of 16000-23999. In practice, a large-scale 263 design would use a bigger range such as 16000-80000, or even larger. 264 Larger range provides allocations for various Traffic Engineering 265 applications within a given domain 267 6.2. Redistribution of Agg nodes routes 269 The operator might choose to not redistribute the Agg nodes routes 270 into the Metro/DC domains. In that case, more segments are required 271 in order to express an inter-domain path. 273 For example, node A would use an SRTE Policy {DCI1, Agg1, Agg3, DCI3, 274 Z} in order to reach Z instead of {Agg3, DCI3, Z} in the reference 275 design. 277 6.3. Sizing and hierarchy 279 The operator is free to choose among a small number of larger leaf 280 domains, a large number of small leaf domains or a mix of small and 281 large core/leaf domains. 283 The operator is free to use a 2-tier design (Core/Metro) or a 3-tier 284 (Core/Metro/DC). 286 6.4. Local Segments to Hosts/Servers 288 Local segments can be programmed at any leaf node (e.g. node Z) in 289 order to identify locally-attached hosts (or VM's). For example, if 290 node Z has bound a local segment 40001 to a local host ZH1, then node 291 A uses the following SRTE Policy in order to reach that host: {16006, 292 18006, 20001, 40001}. Such local segment could represent the NID 293 (Network Interface Device) in the context of the SP access network, 294 or VM in the context of the DC network. 296 6.5. Compressed SRTE policies 298 As an example and according to Section 3, we assume node A can reach 299 node Z (e.g., with a low-latency SLA contract) via the SRTE policy 300 consisting of the path: Agg1, Agg2, Agg3, DCI3/4(anycast), Z. The 301 path is represented by the segment list: {16001, 16002, 16003, 18006, 302 20001}. 304 It is clear that the control-plane solution can install an SRTE 305 Policy {16002, 16003, 18006} at Agg1, collect the Binding SID 306 allocated by Agg1 to that policy (e.g. 4001) and hence program node A 307 with the compressed SRTE Policy {16001, 4001, 20001}. 309 From node A, 16001 leads to Agg1. Once at Agg1, 4001 leads to the 310 DCI pair (DCI3, DCI4) via a specific low-latency path {16002, 16003, 311 18006}. Once at that DCI pair, 20001 leads to Z. 313 Binding SID's allocated to "intermediate" SRTE Policies allow to 314 compress end-to-end SRTE Policies. 316 The segment list {16001, 4001, 20001} expresses the same path as 317 {16001, 16002, 16003, 18006, 20001} but with 2 less segments. 319 The Binding SID also provide for an inherent churn protection. 321 When the core topology changes, the control-plane can update the low- 322 latency SRTE Policy from Agg1 to the DCI pair to DC2 without updating 323 the SRTE Policy from A to Z. 325 7. Deployment Model 327 It is expected that this design be deployed as a green field but as 328 well in interworking (brown field) with seamless-mpls design as 329 described in [I-D.ietf-mpls-seamless-mpls]. 331 8. Benefits 333 The design options illustrated in this document allow the 334 interconnection on a very large scale. Millions of endpoints across 335 different domains can be interconnected. 337 8.1. Simplified operations 339 Two protocols have been removed from the network: LDP and RSVP-TE. 340 No new protocol has been introduced. The design leverage the core IP 341 protocols: ISIS, OSPF, BGP, PCEP with straightforward SR extensions. 343 8.2. Inter-domain SLA 345 Fast reroute and resiliency is provided by TI-LFA with sub-50msec FRR 346 upon Link/Node/SRLG failure. TI-LFA is described in 347 [I-D.francois-rtgwg-segment-routing-ti-lfa]. 349 The use of anycast SID's also provide an improvement in availability 350 and resiliency. 352 Inter-domain SLA's can be delivered, e.g., latency vs. cost optimized 353 path, disjointness from backbone planes, disjointness from other 354 services, disjointness between primary and backup paths. 356 Existing inter-domain solutions (Seamless MPLS) do not provide any 357 support for SLA contracts. They just provide a best-effort 358 reachability across domains. 360 8.3. Scale 362 In addition to having eliminated two control plane protocols, per- 363 service midpoint states have also been removed from the network. 365 8.4. ECMP 367 Each policy (intra or inter-domain, with or without TE) is expressed 368 as a list of segments. Since each segment is optimized for ECMP, 369 then the entire policy is optimized for ECMP. The ECMP gain of 370 anycast prefix segment should also be considered (e.g. 16001 load- 371 shares across any gateway from M1 leaf domain to Core and 16002 load- 372 shares across any gateway from Core to M1 leaf domain. 374 9. IANA Considerations 376 This document does not make any IANA request. 378 10. Manageability Considerations 380 This document describes an application of Segment Routing over the 381 MPLS data plane. Segment Routing does not introduce any change in 382 the MPLS data plane. Manageability considerations described in 383 [I-D.ietf-spring-segment-routing] apply to the MPLS data plane when 384 used with Segment Routing. 386 11. Security Considerations 388 This document does not introduce additional security requirements and 389 mechanisms other than the ones described in 390 [I-D.ietf-spring-segment-routing]. 392 12. Acknowledgements 394 We would like to thank Giles Heron, Alexander Preusche, Steve Braaten 395 and Francis Ferguson for their contribution to the content of this 396 document. 398 13. Contributors 400 The following people have substantially contributed to the editing of 401 this document: 403 Dennis Cai 404 Individual 406 Tim Laberge 407 Individual 409 Steven Lin 410 Google Inc. 412 Steven Lin 413 Google Inc. 415 Bruno Decraene 416 Orange 418 Luay Jalil 419 Verizon 421 Jeff Tantsura 422 Individual 424 Rob Shakir 425 Google 427 14. References 429 14.1. Normative References 431 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 432 Requirement Levels", BCP 14, RFC 2119, 433 DOI 10.17487/RFC2119, March 1997, 434 . 436 14.2. Informative References 438 [I-D.francois-rtgwg-segment-routing-ti-lfa] 439 Francois, P., Bashandy, A., Filsfils, C., Decraene, B., 440 and S. Litkowski, "Abstract", draft-francois-rtgwg- 441 segment-routing-ti-lfa-04 (work in progress), December 442 2016. 444 [I-D.ietf-mpls-seamless-mpls] 445 Leymann, N., Decraene, B., Filsfils, C., Konstantynowicz, 446 M., and D. Steinberg, "Seamless MPLS Architecture", draft- 447 ietf-mpls-seamless-mpls-07 (work in progress), June 2014. 449 [I-D.ietf-spring-segment-routing] 450 Filsfils, C., Previdi, S., Ginsberg, L., Decraene, B., 451 Litkowski, S., and R. Shakir, "Segment Routing 452 Architecture", draft-ietf-spring-segment-routing-15 (work 453 in progress), January 2018. 455 Authors' Addresses 457 Clarence Filsfils (editor) 458 Cisco Systems, Inc. 459 Brussels 460 Belgium 462 Email: cfilsfil@cisco.com 464 Stefano Previdi 465 Cisco Systems, Inc. 466 Via Del Serafico, 200 467 Rome 00142 468 Italy 470 Email: stefano@previdi.net 471 Gaurav Dawra (editor) 472 LinkedIn 473 USA 475 Email: gdawra.ietf@gmail.com 477 Wim Henderickx 478 Nokia 479 Copernicuslaan 50 480 Antwerp 2018 481 Belgium 483 Email: wim.henderickx@nokia.com 485 Dave Cooper 486 Level 3 488 Email: Dave.Cooper@Level3.com