idnits 2.17.1 draft-filsfils-spring-large-scale-interconnect-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 29, 2018) is 2214 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC2119' is defined on line 408, but no explicit reference was found in the text Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Filsfils, Ed. 3 Internet-Draft S. Previdi 4 Intended status: Informational Cisco Systems, Inc. 5 Expires: September 30, 2018 G. Dawra, Ed. 6 LinkedIn 7 W. Henderickx 8 Nokia 9 D. Cooper 10 Level 3 11 March 29, 2018 13 Interconnecting Millions Of Endpoints With Segment Routing 14 draft-filsfils-spring-large-scale-interconnect-09 16 Abstract 18 This document describes an application of Segment Routing to scale 19 the network to support hundreds of thousands of network nodes, and 20 tens of millions of physical underlay endpoints. This use-case can 21 be applied to the interconnection of massive-scale DC's and/or large 22 aggregation networks. Forwarding tables of midpoint and leaf nodes 23 only require a few tens of thousands of entries. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on September 30, 2018. 42 Copyright Notice 44 Copyright (c) 2018 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (https://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 60 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 2 61 3. Reference Design . . . . . . . . . . . . . . . . . . . . . . 3 62 4. Control Plane . . . . . . . . . . . . . . . . . . . . . . . . 4 63 5. Illustration of the scale . . . . . . . . . . . . . . . . . . 5 64 6. Design Options . . . . . . . . . . . . . . . . . . . . . . . 6 65 6.1. SRGB Size . . . . . . . . . . . . . . . . . . . . . . . . 6 66 6.2. Redistribution of Agg nodes routes . . . . . . . . . . . 6 67 6.3. Sizing and hierarchy . . . . . . . . . . . . . . . . . . 6 68 6.4. Local Segments to Hosts/Servers . . . . . . . . . . . . . 6 69 6.5. Compressed SRTE policies . . . . . . . . . . . . . . . . 7 70 7. Deployment Model . . . . . . . . . . . . . . . . . . . . . . 7 71 8. Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . 7 72 8.1. Simplified operations . . . . . . . . . . . . . . . . . . 8 73 8.2. Inter-domain SLA . . . . . . . . . . . . . . . . . . . . 8 74 8.3. Scale . . . . . . . . . . . . . . . . . . . . . . . . . . 8 75 8.4. ECMP . . . . . . . . . . . . . . . . . . . . . . . . . . 8 76 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 8 77 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 78 11. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 9 79 12. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 80 12.1. Normative References . . . . . . . . . . . . . . . . . . 9 81 12.2. Informative References . . . . . . . . . . . . . . . . . 9 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 84 1. Introduction 86 This document describes how SR can be used to interconnect millions 87 of endpoints.The following terminology is used in this document: 89 2. Terminology 91 The following terms and abbreviations are used in this document: 93 Term Definition 94 --------------------------------------------------------- 95 Agg Aggregation 96 BGP Border Gateway Protocol 97 DC Data Center 98 DCI Data Center Interconnect 99 ECMP Equal Cost MultiPathing 100 FIB Forwarding Information Base 101 LDP Label Distribution Protocol 102 LFIB Label Forwarding Information Base 103 MPLS Multi-Protocol Label Switching 104 PCE Path Computation Element 105 PCEP Path Computation Element Protocol 106 PW Pseudowire 107 SLA Service level Agreement 108 SR Segment Routing 109 SRTE Policy Segment Routing Traffic Engineering Policy 110 TE Traffic Engineering 111 TI-LFA Topology Independent - Loop Free Alternative 113 3. Reference Design 115 The network diagram here below describes the reference network 116 topology used in this document: 118 +-------+ +--------+ +--------+ +-------+ +-------+ 119 A DCI1 Agg1 Agg3 DCI3 Z 120 | DC1 | | M1 | | C | | M2 | | DC2 | 121 | DCI2 Agg2 Agg4 DCI4 | 122 +-------+ +--------+ +--------+ +-------+ +-------+ 124 Figure 1: Reference Topology 126 The following applies to the reference topology above: 128 Independent ISIS-OSPF/SR instance in core (C) region. 130 Independent ISIS-OSPF/SR instance in Metro1 (M1) region. 132 Independent ISIS-OSPF/SR instance in Metro2 (M2) region. 134 BGP/SR in DC1. 136 BGP/SR in DC2. 138 Agg routes (Agg1, Agg2, Agg3, Agg4) are redistributed from C to M 139 (M1 and M2) and from M to DC domains. 141 No other route is advertised or redistributed between regions. 143 The same homogeneous SRGB is used throughout the domains (e.g. 144 16000-23999). 146 Unique SRGB sub-ranges are allocated to each metro (M) and core 147 (C) domains: 149 16000-16999 range is allocated to the core (C) domain/region. 151 17000-17999 range is allocated to the M1 domain/region. 153 18000-18999 range is allocated to the M2 domain/region. 155 Specifically, Agg3 router has SID 16003 allocated and the 156 anycast SID for Agg3 and Agg4 is 16006. 158 Specifically, DCI3 router has SID 18003 allocated and the 159 anycast SID for DCI3 and DCI4 is 18006. 161 The same SRGB sub-range is re-used within each DC (DC1 and DC2) 162 region. for each DC: e.g. 20000-23999. Specifically, range 163 20000-23999 range is used in both DC1 and DC2 regions and nodes A 164 and Z have both SID 20001 allocated to them. 166 4. Control Plane 168 This section provides a high-level description of an implemented 169 control-plane. 171 The mechanism through which SRTE Policies are defined, computed and 172 programmed in the source nodes, are outside the scope of this 173 document. 175 Typically, a controller or a service orchestration system programs 176 node A with a pseudowire (PW) to a remote next-hop Z with a given SLA 177 contract (e.g. low-latency path, be disjoint from a specific core 178 plane, be disjoint from a different PW service, etc.). 180 Node A automatically detects that it does not have reachability to Z. 181 It then automatically sends a PCEP request to an SR PCE for an SRTE 182 policy that provides reachability to Z with the requested SLA. 184 The SR PCE is made of two components. A multi-domain topology and a 185 computation engine. The multi-domain topology is continuously 186 refreshed through BGP-LS feeds from each domain. The computing 187 engine implements Traffic Engineering (TE) algorithms designed 188 specifically for SR path expression. Upon receiving the PCEP 189 request, the SR PCE computes the requested path. The path is 190 expressed through a list of segments (e.g. {16003, 16005, 18001} and 191 provided to node A. 193 The SR PCE logs the request as a stateful query and hence is capable 194 to recompute the path at each network topology change. 196 Node A receives the PCEP reply with the path (expressed as a segment 197 list). Node A installs the received SRTE policy in the dataplane. 198 Node A then automatically steers the PW into that SRTE policy. 200 5. Illustration of the scale 202 According to the reference topology described in Figure 1 the 203 following assumptions are made: 205 There's 1 core domain and 100 of leaf (metro) domains. 207 The core domain includes 200 nodes. 209 Two nodes connect each leaf (metro) domain. Each node connecting 210 a leaf domain has a SID allocated. Each pair of nodes connecting 211 a leaf domain also has a common anycast SID. This brings up to 212 300 prefix segments in total. 214 A core node connects only one leaf domain. 216 Each leaf domain has 6000 leaf node segments. Each leaf-node has 217 500 endpoints attached, thus 500 adjacency segments. In total, it 218 is 3 millions endpoints for a leaf domain. 220 Based on the above, the network scaling numbers are as follows: 222 6,000 leaf node segments multiplied by 100 leaf domains: 600,000 223 nodes. 225 600,000 nodes multiplied by 500 endpoints: 300 millions of 226 endpoints. 228 The node scaling numbers are as follows: 230 Leaf node segment scale: 6,000 leaf node segments + 300 core node 231 segments + 500 adjacency segments = 6,800 segments 233 Core node segment scale: 6,000 leaf domain segments + 300 core 234 domain segments = 6,300 segments 236 In the above calculations, the link adjacency segments are not taken 237 into account. These are local segments and, typically, less than 100 238 per node. 240 It has to be noted that, depending on leaf node FIB capabilities, 241 leaf domains could be split into multiple smaller domains. In the 242 above example, the leaf domains could be split into 6 smaller domains 243 so that each leaf node only need to learn 1000 leaf node segments + 244 300 core node segments + 500 adjacency segments which gives a total 245 of 1,800 segments. 247 6. Design Options 249 This section describes multiple design options to the illustration of 250 previous section. 252 6.1. SRGB Size 254 In the simplified illustrations of this document, we picked a small 255 homogeneous SRGB range of 16000-23999. In practice, a large-scale 256 design would use a bigger range such as 16000-80000, or even larger. 258 6.2. Redistribution of Agg nodes routes 260 The operator might choose to not redistribute the Agg nodes routes 261 into the Metro/DC domains. In that case, more segments are required 262 in order to express an inter-domain path. 264 For example, node A would use an SRTE Policy {DCI1, Agg1, Agg3, DCI3, 265 Z} in order to reach Z instead of {Agg3, DCI3, Z} in the reference 266 design. 268 6.3. Sizing and hierarchy 270 The operator is free to choose among a small number of larger leaf 271 domains, a large number of small leaf domains or a mix of small and 272 large core/leaf domains. 274 The operator is free to use a 2-tier design (Core/Metro) or a 3-tier 275 (Core/Metro/DC). 277 6.4. Local Segments to Hosts/Servers 279 Local segments can be programmed at any leaf node (e.g. node Z) in 280 order to identify locally-attached hosts (or VM's). For example, if 281 node Z has bound a local segment 40001 to a local host ZH1, then node 282 A uses the following SRTE Policy in order to reach that host: {16006, 283 17006, 20001, 40001}. Such local segment could represent the NID 284 (Network Interface Device) in the context of the SP access network, 285 or VM in the context of the DC network. 287 6.5. Compressed SRTE policies 289 As an example and according to Section 3, we assume node A can reach 290 node Z (e.g., with a low-latency SLA contract) via the SRTE policy 291 consisting of the path: Agg1, Agg2, Agg3, DCI3/4(anycast), Z. The 292 path is represented by the segment list: {16001, 16002, 16003, 18006, 293 20001}. 295 It is clear that the control-plane solution can install an SRTE 296 Policy {16002, 16003, 18006} at Agg1, collect the Binding SID 297 allocated by Agg1 to that policy (e.g. 4001) and hence program node A 298 with the compressed SRTE Policy {16001, 4001, 20001}. 300 From node A, 16001 leads to Agg1. Once at Agg1, 4001 leads to the 301 DCI pair (DCI3, DCI4) via a specific low-latency path {16002, 16003, 302 18006}. Once at that DCI pair, 20001 leads to Z. 304 Binding SID's allocated to "intermediate" SRTE Policies allow to 305 compress end-to-end SRTE Policies. 307 The segment list {16001, 4001, 20001} expresses the same path as 308 {16001, 16002, 16003, 18006, 20001} but with 2 less segments. 310 The Binding SID also provide for an inherent churn protection. 312 When the core topology changes, the control-plane can update the low- 313 latency SRTE Policy from Agg1 to the DCI pair to DC2 without updating 314 the SRTE Policy from A to Z. 316 7. Deployment Model 318 It is expected that this design be deployed as a green field but as 319 well in interworking (brown field) with seamless-mpls design as 320 described in [I-D.ietf-mpls-seamless-mpls]. 322 8. Benefits 324 The design options illustrated in this document allow the 325 interconnection on a very large scale. Millions of endpoints across 326 different domains can be interconnected. 328 8.1. Simplified operations 330 Two protocols have been removed from the network: LDP and RSVP-TE. 331 No new protocol has been introduced. The design leverage the core IP 332 protocols: ISIS, OSPF, BGP, PCEP with straightforward SR extensions. 334 8.2. Inter-domain SLA 336 Fast reroute and resiliency is provided by TI-LFA with sub-50msec FRR 337 upon Link/Node/SRLG failure. TI-LFA is described in 338 [I-D.francois-rtgwg-segment-routing-ti-lfa]. 340 The use of anycast SID's also provide an improvement in availability 341 and resiliency. 343 Inter-domain SLA's can be delivered, e.g., latency vs. cost optimized 344 path, disjointness from backbone planes, disjointness from other 345 services, disjointness between primary and backup paths. 347 Existing inter-domain solutions (Seamless MPLS) do not provide any 348 support for SLA contracts. They just provide a best-effort 349 reachability across domains. 351 8.3. Scale 353 In addition to having eliminated two control plane protocols, per- 354 service midpoint states have also been removed from the network. 356 8.4. ECMP 358 Each policy (intra or inter-domain, with or without TE) is expressed 359 as a list of segments. Since each segment is optimized for ECMP, 360 then the entire policy is optimized for ECMP. The ECMP gain of 361 anycast prefix segment should also be considered (e.g. 16001 load- 362 shares across any gateway from M1 leaf domain to Core and 16002 load- 363 shares across any gateway from Core to M1 leaf domain. 365 9. IANA Considerations 367 This document does not make any IANA request. 369 10. Acknowledgements 371 We would like to thank Giles Heron, Alexander Preusche, Steve Braaten 372 and Francis Ferguson for their contribution to the content of this 373 document. 375 11. Contributors 377 The following people have substantially contributed to the editing of 378 this document: 380 Dennis Cai 381 Individual 383 Tim Laberge 384 Individual 386 Steven Lin 387 Google Inc. 389 Steven Lin 390 Google Inc. 392 Bruno Decraene 393 Orange 395 Luay Jalil 396 Verizon 398 Jeff Tantsura 399 Individual 401 Rob Shakir 402 Google 404 12. References 406 12.1. Normative References 408 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 409 Requirement Levels", BCP 14, RFC 2119, 410 DOI 10.17487/RFC2119, March 1997, 411 . 413 12.2. Informative References 415 [I-D.francois-rtgwg-segment-routing-ti-lfa] 416 Francois, P., Bashandy, A., Filsfils, C., Decraene, B., 417 and S. Litkowski, "Abstract", draft-francois-rtgwg- 418 segment-routing-ti-lfa-04 (work in progress), December 419 2016. 421 [I-D.ietf-mpls-seamless-mpls] 422 Leymann, N., Decraene, B., Filsfils, C., Konstantynowicz, 423 M., and D. Steinberg, "Seamless MPLS Architecture", draft- 424 ietf-mpls-seamless-mpls-07 (work in progress), June 2014. 426 Authors' Addresses 428 Clarence Filsfils (editor) 429 Cisco Systems, Inc. 430 Brussels 431 Belgium 433 Email: cfilsfil@cisco.com 435 Stefano Previdi 436 Cisco Systems, Inc. 437 Via Del Serafico, 200 438 Rome 00142 439 Italy 441 Email: stefano@previdi.net 443 Gaurav Dawra (editor) 444 LinkedIn 445 USA 447 Email: gdawra.ietf@gmail.com 449 Wim Henderickx 450 Nokia 451 Copernicuslaan 50 452 Antwerp 2018 453 Belgium 455 Email: wim.henderickx@nokia.com 457 Dave Cooper 458 Level 3 460 Email: Dave.Cooper@Level3.com