idnits 2.17.1 draft-filsfils-spring-large-scale-interconnect-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (December 21, 2017) is 2319 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Filsfils, Ed. 3 Internet-Draft S. Previdi 4 Intended status: Informational G. Dawra, Ed. 5 Expires: June 24, 2018 Cisco Systems, Inc. 6 D. Cai 7 Individual 8 W. Henderickx 9 Nokia 10 D. Cooper 11 Level 3 12 T. Laberge 13 S. Lin 14 Individual 15 B. Decraene 16 Orange 17 L. Jalil 18 Verizon 19 J. Tantsura 20 Individual 21 R. Shakir 22 Google 23 December 21, 2017 25 Interconnecting Millions Of Endpoints With Segment Routing 26 draft-filsfils-spring-large-scale-interconnect-08 28 Abstract 30 This document describes an application of Segment Routing to scale 31 the network to support hundreds of thousands of network nodes, and 32 tens of millions of physical underlay endpoints. This use-case can 33 be applied to the interconnection of massive-scale DC's and/or large 34 aggregation networks. Forwarding tables of midpoint and leaf nodes 35 only require a few tens of thousands of entries. 37 Requirements Language 39 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 40 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 41 document are to be interpreted as described in RFC 2119 [RFC2119]. 43 Status of This Memo 45 This Internet-Draft is submitted in full conformance with the 46 provisions of BCP 78 and BCP 79. 48 Internet-Drafts are working documents of the Internet Engineering 49 Task Force (IETF). Note that other groups may also distribute 50 working documents as Internet-Drafts. The list of current Internet- 51 Drafts is at https://datatracker.ietf.org/drafts/current/. 53 Internet-Drafts are draft documents valid for a maximum of six months 54 and may be updated, replaced, or obsoleted by other documents at any 55 time. It is inappropriate to use Internet-Drafts as reference 56 material or to cite them other than as "work in progress." 58 This Internet-Draft will expire on June 24, 2018. 60 Copyright Notice 62 Copyright (c) 2017 IETF Trust and the persons identified as the 63 document authors. All rights reserved. 65 This document is subject to BCP 78 and the IETF Trust's Legal 66 Provisions Relating to IETF Documents 67 (https://trustee.ietf.org/license-info) in effect on the date of 68 publication of this document. Please review these documents 69 carefully, as they describe your rights and restrictions with respect 70 to this document. Code Components extracted from this document must 71 include Simplified BSD License text as described in Section 4.e of 72 the Trust Legal Provisions and are provided without warranty as 73 described in the Simplified BSD License. 75 Table of Contents 77 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 78 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 79 3. Reference Design . . . . . . . . . . . . . . . . . . . . . . 3 80 4. Control Plane . . . . . . . . . . . . . . . . . . . . . . . . 4 81 5. Illustration of the scale . . . . . . . . . . . . . . . . . . 5 82 6. Design Options . . . . . . . . . . . . . . . . . . . . . . . 6 83 6.1. SRGB Size . . . . . . . . . . . . . . . . . . . . . . . . 6 84 6.2. Redistribution of Agg nodes routes . . . . . . . . . . . 6 85 6.3. Sizing and hierarchy . . . . . . . . . . . . . . . . . . 7 86 6.4. Local Segments to Hosts/Servers . . . . . . . . . . . . . 7 87 6.5. Compressed SRTE policies . . . . . . . . . . . . . . . . 7 88 7. Deployment Model . . . . . . . . . . . . . . . . . . . . . . 8 89 8. Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . 8 90 8.1. Simplified operations . . . . . . . . . . . . . . . . . . 8 91 8.2. Inter-domain SLA . . . . . . . . . . . . . . . . . . . . 8 92 8.3. Scale . . . . . . . . . . . . . . . . . . . . . . . . . . 8 93 8.4. ECMP . . . . . . . . . . . . . . . . . . . . . . . . . . 8 94 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 95 10. Manageability Considerations . . . . . . . . . . . . . . . . 9 96 11. Security Considerations . . . . . . . . . . . . . . . . . . . 9 97 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 98 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 99 13.1. Normative References . . . . . . . . . . . . . . . . . . 9 100 13.2. Informative References . . . . . . . . . . . . . . . . . 9 101 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9 103 1. Introduction 105 This document describes how SR can be used to interconnect millions 106 of endpoints.The following terminology is used in this document: 108 2. Terminology 110 The following terms and abbreviations are used in this document: 112 Term Definition 113 --------------------------------------------------------- 114 Agg Aggregation 115 BGP Border Gateway Protocol 116 DC Data Center 117 DCI Data Center Interconnect 118 ECMP Equal Cost MultiPathing 119 FIB Forwarding Information Base 120 LDP Label Distribution Protocol 121 LFIB Label Forwarding Information Base 122 MPLS Multi-Protocol Label Switching 123 PCE Path Computation Element 124 PCEP Path Computation Element Protocol 125 PW Pseudowire 126 SLA Service level Agreement 127 SR Segment Routing 128 SRTE Policy Segment Routing Traffic Engineering Policy 129 TE Traffic Engineering 130 TI-LFA Topology Independent - Loop Free Alternative 132 3. Reference Design 134 The network diagram here below describes the reference network 135 topology used in this document: 137 +-------+ +--------+ +--------+ +-------+ +-------+ 138 A DCI1 Agg1 Agg3 DCI3 Z 139 | DC1 | | M1 | | C | | M2 | | DC2 | 140 | DCI2 Agg2 Agg4 DCI4 | 141 +-------+ +--------+ +--------+ +-------+ +-------+ 143 Figure 1: Reference Topology 145 The following applies to the reference topology above: 147 Independent ISIS-OSPF/SR instance in core (C) region. 149 Independent ISIS-OSPF/SR instance in Metro1 (M1) region. 151 Independent ISIS-OSPF/SR instance in Metro2 (M2) region. 153 BGP/SR in DC1. 155 BGP/SR in DC2. 157 Agg routes (Agg1, Agg2, Agg3, Agg4) are redistributed from C to M 158 (M1 and M2) and from M to DC domains. 160 No other route is advertised or redistributed between regions. 162 The same homogeneous SRGB is used throughout the domains (e.g. 163 16000-23999). 165 Unique SRGB sub-ranges are allocated to each metro (M) and core 166 (C) domains: 168 16000-16999 range is allocated to the core (C) domain/region. 170 17000-17999 range is allocated to the M1 domain/region. 172 18000-18999 range is allocated to the M2 domain/region. 174 Specifically, Agg3 router has SID 16003 allocated and the 175 anycast SID for Agg3 and Agg4 is 16006. 177 Specifically, DCI3 router has SID 18003 allocated and the 178 anycast SID for DCI3 and DCI4 is 18006. 180 The same SRGB sub-range is re-used within each DC (DC1 and DC2) 181 region. for each DC: e.g. 20000-23999. Specifically, range 182 20000-23999 range is used in both DC1 and DC2 regions and nodes A 183 and Z have both SID 20001 allocated to them. 185 4. Control Plane 187 This section provides a high-level description of an implemented 188 control-plane. 190 The mechanism through which SRTE Policies are defined, computed and 191 programmed in the source nodes, are outside the scope of this 192 document. 194 Typically, a controller or a service orchestration system programs 195 node A with a pseudowire (PW) to a remote next-hop Z with a given SLA 196 contract (e.g. low-latency path, be disjoint from a specific core 197 plane, be disjoint from a different PW service, etc.). 199 Node A automatically detects that it does not have reachability to Z. 200 It then automatically sends a PCEP request to an SR PCE for an SRTE 201 policy that provides reachability to Z with the requested SLA. 203 The SR PCE is made of two components. A multi-domain topology and a 204 computation engine. The multi-domain topology is continuously 205 refreshed through BGP-LS feeds from each domain. The computing 206 engine implements Traffic Engineering (TE) algorithms designed 207 specifically for SR path expression. Upon receiving the PCEP 208 request, the SR PCE computes the requested path. The path is 209 expressed through a list of segments (e.g. {16003, 16005, 18001} and 210 provided to node A. 212 The SR PCE logs the request as a stateful query and hence is capable 213 to recompute the path at each network topology change. 215 Node A receives the PCEP reply with the path (expressed as a segment 216 list). Node A installs the received SRTE policy in the dataplane. 217 Node A then automatically steers the PW into that SRTE policy. 219 5. Illustration of the scale 221 According to the reference topology described in Figure 1 the 222 following assumptions are made: 224 There's 1 core domain and 100 of leaf (metro) domains. 226 The core domain includes 200 nodes. 228 Two nodes connect each leaf (metro) domain. Each node connecting 229 a leaf domain has a SID allocated. Each pair of nodes connecting 230 a leaf domain also has a common anycast SID. This brings up to 231 300 prefix segments in total. 233 A core node connects only one leaf domain. 235 Each leaf domain has 6000 leaf node segments. Each leaf-node has 236 500 endpoints attached, thus 500 adjacency segments. In total, it 237 is 3 millions endpoints for a leaf domain. 239 Based on the above, the network scaling numbers are as follows: 241 6,000 leaf node segments multiplied by 100 leaf domains: 600,000 242 nodes. 244 600,000 nodes multiplied by 500 endpoints: 300 millions of 245 endpoints. 247 The node scaling numbers are as follows: 249 Leaf node segment scale: 6,000 leaf node segments + 300 core node 250 segments + 500 adjacency segments = 6,800 segments 252 Core node segment scale: 6,000 leaf domain segments + 300 core 253 domain segments = 6,300 segments 255 In the above calculations, the link adjacency segments are not taken 256 into account. These are local segments and, typically, less than 100 257 per node. 259 It has to be noted that, depending on leaf node FIB capabilities, 260 leaf domains could be split into multiple smaller domains. In the 261 above example, the leaf domains could be split into 6 smaller domains 262 so that each leaf node only need to learn 1000 leaf node segments + 263 300 core node segments + 500 adjacency segments which gives a total 264 of 1,800 segments. 266 6. Design Options 268 This section describes multiple design options to the illustration of 269 previous section. 271 6.1. SRGB Size 273 In the simplified illustrations of this document, we picked a small 274 homogeneous SRGB range of 16000-23999. In practice, a large-scale 275 design would use a bigger range such as 16000-80000, or even larger. 277 6.2. Redistribution of Agg nodes routes 279 The operator might choose to not redistribute the Agg nodes routes 280 into the Metro/DC domains. In that case, more segments are required 281 in order to express an inter-domain path. 283 For example, node A would use an SRTE Policy {DCI1, Agg1, Agg3, DCI3, 284 Z} in order to reach Z instead of {Agg3, DCI3, Z} in the reference 285 design. 287 6.3. Sizing and hierarchy 289 The operator is free to choose among a small number of larger leaf 290 domains, a large number of small leaf domains or a mix of small and 291 large core/leaf domains. 293 The operator is free to use a 2-tier design (Core/Metro) or a 3-tier 294 (Core/Metro/DC). 296 6.4. Local Segments to Hosts/Servers 298 Local segments can be programmed at any leaf node (e.g. node Z) in 299 order to identify locally-attached hosts (or VM's). For example, if 300 node Z has bound a local segment 40001 to a local host ZH1, then node 301 A uses the following SRTE Policy in order to reach that host: {16006, 302 17006, 20001, 40001}. Such local segment could represent the NID 303 (Network Interface Device) in the context of the SP access network, 304 or VM in the context of the DC network. 306 6.5. Compressed SRTE policies 308 As an example and according to Section 3, we assume node A can reach 309 node Z (e.g., with a low-latency SLA contract) via the SRTE policy 310 consisting of the path: Agg1, Agg2, Agg3, DCI3/4(anycast), Z. The 311 path is represented by the segment list: {16001, 16002, 16003, 18006, 312 20001}. 314 It is clear that the control-plane solution can install an SRTE 315 Policy {16002, 16003, 18006} at Agg1, collect the Binding SID 316 allocated by Agg1 to that policy (e.g. 4001) and hence program node A 317 with the compressed SRTE Policy {16001, 4001, 20001}. 319 From node A, 16001 leads to Agg1. Once at Agg1, 4001 leads to the 320 DCI pair (DCI3, DCI4) via a specific low-latency path {16002, 16003, 321 18006}. Once at that DCI pair, 20001 leads to Z. 323 Binding SID's allocated to "intermediate" SRTE Policies allow to 324 compress end-to-end SRTE Policies. 326 The segment list {16001, 4001, 20001} expresses the same path as 327 {16001, 16002, 16003, 18006, 20001} but with 2 less segments. 329 The Binding SID also provide for an inherent churn protection. 331 When the core topology changes, the control-plane can update the low- 332 latency SRTE Policy from Agg1 to the DCI pair to DC2 without updating 333 the SRTE Policy from A to Z. 335 7. Deployment Model 337 It is expected that this design be deployed as a green field but as 338 well in interworking (brown field) with seamless-mpls design as 339 described in [I-D.ietf-mpls-seamless-mpls]. 341 8. Benefits 343 The design options illustrated in this document allow the 344 interconnection on a very large scale. Millions of endpoints across 345 different domains can be interconnected. 347 8.1. Simplified operations 349 Two protocols have been removed from the network: LDP and RSVP-TE. 350 No new protocol has been introduced. The design leverage the core IP 351 protocols: ISIS, OSPF, BGP, PCEP with straightforward SR extensions. 353 8.2. Inter-domain SLA 355 Fast reroute and resiliency is provided by TI-LFA with sub-50msec FRR 356 upon Link/Node/SRLG failure. TI-LFA is described in 357 [I-D.francois-rtgwg-segment-routing-ti-lfa]. 359 The use of anycast SID's also provide an improvement in availability 360 and resiliency. 362 Inter-domain SLA's can be delivered, e.g., latency vs. cost optimized 363 path, disjointness from backbone planes, disjointness from other 364 services, disjointness between primary and backup paths. 366 Existing inter-domain solutions (Seamless MPLS) do not provide any 367 support for SLA contracts. They just provide a best-effort 368 reachability across domains. 370 8.3. Scale 372 In addition to having eliminated two control plane protocols, per- 373 service midpoint states have also been removed from the network. 375 8.4. ECMP 377 Each policy (intra or inter-domain, with or without TE) is expressed 378 as a list of segments. Since each segment is optimized for ECMP, 379 then the entire policy is optimized for ECMP. The ECMP gain of 380 anycast prefix segment should also be considered (e.g. 16001 load- 381 shares across any gateway from M1 leaf domain to Core and 16002 load- 382 shares across any gateway from Core to M1 leaf domain. 384 9. IANA Considerations 386 TBD 388 10. Manageability Considerations 390 TBD 392 11. Security Considerations 394 TBD 396 12. Acknowledgements 398 We would like to thank Giles Heron, Alexander Preusche, Steve Braaten 399 and Francis Ferguson for their contribution to the content of this 400 document. 402 13. References 404 13.1. Normative References 406 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 407 Requirement Levels", BCP 14, RFC 2119, 408 DOI 10.17487/RFC2119, March 1997, 409 . 411 13.2. Informative References 413 [I-D.francois-rtgwg-segment-routing-ti-lfa] 414 Francois, P., Bashandy, A., Filsfils, C., Decraene, B., 415 and S. Litkowski, "Abstract", draft-francois-rtgwg- 416 segment-routing-ti-lfa-04 (work in progress), December 417 2016. 419 [I-D.ietf-mpls-seamless-mpls] 420 Leymann, N., Decraene, B., Filsfils, C., Konstantynowicz, 421 M., and D. Steinberg, "Seamless MPLS Architecture", draft- 422 ietf-mpls-seamless-mpls-07 (work in progress), June 2014. 424 Authors' Addresses 426 Clarence Filsfils (editor) 427 Cisco Systems, Inc. 428 Brussels 429 Belgium 431 Email: cfilsfil@cisco.com 432 Stefano Previdi 433 Cisco Systems, Inc. 434 Via Del Serafico, 200 435 Rome 00142 436 Italy 438 Email: stefano@previdi.net 440 Gaurav Dawra (editor) 441 Cisco Systems, Inc. 442 USA 444 Email: gdawra.ietf@gmail.com 446 Dennis Cai 447 Individual 449 Wim Henderickx 450 Nokia 451 Copernicuslaan 50 452 Antwerp 2018 453 Belgium 455 Email: wim.henderickx@nokia.com 457 Dave Cooper 458 Level 3 460 Email: Dave.Cooper@Level3.com 462 Tim Laberge 463 Individual 465 Steven Lin 466 Individual 468 Email: slin100@yahoo.com 469 Bruno Decraene 470 Orange 471 FR 473 Email: bruno.decraene@orange.com 475 Luay Jalil 476 Verizon 477 400 International Pkwy 478 Richardson, TX 75081 479 United States 481 Email: luay.jalil@verizon.com 483 Jeff Tantsura 484 Individual 486 Email: jefftant@gmail.com 488 Rob Shakir 489 Google, Inc. 490 1600 Amphitheatre Parkway 491 Mountain View, CA 94043 493 Email: robjs@google.com