idnits 2.17.1 draft-filsfils-spring-large-scale-interconnect-07.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (November 21, 2017) is 2348 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Filsfils, Ed. 3 Internet-Draft S. Previdi 4 Intended status: Informational G. Dawra, Ed. 5 Expires: May 25, 2018 Cisco Systems, Inc. 6 D. Cai 7 Individual 8 W. Henderickx 9 Nokia 10 D. Cooper 11 Level 3 12 T. Laberge 13 Cisco Systems, Inc. 14 S. Lin 15 Individual 16 B. Decraene 17 Orange 18 L. Jalil 19 Verizon 20 J. Tantsura 21 Individual 22 R. Shakir 23 Google 24 November 21, 2017 26 Interconnecting Millions Of Endpoints With Segment Routing 27 draft-filsfils-spring-large-scale-interconnect-07 29 Abstract 31 This document describes an application of Segment Routing to scale 32 the network to support hundreds of thousands of network nodes, and 33 tens of millions of physical underlay endpoints. This use-case can 34 be applied to the interconnection of massive-scale DC's and/or large 35 aggregation networks. Forwarding tables of midpoint and leaf nodes 36 only require a few tens of thousands of entries. 38 Requirements Language 40 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 41 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 42 document are to be interpreted as described in RFC 2119 [RFC2119]. 44 Status of This Memo 46 This Internet-Draft is submitted in full conformance with the 47 provisions of BCP 78 and BCP 79. 49 Internet-Drafts are working documents of the Internet Engineering 50 Task Force (IETF). Note that other groups may also distribute 51 working documents as Internet-Drafts. The list of current Internet- 52 Drafts is at https://datatracker.ietf.org/drafts/current/. 54 Internet-Drafts are draft documents valid for a maximum of six months 55 and may be updated, replaced, or obsoleted by other documents at any 56 time. It is inappropriate to use Internet-Drafts as reference 57 material or to cite them other than as "work in progress." 59 This Internet-Draft will expire on May 25, 2018. 61 Copyright Notice 63 Copyright (c) 2017 IETF Trust and the persons identified as the 64 document authors. All rights reserved. 66 This document is subject to BCP 78 and the IETF Trust's Legal 67 Provisions Relating to IETF Documents 68 (https://trustee.ietf.org/license-info) in effect on the date of 69 publication of this document. Please review these documents 70 carefully, as they describe your rights and restrictions with respect 71 to this document. Code Components extracted from this document must 72 include Simplified BSD License text as described in Section 4.e of 73 the Trust Legal Provisions and are provided without warranty as 74 described in the Simplified BSD License. 76 Table of Contents 78 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 79 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 80 3. Reference Design . . . . . . . . . . . . . . . . . . . . . . 3 81 4. Control Plane . . . . . . . . . . . . . . . . . . . . . . . . 4 82 5. Illustration of the scale . . . . . . . . . . . . . . . . . . 5 83 6. Design Options . . . . . . . . . . . . . . . . . . . . . . . 6 84 6.1. SRGB Size . . . . . . . . . . . . . . . . . . . . . . . . 6 85 6.2. Redistribution of Agg nodes routes . . . . . . . . . . . 6 86 6.3. Sizing and hierarchy . . . . . . . . . . . . . . . . . . 7 87 6.4. Local Segments to Hosts/Servers . . . . . . . . . . . . . 7 88 6.5. Compressed SRTE policies . . . . . . . . . . . . . . . . 7 89 7. Deployment Model . . . . . . . . . . . . . . . . . . . . . . 8 90 8. Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . 8 91 8.1. Simplified operations . . . . . . . . . . . . . . . . . . 8 92 8.2. Inter-domain SLA . . . . . . . . . . . . . . . . . . . . 8 93 8.3. Scale . . . . . . . . . . . . . . . . . . . . . . . . . . 8 94 8.4. ECMP . . . . . . . . . . . . . . . . . . . . . . . . . . 8 95 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 96 10. Manageability Considerations . . . . . . . . . . . . . . . . 9 97 11. Security Considerations . . . . . . . . . . . . . . . . . . . 9 98 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 99 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 100 13.1. Normative References . . . . . . . . . . . . . . . . . . 9 101 13.2. Informative References . . . . . . . . . . . . . . . . . 9 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9 104 1. Introduction 106 This document describes how SR can be used to interconnect millions 107 of endpoints.The following terminology is used in this document: 109 2. Terminology 111 The following terms and abbreviations are used in this document: 113 Term Definition 114 --------------------------------------------------------- 115 Agg Aggregation 116 BGP Border Gateway Protocol 117 DC Data Center 118 DCI Data Center Interconnect 119 ECMP Equal Cost MultiPathing 120 FIB Forwarding Information Base 121 LDP Label Distribution Protocol 122 LFIB Label Forwarding Information Base 123 MPLS Multi-Protocol Label Switching 124 PCE Path Computation Element 125 PCEP Path Computation Element Protocol 126 PW Pseudowire 127 SLA Service level Agreement 128 SR Segment Routing 129 SRTE Policy Segment Routing Traffic Engineering Policy 130 TE Traffic Engineering 131 TI-LFA Topology Independent - Loop Free Alternative 133 3. Reference Design 135 The network diagram here below describes the reference network 136 topology used in this document: 138 +-------+ +--------+ +--------+ +-------+ +-------+ 139 A DCI1 Agg1 Agg3 DCI3 Z 140 | DC1 | | M1 | | C | | M2 | | DC2 | 141 | DCI2 Agg2 Agg4 DCI4 | 142 +-------+ +--------+ +--------+ +-------+ +-------+ 144 Figure 1: Reference Topology 146 The following applies to the reference topology above: 148 Independent ISIS-OSPF/SR instance in core (C) region. 150 Independent ISIS-OSPF/SR instance in Metro1 (M1) region. 152 Independent ISIS-OSPF/SR instance in Metro2 (M2) region. 154 BGP/SR in DC1. 156 BGP/SR in DC2. 158 Agg routes (Agg1, Agg2, Agg3, Agg4) are redistributed from C to M 159 (M1 and M2) and from M to DC domains. 161 No other route is advertised or redistributed between regions. 163 The same homogeneous SRGB is used throughout the domains (e.g. 164 16000-23999). 166 Unique SRGB sub-ranges are allocated to each metro (M) and core 167 (C) domains: 169 16000-16999 range is allocated to the core (C) domain/region. 171 17000-17999 range is allocated to the M1 domain/region. 173 18000-18999 range is allocated to the M2 domain/region. 175 Specifically, Agg3 router has SID 16003 allocated and the 176 anycast SID for Agg3 and Agg4 is 16006. 178 Specifically, DCI3 router has SID 18003 allocated and the 179 anycast SID for DCI3 and DCI4 is 18006. 181 The same SRGB sub-range is re-used within each DC (DC1 and DC2) 182 region. for each DC: e.g. 20000-23999. Specifically, range 183 20000-23999 range is used in both DC1 and DC2 regions and nodes A 184 and Z have both SID 20001 allocated to them. 186 4. Control Plane 188 This section provides a high-level description of an implemented 189 control-plane. 191 The mechanism through which SRTE Policies are defined, computed and 192 programmed in the source nodes, are outside the scope of this 193 document. 195 Typically, a controller or a service orchestration system programs 196 node A with a pseudowire (PW) to a remote next-hop Z with a given SLA 197 contract (e.g. low-latency path, be disjoint from a specific core 198 plane, be disjoint from a different PW service, etc.). 200 Node A automatically detects that it does not have reachability to Z. 201 It then automatically sends a PCEP request to an SR PCE for an SRTE 202 policy that provides reachability to Z with the requested SLA. 204 The SR PCE is made of two components. A multi-domain topology and a 205 computation engine. The multi-domain topology is continuously 206 refreshed through BGP-LS feeds from each domain. The computing 207 engine implements Traffic Engineering (TE) algorithms designed 208 specifically for SR path expression. Upon receiving the PCEP 209 request, the SR PCE computes the requested path. The path is 210 expressed through a list of segments (e.g. {16003, 16005, 18001} and 211 provided to node A. 213 The SR PCE logs the request as a stateful query and hence is capable 214 to recompute the path at each network topology change. 216 Node A receives the PCEP reply with the path (expressed as a segment 217 list). Node A installs the received SRTE policy in the dataplane. 218 Node A then automatically steers the PW into that SRTE policy. 220 5. Illustration of the scale 222 According to the reference topology described in Figure 1 the 223 following assumptions are made: 225 There's 1 core domain and 100 of leaf (metro) domains. 227 The core domain includes 200 nodes. 229 Two nodes connect each leaf (metro) domain. Each node connecting 230 a leaf domain has a SID allocated. Each pair of nodes connecting 231 a leaf domain also has a common anycast SID. This brings up to 232 300 prefix segments in total. 234 A core node connects only one leaf domain. 236 Each leaf domain has 6000 leaf node segments. Each leaf-node has 237 500 endpoints attached, thus 500 adjacency segments. In total, it 238 is 3 millions endpoints for a leaf domain. 240 Based on the above, the network scaling numbers are as follows: 242 6,000 leaf node segments multiplied by 100 leaf domains: 600,000 243 nodes. 245 600,000 nodes multiplied by 500 endpoints: 300 millions of 246 endpoints. 248 The node scaling numbers are as follows: 250 Leaf node segment scale: 6,000 leaf node segments + 300 core node 251 segments + 500 adjacency segments = 6,800 segments 253 Core node segment scale: 6,000 leaf domain segments + 300 core 254 domain segments = 6,300 segments 256 In the above calculations, the link adjacency segments are not taken 257 into account. These are local segments and, typically, less than 100 258 per node. 260 It has to be noted that, depending on leaf node FIB capabilities, 261 leaf domains could be split into multiple smaller domains. In the 262 above example, the leaf domains could be split into 6 smaller domains 263 so that each leaf node only need to learn 1000 leaf node segments + 264 300 core node segments + 500 adjacency segments which gives a total 265 of 1,800 segments. 267 6. Design Options 269 This section describes multiple design options to the illustration of 270 previous section. 272 6.1. SRGB Size 274 In the simplified illustrations of this document, we picked a small 275 homogeneous SRGB range of 16000-23999. In practice, a large-scale 276 design would use a bigger range such as 16000-80000, or even larger. 278 6.2. Redistribution of Agg nodes routes 280 The operator might choose to not redistribute the Agg nodes routes 281 into the Metro/DC domains. In that case, more segments are required 282 in order to express an inter-domain path. 284 For example, node A would use an SRTE Policy {DCI1, Agg1, Agg3, DCI3, 285 Z} in order to reach Z instead of {Agg3, DCI3, Z} in the reference 286 design. 288 6.3. Sizing and hierarchy 290 The operator is free to choose among a small number of larger leaf 291 domains, a large number of small leaf domains or a mix of small and 292 large core/leaf domains. 294 The operator is free to use a 2-tier design (Core/Metro) or a 3-tier 295 (Core/Metro/DC). 297 6.4. Local Segments to Hosts/Servers 299 Local segments can be programmed at any leaf node (e.g. node Z) in 300 order to identify locally-attached hosts (or VM's). For example, if 301 node Z has bound a local segment 40001 to a local host ZH1, then node 302 A uses the following SRTE Policy in order to reach that host: {16006, 303 17006, 20001, 40001}. Such local segment could represent the NID 304 (Network Interface Device) in the context of the SP access network, 305 or VM in the context of the DC network. 307 6.5. Compressed SRTE policies 309 As an example and according to Section 3, we assume node A can reach 310 node Z (e.g., with a low-latency SLA contract) via the SRTE policy 311 consisting of the path: Agg1, Agg2, Agg3, DCI3/4(anycast), Z. The 312 path is represented by the segment list: {16001, 16002, 16003, 18006, 313 20001}. 315 It is clear that the control-plane solution can install an SRTE 316 Policy {16002, 16003, 18006} at Agg1, collect the Binding SID 317 allocated by Agg1 to that policy (e.g. 4001) and hence program node A 318 with the compressed SRTE Policy {16001, 4001, 20001}. 320 From node A, 16001 leads to Agg1. Once at Agg1, 4001 leads to the 321 DCI pair (DCI3, DCI4) via a specific low-latency path {16002, 16003, 322 18006}. Once at that DCI pair, 20001 leads to Z. 324 Binding SID's allocated to "intermediate" SRTE Policies allow to 325 compress end-to-end SRTE Policies. 327 The segment list {16001, 4001, 20001} expresses the same path as 328 {16001, 16002, 16003, 18006, 20001} but with 2 less segments. 330 The Binding SID also provide for an inherent churn protection. 332 When the core topology changes, the control-plane can update the low- 333 latency SRTE Policy from Agg1 to the DCI pair to DC2 without updating 334 the SRTE Policy from A to Z. 336 7. Deployment Model 338 It is expected that this design be deployed as a green field but as 339 well in interworking (brown field) with seamless-mpls design as 340 described in [I-D.ietf-mpls-seamless-mpls]. 342 8. Benefits 344 The design options illustrated in this document allow the 345 interconnection on a very large scale. Millions of endpoints across 346 different domains can be interconnected. 348 8.1. Simplified operations 350 Two protocols have been removed from the network: LDP and RSVP-TE. 351 No new protocol has been introduced. The design leverage the core IP 352 protocols: ISIS, OSPF, BGP, PCEP with straightforward SR extensions. 354 8.2. Inter-domain SLA 356 Fast reroute and resiliency is provided by TI-LFA with sub-50msec FRR 357 upon Link/Node/SRLG failure. TI-LFA is described in 358 [I-D.francois-rtgwg-segment-routing-ti-lfa]. 360 The use of anycast SID's also provide an improvement in availability 361 and resiliency. 363 Inter-domain SLA's can be delivered, e.g., latency vs. cost optimized 364 path, disjointness from backbone planes, disjointness from other 365 services, disjointness between primary and backup paths. 367 Existing inter-domain solutions (Seamless MPLS) do not provide any 368 support for SLA contracts. They just provide a best-effort 369 reachability across domains. 371 8.3. Scale 373 In addition to having eliminated two control plane protocols, per- 374 service midpoint states have also been removed from the network. 376 8.4. ECMP 378 Each policy (intra or inter-domain, with or without TE) is expressed 379 as a list of segments. Since each segment is optimized for ECMP, 380 then the entire policy is optimized for ECMP. The ECMP gain of 381 anycast prefix segment should also be considered (e.g. 16001 load- 382 shares across any gateway from M1 leaf domain to Core and 16002 load- 383 shares across any gateway from Core to M1 leaf domain. 385 9. IANA Considerations 387 TBD 389 10. Manageability Considerations 391 TBD 393 11. Security Considerations 395 TBD 397 12. Acknowledgements 399 We would like to thank Giles Heron, Alexander Preusche, Steve Braaten 400 and Francis Ferguson for their contribution to the content of this 401 document. 403 13. References 405 13.1. Normative References 407 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 408 Requirement Levels", BCP 14, RFC 2119, 409 DOI 10.17487/RFC2119, March 1997, 410 . 412 13.2. Informative References 414 [I-D.francois-rtgwg-segment-routing-ti-lfa] 415 Francois, P., Bashandy, A., Filsfils, C., Decraene, B., 416 and S. Litkowski, "Abstract", draft-francois-rtgwg- 417 segment-routing-ti-lfa-04 (work in progress), December 418 2016. 420 [I-D.ietf-mpls-seamless-mpls] 421 Leymann, N., Decraene, B., Filsfils, C., Konstantynowicz, 422 M., and D. Steinberg, "Seamless MPLS Architecture", draft- 423 ietf-mpls-seamless-mpls-07 (work in progress), June 2014. 425 Authors' Addresses 427 Clarence Filsfils (editor) 428 Cisco Systems, Inc. 429 Brussels 430 Belgium 432 Email: cfilsfil@cisco.com 433 Stefano Previdi 434 Cisco Systems, Inc. 435 Via Del Serafico, 200 436 Rome 00142 437 Italy 439 Email: stefano@previdi.net 441 Gaurav Dawra (editor) 442 Cisco Systems, Inc. 443 USA 445 Email: gdawra@cisco.com 447 Dennis Cai 448 Individual 450 Wim Henderickx 451 Nokia 452 Copernicuslaan 50 453 Antwerp 2018 454 Belgium 456 Email: wim.henderickx@nokia.com 458 Dave Cooper 459 Level 3 461 Email: Dave.Cooper@Level3.com 463 Tim Laberge 464 Cisco Systems, Inc. 466 Email: tlaberge@cisco.com 468 Steven Lin 469 Individual 471 Email: slin100@yahoo.com 472 Bruno Decraene 473 Orange 474 FR 476 Email: bruno.decraene@orange.com 478 Luay Jalil 479 Verizon 480 400 International Pkwy 481 Richardson, TX 75081 482 United States 484 Email: luay.jalil@verizon.com 486 Jeff Tantsura 487 Individual 489 Email: jefftant@gmail.com 491 Rob Shakir 492 Google, Inc. 493 1600 Amphitheatre Parkway 494 Mountain View, CA 94043 496 Email: robjs@google.com