idnits 2.17.1 draft-filsfils-spring-large-scale-interconnect-05.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (December 7, 2016) is 2690 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-04) exists of draft-francois-rtgwg-segment-routing-ti-lfa-02 Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Filsfils, Ed. 3 Internet-Draft D. Cai 4 Intended status: Informational S. Previdi, Ed. 5 Expires: June 10, 2017 Cisco Systems, Inc. 6 W. Henderickx 7 Nokia 8 D. Cooper 9 Level 3 10 T. Laberge 11 Cisco Systems, Inc. 12 S. Lin 13 Individual 14 B. Decraene 15 Orange 16 L. Jalil 17 Verizon 18 J. Tantsura 19 Individual 20 R. Shakir 21 Google 22 December 7, 2016 24 Interconnecting Millions Of Endpoints With Segment Routing 25 draft-filsfils-spring-large-scale-interconnect-05 27 Abstract 29 This document describes an application of Segment Routing to scale 30 the network to support hundreds of thousands of network nodes, and 31 tens of millions of physical underlay endpoints. This use-case can 32 be applied to the interconnection of massive-scale DC's and/or large 33 aggregation networks. Forwarding tables of midpoint and leaf nodes 34 only require a few tens of thousands of entries. 36 Requirements Language 38 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 39 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 40 document are to be interpreted as described in RFC 2119 [RFC2119]. 42 Status of This Memo 44 This Internet-Draft is submitted in full conformance with the 45 provisions of BCP 78 and BCP 79. 47 Internet-Drafts are working documents of the Internet Engineering 48 Task Force (IETF). Note that other groups may also distribute 49 working documents as Internet-Drafts. The list of current Internet- 50 Drafts is at http://datatracker.ietf.org/drafts/current/. 52 Internet-Drafts are draft documents valid for a maximum of six months 53 and may be updated, replaced, or obsoleted by other documents at any 54 time. It is inappropriate to use Internet-Drafts as reference 55 material or to cite them other than as "work in progress." 57 This Internet-Draft will expire on June 10, 2017. 59 Copyright Notice 61 Copyright (c) 2016 IETF Trust and the persons identified as the 62 document authors. All rights reserved. 64 This document is subject to BCP 78 and the IETF Trust's Legal 65 Provisions Relating to IETF Documents 66 (http://trustee.ietf.org/license-info) in effect on the date of 67 publication of this document. Please review these documents 68 carefully, as they describe your rights and restrictions with respect 69 to this document. Code Components extracted from this document must 70 include Simplified BSD License text as described in Section 4.e of 71 the Trust Legal Provisions and are provided without warranty as 72 described in the Simplified BSD License. 74 Table of Contents 76 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 77 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 78 3. Reference Design . . . . . . . . . . . . . . . . . . . . . . 3 79 4. Control Plane . . . . . . . . . . . . . . . . . . . . . . . . 4 80 5. Illustration of the scale . . . . . . . . . . . . . . . . . . 5 81 6. Design Options . . . . . . . . . . . . . . . . . . . . . . . 6 82 6.1. SRGB Size . . . . . . . . . . . . . . . . . . . . . . . . 6 83 6.2. Redistribution of Agg nodes routes . . . . . . . . . . . 6 84 6.3. Sizing and hierarchy . . . . . . . . . . . . . . . . . . 7 85 6.4. Local Segments to Hosts/Servers . . . . . . . . . . . . . 7 86 6.5. Compressed SRTE policies . . . . . . . . . . . . . . . . 7 87 7. Deployment Model . . . . . . . . . . . . . . . . . . . . . . 8 88 8. Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . 8 89 8.1. Simplified operations . . . . . . . . . . . . . . . . . . 8 90 8.2. Inter-domain SLA . . . . . . . . . . . . . . . . . . . . 8 91 8.3. Scale . . . . . . . . . . . . . . . . . . . . . . . . . . 8 92 8.4. ECMP . . . . . . . . . . . . . . . . . . . . . . . . . . 8 93 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 94 10. Manageability Considerations . . . . . . . . . . . . . . . . 9 95 11. Security Considerations . . . . . . . . . . . . . . . . . . . 9 96 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 97 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 98 13.1. Normative References . . . . . . . . . . . . . . . . . . 9 99 13.2. Informative References . . . . . . . . . . . . . . . . . 9 100 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 9 102 1. Introduction 104 This document describes how SR can be used to interconnect millions 105 of endpoints.The following terminology is used in this document: 107 2. Terminology 109 The following terms and abbreviations are used in this document: 111 Term Definition 112 --------------------------------------------------------- 113 Agg Aggregation 114 BGP Border Gateway Protocol 115 DC Data Center 116 DCI Data Center Interconnect 117 ECMP Equal Cost MultiPathing 118 FIB Forwarding Information Base 119 LDP Label Distribution Protocol 120 LFIB Label Forwarding Information Base 121 MPLS Multi-Protocol Label Switching 122 PCE Path Computation Element 123 PCEP Path Computation Element Protocol 124 PW Pseudowire 125 SLA Service level Agreement 126 SR Segment Routing 127 SRTE Policy Segment Routing Traffic Engineering Policy 128 TE Traffic Engineering 129 TI-LFA Topology Independent - Loop Free Alternative 131 3. Reference Design 133 The network diagram here below describes the reference network 134 topology used in this document: 136 +-------+ +--------+ +--------+ +-------+ +-------+ 137 A DCI1 Agg1 Agg3 DCI3 Z 138 | DC1 | | M1 | | C | | M2 | | DC2 | 139 | DCI2 Agg2 Agg4 DCI4 | 140 +-------+ +--------+ +--------+ +-------+ +-------+ 142 Figure 1: Reference Topology 144 The following applies to the reference topology above: 146 Independent ISIS-OSPF/SR instance in core (C) region. 148 Independent ISIS-OSPF/SR instance in Metro1 (M1) region. 150 Independent ISIS-OSPF/SR instance in Metro2 (M2) region. 152 BGP/SR in DC1. 154 BGP/SR in DC2. 156 Agg routes (Agg1, Agg2, Agg3, Agg4) are redistributed from C to M 157 (M1 and M2) and from M to DC domains. 159 No other route is advertised or redistributed between regions. 161 The same homogeneous SRGB is used throughout the domains (e.g. 162 16000-23999). 164 Unique SRGB sub-ranges are allocated to each metro (M) and core 165 (C) domains: 167 16000-16999 range is allocated to the core (C) domain/region. 169 17000-17999 range is allocated to the M1 domain/region. 171 18000-18999 range is allocated to the M2 domain/region. 173 Specifically, Agg3 router has SID 16003 allocated and the 174 anycast SID for Agg3 and Agg4 is 16006. 176 Specifically, DCI3 router has SID 18003 allocated and the 177 anycast SID for DCI3 and DCI4 is 18006. 179 The same SRGB sub-range is re-used within each DC (DC1 and DC2) 180 region. for each DC: e.g. 20000-23999. Specifically, range 181 20000-23999 range is used in both DC1 and DC2 regions and nodes A 182 and Z have both SID 20001 allocated to them. 184 4. Control Plane 186 This section provides a high-level description of an implemented 187 control-plane. 189 The mechanism through which SRTE Policies are defined, computed and 190 programmed in the source nodes, are outside the scope of this 191 document. 193 Typically, a controller or a service orchestration system programs 194 node A with a pseudowire (PW) to a remote next-hop Z with a given SLA 195 contract (e.g. low-latency path, be disjoint from a specific core 196 plane, be disjoint from a different PW service, etc.). 198 Node A automatically detects that it does not have reachability to Z. 199 It then automatically sends a PCEP request to an SR PCE for an SRTE 200 policy that provides reachability to Z with the requested SLA. 202 The SR PCE is made of two components. A multi-domain topology and a 203 computation engine. The multi-domain topology is continuously 204 refreshed through BGP-LS feeds from each domain. The computing 205 engine implements Traffic Engineering (TE) algorithms designed 206 specifically for SR path expression. Upon receiving the PCEP 207 request, the SR PCE computes the requested path. The path is 208 expressed through a list of segments (e.g. {16003, 16005, 18001} and 209 provided to node A. 211 The SR PCE logs the request as a stateful query and hence is capable 212 to recompute the path at each network topology change. 214 Node A receives the PCEP reply with the path (expressed as a segment 215 list). Node A installs the received SRTE policy in the dataplane. 216 Node A then automatically steers the PW into that SRTE policy. 218 5. Illustration of the scale 220 According to the reference topology described in Figure 1 the 221 following assumptions are made: 223 There's 1 core domain and 100 of leaf (metro) domains. 225 The core domain includes 200 nodes. 227 Two nodes connect each leaf (metro) domain. Each node connecting 228 a leaf domain has a SID allocated. Each pair of nodes connecting 229 a leaf domain also has a common anycast SID. This brings up to 230 300 prefix segments in total. 232 A core node connects only one leaf domain. 234 Each leaf domain has 6000 leaf node segments. Each leaf-node has 235 500 endpoints attached, thus 500 adjacency segments. In total, it 236 is 3 millions endpoints for a leaf domain. 238 Based on the above, the network scaling numbers are as follows: 240 6,000 leaf node segments multiplied by 100 leaf domains: 600,000 241 nodes. 243 600,000 nodes multiplied by 500 endpoints: 300 millions of 244 endpoints. 246 The node scaling numbers are as follows: 248 Leaf node segment scale: 6,000 leaf node segments + 300 core node 249 segments + 500 adjacency segments = 6,800 segments 251 Core node segment scale: 6,000 leaf domain segments + 300 core 252 domain segments = 6,300 segments 254 In the above calculations, the link adjacency segments are not taken 255 into account. These are local segments and, typically, less than 100 256 per node. 258 It has to be noted that, depending on leaf node FIB capabilities, 259 leaf domains could be split into multiple smaller domains. In the 260 above example, the leaf domains could be split into 6 smaller domains 261 so that each leaf node only need to learn 1000 leaf node segments + 262 300 core node segments + 500 adjacency segments which gives a total 263 of 1,800 segments. 265 6. Design Options 267 This section describes multiple design options to the illustration of 268 previous section. 270 6.1. SRGB Size 272 In the simplified illustrations of this document, we picked a small 273 homogeneous SRGB range of 16000-23999. In practice, a large-scale 274 design would use a bigger range such as 16000-80000, or even larger. 276 6.2. Redistribution of Agg nodes routes 278 The operator might choose to not redistribute the Agg nodes routes 279 into the Metro/DC domains. In that case, more segments are required 280 in order to express an inter-domain path. 282 For example, node A would use an SRTE Policy {DCI1, Agg1, Agg3, DCI3, 283 Z} in order to reach Z instead of {Agg3, DCI3, Z} in the reference 284 design. 286 6.3. Sizing and hierarchy 288 The operator is free to choose among a small number of larger leaf 289 domains, a large number of small leaf domains or a mix of small and 290 large core/leaf domains. 292 The operator is free to use a 2-tier design (Core/Metro) or a 3-tier 293 (Core/Metro/DC). 295 6.4. Local Segments to Hosts/Servers 297 Local segments can be programmed at any leaf node (e.g. node Z) in 298 order to identify locally-attached hosts (or VM's). For example, if 299 node Z has bound a local segment 40001 to a local host ZH1, then node 300 A uses the following SRTE Policy in order to reach that host: {16006, 301 17006, 20001, 40001}. Such local segment could represent the NID 302 (Network Interface Device) in the context of the SP access network, 303 or VM in the context of the DC network. 305 6.5. Compressed SRTE policies 307 As an example and according to Section 3, we assume node A can reach 308 node Z (e.g., with a low-latency SLA contract) via the SRTE policy 309 consisting of the path: Agg1, Agg2, Agg3, DCI3/4(anycast), Z. The 310 path is represented by the segment list: {16001, 16002, 16003, 18006, 311 20001}. 313 It is clear that the control-plane solution can install an SRTE 314 Policy {16002, 16003, 18006} at Agg1, collect the Binding SID 315 allocated by Agg1 to that policy (e.g. 4001) and hence program node A 316 with the compressed SRTE Policy {16001, 4001, 20001}. 318 From node A, 16001 leads to Agg1. Once at Agg1, 4001 leads to the 319 DCI pair (DCI3, DCI4) via a specific low-latency path {16002, 16003, 320 18006}. Once at that DCI pair, 20001 leads to Z. 322 Binding SID's allocated to "intermediate" SRTE Policies allow to 323 compress end-to-end SRTE Policies. 325 The segment list {16001, 4001, 20001} expresses the same path as 326 {16001, 16002, 16003, 18006, 20001} but with 2 less segments. 328 The Binding SID also provide for an inherent churn protection. 330 When the core topology changes, the control-plane can update the low- 331 latency SRTE Policy from Agg1 to the DCI pair to DC2 without updating 332 the SRTE Policy from A to Z. 334 7. Deployment Model 336 It is expected that this design be deployed as a green field but as 337 well in interworking (brown field) with seamless-mpls design as 338 described in [I-D.ietf-mpls-seamless-mpls]. 340 8. Benefits 342 The design options illustrated in this document allow the 343 interconnection on a very large scale. Millions of endpoints across 344 different domains can be interconnected. 346 8.1. Simplified operations 348 Two protocols have been removed from the network: LDP and RSVP-TE. 349 No new protocol has been introduced. The design leverage the core IP 350 protocols: ISIS, OSPF, BGP, PCEP with straightforward SR extensions. 352 8.2. Inter-domain SLA 354 Fast reroute and resiliency is provided by TI-LFA with sub-50msec FRR 355 upon Link/Node/SRLG failure. TI-LFA is described in 356 [I-D.francois-rtgwg-segment-routing-ti-lfa]. 358 The use of anycast SID's also provide an improvement in availability 359 and resiliency. 361 Inter-domain SLA's can be delivered, e.g., latency vs. cost optimized 362 path, disjointness from backbone planes, disjointness from other 363 services, disjointness between primary and backup paths. 365 Existing inter-domain solutions (Seamless MPLS) do not provide any 366 support for SLA contracts. They just provide a best-effort 367 reachability across domains. 369 8.3. Scale 371 In addition to having eliminated two control plane protocols, per- 372 service midpoint states have also been removed from the network. 374 8.4. ECMP 376 Each policy (intra or inter-domain, with or without TE) is expressed 377 as a list of segments. Since each segment is optimized for ECMP, 378 then the entire policy is optimized for ECMP. The ECMP gain of 379 anycast prefix segment should also be considered (e.g. 16001 load- 380 shares across any gateway from M1 leaf domain to Core and 16002 load- 381 shares across any gateway from Core to M1 leaf domain. 383 9. IANA Considerations 385 TBD 387 10. Manageability Considerations 389 TBD 391 11. Security Considerations 393 TBD 395 12. Acknowledgements 397 We would like to thank Giles Heron, Alexander Preusche, Steve Braaten 398 and Francis Ferguson for their contribution to the content of this 399 document. 401 13. References 403 13.1. Normative References 405 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 406 Requirement Levels", BCP 14, RFC 2119, 407 DOI 10.17487/RFC2119, March 1997, 408 . 410 13.2. Informative References 412 [I-D.francois-rtgwg-segment-routing-ti-lfa] 413 Francois, P., Bashandy, A., and C. Filsfils, "Abstract", 414 draft-francois-rtgwg-segment-routing-ti-lfa-02 (work in 415 progress), November 2016. 417 [I-D.ietf-mpls-seamless-mpls] 418 Leymann, N., Decraene, B., Filsfils, C., Konstantynowicz, 419 M., and D. Steinberg, "Seamless MPLS Architecture", draft- 420 ietf-mpls-seamless-mpls-07 (work in progress), June 2014. 422 Authors' Addresses 424 Clarence Filsfils (editor) 425 Cisco Systems, Inc. 426 Brussels 427 Belgium 429 Email: cfilsfil@cisco.com 430 Dennis Cai 431 Cisco Systems, Inc. 433 Email: dcai@cisco.com 435 Stefano Previdi (editor) 436 Cisco Systems, Inc. 437 Via Del Serafico, 200 438 Rome 00142 439 Italy 441 Email: sprevidi@cisco.com 443 Wim Henderickx 444 Nokia 445 Copernicuslaan 50 446 Antwerp 2018 447 Belgium 449 Email: wim.henderickx@nokia.com 451 Dave Cooper 452 Level 3 454 Email: Dave.Cooper@Level3.com 456 Tim Laberge 457 Cisco Systems, Inc. 459 Email: tlaberge@cisco.com 461 Steven Lin 462 Individual 464 Email: slin100@yahoo.com 466 Bruno Decraene 467 Orange 468 FR 470 Email: bruno.decraene@orange.com 471 Luay Jalil 472 Verizon 473 400 International Pkwy 474 Richardson, TX 75081 475 United States 477 Email: luay.jalil@verizon.com 479 Jeff Tantsura 480 Individual 482 Email: jefftant@gmail.com 484 Rob Shakir 485 Google, Inc. 486 1600 Amphitheatre Parkway 487 Mountain View, CA 94043 489 Email: robjs@google.com