idnits 2.17.1 draft-ietf-bier-entropy-staged-dc-clos-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (May 6, 2020) is 1444 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC8365' is defined on line 325, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Xie 3 Internet-Draft Huawei Technologies 4 Intended status: Informational X. Xu 5 Expires: November 7, 2020 Alibaba Inc. 6 G. Yan 7 Huawei Technologies 8 M. McBride 9 Futurewei 10 May 6, 2020 12 Use of BIER Entropy for Data Center Clos Networks 13 draft-ietf-bier-entropy-staged-dc-clos-03 15 Abstract 17 Bit Index Explicit Replication (BIER) introduces a new multicast- 18 specific BIER Header. BIER can be applied to the Multi Protocol 19 Label Switching (MPLS) data plane or Non-MPLS data plane. Entropy is 20 a technique used in BIER to support load-balancing. This document 21 examines and describes how BIER Entropy is to be applied to Data 22 Center Clos networks for path selection. 24 Requirements Language 26 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 27 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 28 "OPTIONAL" in this document are to be interpreted as described in BCP 29 14 [RFC2119] [RFC8174] when, and only when, they appear in all 30 capitals, as shown here. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at https://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on November 7, 2020. 49 Copyright Notice 51 Copyright (c) 2020 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (https://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 67 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 68 3. Problem Statement and Considerations . . . . . . . . . . . . 3 69 3.1. Problem Statement . . . . . . . . . . . . . . . . . . . . 3 70 3.2. Considerations . . . . . . . . . . . . . . . . . . . . . 4 71 4. Use of BIER Entropy for DC Clos Network . . . . . . . . . . . 5 72 4.1. Use of BIER Entropy for DC Clos Network . . . . . . . . . 5 73 4.2. Steering for elephant flows . . . . . . . . . . . . . . . 6 74 4.3. Path Division for Tenant flows to different SIs . . . . . 6 75 4.4. Link Failure and Convergence . . . . . . . . . . . . . . 6 76 5. Data-Plane Processing . . . . . . . . . . . . . . . . . . . . 7 77 6. Security Considerations . . . . . . . . . . . . . . . . . . . 7 78 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 79 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 80 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 81 9.1. Normative References . . . . . . . . . . . . . . . . . . 7 82 9.2. Informative References . . . . . . . . . . . . . . . . . 8 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 85 1. Introduction 87 Bit Index Explicit Replication (BIER) [RFC8279] is an architecture 88 that provides optimal multicast forwarding without requiring 89 intermediate routers to maintain any per-flow state by using a 90 multicast-specific BIER header. [RFC8296] defines two types of BIER 91 encapsulation formats: one is MPLS encapsulation, the other is non- 92 MPLS encapsulation. Entropy is a technique used in BIER to support 93 load-balancing. This document examines and describes how BIER 94 Entropy is to be applied to Data Center Clos networks for path 95 selection. 97 2. Terminology 99 Readers of this document are assumed to be familiar with the 100 terminology and concepts of the documents listed as Normative 101 References. 103 3. Problem Statement and Considerations 105 3.1. Problem Statement 107 A common choice for a horizontally scalable topology used in Data 108 Center is a Clos topology. This topology features an odd number of 109 stages, for example, a 5-Stage Clos Topology as a example in 110 [RFC7938]. 112 ECMP is the fundamental load-sharing mechanism used by a Clos 113 topology. Effectively, every lower-tier device will use all of its 114 directly attached upper-tier devices to load-share traffic destined 115 to the same IP prefix. The number of ECMP paths between any two Tier 116 3 devices in Clos topology is equal to the number of the devices in 117 the middle stage (Tier 1). For example, Figure 1 illustrates a 118 topology where Tier 3 device L1 has four paths to reach servers X and 119 Y, via Tier 2 devices S1 and S2 and then Tier 1 devices S11, S12, S21 120 and S22 respectively. 122 Tier 1 123 +-----+ 124 Cluster |SUPER| 125 +----------------------------+ +--| S11 |--+ 126 | | | +-----+ | 127 | Tier 2 | | | Tier 2 128 | +-----+ | | +-----+ | +-----+ 129 | +-------------|SPINE|------+--|SUPER|--+--|SPINE|-------------+ 130 | | +-----| S1 |------+ | S12 | +--| S3 |-----+ | 131 | | | +-----+ | +-----+ +-----+ | | 132 | | | | | | 133 | | | +-----+ | +-----+ +-----+ | | 134 | | +-----------|SPINE|------+ |SUPER| +--|SPINE|-----------+ | 135 | | | | +---| S2 |------+--| S21 |--+--| S4 |---+ | | | 136 | | | | | +-----+ | | +-----+ | +-----+ | | | | 137 | | | | | | | | | | | | 138 | +-----+ +-----+ | | +-----+ | +-----+ +-----+ 139 | | LEAF| | LEAF| | +--|SUPER|--+ | LEAF| | LEAF| 140 | | L1 | | L2 | Tier 3 | | S22 | Tier 3 | L3 | | L4 | 141 | +-----+ +-----+ | +-----+ +-----+ +-----+ 142 | | | | | | | | | | 143 | O O O O | X Y O O 144 | Servers | Servers 145 +----------------------------+ 147 Figure 1: 5-Stage Clos Topology 149 When BIER is deployed in a multi-tenant data center network 150 environment for efficient delivery of Broadcast, Unknown-unicast and 151 Multicast (BUM) traffic, a network operator may want a deterministic 152 path for every packet. For example, when L1 needs to send a BUM 153 packet to L3 and L4, which are in different SIs, L1 has to send the 154 packet twice, and expects the packet along two deterministic paths of 155 L1->S1->S11-->L3 and L1->S2->S21-->L4 seperately. Another example of 156 using a deterministic path in a DC is for per-flow steering of 157 "elephant" flows defined in [I-D.ietf-spring-segment-routing-msdc]. 159 A deterministic path for a multicast packet, with multiple staged 160 equal cost paths, is comparable to a traffic-engineering path defined 161 in [RFC8662] for a unicast path with multiple hop equal cost paths. 163 3.2. Considerations 165 The idea behind entropy is that the ingress router computes a hash 166 based on several fields from a given packet and places the result in 167 an additional label, named "entropy label". Then this entropy label 168 can be used as part of the hash keys used by an transit router. When 169 entropy label is used, the keys used in the hashing functions are 170 still a local configuration matter. A router may soley use the 171 entropy label or use a combination of multiple fields from the 172 incoming packet. The hashing function is to randomly load balance 173 the mass of flows between the small number of equal cost paths. 175 If one wants, however, to get a deterministic path from the equal 176 cost paths, one can use part of the 20-bit entropy field. For 177 example, bit 0 to bit 2 of entropy label can represent a value of 0 178 to 7, and thus can be used to select a deterministic path from 8 179 equal cost paths. And thus, a 20-bit entropy label can be used by 180 routers in different tiers to select a deterministic path 181 independently by using different parts of the 20-bit entropy label, 182 and form an end-to-end deterministic path. 184 This is simple and applicable especially for DC Clos networks, 185 because data delivery in DC Clos networks for tenants is always 186 multi-staged, with the upstream direction stages having equal cost 187 paths. 189 4. Use of BIER Entropy for DC Clos Network 191 4.1. Use of BIER Entropy for DC Clos Network 193 Take the 5-stage Clos network in figure 1 as an example. 195 Tier 2 in every cluster has N nodes, and the Tier 1 has M nodes. M 196 is equal to N multiplied by P. 198 Tier 3 switches, in upstream direction, act as stage 1 of data 199 delivery and have N equal cost paths to every BFERs in other 200 clusters. Tier 2 switches, in upstream direction, act as stage 2 of 201 data delivery and have P equal cost paths to every BFERs in other 202 clusters. 204 Example 1: One can configure, on each Tier 3 switch, the use of bit 0 205 for path selection when N is equal to 2, and configure, on each Tier 206 2 switch, to use bit 1 for path selection when P is equal to 2. 208 Example 2: One can configure, on each Tier 3 switch, the use of bit 0 209 to bit 1 for path selection when N is equal to 4, and configure on 210 each Tier 2 switches the use of bit 2 to bit 7 for path selection 211 when P is equal to 48. 213 Assume that, each of the Tier 3 and Tier 2 switchs in the example has 214 two parameters, X and Y, configured locally for using part of entropy 215 label to do path selection, then in example 2: 217 o Each of Tier 3 (Stage 1) switches has a pair of parameters (X1=1, 218 Y1=4) 220 o Each of Tier 2 (Stage 2) switches has a pair of parameters 221 (X2=X1*Y1=4, Y2=64) 223 o Each of Tier 3 (Stage 1) switches populates its BIFTs for ECMP, 224 for example, BIFT-0 to BIFT-3. 226 o Each of Tier 2 (Stage 2) switches populates its BIFTs for ECMP, 227 for example, BIFT-0 to BIFT-47. 229 For each of Tier 3 (Stage 1) switches, each of the BIFT will have a 230 prefered neighboring BFR. For example, LEAF L1 will have a prefered 231 neighbor S1/S2 for BIFT-0/1 seperately, and when forming the BIFT-0 232 table through the underlay routing to every BFER, the prefered 233 neighboring BFR will has a highest priority among all the locally 234 available ECMP path. 236 Then an end-to-end deterministic path for a BIER packet can be had by 237 calculating an entropy label value like this: 239 o Entropy = (P1-1)*X1 + (P2-1)*X2 241 Where P1 represents one of the Stage 1 equal cost paths with a value 242 between 1 and N, and P2 represents one of the Stage 2 equal cost 243 paths with a value between 1 and P. 245 4.2. Steering for elephant flows 247 One can steer an "elephant" flow to an end-to-end deterministic path, 248 or some divided end-to-end deterministic paths across different SIs. 250 4.3. Path Division for Tenant flows to different SIs 252 When the VNEs for a tenant span multiple SIs, then it is useful to 253 divide the BUM packets paths across different SIs. 255 One can configure a policy to use different paths for BIER SIs when 256 using BIER as the BUM tunnel, on each VNE for each VNI. 258 4.4. Link Failure and Convergence 260 As stated above, each of the BIFT on a BFR will have a prefered 261 neighboring BFR. But when the link to the prefered neighbor of some 262 BIFT (say BIFT-X) fail, BIFT-X will converge normally, and the path 263 of this BIFT-X will then probably not being the 'best optimized' 264 path. For example, the link between S1 and L2 fail, then the 265 prefered neighbor of BIFT-0 of LEAF L1, S1, is no longer the 266 neighboring BFR for LEAF L2, and the flow using a Entropy using LEAF 267 L1's BIFT-0 will have to replicate on L1, one packet to S1 for BFER 268 L3 and L4, and one packet to S2 for BFER L2. If the flow changes to 269 use a Entropy using LEAF L1's BIFT-1, it will then be the 'best 270 optimized' path, because the flow doesn't have to replicate on L1, 271 and it need to forward only one copy to S1 for BFER L2 and L3 and L4. 272 Such a change to a flow's entropy is the Ingress switch's 273 responsibility, possibly with the assisstance of a controller. 275 5. Data-Plane Processing 277 The use of BIER entropy label to select a path between some equal 278 cost paths is a local configuration matter. This draft defines a 279 method to use part of the 20-bit entropy label in each router, and 280 this needs a data-plane to do some bit operation function. It is 281 expected to be easier than hashing function. 283 6. Security Considerations 285 This document introduces no new security considerations beyond those 286 already specified in [RFC8279] and [RFC8296]. 288 7. IANA Considerations 290 This document contains no actions for IANA. 292 8. Acknowledgements 294 The authors wish to thank Tony Przygienda, Greg Shepherd, Alia Atlas, 295 Jeffery Zhang, Andrew Dolganow, and Toerless Eckert for their 296 reviews, comments and suggestions. 298 9. References 300 9.1. Normative References 302 [I-D.ietf-spring-segment-routing-msdc] 303 Filsfils, C., Previdi, S., Dawra, G., Aries, E., and P. 304 Lapukhov, "BGP-Prefix Segment in large-scale data 305 centers", draft-ietf-spring-segment-routing-msdc-11 (work 306 in progress), November 2018. 308 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 309 BGP for Routing in Large-Scale Data Centers", RFC 7938, 310 DOI 10.17487/RFC7938, August 2016, 311 . 313 [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 314 Przygienda, T., and S. Aldrin, "Multicast Using Bit Index 315 Explicit Replication (BIER)", RFC 8279, 316 DOI 10.17487/RFC8279, November 2017, 317 . 319 [RFC8296] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 320 Tantsura, J., Aldrin, S., and I. Meilik, "Encapsulation 321 for Bit Index Explicit Replication (BIER) in MPLS and Non- 322 MPLS Networks", RFC 8296, DOI 10.17487/RFC8296, January 323 2018, . 325 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 326 Uttaro, J., and W. Henderickx, "A Network Virtualization 327 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 328 DOI 10.17487/RFC8365, March 2018, 329 . 331 [RFC8662] Kini, S., Kompella, K., Sivabalan, S., Litkowski, S., 332 Shakir, R., and J. Tantsura, "Entropy Label for Source 333 Packet Routing in Networking (SPRING) Tunnels", RFC 8662, 334 DOI 10.17487/RFC8662, December 2019, 335 . 337 9.2. Informative References 339 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 340 Requirement Levels", BCP 14, RFC 2119, 341 DOI 10.17487/RFC2119, March 1997, 342 . 344 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 345 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 346 May 2017, . 348 Authors' Addresses 350 Jingrong Xie 351 Huawei Technologies 353 Email: xiejingrong@huawei.com 355 Xiaohu Xu 356 Alibaba Inc. 358 Email: xiaohu.xxh@alibaba-inc.com 359 Gang Yan 360 Huawei Technologies 362 Email: yangang@huawei.com 364 Mike McBride 365 Futurewei 367 Email: mmcbride7@gmail.com