idnits 2.17.1 draft-xie-bier-entropy-staged-dc-clos-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 19, 2018) is 2008 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC8365' is defined on line 329, but no explicit reference was found in the text == Outdated reference: A later version (-11) exists of draft-ietf-spring-segment-routing-msdc-10 Summary: 0 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Xie 3 Internet-Draft Huawei Technologies 4 Intended status: Informational X. Xu 5 Expires: April 22, 2019 Alibaba Inc. 6 G. Yan 7 M. McBride 8 Huawei Technologies 9 October 19, 2018 11 Use of BIER Entropy for Data Center CLOS Networks 12 draft-xie-bier-entropy-staged-dc-clos-02 14 Abstract 16 Bit Index Explicit Replication (BIER) introduces a new multicast- 17 specific BIER Header. BIER can be applied to the Multi Protocol 18 Label Switching (MPLS) data plane or Non-MPLS data plane. Entropy is 19 a technique used in BIER to support load-balancing. This document 20 examines and describes how BIER Entropy is to be applied to Data 21 Center CLOS networks for path selection. 23 Requirements Language 25 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 26 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 27 document are to be interpreted as described in [RFC2119]. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on April 22, 2019. 46 Copyright Notice 48 Copyright (c) 2018 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 64 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 65 3. Problem Statement and Considerations . . . . . . . . . . . . 3 66 3.1. Problem Statement . . . . . . . . . . . . . . . . . . . . 3 67 3.2. Considerations . . . . . . . . . . . . . . . . . . . . . 4 68 4. Use of BIER Entropy for DC CLOS Network . . . . . . . . . . . 5 69 4.1. Use of BIER Entropy for DC CLOS Network . . . . . . . . . 5 70 4.2. Steering for elephant flows . . . . . . . . . . . . . . . 6 71 4.3. Path Division for Tenant flows to different SIs . . . . . 6 72 4.4. Link Failure and Convergence . . . . . . . . . . . . . . 6 73 5. Data-Plane Processing . . . . . . . . . . . . . . . . . . . . 7 74 6. Security Considerations . . . . . . . . . . . . . . . . . . . 7 75 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 76 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 77 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 78 9.1. Normative References . . . . . . . . . . . . . . . . . . 7 79 9.2. Informative References . . . . . . . . . . . . . . . . . 8 80 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 82 1. Introduction 84 Bit Index Explicit Replication (BIER) [RFC8279] is an architecture 85 that provides optimal multicast forwarding without requiring 86 intermediate routers to maintain any per-flow state by using a 87 multicast-specific BIER header. [RFC8296] defines two types of BIER 88 encapsulation formats: one is MPLS encapsulation, the other is non- 89 MPLS encapsulation. Entropy is a technique used in BIER to support 90 load-balancing. This document examines and describes how BIER 91 Entropy is to be applied to Data Center CLOS networks for path 92 selection. 94 2. Terminology 96 Readers of this document are assumed to be familiar with the 97 terminology and concepts of the documents listed as Normative 98 References. 100 3. Problem Statement and Considerations 102 3.1. Problem Statement 104 A common choice for a horizontally scalable topology used in Data 105 Center is a CLOS topology. This topology features an odd number of 106 stages, for example, a 5-Stage CLOS Topology as a example in 107 [RFC7938]. 109 ECMP is the fundamental load-sharing mechanism used by a CLOS 110 topology. Effectively, every lower-tier device will use all of its 111 directly attached upper-tier devices to load-share traffic destined 112 to the same IP prefix. The number of ECMP paths between any two Tier 113 3 devices in CLOS topology is equal to the number of the devices in 114 the middle stage (Tier 1). For example, Figure 1 illustrates a 115 topology where Tier 3 device L1 has four paths to reach servers X and 116 Y, via Tier 2 devices S1 and S2 and then Tier 1 devices S11, S12, S21 117 and S22 respectively. 119 Tier 1 120 +-----+ 121 Cluster |SUPER| 122 +----------------------------+ +--| S11 |--+ 123 | | | +-----+ | 124 | Tier 2 | | | Tier 2 125 | +-----+ | | +-----+ | +-----+ 126 | +-------------|SPINE|------+--|SUPER|--+--|SPINE|-------------+ 127 | | +-----| S1 |------+ | S12 | +--| S3 |-----+ | 128 | | | +-----+ | +-----+ +-----+ | | 129 | | | | | | 130 | | | +-----+ | +-----+ +-----+ | | 131 | | +-----------|SPINE|------+ |SUPER| +--|SPINE|-----------+ | 132 | | | | +---| S2 |------+--| S21 |--+--| S4 |---+ | | | 133 | | | | | +-----+ | | +-----+ | +-----+ | | | | 134 | | | | | | | | | | | | 135 | +-----+ +-----+ | | +-----+ | +-----+ +-----+ 136 | | LEAF| | LEAF| | +--|SUPER|--+ | LEAF| | LEAF| 137 | | L1 | | L2 | Tier 3 | | S22 | Tier 3 | L3 | | L4 | 138 | +-----+ +-----+ | +-----+ +-----+ +-----+ 139 | | | | | | | | | | 140 | O O O O | X Y O O 141 | Servers | Servers 142 +----------------------------+ 144 Figure 1: 5-Stage CLOS Topology 146 When BIER is deployed in a multi-tenant data center network 147 environment for efficient delivery of Broadcast, Unknown-unicast and 148 Multicast (BUM) traffic, a network operator may want a deterministic 149 path for every packet. For example, when L1 needs to send a BUM 150 packet to L3 and L4, which are in different SIs, L1 has to send the 151 packet twice, and expects the packet along two deterministic paths of 152 L1->S1->S11-->L3 and L1->S2->S21-->L4 seperately. Another example of 153 using a deterministic path in a DC is for per-flow steering of 154 "elephant" flows defined in [I-D.ietf-spring-segment-routing-msdc]. 156 A deterministic path for a multicast packet, with multiple staged 157 equal cost paths, is comparable to a traffic-engineering path defined 158 in [I-D.ietf-mpls-spring-entropy-label] for a unicast path with 159 multiple hop equal cost paths. 161 3.2. Considerations 163 The idea behind entropy is that the ingress router computes a hash 164 based on several fields from a given packet and places the result in 165 an additional label, named "entropy label". Then this entropy label 166 can be used as part of the hash keys used by an transit router. When 167 entropy label is used, the keys used in the hashing functions are 168 still a local configuration matter. A router may soley use the 169 entropy label or use a combination of multiple fields from the 170 incoming packet. The hashing function is to randomly load balance 171 the mass of flows between the small number of equal cost paths. 173 If one wants, however, to get a deterministic path from the equal 174 cost paths, one can use part of the 20-bit entropy field. For 175 example, bit 0 to bit 2 of entropy label can represent a value of 0 176 to 7, and thus can be used to select a deterministic path from 8 177 equal cost paths. And thus, a 20-bit entropy label can be used by 178 routers in different tiers to select a deterministic path 179 independently by using different parts of the 20-bit entropy label, 180 and form an end-to-end deterministic path. 182 This is simple and applicable especially for DC CLOS networks, 183 because data delivery in DC CLOS networks for tenants is always 184 multi-staged, with the upstream direction stages having equal cost 185 paths. 187 4. Use of BIER Entropy for DC CLOS Network 189 4.1. Use of BIER Entropy for DC CLOS Network 191 Take the 5-stage CLOS network in figure 1 as an example. 193 Tier 2 in every cluster has N nodes, and the Tier 1 has M nodes. M 194 is equal to N multiplied by P. 196 Tier 3 switches, in upstream direction, act as stage 1 of data 197 delivery and have N equal cost paths to every BFERs in other 198 clusters. Tier 2 switches, in upstream direction, act as stage 2 of 199 data delivery and have P equal cost paths to every BFERs in other 200 clusters. 202 Example 1: One can configure, on each Tier 3 switch, the use of bit 0 203 for path selection when N is equal to 2, and configure, on each Tier 204 2 switch, to use bit 1 for path selection when P is equal to 2. 206 Example 2: One can configure, on each Tier 3 switch, the use of bit 0 207 to bit 1 for path selection when N is equal to 4, and configure on 208 each Tier 2 switches the use of bit 2 to bit 7 for path selection 209 when P is equal to 48. 211 Assume that, each of the Tier 3 and Tier 2 switchs in the example has 212 two parameters, X and Y, configured locally for using part of entropy 213 label to do path selection, then in example 2: 215 o Each of Tier 3 (Stage 1) switches has a pair of parameters (X1=1, 216 Y1=4) 218 o Each of Tier 2 (Stage 2) switches has a pair of parameters 219 (X2=X1*Y1=4, Y2=64) 221 o Each of Tier 3 (Stage 1) switches populates its BIFTs for ECMP, 222 for example, BIFT-0 to BIFT-3. 224 o Each of Tier 2 (Stage 2) switches populates its BIFTs for ECMP, 225 for example, BIFT-0 to BIFT-47. 227 For each of Tier 3 (Stage 1) switches, each of the BIFT will have a 228 prefered neighboring BFR. For example, LEAF L1 will have a prefered 229 neighbor S1/S2 for BIFT-0/1 seperately, and when forming the BIFT-0 230 table through the underlay routing to every BFER, the prefered 231 neighboring BFR will has a highest priority among all the locally 232 available ECMP path. 234 Then an end-to-end deterministic path for a BIER packet can be had by 235 calculating an entropy label value like this: 237 o Entropy = (P1-1)*X1 + (P2-1)*X2 239 Where P1 represents one of the Stage 1 equal cost paths with a value 240 between 1 and N, and P2 represents one of the Stage 2 equal cost 241 paths with a value between 1 and P. 243 4.2. Steering for elephant flows 245 One can steer an "elephant" flow to an end-to-end deterministic path, 246 or some divided end-to-end deterministic paths across different SIs. 248 4.3. Path Division for Tenant flows to different SIs 250 When the VNEs for a tenant span multiple SIs, then it is useful to 251 divide the BUM packets paths across different SIs. 253 One can configure a policy to use different paths for BIER SIs when 254 using BIER as the BUM tunnel, on each VNE for each VNI. 256 4.4. Link Failure and Convergence 258 As stated above, each of the BIFT on a BFR will have a prefered 259 neighboring BFR. But when the link to the prefered neighbor of some 260 BIFT (say BIFT-X) fail, BIFT-X will converge normally, and the path 261 of this BIFT-X will then probably not being the 'best optimized' 262 path. For example, the link between S1 and L2 fail, then the 263 prefered neighbor of BIFT-0 of LEAF L1, S1, is no longer the 264 neighboring BFR for LEAF L2, and the flow using a Entropy using LEAF 265 L1's BIFT-0 will have to replicate on L1, one packet to S1 for BFER 266 L3 and L4, and one packet to S2 for BFER L2. If the flow changes to 267 use a Entropy using LEAF L1's BIFT-1, it will then be the 'best 268 optimized' path, because the flow doesn't have to replicate on L1, 269 and it need to forward only one copy to S1 for BFER L2 and L3 and L4. 270 Such a change to a flow's entropy is the Ingress switch's 271 responsibility, possibly with the assisstance of a controller. 273 5. Data-Plane Processing 275 The use of BIER entropy label to select a path between some equal 276 cost paths is a local configuration matter. This draft defines a 277 method to use part of the 20-bit entropy label in each router, and 278 this needs a data-plane to do some bit operation function. It is 279 expected to be easier than hashing function. 281 6. Security Considerations 283 This document introduces no new security considerations beyond those 284 already specified in [RFC8279] and [RFC8296]. 286 7. IANA Considerations 288 This document contains no actions for IANA. 290 8. Acknowledgements 292 The authors wish to thank Tony Przygienda, Greg Shepherd, Alia Atlas, 293 Jeffery Zhang, Andrew Dolganow, and Toerless Eckert for their 294 reviews, comments and suggestions. 296 9. References 298 9.1. Normative References 300 [I-D.ietf-mpls-spring-entropy-label] 301 Kini, S., Kompella, K., Sivabalan, S., Litkowski, S., 302 Shakir, R., and J. Tantsura, "Entropy label for SPRING 303 tunnels", draft-ietf-mpls-spring-entropy-label-12 (work in 304 progress), July 2018. 306 [I-D.ietf-spring-segment-routing-msdc] 307 Filsfils, C., Previdi, S., Dawra, G., Aries, E., and P. 308 Lapukhov, "BGP-Prefix Segment in large-scale data 309 centers", draft-ietf-spring-segment-routing-msdc-10 (work 310 in progress), October 2018. 312 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 313 BGP for Routing in Large-Scale Data Centers", RFC 7938, 314 DOI 10.17487/RFC7938, August 2016, 315 . 317 [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 318 Przygienda, T., and S. Aldrin, "Multicast Using Bit Index 319 Explicit Replication (BIER)", RFC 8279, 320 DOI 10.17487/RFC8279, November 2017, 321 . 323 [RFC8296] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 324 Tantsura, J., Aldrin, S., and I. Meilik, "Encapsulation 325 for Bit Index Explicit Replication (BIER) in MPLS and Non- 326 MPLS Networks", RFC 8296, DOI 10.17487/RFC8296, January 327 2018, . 329 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 330 Uttaro, J., and W. Henderickx, "A Network Virtualization 331 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 332 DOI 10.17487/RFC8365, March 2018, 333 . 335 9.2. Informative References 337 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 338 Requirement Levels", BCP 14, RFC 2119, 339 DOI 10.17487/RFC2119, March 1997, 340 . 342 Authors' Addresses 344 Jingrong Xie 345 Huawei Technologies 347 Email: xiejingrong@huawei.com 349 Xiaohu Xu 350 Alibaba Inc. 352 Email: xiaohu.xxh@alibaba-inc.com 354 Gang Yan 355 Huawei Technologies 357 Email: yangang@huawei.com 358 Mike McBride 359 Huawei Technologies 361 Email: mmcbride7@gmail.com