idnits 2.17.1 draft-xie-mboned-bier-entropy-staged-dc-clos-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 2, 2018) is 2118 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'RFC8365' is defined on line 326, but no explicit reference was found in the text == Outdated reference: A later version (-12) exists of draft-ietf-mpls-spring-entropy-label-11 == Outdated reference: A later version (-11) exists of draft-ietf-spring-segment-routing-msdc-09 ** Downref: Normative reference to an Informational draft: draft-ietf-spring-segment-routing-msdc (ref. 'I-D.ietf-spring-segment-routing-msdc') ** Downref: Normative reference to an Informational RFC: RFC 7938 Summary: 2 errors (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Xie 3 Internet-Draft Huawei Technologies 4 Intended status: Standards Track X. Xu 5 Expires: January 3, 2019 Alibaba Inc. 6 G. Yan 7 M. McBride 8 Huawei Technologies 9 July 2, 2018 11 Use of BIER Entropy for Data Center CLOS Networks 12 draft-xie-mboned-bier-entropy-staged-dc-clos-00 14 Abstract 16 Bit Index Explicit Replication (BIER) introduces a new multicast- 17 specific BIER Header. BIER can be applied to the Multi Protocol 18 Label Switching (MPLS) data plane or Non-MPLS data plane. Entropy is 19 a technique used in BIER to support load-balancing. This document 20 examines and describes how BIER Entropy is to be applied to Data 21 Center CLOS networks for path selection. 23 Requirements Language 25 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 26 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 27 document are to be interpreted as described in [RFC2119]. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on January 3, 2019. 46 Copyright Notice 48 Copyright (c) 2018 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 64 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 65 3. Problem Statement and Considerations . . . . . . . . . . . . 3 66 3.1. Problem Statement . . . . . . . . . . . . . . . . . . . . 3 67 3.2. Considerations . . . . . . . . . . . . . . . . . . . . . 4 68 4. Use of BIER Entropy for DC CLOS Network . . . . . . . . . . . 5 69 4.1. Use of BIER Entropy for DC CLOS Network . . . . . . . . . 5 70 4.2. Steering for elephant flows . . . . . . . . . . . . . . . 6 71 4.3. Path Division for Tenant flows to different SIs . . . . . 6 72 4.4. Link Failure and Convergence . . . . . . . . . . . . . . 6 73 5. Data-Plane Processing . . . . . . . . . . . . . . . . . . . . 7 74 6. Security Considerations . . . . . . . . . . . . . . . . . . . 7 75 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 7 76 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 7 77 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 7 78 9.1. Normative References . . . . . . . . . . . . . . . . . . 7 79 9.2. Informative References . . . . . . . . . . . . . . . . . 8 80 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 8 82 1. Introduction 84 Bit Index Explicit Replication (BIER) [RFC8279] is an architecture 85 that provides optimal multicast forwarding without requiring 86 intermediate routers to maintain any per-flow state by using a 87 multicast-specific BIER header. [RFC8296] defines two types of BIER 88 encapsulation formats: one is MPLS encapsulation, the other is non- 89 MPLS encapsulation. Entropy is a technique used in BIER to support 90 load-balancing. This document examines and describes how BIER 91 Entropy is to be applied to Data Center CLOS networks for path 92 selection. 94 2. Terminology 96 Readers of this document are assumed to be familiar with the 97 terminology and concepts of the documents listed as Normative 98 References. 100 3. Problem Statement and Considerations 102 3.1. Problem Statement 104 A common choice for a horizontally scalable topology used in Data 105 Center is a Clos topology. This topology features an odd number of 106 stages, for example, a 5-Stage Clos Topology as a example in 107 [RFC7938]. 109 ECMP is the fundamental load-sharing mechanism used by a Clos 110 topology. Effectively, every lower-tier device will use all of its 111 directly attached upper-tier devices to load-share traffic destined 112 to the same IP prefix. The number of ECMP paths between any two Tier 113 3 devices in Clos topology is equal to the number of the devices in 114 the middle stage (Tier 1). For example, Figure 1 illustrates a 115 topology where Tier 3 device L1 has four paths to reach servers X and 116 Y, via Tier 2 devices S1 and S2 and then Tier 1 devices S11, S12, 117 S21, and S22, respectively. 119 Tier 1 120 +-----+ 121 Cluster |SUPER| 122 +----------------------------+ +--| S11 |--+ 123 | | | +-----+ | 124 | Tier 2 | | | Tier 2 125 | +-----+ | | +-----+ | +-----+ 126 | +-------------|SPINE|------+--|SUPER|--+--|SPINE|-------------+ 127 | | +-----| S1 |------+ | S12 | +--| S3 |-----+ | 128 | | | +-----+ | +-----+ +-----+ | | 129 | | | | | | 130 | | | +-----+ | +-----+ +-----+ | | 131 | | +-----------|SPINE|------+ |SUPER| +--|SPINE|-----------+ | 132 | | | | +---| S2 |------+--| S21 |--+--| S4 |---+ | | | 133 | | | | | +-----+ | | +-----+ | +-----+ | | | | 134 | | | | | | | | | | | | 135 | +-----+ +-----+ | | +-----+ | +-----+ +-----+ 136 | | LEAF| | LEAF| | +--|SUPER|--+ | LEAF| | LEAF| 137 | | L1 | | L2 | Tier 3 | | S22 | Tier 3 | L3 | | L4 | 138 | +-----+ +-----+ | +-----+ +-----+ +-----+ 139 | | | | | | | | | | 140 | O O O O | X Y O O 141 | Servers | Servers 142 +----------------------------+ 144 Figure 1: 5-Stage Clos Topology 146 When BIER is deployed in a multi-tenant data center network 147 environment for efficient delivery of Broadcast, Unknown-unicast and 148 Multicast (BUM) traffic, a network operator may want a deterministic 149 path for every packet. For example, when L1 needs to send a BUM 150 packet to L3 and L4, which are in different SIs, L1 has to send the 151 packet twice, and expects the packet along two deterministic paths of 152 L1->S1->S11-->L3 and L1->S2->S21-->L4 seperately. Another example of 153 using a deterministic path in a DC is for per-flow steering of 154 "elephant" flows defined in [I-D.ietf-spring-segment-routing-msdc]. 156 A deterministic path for a multicast path, with multiple staged equal 157 cost paths, is comparable to a traffic-engineering path defined in 158 [I-D.ietf-mpls-spring-entropy-label] for a unicast path with multiple 159 hop equal cost paths. 161 3.2. Considerations 163 The idea behind entropy is that the ingress router computes a hash 164 based on several fields from a given packet and places the result in 165 an additional label, named "entropy label". Then this entropy label 166 can be used as part of the hash keys used by an transit router. When 167 entropy label is used, the keys used in the hashing functions are 168 still a local configuration matter. A router may soley use the 169 entropy label or use a combination of multiple fields from the 170 incoming packet. The hashing function is to randomly load balance 171 the mass of flows between the small number of equal cost paths. 173 If one wants, however, to get a deterministic path from the equal 174 cost paths, one can use part of the 20-bit entropy field. For 175 example, bit 0 to bit 2 of entropy label can represent a value of 0 176 to 7, and thus can be used to select a deterministic path from 8 177 equal cost paths. And thus, a 20-bit entropy label can be used by 178 routers in different tiers to select a deterministic path 179 independently by using different parts of the 20-bit entropy label, 180 and form an end-to-end deterministic path. 182 This is simple and applicable especially for DC CLOS networks, 183 because data delivery in DC CLOS networks for tenants is always 184 multi-staged, with the upstream direction stages having equal cost 185 paths. 187 4. Use of BIER Entropy for DC CLOS Network 189 4.1. Use of BIER Entropy for DC CLOS Network 191 Take the 5-stage CLOS network in figure 1 as an example. 193 Tier 2 in every cluster has N nodes, and the Tier 1 has M nodes. M 194 is equal to N multiplied by P. 196 Tier 3 switches, in upstream direction, act as stage 1 of data 197 delivery and have N equal cost paths to every BFERs in other 198 clusters. Tier 2 switches, in upstream direction, act as stage 2 of 199 data delivery and have P equal cost paths to every BFERs in other 200 clusters. 202 Example 1: One can configure, on each Tier 3 switch, the use of bit 0 203 for path selection when N is equal to 2, and configure, on each Tier 204 2 switch, to use bit 1 for path selection when P is equal to 2. 206 Example 2: One can configure, on each Tier 3 switch, the use of bit 0 207 to bit 1 for path selection when N is equal to 4, and configure on 208 each Tier 2 switches the use of bit 2 to bit 7 for path selection 209 when P is equal to 48. 211 Assume that, each Tier 3 and Tier 2 switch the the example have two 212 parameters, X and Y, for using part of entropy label to do path 213 selection, then in example 2: 215 o Each of Tier 3 (Stage 1) switches has a pair of parameters (X1=1, 216 Y1=4) 218 o Each of Tier 2 (Stage 2) switches has a pair of parameters 219 (X2=X1*Y1=4, Y2=64) 221 o Each of Tier 3 (Stage 1) switches populates its BIFTs for ECMP, 222 for example, BIFT-0 to BIFT-3. 224 o Each of Tier 2 (Stage 2) switches populates its BIFTs for ECMP, 225 for example, BIFT-0 to BIFT-47. 227 For each of Tier 3 (Stage 1) switches, each of the BIFT will have a 228 prefered neighboring BFR. For example, LEAF L1 will have a prefered 229 neighbor S1/S2 for BIFT-0/1 seperately, and when forming the BIFT-0 230 table through the underlay routing to every BFER, the prefered 231 neighboring BFR will has a highest priority among all the locally 232 available ECMP path. 234 Then an end-to-end deterministic path for a BIER packet can be had by 235 calculating an entropy label value like this: 237 o Entropy = (P1-1)*X1 + (P2-1)*X2 239 Where P1 represents one of the Stage 1 equal cost paths with a value 240 between 1 and N, and P2 represents one of the Stage 2 equal cost 241 paths with a value between 1 and P. 243 4.2. Steering for elephant flows 245 One can steer an "elephant" flow to an end-to-end deterministic path, 246 or some divided end-to-end deterministic paths across different SIs. 248 4.3. Path Division for Tenant flows to different SIs 250 When the VNEs for a tenant span multiple SIs, then it is useful to 251 divide the BUM packets paths across different SIs. 253 One can configure a policy to use different paths for BIER SIs when 254 using BIER as the BUM tunnel, on each VNE for each VNI. 256 4.4. Link Failure and Convergence 258 As stated above, each of the BIFT on a BFR will have a prefered 259 neighboring BFR. But when the link to the prefered neighbor of some 260 BIFT (say BIFT-X) fail, BIFT-X will converge normally, and will then 261 probably not being the 'best' path. For example, the link between S1 262 and L2 fail, then the prefered neighbor of BIFT-0 of LEAF L1, S1, is 263 no longer the neighboring BFR for LEAF L2, and the flow using a 264 Entropy using LEAF L1's BIFT-0 will have to replicate on L1, one 265 packet to S1 for BFER L3 and L4, and one packet to S2 for BFER L2. 266 If the flow changes to use a Entropy using LEAF L1's BIFT-1, it will 267 then be the 'best' path, because the flow doesn't have to replicate 268 on L1, only one to S1 for BFER L2 and L3 and L4. Such a change to a 269 flow's entropy is the Ingress switch's responsibility, possibly with 270 the assisstance of a controller. 272 5. Data-Plane Processing 274 The use of BIER entropy label to select a path between some equal 275 cost paths is a local configuration matter. This draft defines a 276 method to use part of the 20-bit entropy label in each router, and 277 this needs a data-plane to do some bit operation function. It is 278 expected to be easier than hashing function. 280 6. Security Considerations 282 This document introduces no new security considerations beyond those 283 already specified in [RFC8279] and [RFC8296]. 285 7. IANA Considerations 287 This document contains no actions for IANA. 289 8. Acknowledgements 291 TBD. 293 9. References 295 9.1. Normative References 297 [I-D.ietf-mpls-spring-entropy-label] 298 Kini, S., Kompella, K., Sivabalan, S., Litkowski, S., 299 Shakir, R., and J. Tantsura, "Entropy label for SPRING 300 tunnels", draft-ietf-mpls-spring-entropy-label-11 (work in 301 progress), May 2018. 303 [I-D.ietf-spring-segment-routing-msdc] 304 Filsfils, C., Previdi, S., Dawra, G., Aries, E., and P. 305 Lapukhov, "BGP-Prefix Segment in large-scale data 306 centers", draft-ietf-spring-segment-routing-msdc-09 (work 307 in progress), May 2018. 309 [RFC7938] Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of 310 BGP for Routing in Large-Scale Data Centers", RFC 7938, 311 DOI 10.17487/RFC7938, August 2016, 312 . 314 [RFC8279] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 315 Przygienda, T., and S. Aldrin, "Multicast Using Bit Index 316 Explicit Replication (BIER)", RFC 8279, 317 DOI 10.17487/RFC8279, November 2017, 318 . 320 [RFC8296] Wijnands, IJ., Ed., Rosen, E., Ed., Dolganow, A., 321 Tantsura, J., Aldrin, S., and I. Meilik, "Encapsulation 322 for Bit Index Explicit Replication (BIER) in MPLS and Non- 323 MPLS Networks", RFC 8296, DOI 10.17487/RFC8296, January 324 2018, . 326 [RFC8365] Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., 327 Uttaro, J., and W. Henderickx, "A Network Virtualization 328 Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, 329 DOI 10.17487/RFC8365, March 2018, 330 . 332 9.2. Informative References 334 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 335 Requirement Levels", BCP 14, RFC 2119, 336 DOI 10.17487/RFC2119, March 1997, 337 . 339 Authors' Addresses 341 Jingrong Xie 342 Huawei Technologies 344 Email: xiejingrong@huawei.com 346 Xiaohu Xu 347 Alibaba Inc. 349 Email: xiaohu.xxh@alibaba-inc.com 351 Gang Yan 352 Huawei Technologies 354 Email: yangang@huawei.com 355 Mike McBride 356 Huawei Technologies 358 Email: mmcbride7@gmail.com