idnits 2.17.1 draft-mohanty-bess-weighted-hrw-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (September 11, 2019) is 1687 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'I-D.ietf-idr-extcomm-iana' is defined on line 365, but no explicit reference was found in the text == Unused Reference: 'RFC4271' is defined on line 380, but no explicit reference was found in the text == Unused Reference: 'CLRS2009' is defined on line 405, but no explicit reference was found in the text == Unused Reference: 'I-D.mankamana-pim-bdr' is defined on line 416, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'HRW1999' == Outdated reference: A later version (-21) exists of draft-ietf-bess-evpn-unequal-lb-00 -- Possible downref: Non-RFC (?) normative reference: ref. 'WHRW' == Outdated reference: A later version (-05) exists of draft-mankamana-pim-bdr-00 == Outdated reference: A later version (-03) exists of draft-mohanty-bess-ebgp-dmz-00 Summary: 1 error (**), 0 flaws (~~), 9 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group S. Mohanty 3 Internet-Draft M. Misra 4 Intended status: Standards Track A. Lindem 5 Expires: March 14, 2020 A. Sajassi 6 Cisco Systems, Inc. 7 J. Drake 8 Juniper Networks, Inc. 9 September 11, 2019 11 Weighted HRW and its applications 12 draft-mohanty-bess-weighted-hrw-01 14 Abstract 16 Rendezvous Hashing also known as Highest Random Weight (HRW) has been 17 used in many load balancing applications where the central problem is 18 how to map an object to as server such that the mapping is uniform 19 and also minimally affected by the change in the server set. 20 Recently, it has found use in DF election algorithms in the EVPN 21 context and load balancing using DMZ. This draft deals with the 22 problem of achieving load balancing with minimal disruption when the 23 servers have different weights. It provides an algorithm to do so 24 and also describes a few use-case scenarios where this algorithmic 25 technique can apply. 27 Status of This Memo 29 This Internet-Draft is submitted in full conformance with the 30 provisions of BCP 78 and BCP 79. 32 Internet-Drafts are working documents of the Internet Engineering 33 Task Force (IETF). Note that other groups may also distribute 34 working documents as Internet-Drafts. The list of current Internet- 35 Drafts is at https://datatracker.ietf.org/drafts/current/. 37 Internet-Drafts are draft documents valid for a maximum of six months 38 and may be updated, replaced, or obsoleted by other documents at any 39 time. It is inappropriate to use Internet-Drafts as reference 40 material or to cite them other than as "work in progress." 42 This Internet-Draft will expire on March 14, 2020. 44 Copyright Notice 46 Copyright (c) 2019 IETF Trust and the persons identified as the 47 document authors. All rights reserved. 49 This document is subject to BCP 78 and the IETF Trust's Legal 50 Provisions Relating to IETF Documents 51 (https://trustee.ietf.org/license-info) in effect on the date of 52 publication of this document. Please review these documents 53 carefully, as they describe your rights and restrictions with respect 54 to this document. Code Components extracted from this document must 55 include Simplified BSD License text as described in Section 4.e of 56 the Trust Legal Provisions and are provided without warranty as 57 described in the Simplified BSD License. 59 Table of Contents 61 1. Requirements Language . . . . . . . . . . . . . . . . . . . . 2 62 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 63 3. HRW Introduction . . . . . . . . . . . . . . . . . . . . . . 3 64 4. HRW with weights . . . . . . . . . . . . . . . . . . . . . . 4 65 5. HRW and Consistent Hashing . . . . . . . . . . . . . . . . . 5 66 6. Weighted HRW and its application to the EVPN DF Election . . 5 67 7. Weighted HRW and its application to Resilient Hashing . . . . 7 68 8. Weighted HRW and its application to Multicast DR Election . . 7 69 9. Protocol Considerations . . . . . . . . . . . . . . . . . . . 8 70 10. Operational Considerations . . . . . . . . . . . . . . . . . 8 71 11. Security Considerations . . . . . . . . . . . . . . . . . . . 8 72 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 73 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 74 13.1. Normative References . . . . . . . . . . . . . . . . . . 8 75 13.2. Informative References . . . . . . . . . . . . . . . . . 9 76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 78 1. Requirements Language 80 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 81 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 82 document are to be interpreted as described in [RFC2119]. 84 2. Introduction 86 Given an object O, a set of servers and a set of clients, a 87 fundamental problem is how do the set of clients, independently and 88 unanimously agree in a distributed framework, which server to assign 89 O? This is the distributed hash table problem. The assignment 90 should be "minimally disruptive" which means that there should be a 91 minimal remapping of objects whenever a server is down or a new 92 server comes up or the object set changes. This is a very common 93 problem in practice in the Internet load balancing and web caching as 94 described in the 'Akamai' paper [CHASH], database [DYNAMODB] and 95 networking context. 97 +----+ +----+ +----+ +-----+ 98 | | | | | | | | 99 | S0 | | S1 | | S2 | | Sn | 100 | | | | | | | | 101 +----+ +----+ +----+ +-----+ 102 | | | | 103 | | | | 104 | | | | 105 | | | | 106 | | | | 107 +----------+----------+-------------| 109 O0, O1, O2 ... ON 110 Set of Objects need to be assigned to the set of servers. 111 All the servers are of same capacities 113 Figure 1 The object to server assignment problem 115 Figure 1 117 In the Fig 1, we show a set of servers, S0,..,Sn and object pool 118 O0,..,On and the requirement is to assign Oi to Sj such that the 119 servers are uniformly loaded. In addition, when any server goes down 120 or a new one is introduced, there should be minimal reassignments. 122 There are two standard techniques to address this problem. 124 1. Consistent Hashing 126 2. Rendezvous Hashing 128 3. HRW Introduction 130 Highest Random Weight (HRW) as defined in [HRW1999]is originally 131 proposed in the context of Internet Caching and proxy Server load 132 balancing. Given an object name and a set of servers, HRW maps a 133 request to a server using the object-id (Oi) and server-id(Sj) rather 134 than the state of the server states. HRW computes a hash, Hash(Oi, 135 Sj) from the server-id and the object-id; this hash value can be 136 considered as a score, and forms an ordered list of the servers based 137 on the hash value (i.e. score) in decreasing order. The server for 138 which the score is the highest, serves as the primary responsible for 139 that particular object, and the server with the next highest score 140 serves as the backup server. HRW always maps a given object object 141 name to the same server within a given cluster; consequently it can 142 be used at client sites to achieve global consensus on object-server 143 mappings. When that server goes down, the backup server becomes the 144 responsible designate. 146 Choosing an appropriate hash function that is statistically oblivious 147 to the key distribution and imparts a good uniform distribution of 148 the hash output is an important aspect of the algorithm. The 149 original HRW [HRW1999] provides pseudorandom functions based on Unix 150 utilities rand and srand and easily constructed XOR functions that 151 perform considerably well. Any good uniform hash function like the 152 Jenkins hash for instance will also work. HRW already finds use in 153 multicast and ECMP [RFC2991],[RFC2992]. 155 4. HRW with weights 157 The issue when the servers are not of the same capacity is also quite 158 a common problem. However this problem has not gained as much 159 attention as it should. In such a case, an obvious approach is to 160 take the normalized weight factor into account, fi=wi/Sum(wi)and 161 multiply the Hash(Oi, Sj) with that value i.e. the value fi*Hash(Oi, 162 Sj). The Cache Array Routing Protocol [CARP] used this method. 163 However there is a problem with this approach, since any change in 164 weight of any of the servers, will result in a change in the 165 normalized weights for everyone. This will necessitate re-computing 166 all the weighted hash values all over again. Therefore this approach 167 does not have the minimal disruption property of the HRW. We address 168 this issue of the weighted HRW with minimal disruption in this draft. 170 Instead of re-normalizing the weights, or, in other words relatively 171 scaling them, the approach taken by [WHRW] is to adjust the score 172 before weighing them. When a server is added, removed or modified 173 (its weight changes), only the score for that server changes. That 174 server may win or lose some objects. Other servers remain affected. 175 There is no needless transfer of objects between servers whose weight 176 did not change. [WHRW] uses a clever way to accomplish this by 177 defining the score function as: 179 1. Score(Oi, Sj) = -wi/log(Hash(Oi, Sj)/Hmax); where Hmax is the 180 maximum hash value. 182 The author provides a mathematical proof as to why this choice of the 183 Score function works with very mild assumptions on the probability 184 distribution of the hash function. 186 +----+ +----+ +----+ +----+ 187 | | | | | | | | 188 | S0 | | S1 | | S2 |-------| Sn | 189 | w0 | | w1 | | w2 | | wn | 190 +----+ +----+ +----+ +----+ 191 | | | | 192 | | | | 193 | | | | 194 | | | | 195 | | | | 196 +----------+----------+-------------| 198 O1, O2 ... ON 199 Set of Objects need to be assigned to the set of servers. 200 Each server is now associated with a weight 202 Figure 1 The object to server assignment problem 204 Figure 2 206 5. HRW and Consistent Hashing 208 HRW is not the only algorithm that addresses the object to server 209 mapping problem with goals of fair load distribution, redundancy and 210 fast access. There is another family of algorithms that also 211 addresses this problem; these fall under the umbrella of the 212 Consistent Hashing Algorithms [CHASH]. These will not be considered 213 here. 215 6. Weighted HRW and its application to the EVPN DF Election 217 The notion and need for the Designated Forwarder is described in 218 [RFC7432]. Consider a CE that is a host or a router that is multi- 219 homed directly to more than one PE in an EVPN instance on a given 220 Ethernet segment. One or more Ethernet Tags may be configured on the 221 Ethernet segment. In this scenario only one of the PEs, referred to 222 as the Designated Forwarder (DF), is responsible for certain actions: 224 a. Sending multicast and broadcast traffic, on a given Ethernet Tag 225 on a particular Ethernet segment, to the CE. 227 b. Flooding unknown unicast traffic (i.e. traffic for which an PE 228 does not know the destination MAC address), on a given Ethernet 229 Tag on a particular Ethernet segment to the CE, if the 230 environment requires flooding of unknown unicast traffic. 232 +---------------+ 233 | IP/MPLS | 234 | CORE | 235 +----+ ES1 +----+ +----+ 236 | CE1|-----| |-----------| |____ES2 237 +----+ | PE1| | PE2| \ 238 | |-------- +----+ \+----+ 239 +----+ | | | CE2| 240 | | +----+ /+----+ 241 | |__| |____/ | 242 | | PE3| ES2 / 243 | +----+ / 244 | | / 245 +-------------+----+ / 246 | PE4|____/ES2 247 | | 248 +----+ 250 Figure 3 Multi-homing Network of EVPN 252 Figure 3 254 Figure 3 illustrates a case where there are two Ethernet Segments, 255 ES1 and ES2. PE1 is attached to CE1 via Ethernet Segment ES1 whereas 256 PE2, PE3 and PE4 are attached to CE2 via ES2 i.e. PE2, PE3 and PE4 257 form a redundancy group. Since CE2 is multi-homed to different PEs 258 on the same Ethernet Segment, it is necessary for PE2, PE3 and PE4 to 259 agree on a DF to satisfy the above mentioned requirements. 261 The use of HRW in the EVPN DF Election is described in 262 [I-D.ietf-bess-evpn-df-election-framework]. In that draft it is 263 explained how the HRW DF Election performs better than the modulo DF 264 Election algorithm in [RFC7432]. However, it is implicitly assumed 265 there that all the PEs are of the same capacity (weights equal). 267 DMZ link bandwidth for load balancing flows across multiple EBGP 268 egress points is described in [I-D.ietf-idr-link-bandwidth]. It has 269 been extended to the case of cumulative DMZ load balancing 270 [I-D.mohanty-bess-ebgp-dmz] in the case of an all EBGP network in the 271 data center. [I-D.ietf-bess-evpn-unequal-lb] describes the use of 272 the DMZ in the EVPN DF Election. The argument is made that ideally 273 one should be able to change the link bandwidth in one or more of the 274 multi-homed PEs rather than have to change in all of the multi-homed 275 PEs simultaneously. The draft describes the bandwidth increments to 276 be taken into consideration and proposes an iterative way to assign 277 the score function. The description in Section 4.3.2 of 278 [I-D.ietf-bess-evpn-unequal-lb] is an non-optimal solution and 279 somewhat empirical. It does not obey the minimal disruption property 280 of the HRW. 282 In contrast to the procedures for weighted HRW in 4.3.2 of 283 [I-D.ietf-bess-evpn-unequal-lb], we can achieve an optimal solution 284 for weighted HRW in [I-D.ietf-bess-evpn-unequal-lb] using the score 285 function as described in Section 4 above and obviating the need to 286 take bandwidth increments. It is an order of magnitude faster and 287 efficient and minimally disruptive. 289 7. Weighted HRW and its application to Resilient Hashing 291 With the exponential increase in the number of physical links used in 292 data centers, there is also the potential for an increase in the 293 number of failed physical links. In systems that employ static 294 hashing for load balancing flows across members of port channels or 295 Equal Cost Multipath (ECMP) groups, each flow is hashed to a link. 296 When a link fails, all flows including those that were previously 297 mapped to the non-failed links are rehashed across the remaining 298 working links. This causes packet reordering of flows that were in 299 fact not mapped to the link that failed. A similar rehashing with 300 packet re-ordering also happens when a link is added to the port 301 channel or Equal Cost Multipath (ECMP) group. With the ever 302 increasing number of physical links used in the data centers there 303 the possibility for increasing number of failed links only increases. 304 Hence the resilient hashing is very important. 306 However when the links are not of the same speed, Resilient hashing 307 for ECMP does not apply per-se. However, one can use the method 308 explained in Section 4 to achieve resilient hashing even in the 309 Unequal Cost Multipath (UCMP)case or when member links are of 310 different bandwidths. 312 8. Weighted HRW and its application to Multicast DR Election 314 [I-D.mankamana-pim-bdr]propose a mechanism to elect backup DR on a 315 shared LAN. A backup DR on LAN would be useful for faster 316 convergence. When the access bandwidth is different for the PIM 317 routers and we want to do a load balancing among the PIM routers for 318 DR/backup DR functionality with regards to the various (S,G) flow, 319 technique similar to Section 4 can be applied. The details of the 320 problem is out of the scope of the current draft and is being worked 321 on separately at this time. 323 9. Protocol Considerations 325 A request needs to registered with IANA registry for the weighted HRW 326 EVPN DF Election Algorithm in the DF Alg field in the DF Election 327 Extended Community in draft 328 [I-D.ietf-bess-evpn-df-election-framework]. 330 10. Operational Considerations 332 TBD. 334 11. Security Considerations 336 This document raises no new security issues for EVPN. 338 12. Acknowledgements 340 The authors would like to thank Shyam Sethuram and Peter Psenak for 341 useful discussions related to this draft. 343 13. References 345 13.1. Normative References 347 [HRW1999] Thaler, D. and C. Ravishankar, "Using Name-Based Mappings 348 to Increase Hit Rates", IEEE/ACM Transactions in 349 networking Volume 6 Issue 1, February 1998. 351 [I-D.ietf-bess-evpn-df-election-framework] 352 Rabadan, J., satyamoh@cisco.com, s., Sajassi, A., Drake, 353 J., Nagaraj, K., and S. Sathappan, "Framework for EVPN 354 Designated Forwarder Election Extensibility", draft-ietf- 355 bess-evpn-df-election-framework-09 (work in progress), 356 January 2019. 358 [I-D.ietf-bess-evpn-unequal-lb] 359 Malhotra, N., Sajassi, A., Rabadan, J., Drake, J., 360 Lingala, A., and S. Thoria, "Weighted Multi-Path 361 Procedures for EVPN All-Active Multi-Homing", draft-ietf- 362 bess-evpn-unequal-lb-00 (work in progress), September 363 2018. 365 [I-D.ietf-idr-extcomm-iana] 366 Rosen, E. and Y. Rekhter, "IANA Registries for BGP 367 Extended Communities", draft-ietf-idr-extcomm-iana-02 368 (work in progress), December 2013. 370 [I-D.ietf-idr-link-bandwidth] 371 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 372 Extended Community", draft-ietf-idr-link-bandwidth-07 373 (work in progress), March 2018. 375 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 376 Requirement Levels", BCP 14, RFC 2119, 377 DOI 10.17487/RFC2119, March 1997, 378 . 380 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 381 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 382 DOI 10.17487/RFC4271, January 2006, 383 . 385 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 386 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 387 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 388 2015, . 390 [WHRW] Resch, J., "New hashing Algorithms for Data Storage", 391 Storage Developer Conference 18, November 2015. 393 13.2. Informative References 395 [CARP] Valloppillil, V. and K. Ross, "Cache Array Routing 396 Protocol v1.1", IEEE/ACM Transactions in networking Volume 397 6 Issue 1, February 1998. 399 [CHASH] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., 400 Levine, M., and D. Lewin, "Consistent Hashing and Random 401 Trees: Distributed Caching Protocols for Relieving Hot 402 Spots on the World Wide Web", ACM Symposium on Theory of 403 Computing ACM Press New York, May 1997. 405 [CLRS2009] 406 Cormen, T., Leiserson, C., Rivest, R., and C. Stein, 407 "Introduction to Algorithms (3rd ed.)", MIT Press and 408 McGraw-Hill ISBN 0-262-03384-4., February 2009. 410 [DYNAMODB] 411 Decennia, G., Hastorun, D., Jampani, M., Kakulapati, G., 412 Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, 413 P., and W. Vogels, "Dynamo: Amazon's Highly Available Key- 414 value Store", SOSP 07, October 2007. 416 [I-D.mankamana-pim-bdr] 417 mishra, m., "PIM Backup Designated Router Procedure", 418 draft-mankamana-pim-bdr-00 (work in progress), June 2018. 420 [I-D.mohanty-bess-ebgp-dmz] 421 satyamoh@cisco.com, s., Millisor, A., and A. Vayner, 422 "Cumulative DMZ Link Bandwidth and load-balancing", draft- 423 mohanty-bess-ebgp-dmz-00 (work in progress), March 2018. 425 [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 426 Multicast Next-Hop Selection", RFC 2991, 427 DOI 10.17487/RFC2991, November 2000, 428 . 430 [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 431 Algorithm", RFC 2992, DOI 10.17487/RFC2992, November 2000, 432 . 434 Authors' Addresses 436 Satya Ranjan Mohanty 437 Cisco Systems, Inc. 438 225 West Tasman Drive 439 San Jose, CA 95134 440 USA 442 Email: satyamoh@cisco.com 444 Mankamana Misra 445 Cisco Systems, Inc. 446 170 W. Tasman Drive 447 San Jose, CA 95134 448 USA 450 Email: mankamis@cisco.com 452 Acee Lindem 453 Cisco Systems, Inc. 454 170 West Tasman Drive 455 San Jose, CA 95134 456 USA 458 Email: acee@cisco.com 459 Ali Sajassi 460 Cisco Systems, Inc. 461 170 West Tasman Drive 462 San Jose, CA 95134 463 USA 465 Email: sajassi@cisco.com 467 John Drake 468 Juniper Networks, Inc. 469 1194 N. Mathilda Drive 470 Sunnyvale, CA 94089 471 USA 473 Email: jdrake@juniper.net