idnits 2.17.1 draft-mohanty-bess-weighted-hrw-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (March 11, 2019) is 1871 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Unused Reference: 'I-D.ietf-idr-extcomm-iana' is defined on line 363, but no explicit reference was found in the text == Unused Reference: 'RFC4271' is defined on line 378, but no explicit reference was found in the text == Unused Reference: 'CLRS2009' is defined on line 403, but no explicit reference was found in the text == Unused Reference: 'I-D.mankamana-pim-bdr' is defined on line 414, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'HRW1999' == Outdated reference: A later version (-21) exists of draft-ietf-bess-evpn-unequal-lb-00 -- Possible downref: Non-RFC (?) normative reference: ref. 'WHRW' == Outdated reference: A later version (-05) exists of draft-mankamana-pim-bdr-00 == Outdated reference: A later version (-03) exists of draft-mohanty-bess-ebgp-dmz-00 Summary: 1 error (**), 0 flaws (~~), 9 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 BESS Working Group S. Mohanty 3 Internet-Draft M. Misra 4 Intended status: Standards Track A. Lindem 5 Expires: September 12, 2019 A. Sajassi 6 Cisco Systems, Inc. 7 March 11, 2019 9 Weighted HRW and its applications 10 draft-mohanty-bess-weighted-hrw-00 12 Abstract 14 Rendezvous Hashing also known as Highest Random Weight (HRW) has been 15 used in many load balancing applications where the central problem is 16 how to map an object to as server such that the mapping is uniform 17 and also minimally affected by the change in the server set. 18 Recently, it has found use in DF election algorithms in the EVPN 19 context and load balancing using DMZ. This draft deals with the 20 problem of achieving load balancing with minimal disruption when the 21 servers have different weights. It provides an algorithm to do so 22 and also describes a few use-case scenarios where this algorithmic 23 technique can apply. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on September 12, 2019. 42 Copyright Notice 44 Copyright (c) 2019 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (https://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Requirements Language . . . . . . . . . . . . . . . . . . . . 2 60 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 61 3. HRW Introduction . . . . . . . . . . . . . . . . . . . . . . 3 62 4. HRW with weights . . . . . . . . . . . . . . . . . . . . . . 4 63 5. HRW and Consistent Hashing . . . . . . . . . . . . . . . . . 5 64 6. Weighted HRW and its application to the EVPN DF Election . . 5 65 7. Weighted HRW and its application to Resilient Hashing . . . . 7 66 8. Weighted HRW and its application to Multicast DR Election . . 7 67 9. Protocol Considerations . . . . . . . . . . . . . . . . . . . 8 68 10. Operational Considerations . . . . . . . . . . . . . . . . . 8 69 11. Security Considerations . . . . . . . . . . . . . . . . . . . 8 70 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 8 71 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 8 72 13.1. Normative References . . . . . . . . . . . . . . . . . . 8 73 13.2. Informative References . . . . . . . . . . . . . . . . . 9 74 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 76 1. Requirements Language 78 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 79 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 80 document are to be interpreted as described in [RFC2119]. 82 2. Introduction 84 Given an object O, a set of servers and a set of clients, a 85 fundamental problem is how do the set of clients, independently and 86 unanimously agree in a distributed framework, which server to assign 87 O? This is the distributed hash table problem. The assignment 88 should be "minimally disruptive" which means that there should be a 89 minimal remapping of objects whenever a server is down or a new 90 server comes up or the object set changes. This is a very common 91 problem in practice in the Internet load balancing and web caching as 92 described in the 'Akamai' paper [CHASH], database [DYNAMODB] and 93 networking context. 95 +----+ +----+ +----+ +-----+ 96 | | | | | | | | 97 | S0 | | S1 | | S2 | | Sn | 98 | | | | | | | | 99 +----+ +----+ +----+ +-----+ 100 | | | | 101 | | | | 102 | | | | 103 | | | | 104 | | | | 105 +----------+----------+-------------| 107 O0, O1, O2 ... ON 108 Set of Objects need to be assigned to the set of servers. 109 All the servers are of same capacities 111 Figure 1 The object to server assignment problem 113 Figure 1 115 In the Fig 1, we show a set of servers, S0,..,Sn and object pool 116 O0,..,On and the requirement is to assign Oi to Sj such that the 117 servers are uniformly loaded. In addition, when any server goes down 118 or a new one is introduced, there should be minimal reassignments. 120 There are two standard techniques to address this problem. 122 1. Consistent Hashing 124 2. Rendezvous Hashing 126 3. HRW Introduction 128 Highest Random Weight (HRW) as defined in [HRW1999]is originally 129 proposed in the context of Internet Caching and proxy Server load 130 balancing. Given an object name and a set of servers, HRW maps a 131 request to a server using the object-id (Oi) and server-id(Sj) rather 132 than the state of the server states. HRW computes a hash, Hash(Oi, 133 Sj) from the server-id and the object-id; this hash value can be 134 considered as a score, and forms an ordered list of the servers based 135 on the hash value (i.e. score) in decreasing order. The server for 136 which the score is the highest, serves as the primary responsible for 137 that particular object, and the server with the next highest score 138 serves as the backup server. HRW always maps a given object object 139 name to the same server within a given cluster; consequently it can 140 be used at client sites to achieve global consensus on object-server 141 mappings. When that server goes down, the backup server becomes the 142 responsible designate. 144 Choosing an appropriate hash function that is statistically oblivious 145 to the key distribution and imparts a good uniform distribution of 146 the hash output is an important aspect of the algorithm. The 147 original HRW [HRW1999] provides pseudorandom functions based on Unix 148 utilities rand and srand and easily constructed XOR functions that 149 perform considerably well. Any good uniform hash function like the 150 Jenkins hash for instance will also work. HRW already finds use in 151 multicast and ECMP [RFC2991],[RFC2992]. 153 4. HRW with weights 155 The issue when the servers are not of the same capacity is also quite 156 a common problem. However this problem has not gained as much 157 attention as it should. In such a case, an obvious approach is to 158 take the normalized weight factor into account, fi=wi/Sum(wi)and 159 multiply the Hash(Oi, Sj) with that value i.e. the value fi*Hash(Oi, 160 Sj). The Cache Array Routing Protocol [CARP] used this method. 161 However there is a problem with this approach, since any change in 162 weight of any of the servers, will result in a change in the 163 normalized weights for everyone. This will necessitate re-computing 164 all the weighted hash values all over again. Therefore this approach 165 does not have the minimal disruption property of the HRW. We address 166 this issue of the weighted HRW with minimal disruption in this draft. 168 Instead of re-normalizing the weights, or, in other words relatively 169 scaling them, the approach taken by [WHRW] is to adjust the score 170 before weighing them. When a server is added, removed or modified 171 (its weight changes), only the score for that server changes. That 172 server may win or lose some objects. Other servers remain affected. 173 There is no needless transfer of objects between servers whose weight 174 did not change. [WHRW] uses a clever way to accomplish this by 175 defining the score function as: 177 1. Score(Oi, Sj) = -wi/log(Hash(Oi, Sj)/Hmax); where Hmax is the 178 maximum hash value. 180 The author provides a mathematical proof as to why this choice of the 181 Score function works with very mild assumptions on the probability 182 distribution of the hash function. 184 +----+ +----+ +----+ +----+ 185 | | | | | | | | 186 | S0 | | S1 | | S2 |-------| Sn | 187 | w0 | | w1 | | w2 | | wn | 188 +----+ +----+ +----+ +----+ 189 | | | | 190 | | | | 191 | | | | 192 | | | | 193 | | | | 194 +----------+----------+-------------| 196 O1, O2 ... ON 197 Set of Objects need to be assigned to the set of servers. 198 Each server is now associated with a weight 200 Figure 1 The object to server assignment problem 202 Figure 2 204 5. HRW and Consistent Hashing 206 HRW is not the only algorithm that addresses the object to server 207 mapping problem with goals of fair load distribution, redundancy and 208 fast access. There is another family of algorithms that also 209 addresses this problem; these fall under the umbrella of the 210 Consistent Hashing Algorithms [CHASH]. These will not be considered 211 here. 213 6. Weighted HRW and its application to the EVPN DF Election 215 The notion and need for the Designated Forwarder is described in 216 [RFC7432]. Consider a CE that is a host or a router that is multi- 217 homed directly to more than one PE in an EVPN instance on a given 218 Ethernet segment. One or more Ethernet Tags may be configured on the 219 Ethernet segment. In this scenario only one of the PEs, referred to 220 as the Designated Forwarder (DF), is responsible for certain actions: 222 a. Sending multicast and broadcast traffic, on a given Ethernet Tag 223 on a particular Ethernet segment, to the CE. 225 b. Flooding unknown unicast traffic (i.e. traffic for which an PE 226 does not know the destination MAC address), on a given Ethernet 227 Tag on a particular Ethernet segment to the CE, if the 228 environment requires flooding of unknown unicast traffic. 230 +---------------+ 231 | IP/MPLS | 232 | CORE | 233 +----+ ES1 +----+ +----+ 234 | CE1|-----| |-----------| |____ES2 235 +----+ | PE1| | PE2| \ 236 | |-------- +----+ \+----+ 237 +----+ | | | CE2| 238 | | +----+ /+----+ 239 | |__| |____/ | 240 | | PE3| ES2 / 241 | +----+ / 242 | | / 243 +-------------+----+ / 244 | PE4|____/ES2 245 | | 246 +----+ 248 Figure 3 Multi-homing Network of EVPN 250 Figure 3 252 Figure 3 illustrates a case where there are two Ethernet Segments, 253 ES1 and ES2. PE1 is attached to CE1 via Ethernet Segment ES1 whereas 254 PE2, PE3 and PE4 are attached to CE2 via ES2 i.e. PE2, PE3 and PE4 255 form a redundancy group. Since CE2 is multi-homed to different PEs 256 on the same Ethernet Segment, it is necessary for PE2, PE3 and PE4 to 257 agree on a DF to satisfy the above mentioned requirements. 259 The use of HRW in the EVPN DF Election is described in 260 [I-D.ietf-bess-evpn-df-election-framework]. In that draft it is 261 explained how the HRW DF Election performs better than the modulo DF 262 Election algorithm in [RFC7432]. However, it is implicitly assumed 263 there that all the PEs are of the same capacity (weights equal). 265 DMZ link bandwidth for load balancing flows across multiple EBGP 266 egress points is described in [I-D.ietf-idr-link-bandwidth]. It has 267 been extended to the case of cumulative DMZ load balancing 268 [I-D.mohanty-bess-ebgp-dmz] in the case of an all EBGP network in the 269 data center. [I-D.ietf-bess-evpn-unequal-lb] describes the use of 270 the DMZ in the EVPN DF Election. The argument is made that ideally 271 one should be able to change the link bandwidth in one or more of the 272 multi-homed PEs rather than have to change in all of the multi-homed 273 PEs simultaneously. The draft describes the bandwidth increments to 274 be taken into consideration and proposes an iterative way to assign 275 the score function. The description in Section 4.3.2 of 276 [I-D.ietf-bess-evpn-unequal-lb] is an non-optimal solution and 277 somewhat empirical. It does not obey the minimal disruption property 278 of the HRW. 280 In contrast to the procedures for weighted HRW in 4.3.2 of 281 [I-D.ietf-bess-evpn-unequal-lb], we can achieve an optimal solution 282 for weighted HRW in [I-D.ietf-bess-evpn-unequal-lb] using the score 283 function as described in Section 4 above and obviating the need to 284 take bandwidth increments. It is an order of magnitude faster and 285 efficient and minimally disruptive. 287 7. Weighted HRW and its application to Resilient Hashing 289 With the exponential increase in the number of physical links used in 290 data centers, there is also the potential for an increase in the 291 number of failed physical links. In systems that employ static 292 hashing for load balancing flows across members of port channels or 293 Equal Cost Multipath (ECMP) groups, each flow is hashed to a link. 294 When a link fails, all flows including those that were previously 295 mapped to the non-failed links are rehashed across the remaining 296 working links. This causes packet reordering of flows that were in 297 fact not mapped to the link that failed. A similar rehashing with 298 packet re-ordering also happens when a link is added to the port 299 channel or Equal Cost Multipath (ECMP) group. With the ever 300 increasing number of physical links used in the data centers there 301 the possibility for increasing number of failed links only increases. 302 Hence the resilient hashing is very important. 304 However when the links are not of the same speed, Resilient hashing 305 for ECMP does not apply per-se. However, one can use the method 306 explained in Section 4 to achieve resilient hashing even in the 307 Unequal Cost Multipath (UCMP)case or when member links are of 308 different bandwidths. 310 8. Weighted HRW and its application to Multicast DR Election 312 [I-D.mankamana-pim-bdr]propose a mechanism to elect backup DR on a 313 shared LAN. A backup DR on LAN would be useful for faster 314 convergence. When the access bandwidth is different for the PIM 315 routers and we want to do a load balancing among the PIM routers for 316 DR/backup DR functionality with regards to the various (S,G) flow, 317 technique similar to Section 4 can be applied. The details of the 318 problem is out of the scope of the current draft and is being worked 319 on separately at this time. 321 9. Protocol Considerations 323 A request needs to registered with IANA registry for the weighted HRW 324 EVPN DF Election Algorithm in the DF Alg field in the DF Election 325 Extended Community in draft 326 [I-D.ietf-bess-evpn-df-election-framework]. 328 10. Operational Considerations 330 TBD. 332 11. Security Considerations 334 This document raises no new security issues for EVPN. 336 12. Acknowledgements 338 The authors would like to thank Shyam Sethuram and Peter Psenak for 339 useful discussions related to this draft. 341 13. References 343 13.1. Normative References 345 [HRW1999] Thaler, D. and C. Ravishankar, "Using Name-Based Mappings 346 to Increase Hit Rates", IEEE/ACM Transactions in 347 networking Volume 6 Issue 1, February 1998. 349 [I-D.ietf-bess-evpn-df-election-framework] 350 Rabadan, J., satyamoh@cisco.com, s., Sajassi, A., Drake, 351 J., Nagaraj, K., and S. Sathappan, "Framework for EVPN 352 Designated Forwarder Election Extensibility", draft-ietf- 353 bess-evpn-df-election-framework-09 (work in progress), 354 January 2019. 356 [I-D.ietf-bess-evpn-unequal-lb] 357 Malhotra, N., Sajassi, A., Rabadan, J., Drake, J., 358 Lingala, A., and S. Thoria, "Weighted Multi-Path 359 Procedures for EVPN All-Active Multi-Homing", draft-ietf- 360 bess-evpn-unequal-lb-00 (work in progress), September 361 2018. 363 [I-D.ietf-idr-extcomm-iana] 364 Rosen, E. and Y. Rekhter, "IANA Registries for BGP 365 Extended Communities", draft-ietf-idr-extcomm-iana-02 366 (work in progress), December 2013. 368 [I-D.ietf-idr-link-bandwidth] 369 Mohapatra, P. and R. Fernando, "BGP Link Bandwidth 370 Extended Community", draft-ietf-idr-link-bandwidth-07 371 (work in progress), March 2018. 373 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 374 Requirement Levels", BCP 14, RFC 2119, 375 DOI 10.17487/RFC2119, March 1997, 376 . 378 [RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A 379 Border Gateway Protocol 4 (BGP-4)", RFC 4271, 380 DOI 10.17487/RFC4271, January 2006, 381 . 383 [RFC7432] Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., 384 Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based 385 Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, February 386 2015, . 388 [WHRW] Resch, J., "New hashing Algorithms for Data Storage", 389 Storage Developer Conference 18, November 2015. 391 13.2. Informative References 393 [CARP] Valloppillil, V. and K. Ross, "Cache Array Routing 394 Protocol v1.1", IEEE/ACM Transactions in networking Volume 395 6 Issue 1, February 1998. 397 [CHASH] Karger, D., Lehman, E., Leighton, T., Panigrahy, R., 398 Levine, M., and D. Lewin, "Consistent Hashing and Random 399 Trees: Distributed Caching Protocols for Relieving Hot 400 Spots on the World Wide Web", ACM Symposium on Theory of 401 Computing ACM Press New York, May 1997. 403 [CLRS2009] 404 Cormen, T., Leiserson, C., Rivest, R., and C. Stein, 405 "Introduction to Algorithms (3rd ed.)", MIT Press and 406 McGraw-Hill ISBN 0-262-03384-4., February 2009. 408 [DYNAMODB] 409 Decennia, G., Hastorun, D., Jampani, M., Kakulapati, G., 410 Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, 411 P., and W. Vogels, "Dynamo: Amazon's Highly Available Key- 412 value Store", SOSP 07, October 2007. 414 [I-D.mankamana-pim-bdr] 415 mishra, m., "PIM Backup Designated Router Procedure", 416 draft-mankamana-pim-bdr-00 (work in progress), June 2018. 418 [I-D.mohanty-bess-ebgp-dmz] 419 satyamoh@cisco.com, s., Millisor, A., and A. Vayner, 420 "Cumulative DMZ Link Bandwidth and load-balancing", draft- 421 mohanty-bess-ebgp-dmz-00 (work in progress), March 2018. 423 [RFC2991] Thaler, D. and C. Hopps, "Multipath Issues in Unicast and 424 Multicast Next-Hop Selection", RFC 2991, 425 DOI 10.17487/RFC2991, November 2000, 426 . 428 [RFC2992] Hopps, C., "Analysis of an Equal-Cost Multi-Path 429 Algorithm", RFC 2992, DOI 10.17487/RFC2992, November 2000, 430 . 432 Authors' Addresses 434 Satya Ranjan Mohanty 435 Cisco Systems, Inc. 436 225 West Tasman Drive 437 San Jose, CA 95134 438 USA 440 Email: satyamoh@cisco.com 442 Mankamana Misra 443 Cisco Systems, Inc. 444 170 W. Tasman Drive 445 San Jose, CA 95134 446 USA 448 Email: mankamis@cisco.com 450 Acee Lindem 451 Cisco Systems, Inc. 452 170 West Tasman Drive 453 San Jose, CA 95134 454 USA 456 Email: acee@cisco.com 457 Ali Sajassi 458 Cisco Systems, Inc. 459 170 West Tasman Drive 460 San Jose, CA 95134 461 USA 463 Email: sajassi@cisco.com