idnits 2.17.1 draft-maenpaa-p2psip-topologyplugin-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 959 has weird spacing: '...ionType typ...' -- The document date (July 6, 2009) is 5407 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'Step 1' on line 1729 -- Looks like a reference, but probably isn't: 'Step 2' on line 1739 == Unused Reference: '27' is defined on line 1884, but no explicit reference was found in the text == Outdated reference: A later version (-26) exists of draft-ietf-p2psip-base-02 == Outdated reference: A later version (-22) exists of draft-ietf-p2psip-diagnostics-01 Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 P2PSIP Working Group J. Maenpaa 3 Internet-Draft Ericsson 4 Intended status: Standards Track A. Swaminathan 5 Expires: January 7, 2010 S. Das 6 Qualcomm, Inc. 7 G. Camarillo 8 J. Hautakorpi 9 Ericsson 10 July 6, 2009 12 A Topology Plug-in for REsource LOcation And Discovery 13 draft-maenpaa-p2psip-topologyplugin-00 15 Status of this Memo 17 This Internet-Draft is submitted to IETF in full conformance with the 18 provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as Internet- 23 Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt. 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html. 36 This Internet-Draft will expire on January 7, 2010. 38 Copyright Notice 40 Copyright (c) 2009 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents in effect on the date of 45 publication of this document (http://trustee.ietf.org/license-info). 46 Please review these documents carefully, as they describe your rights 47 and restrictions with respect to this document. 49 Abstract 51 REsource LOcation And Discovery (RELOAD) is a peer-to-peer signaling 52 protocol that can be used to maintain an overlay network, and to 53 store data in and retrieve data from the overlay. This document 54 defines a new topology plug-in for RELOAD that is more appropriate 55 for real world large scale overlays. This topology plug-in 56 implements three important functionalities that allow RELOAD to 57 operate under real world constraints. First, it includes a load 58 balancing algorithm that specifies efficient allocation of load to 59 different nodes in the network. Second, the document describes 60 robust techniques for stabilization of fingers and successors and 61 specifies self tuning mechanisms that allow dynamic and automatic 62 adjustment of parameters needed for these advanced techniques in the 63 topology plug-in. Finally, it specifies a locality aware finger 64 selection algorithm that reduces average lookup latency. 66 Table of Contents 68 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 69 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 70 3. Need for Advanced Topology Plug-in . . . . . . . . . . . . . . 8 71 3.1. Need for Load Balancing . . . . . . . . . . . . . . . . . 8 72 3.2. Need for Robust Stabilization . . . . . . . . . . . . . . 9 73 3.3. Need for Locality Awareness . . . . . . . . . . . . . . . 9 74 3.4. Need for Self-Tuning of System Parameters . . . . . . . . 10 75 4. Chord . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 76 5. Load Balancing in the Proposed Topology Plug-in for RELOAD . . 11 77 5.1. Basic Load Balancing scheme . . . . . . . . . . . . . . . 11 78 5.2. Recommendations on Virtual Servers . . . . . . . . . . . . 12 79 5.3. Routing . . . . . . . . . . . . . . . . . . . . . . . . . 13 80 5.4. Obtaining Virtual Server Identities . . . . . . . . . . . 14 81 5.4.1. Without Enrollment Server . . . . . . . . . . . . . . 14 82 5.4.2. With Enrollment Server . . . . . . . . . . . . . . . . 15 83 5.5. Extensions to Overlays with Heterogeneous Nodes . . . . . 16 84 6. Stabilizing Fingers, Successors, and Predecessors in the 85 Topology Plug-in . . . . . . . . . . . . . . . . . . . . . . . 16 86 6.1. Choice of Approach to Stabilization . . . . . . . . . . . 16 87 6.2. Update Messages for Stabilization . . . . . . . . . . . . 18 88 6.3. Finger Stabilization . . . . . . . . . . . . . . . . . . . 21 89 6.3.1. Locality-aware Finger Selection . . . . . . . . . . . 22 90 6.4. Successor Stabilization . . . . . . . . . . . . . . . . . 23 91 6.5. Predecessor Stabilization . . . . . . . . . . . . . . . . 24 92 6.6. Joining the Overlay . . . . . . . . . . . . . . . . . . . 24 93 6.6.1. Contents of the Join Message . . . . . . . . . . . . . 25 94 6.7. Leaving the Overlay . . . . . . . . . . . . . . . . . . . 26 95 6.7.1. Contents of the Leave Message . . . . . . . . . . . . 26 96 6.8. Self Tuning System Parameters . . . . . . . . . . . . . . 27 97 6.8.1. Self Tuning Load Balancing Algorithm Parameters . . . 28 98 6.8.2. Self Tuning the Stabilization Interval . . . . . . . . 32 99 7. Security Considerations . . . . . . . . . . . . . . . . . . . 37 100 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 37 101 9. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 37 102 10. Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 103 10.1. Comparison of the Load Balancing Algorithm with Chord . . 38 104 10.2. Performance of the Load Balancing Algorithm as Network 105 Grows . . . . . . . . . . . . . . . . . . . . . . . . . . 38 106 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40 107 11.1. Normative References . . . . . . . . . . . . . . . . . . . 40 108 11.2. Informative References . . . . . . . . . . . . . . . . . . 40 109 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 42 111 1. Introduction 113 REsource LOcation And Discovery (RELOAD) [1] is a peer-to-peer 114 signaling protocol that can be used to maintain an overlay network, 115 and to store data in and retrieve data from the overlay. For 116 interoperability reasons, RELOAD specifies one overlay algorithm that 117 is mandatory to implement. Additionally, RELOAD supports a variety 118 of other overlay algorithms through the use of topology plug-ins. A 119 topology plug-in implements the topology defined by a specific 120 overlay algorithm. 122 This document defines a new topology plug-in for RELOAD that is more 123 appropriate for real world large scale overlays that have to deal 124 with object storage in a fair manner, provide good lookup performance 125 to a variety of applications, and self-organize to deal with churn. 126 This topology plug-in implements three important functionalities that 127 allow RELOAD to operate under these real world constraints. First, 128 it includes a load balancing algorithm that specifies efficient 129 allocation of load to different nodes in the network. Second, the 130 document describes robust techniques for stabilization of fingers and 131 successors and specifies self tuning mechanisms that allow dynamic 132 and automatic adjustment of parameters needed for these advanced 133 techniques in the topology plug-in; this avoids the need for 134 constants that may only work in specific scenarios. Finally, it 135 specifies a locality aware finger selection algorithm that reduces 136 average lookup latency. 138 Load balancing is essential to effectively manage data and provide 139 services on overlays. Load balancing, as an integral part of the 140 overlay, encourages participation by imposing collective fate sharing 141 on the nodes. Without such a scheme, overlay adoption may be 142 significantly affected. However, the mandatory-to-implement RELOAD 143 DHT protocol based on Chord does not support operating the overlay in 144 a load balanced manner. For instance, for a system with N nodes, it 145 can been shown that the imbalance factor, defined as the maximum load 146 on any node on the network divided by the average load, is of the 147 order of O(log2(N)) in the number of items even if all objects are 148 assumed to be homogeneous. In the case of heterogeneous networks, 149 where the capabilities of nodes (storage and bandwidth) can differ by 150 multiple orders of magnitude, the problem of load balancing becomes 151 more important because the imbalance in load distribution could 152 potentially create a bottleneck in the system. Thus to enable 153 scalable, real world deployments of RELOAD, the topology plug-in in 154 this document specifies a scheme for load balancing. 156 Peer-to-peer overlay networks face a fundamental challenge with 157 churn: joining and leaving of nodes. This can affect the integrity 158 of the routing structure, cause a node to become disconnected from 159 the overlay, and cause overhead in maintaining consistency. Several 160 research studies have been performed on the type of stabilization 161 that can be used in DHT based peer-to-peer overlays such as RELOAD. 162 This document specifies a method for stabilization to deal with churn 163 to allow RELOAD to be scalable and reliable in real world conditions. 164 We specifically prescribe the use of a periodic stabilization routine 165 to counter the undesirable effects of churn on routing. 167 To enable load balancing and stabilization to deal with churn some 168 parameters need to be set (described later). The use of specific 169 constants for such parameters leads to deployment that may only work 170 for specific scenarios (where the parameters are evaluated). Thus, 171 this document specifies self tuning mechanisms for system parameters 172 to allow RELOAD to scale naturally as network dynamics dictate. For 173 example, as churn increases, the topology plug-in specified in this 174 document adapts by increasing the frequency of stabilization. 175 Similarly, as network size increases, load balancing parameters and 176 DHT parameters are modified to ensure quick finger and successor 177 updates and to keep the load balancing property over time. To enable 178 self tuning of system parameters, some characteristics such as churn 179 rate and network size are estimated. These characteristics are then 180 used to dynamically adjust the topology plug-in parameters such as 181 the size of the successor set, size of the routing table, and rate of 182 maintenance messages. The benefit of this approach is that it avoids 183 the problem with static parameters. Using static parameters, it is 184 not possible to achieve a low failure rate and a low communication 185 overhead. The topology plug-in specified in this document allows the 186 system to take into account the evolution of network conditions and 187 adapt to them. 189 Even with up-to-date fingers and successors, making progress in the 190 identifier space can be expensive in terms of network latency because 191 small progress in identifier space could result in a significant leap 192 in physical distance. The successor and predecessor lists can be 193 used to optimize network latency by relaxing the requirement for 194 finger selection. Specifically, at each overlay hop, as progress is 195 made in the identifier space, small physical distance hops are used 196 so as to avoid high latencies in overall lookup. More details on the 197 proposed locality aware finger selection are described later in this 198 document. 200 In summary, the topology plug-in proposed in this document has the 201 following advantages: 203 o First, building on the RELOAD framework, this document introduces 204 a load balancing algorithm that provides an imbalance factor of 205 the order of O(1). The solution proposed in the document can also 206 be extended to the case of heterogeneous nodes in such a way that 207 the number of objects assigned to a node is proportional to the 208 amount of load it can handle; 210 o Second, this topology plug-in specifies stabilization techniques 211 that deal effectively with churn; 213 o Third, this document proposes self-tuning of parameters required 214 in these algorithms such that users no longer need to tune every 215 DHT parameter correctly for a given operating environment. By 216 periodically computing the network parameters, the system 217 automatically adapts to changing operating conditions; and 219 o Finally, the locality aware finger selection algorithm 220 incorporated with the DHT algorithm further optimizes the 221 selection of successors and predecessors to reduce network 222 latency. 224 This document is organized as follows. We first describe the 225 algorithms for load balancing, stabilization, and locality aware 226 finger selection. Then we describe the major components required to 227 operate the topology plug-in: joining, leaving, routing, dealing with 228 failures etc. Finally, we describe self tuning of the system 229 parameters to make the topology plug-in adjust itself to changing 230 network conditions. 232 2. Terminology 234 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 235 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 236 document are to be interpreted as described in RFC 2119 [2]. 238 This document uses the terminology and definitions from the Concepts 239 and Terminology for Peer-to-Peer SIP [1] draft. 241 Chord ring: The Chord DHT orders identifiers on an identifier circle 242 of size 2^m (m is the number of bits in peer identifiers). This 243 identifier circle is called the Chord ring. 245 DHT: Distributed Hash Tables (DHTs) are a class of decentralized 246 distributed systems that provide a lookup service similar to a 247 hash table. Given a key, any participating peer can retrieve the 248 value associated with that key. The responsibility for 249 maintaining the mapping from keys to values is distributed among 250 the peers. 252 Finger table: A data structure with up to m entries maintained by 253 each peer in a Chord-based overlay. The ith entry in the finger 254 table of peer n contains the identity of the first primary virtual 255 server that succeeds n by at least 2^(m-i) on the Chord ring. 256 This peer is called the ith finger of peer n. As an example, the 257 first entry in the finger table of peer n contains a peer half-way 258 around the Chord ring from peer n. The purpose of the finger 259 table is to accelerate lookups. 261 log2(N): Logarithm of N with base 2. 263 N: The number of nodes in the overlay unless otherwise specified. 265 Neighborhood set: Consists of successor and predecessor lists. 267 O(g(n)): Informally, saying that some equation f(n) = O(g(n)) means 268 that f(n) is less than some constant multiple of g(n). 270 Omega(g(n)): Informally, saying that some equation f(n) = 271 Omega(g(n)) means that f(n) is more than some constant multiple of 272 g(n). 274 Predecessor list: A data structure containing the predecessors of a 275 peer. 277 Successor list: A data structure containing the successors of a 278 peer. 280 Virtual server: Each physical peer instantiates one or more virtual 281 servers with random IDs that act as peers in the DHT. 283 Primary virtual server: The primary ID assigned to the peer when it 284 joins the network. This location is chosen uniformly at random 285 among the available IDs in [0, 2^m-1]. 287 Secondary virtual server: List of all identities owned by the 288 physical peer excluding the primary virtual server. The location 289 of the secondary virtual servers are selected so that the ith 290 virtual server is distributed between [vp-i*delta*2^m, vp- 291 (i+1)*delta*2^m] where vp is the primary virtual server. These 292 locations may change to adapt to network dynamics. 294 alpha: number of virtual servers per physical ID. This includes one 295 primary virtual server and (alpha-1) secondary virtual servers. 297 delta: The value of delta defines the spacing between the virtual 298 servers. The location of the secondary virtual servers are 299 selected so that the ith virtual server is distributed between 300 [vp-i*delta*2^m, vp-(i+1)*delta*2^m] where vp is the location of 301 the primary virtual server. 303 Routing table: The set of peers that a node can use to route overlay 304 messages. The routing table consists of the finger table, 305 successor list and predecessor list, all populated with primary 306 virtual server IDs. 308 3. Need for Advanced Topology Plug-in 310 This section details the need for specifying load balancing 311 mechanisms, robust stabilization and locality awareness for a RELOAD 312 topology plug-in to make it efficient and scalable. 314 3.1. Need for Load Balancing 316 Most DHTs, such as Chord, Pastry, CAN, Tapestry, and the RELOAD base 317 topology plug-in employ uniform hashing to generate object IDs so 318 that the object IDs are uniformly distributed over the ID space 319 (example 0 to D = 2^128 - 1). Objects are then assigned to the nodes 320 based on their IDs and the exact way that this is done differs in 321 different DHTs. Let us assume that there are N nodes in the network. 322 Suppose the node IDs are generated uniformly at random, then the 323 probability that the ith largest nodeID is equal to k is given by: 325 P(nodeID(i) == k) = (N-1)C(i-1) * (k/D)^(i-1) * (N-i)C(1) * (1/D) * 326 (1 - (k+1)/D)^(N-i+1) 328 For large N and D, this distribution can be approximated as follows: 330 P(nodeID(i) == k) = e^(-k*N/D) * (k*N/D)^1 * (1/i!) 332 Therefore, ith largest node IDs follows a Poisson distribution. The 333 inter-node distance then follows an exponential distribution with 334 parameter N/D. As consequence of this property, the number of data 335 items assigned to a node is proportional to the inter-node distance 336 and, thus, follows an exponential distribution (c denotes an 337 arbitrary constant) 339 Pr(#items in node == x) = c * (N/D) * e^(c*(N/D)*x) 341 The goodness of Chord for load balancing can be measured in terms of 342 the imbalance metric defined as: 344 Maximum # of data items 345 imbalance factor = ------------------------------- 346 Average # of data items 348 Here, the maximum value is calculated as the one which is largest 349 with probability: O(1 - (1/N^l)) where l is any constant. The 350 imbalance factor measures the storage load of a server; the longer 351 the interval is, the more data has to be stored in the server. 353 For the case of Chord, this imbalance factor is of the order of 354 O(log2(N)) and there are nodes in the overlay that manage log2(N) 355 times the average node's load. Further, this result suggests that 356 the imbalance factor for Chord increases as the number of nodes in 357 the network and as the number of nodes increase the maximum number of 358 data items handled per node increases of the order of 359 O(T*(log2(N)/N)) where T denotes the total number of objects on the 360 overlay. Ideally, we would want the maximum value to be close to the 361 average value of (T/N) giving an imbalance factor close to 1 for a 362 good load balancing algorithm. Therefore, RELOAD base DHT is not 363 very efficient for load balancing and there is a need for load 364 balancing mechanisms in a topology plug-in that is widely usable. 366 3.2. Need for Robust Stabilization 368 To ensure correct lookups in the presence of churn and to ensure 369 optimal routing of queries, a node needs to ensure that its fingers, 370 successor list, and predecessor list are up to date. However, 371 stabilization of these features comes at a cost: peers in the overlay 372 network need to consume network bandwidth to maintain routing state. 373 DHTs use stabilization routines to counter the undesirable effects of 374 churn on routing. The purpose of stabilization is to keep the 375 routing information of each peer in the overlay consistent with the 376 constantly changing overlay topology. 378 The current RELOAD base topology plug-in proposes reactive 379 stabilization. Research studies have shown that this approach may 380 not be the most robust to a wide variety of network conditions. In 381 this document, we revisit the choice of stabilization and recommend 382 techniques likely to work efficiently across deployment types. 384 3.3. Need for Locality Awareness 386 A major performance issue with structured peer-to-peer based topology 387 plug-ins is the need for multi-overlay-hop routing to lookup a piece 388 of data such as the SIP Address of Record (AoR )to IP mapping. Since 389 the identifier space is completely random, a lookup for a data item 390 stored 10ms RTT away can potentially take multiple intercontinental 391 hops before getting answered significantly affecting lookup latency. 392 Additionally, the fact that routers in such peer-to-peer networks are 393 expected to be constrained non-infrastructure nodes adds on 394 additional delays. Locality aware routing aims to mitigate the 395 routing stretch of the topology plug-in so that a lookup is answered 396 by making progress in the identifier space while minimizing the 397 amount of distance traveled in physical space at each hop. 399 3.4. Need for Self-Tuning of System Parameters 401 Two main advantages of a self-tuning DHT are that users no longer 402 need to tune every DHT parameter correctly for a given operating 403 environment and that the system adapts to changing operating 404 conditions. 406 4. Chord 408 This document proposes a new topology plugin for RELOAD that is based 409 on the Chord DHT algorithm. Topology plugins allow RELOAD to support 410 a variety of overlay algorithms. The proposed topology plugin uses a 411 load balanced self-tuning version of Chord. It can be used as an 412 alternative to the default DHT specified by RELOAD. 414 Chord [4] is a structured peer-to-peer algorithm that uses consistent 415 hashing to build a DHT out of several independent peers. Consistent 416 hashing assigns each peer and resource an m-bit identifier. Peers 417 MUST use SHA-1 as the base hash fuction to generate the identifiers. 418 The length of the identifiers MUST be m=128 bits. The identifiers 419 are ordered on an identifier circle of size 2^m. On the identifier 420 circle, key k MUST be assigned to the first peer whose identifier 421 equals or follows the identifier of k in the identifier space. The 422 identifier circle is called the Chord ring. 424 Different DHTs differ significantly in performance when bandwidth is 425 limited. It has been shown that when compared to other DHTs, the 426 advantages of Chord include that it uses bandwidth efficiently and 427 can achieve low lookup latencies at little cost [5]. 429 A simple lookup mechanism could be implemented on a Chord ring by 430 requiring each peer to only know how to contact its current successor 431 on the identifier circle. Queries for a given identifier could then 432 be passed around the circle via the successor pointers until they 433 encounter the first peer whose identifier is equal to or larger than 434 the desired identifier. Such a lookup scheme uses a number of 435 messages that grows linearly with the number of peers. To reduce the 436 cost of lookups, Chord also maintains additional routing information; 437 each peer n MUST maintain a data structure with up to m entries, 438 called the finger table. The first entry in the finger table of peer 439 n contains the peer half-way around the ring from peer n. The second 440 entry contains the peer that is 1/4th of the way around, the third 441 entry the peer that is 1/8th of the way around, etc. In other words, 442 the ith entry in the finger table at peer n SHOULD contain the 443 identity of the first peer s that succeeds n by at least 2^(m-i) on 444 the Chord ring. This peer is called the ith finger of peer n. The 445 interval between two consecutive fingers is called a finger interval. 446 The ith finger interval of peer n covers the range [n.id + 2^(m-i), 447 n.id + 2^(m-i+1)) on the Chord ring. In an N-peer network, each peer 448 maintains information about O(log2(N)) other peers in its finger 449 table. As an example, if N=1000, it is sufficient to maintain 10 450 fingers. 452 To increase robustness in the event of peer failures, each Chord peer 453 MUST maintain a successor list containing the peer's immediate 454 successors on the Chord ring. The successor list will be further 455 described in Section 5.3. Peers MUST also maintain a predecessor 456 list containing the peer's immediate predecessors on the Chord ring. 457 The recommeded value for the size of the predecessor list is 3, as is 458 also specified in [1]. 460 5. Load Balancing in the Proposed Topology Plug-in for RELOAD 462 5.1. Basic Load Balancing scheme 464 The basic Chord DHT suffers from drawbacks. The uniform hashing used 465 in Chord to generate object IDs result in an imbalance factor close 466 to O(log2(N)); this implies that there are peers in the Chord ring 467 that would have to manage O(log2(N)) times the load that is to be 468 handled by an average peer. In this section, we build upon the basic 469 Chord DHT and present the virtual servers algorithm that results in 470 an imbalance factor close to 1. 472 Several research papers have proposed schemes to provide load 473 balanced DHTs. Some of the schemes have general techniques while 474 others are targeted towards optimizing a specific DHT. This solution 475 proposed in this document takes into account results and ideas from 476 previous ideas for load balancing in [6], [7], [8], [4], [9], [10], 477 [11], [12], [13], [14], [15], [16]. 479 The basic scheme for load balancing proposed in this document is an 480 extension of the virtual servers approach [4] over Chord with the 481 benefit of reducing the routing state. At a high level, virtual 482 servers approach works in improving load balancing as follows. The 483 work in [4] suggested that load balancing can be performed 484 efficiently if each peer simulates a logarithmic number of "virtual 485 servers". The method works by associating keys with virtual servers, 486 and mapping multiple virtual servers (with unrelated identifiers) to 487 each real node. The authors show that this will provide a more 488 uniform coverage of the identifier space. For example, if we 489 allocate O(log2(N)) randomly chosen virtual servers to each real 490 node, with high probability each of the N bins will contain 491 O(log2(N)) nodes. The downside of this approach is the additional 492 routing state and its maintenance cost. 494 This document proposes load balancing based on a proposal in [14], 495 and is an improvement over the basic virtual servers approach 496 introduced in [4]. In our approach, each physical node instantiates 497 one or more virtual servers that act as peers in the DHT. The 498 locations of these virtual servers are chosen close to each other in 499 the nodeID space allowing the node to share a single set of overlay 500 links among the virtual servers. 502 The main system parameters of the solution in this document are: how 503 many virtual servers should be chosen exactly and what should the 504 spacing between the node identifiers of those virtual servers. The 505 next sections describe this document's recommendations on virtual 506 nodes and routing for a load balanced DHT. 508 Since the choices of the system parameters depend on network 509 dynamics, this document further discusses dynamically adjusting the 510 topology plug-in with network dynamics and recommends a periodic 511 stabilization model to keep the system parameters up-to-date. 513 5.2. Recommendations on Virtual Servers 515 Selection of virtual servers require two decisions to be made: (1) 516 how many virtual servers (denoted by alpha) should we choose and (2) 517 what should be the namespace spacing (represented by delta*2^m) in 518 between these virtual servers. This section provides the present 519 document's recommendation on choosing alpha and delta. 521 Each physical peer maintains two sets of virtual server identities, 522 namely, primary virtual server identities and secondary virtual 523 server identities. The location of the primary virtual server 524 (denoted as vp) is chosen using existing techniques in RELOAD. These 525 locations remain fixed throughout the entire lifetime of the node. 527 In addition to the primary virtual server, each node maintains 528 secondary virtual servers. The locations of these secondary virtual 529 servers are chosen such that the ith virtual server, vs_i, is 530 uniformly distributed in [vp-i*delta*2^m, vp-(i+1)*delta*2^m]. 532 The value of delta * 2^m defines the spacing between the virtual 533 nodes and this document requires that nodes MUST choose the value of 534 delta as 1/N where N is the number of nodes in the network. This 535 value was found to perform well in simulations and is consistent with 536 the value reported in [14]. 538 Each node MUST maintain a total of alpha = 2*log2(N) virtual server 539 identities for load balancing where N is the number of nodes in the 540 overlay. This value of alpha includes one primary virtual server and 541 (alpha-1) secondary virtual server identities. The choice of 542 2*log2(N) for alpha has been found to work well in simulations [14]. 544 While both these parameters, alpha and delta, depend on N (the size 545 of the network), Section 6.8.1 describes an algorithm used to 546 determine alpha and delta without performing explicit network size 547 estimation. Note that explicit network size estimation only needs to 548 be performed when self-tuning the DHT parameters, since in that case 549 a more accurate estimate of the network size is needed. 551 5.3. Routing 553 Assume an identifier space [0, 2^m-1]. As in the basic Chord 554 approach, the recommended topology plug-in maintains two types of 555 routing information such as finger tables and successor tables. The 556 finger tables are constructed based on the location of the primary 557 virtual server, i.e., the ith finger at peer v contains the identity 558 of the first peer s that succeeds its primary virtual server, vp, by 559 at least 2^(m-i) on the Chord ring. 561 The neighbor table of peer v contains the entries of all nodes which 562 have ownership of any part of the log2(N)/N space that node v owns. 563 As in the case of Chord, each virtual server belonging to the node 564 obtains its successor list of its immediate successor. If this total 565 length is greater than log2(N), the clockwise-farthest entry is 566 dropped to restrict the number of neighbor table entries to log2(N). 567 Further, it can be shown that the maximum number of node IDs that 568 fall within this space is O(log2(N)) resulting in those many neighbor 569 table entries. The added benefit of having log2(N) successors is 570 that if each peer fails independently with probability p, the 571 probability that all log2(N) successors fail simultaneously is only 572 p^log2(N); therefore, the system is much more resilient to failures. 574 To obtain the entries in the peer's neighbor table, the peer starts 575 with the alpha-th secondary virtual ID, denoted as vs_alpha. The 576 peer then proceeds clockwise from vs_alpha on the Chord ring and 577 identifies the virtual IDs of peers that fall in this region. When 578 it encounters a virtual server vs', it stores the location of both 579 the virtual server, vs', and the location of the corresponding 580 primary virtual server, vp'. On the other hand, when it sees its own 581 virtual server, it is dropped and no information is stored. This 582 process is followed until log2(N) distinct primary virtual servers 583 are obtained to populate the neighbor table. It is to be noted that 584 while the size of the successor list may be greater than log2(N) 585 because it includes both the primary and the secondary virtual 586 servers, the number of successor connections is limited to log2(N). 588 To locate any object in the overlay, the nodes MUST first employ the 589 finger tables to reach the virtual server ID that is closest to the 590 query object ID. At the end of this step, the chosen physical node 591 may or may not have the object. If the physical node does not have 592 the object, then it forwards the object request to its neighbors 593 using the neighbor table entries. Therefore the total number of 594 message hops required for locating an object is of the order of 595 O(log2(N)) + 1. 597 Thus, this solution proposed in this document, based on [14], 598 requires maintaining a lower number of connections than that of a 599 pure virtual servers based approach for load balancing. This is 600 because, while the basic Chord scheme with O(log2(N)) virtual servers 601 per physical node requires maintaining O(log2(N)) sets of fingers 602 (and O(log2(N)) routing tables) for each physical node for efficient 603 routing, this document requires maintaining just one set of fingers 604 (and routing table) per physical node. 606 In Section 6, we discuss approaches to keep the fingers and 607 successors up-to-date that is required to ensure correct look-ups. 609 5.4. Obtaining Virtual Server Identities 611 This section describes the procedure for obtaining the primary and 612 virtual server identities. 614 5.4.1. Without Enrollment Server 616 When a new node, v, joins the system, the first primary node identity 617 vp MUST be computed by applying the digest specified in the self- 618 signed-permitted element of the overlay configuration document to the 619 DER representation of the node's public key. The subsequent (alpha - 620 1) virtual node IDs MUST be computed as follows: the ith virtual ID 621 is chosen as random(vp - i*delta*2^m, vp - (i+1)*delta*2^m) 623 This ensures that the chosen virtual servers are close to each other 624 and a maximum of log2(N)/N apart. 626 A node's public key relates to the vp and can be verified by other 627 nodes. However, all other virtual server IDs cannot be related to 628 this public key so they need to be verified based on vp. There are 629 two options to do this verification: 631 o a. Use a constant fixed offset for the ith virtual server as (vp 632 - i*delta*2^m) exactly. This removes the random component of the 633 virtual server ID selection. Since all virtual server IDs are at 634 fixed intervals from vp, a node can weakly verify the node id. 635 However node ID collisions may make this hard. 637 o b. Use the current technique described in Section 5.2 with random 638 selection in an interval. In this case a node can verify that the 639 virtual server ID is in a small range near vp. 641 OPEN ISSUE: At the moment we recommend option (b), but this issue 642 needs further analysis. 644 5.4.2. With Enrollment Server 646 When an enrollment server is present, the added security benefits of 647 the enrollment server certified node identities are for virtual 648 server identities. The enrollment server is provided with the values 649 of the system parameters alpha and delta, and based on them, the 650 enrollment server gives out a set of virtual server identities to a 651 node. Similar to what a node itself does in a deployment without an 652 enrollment server; the enrollment server MUST use a uniform hash 653 function to randomly choose the primary virtual server ID from the 654 space (0, 2^128 - 1); call this virtual server ID vp. The remaining 655 (alpha - 1) node IDs MUST be then chosen by the enrollment server 656 around vp in such a way that the ith virtual serverID of v is chosen 657 to be uniformly distributed in (vp - i/N, vp - (i+1)/N). These node 658 IDs are then passed on to the node. 660 When a node initially joins, it does not have an estimate of alpha 661 and delta since that is based on the number of fingers in the routing 662 table. Thus, the enrollment server MUST use existing values of alpha 663 and delta in requests it receives from nodes in the current overlay 664 or based on its own diagnostic messages sent into the overlay to 665 determine the number of virtual servers and the spacing in between 666 the virtual servers and consequently the node IDs handed out to the 667 joining node. If the overlay has recently formed the enrollment 668 server MAY bootstrap the values of alpha and delta as 20 and 1/1000 669 respectively. The enrollment server may also choose to use past 670 overlay size estimates it may possess to bootstrap alpha and delta as 671 2log2(Nestimated) and 1/Nestimated. For example, an enrollment 672 server at a venue based overlay which is torn down can use past size 673 estimates. 675 Note that the enrollment procedure in RELOAD already defines 676 providing multiple node identities to the enrolling node. The change 677 needed is to include the values of alpha and delta in the simple 678 enrollment request. 680 5.5. Extensions to Overlays with Heterogeneous Nodes 682 This load balancing solution can be extended for overlays with nodes 683 with heterogeneous node capacities. Let the capacity of the node-v 684 in the overlay be represented by Cv. 686 If an overlay has nodes with heterogeneous nodes, nodes in the 687 overlay MUST use alpha = 2*Cv*log2(N) virtual servers per physical 688 node. The location of these (2 Cv log2(N)) virtual nodes MUST be 689 chosen near the primary node location vp such that the location of 690 the ith virtual serverID is uniformly distributed in (vp - i* delta * 691 2^m, vp - (i+1) * delta * 2^m). Each physical node then maintains a 692 total of O(Cv log2(N)) neighbor entries and O(Cv log2(N)) routing 693 fingers. The imbalance factor for this scenario, defined as, 695 Maximum # of data items 696 imbalance factor = --------------------------------- 697 Cv * Average # of data items 699 can be shown to be close to 1. 701 Note that in addition to alpha and delta, the node capacity of the 702 node MUST be passed to the enrollment server in the simple enrollment 703 request. This capacity value is used by the enrollment server to 704 calculate Cv as 706 node_capacity(node n) 707 Cv(node n) = ---------------------------------------- 708 sum_{k=1}^N node_capacity(node k) 710 In deployments with no enrollment servers, the node MUST estimate 711 Cv(node n) by obtaining its neighbors' node capacities and building 712 an estimate of average node capacity. 714 6. Stabilizing Fingers, Successors, and Predecessors in the Topology 715 Plug-in 717 6.1. Choice of Approach to Stabilization 719 There are two alternative approaches to stabilization: periodic and 720 reactive. Periodic stabilization can either use a fixed 721 stabilization rate or calculate the stabilization rate in an adaptive 722 fashion. 724 In reactive stabilization, a peer reacts to the loss of a peer in its 725 neighborhood set or to the appearance of a new peer that should be 726 added to its neighborhood set by sending a copy of its neighbor table 727 to all peers in the neighborhood set. Periodic recovery, in 728 contrast, takes place independently of changes in the neighborhood 729 set. In periodic recovery, a peer periodically shares its 730 neighborhood set with each of the members of that set. 732 The mandatory-to-implement Chord DHT algorithm in RELOAD [1] uses 733 reactive stabilization for the neighborhood set, unlike the original 734 Chord algorithm, which uses periodic stabilization. It has been 735 shown in [17] that reactive stabilization works well for small 736 neighborhood sets (i.e., small overlays) and moderate churn. 737 However, in large-scale (e.g., 1000 peers or more [17]) or high-churn 738 overlays, reactive stabilization runs the risk of creating a positive 739 feedback cycle, which can eventually result in congestion collapse. 740 In [17], it is shown that a 1000-peer overlay under churn uses 741 significantly less bandwidth and has lower latencies when periodic 742 stabilization is used than when reactive stabilization is used. 743 Although in the experiments carried out in [17], reactive recovery 744 performed well when there was no churn, its bandwidth use was 745 observed to jump dramatically under churn. At higher churn rates and 746 larger scale overlays, periodic stabilization uses less bandwidth and 747 the resulting lower contention for the network leads to lower 748 latencies. For this reason, most DHTs such as CAN, Chord, Pastry, 749 Bamboo, etc. use periodic stabilization. As an example, the first 750 version of Bamboo used reactive recovery, which caused Bamboo to 751 suffer from degradation in performance under churn. To fix this 752 problem, Bamboo was modified to use periodic stabilization. 754 In Chord, periodic stabilization is typically done both for 755 successors and fingers. An alternative strategy is analyzed in [18]. 756 In this strategy, called the correction-on- change maintenance 757 strategy, a peer periodically stabilizes its successors but does not 758 do so for its fingers. Instead, finger pointers are stabilized in a 759 reactive fashion. The results obtained in [18] imply that although 760 the correction-on-change strategy works well when churn is low, 761 periodic stabilization outperforms the correction-on-change strategy 762 when churn is high. 764 In this document, we propose to use periodic stabilization for 765 fingers, successors, and predecessors based on these insights. Each 766 peer MUST maintain a stabilization timer. When the stabilization 767 timer fires, the peer MUST restart the timer and carry out the 768 stabilization operations. The stabilization routine is described 769 next and more details on computing the stabilization interval are 770 elaborated in Section 6.8.2. 772 6.2. Update Messages for Stabilization 774 The stabilization procedures are implemented using Update requests 775 and answers. To describe the contents of these messages, the syntax 776 defined in [1] is used. A Chord Update request is defined as: 778 enum { reserved (0), notify(1), succ_stab(2), pred_stab(3), 779 full(4), virtualserver_stab_join(5), 780 virtualserver_stab_leave(6), (255) } 781 ChordUpdateType; 783 struct { 784 ChordUpdateType type; 785 NodeId sender_id; 787 select(type) { 788 case notify: 789 uint32 uptime; 790 NodeId sender_virtual_ids <0..2^16-1>; 791 case pred_stab: /* Empty */ 792 ; 793 case succ_stab: /* Empty */ 794 ; 795 case virtualserver_stab_join: 796 NodeId sender_virtual_ids <0..2^16-1>; 797 case virtualserver_stab_leave: 798 NodeId sender_virtual_ids <0..2^16-1>; 799 case full: 800 uint32 uptime; 801 uint32 alpha; 802 uint32 delta; 803 NodeId sender_virtual_ids <0..2^16-1>; 804 NodeId predecessors <0..2^16-1>; 805 NodeId successors <0..2^16-1>; 806 NodeId fingers <0..2^16-1>; 807 }; 808 } UpdateReq; 810 The "type" field MUST indicate the reason why the Update was sent: 812 notify: the sender of the Update wishes to notify the recipient of 813 the sender's existence. Upon receiving the Update, the recipient 814 SHOULD insert the sender into its routing table, if appropriate. 815 The 'notify' message MUST include the virtual server IDs of the 816 sender in the 'sender_virtual_ids' field. 818 succ_stab: the Update request is related to the successor 819 stabilization routine. 821 pred_stab: the Update request is related to the predecessor 822 stabilization routine. 824 virtualserver_stab_join and virtualserver_stab_leave: the Update 825 request is related to the DHT parameter stabilization routine. 827 full: the Update request contains the entire routing and neighbor 828 table of the sender. 830 The sender_id field contains the sender's primary virtual ID. 832 The sender_virtual_ids contains the list of all secondary virtual 833 server IDs belonging to the sender. 835 If the type of the Update request is 'pred_stab' or 'succ_stab', the 836 request MUST NOT carry any additional information. 838 If the type of the Update request is 'notify', the request MUST 839 contain the sender's current uptime in seconds and the location of 840 the sender's current virtual server IDs. 842 If the type of the request is 'virtualserver_stab_join' or 843 'virtualserver_stab_leave', the contents of the message MUST include 844 the list of primary and the secondary virtual server IDs of the 845 sender. 847 If the type of the request is 'full', the contents of the message 848 MUST be: 850 o uptime: The sender's current uptime in seconds; 852 o alpha: The sender's current DHT parameter value - alpha; 854 o delta: The sender's current DHT parameter value - delta; 856 o sender_virtual_ids: The sender's list of current virtual server 857 IDs; 859 o predecessors: The sender's predecessor list; 861 o successors: The sender's successor list; 863 o fingers: The sender's finger table. 865 In the introduced topology plug-in, each peer decides independently 866 the appropriate size for the successor list, predecessor list, finger 867 table, and system parameters (alpha and delta). Thus, the 868 'predecessors', 'successors', and 'fingers' fields are of variable 869 length. The number of virtual IDs assigned to a node along with the 870 spacing between the virtual IDs are chosen based on the system 871 parameters and may change as the network changes. In keeping with 872 this change, the length of the 'sender_virtual_ids' field MUST also 873 be of variable length and would include the list of its virtual 874 server IDs assigned to the sender. As specified in RELOAD [1], 875 variable length fields are on the wire preceded by length bytes. In 876 the case of the successor list, predecessor list, sender_virtual_ids, 877 and finger table, there are two length bytes (allowing lengths up to 878 2^16-1). The number of NodeId structures included in each field can 879 be calculated based on the length bytes since the size of a single 880 NodeId structure is 16 bytes. If a peer receives more entries than 881 fit into its successor list, predecessor list or finger table, the 882 peer SHOULD ignore the extra entries. If the peer is assigned more 883 virtual IDs than fit into its ID list, it SHOULD reject the 884 assignment. If a peer receives fewer entries than it currently has 885 in its own data structure, the peer SHOULD NOT drop the extra entries 886 from its data structure. 888 If the Update request succeeds, the responding peer sends an 889 UpdateAns message, which is defined as: 891 enum { reserved (0), notify(1), succ_stab(2), pred_stab(3), 892 full(4), virtualserver_stab_join(5), 893 virtualserver_stab_leave(6), (255) } 894 ChordUpdateType; 896 struct { 897 ChordUpdateType type; 899 select(type) { 900 case full: /* Empty */ 901 ; 902 case virtualserver_stab_join: /* Empty */ 903 ; 904 case virtualserver_stab_leave: /* Empty */ 905 ; 906 case notify: 907 uint32 uptime; 908 case pred_stab: 909 NodeId predecessors <0..2^16-1>; 910 case succ_stab: 911 NodeId predecessors <0..2^16-1>; 912 NodeId successors <0..2^16-1>; 913 }; 915 } UpdateAns; 917 If the type of the Update answer is 'full', 918 'virtualserver_stab_join', or 'virtualserver_stab_leave', the answer 919 MUST NOT carry any additional information. If the type is 'notify', 920 the answer MUST contain the sender's current uptime in seconds. If 921 the type is 'pred_stab', the answer SHOULD carry the predecessor list 922 of the responding peer. If the type is 'succ_stab', the answer 923 SHOULD include the predecessor and successor lists of the responding 924 peer. 926 6.3. Finger Stabilization 928 The purpose of the finger stabilization procedure is to incorporate 929 new peers into the finger table. In the procedure, peer v MUST 930 maintain a counter 'next', which stores the index of the next finger 931 that should be stabilized. The counter MUST be initialized to value 932 one and it MUST be incremented by one after each finger stabilization 933 procedure. When the stabilization timer fires, peer v MUST choose 934 one finger interval i from the set of finger_table_size finger 935 intervals it maintains: 937 i = next % (finger_table_size + 1), 939 and send a Probe request addressed to the first identifier belonging 940 to the chosen finger interval i. The peer f responding to the Probe 941 request SHOULD become the ith finger of v. Peer v SHOULD send an 942 Attach request to peer f to initiate a new connection to it. 944 This document defines a new ProbeInformationType value 'uptime'. 946 When this value is present in the requested_info field of a Probe 947 request, it indicates that the receiver MUST include in the Probe 948 response its current uptime in a ProbeInformation structure. A Probe 949 request that is sent as part of the finger stabilization procedure 950 MUST contain the 'uptime' ProbeInformationType in its requested_info 951 field. The extended ProbeInformation structure that is returned in 952 the Probe response is defined as: 954 enum { responsible_set(1), num_resources(2), uptime(3), 955 (255) } 956 ProbeInformationType; 958 struct { 959 ProbeInformationType type; 961 select (type) { 962 case responsible_set: 963 uint32 responsible_ppb; 965 case num_resources: 966 uint32 num_resources; 968 case uptime: 969 uint32 uptime; 970 }; 971 } ProbeInformation; 973 The types "responsible_ppb" and "num_resources" have been specified 974 in RELOAD [1]. The "uptime" is a new type and contains the sender's 975 current uptime in seconds. 977 6.3.1. Locality-aware Finger Selection 979 Making progress in the identifier space can be expensive in terms of 980 network latency. The successor and predecessor lists can be used to 981 optimize network latency by relaxing the requirement for finger 982 selection. Specifically, for each finger table entry, a node (say v) 983 first determines a node n that matches the identifier of the finger. 984 It then retrieves the successors and predecessors of n from n. Node 985 v then PINGs the successors and predecessors of node n and chooses 986 the topologically closest node among these as the choice for the 987 finger table entry. The sizes of the successor and predecessor lists 988 have an impact on network latency; the greater the number of 989 successors and predecessors, the higher the probability of finding a 990 topologically close finger table entry. Our simulations of the basic 991 Chord protocol with just three successors and three predecessors 992 itself shows a reduction close to 41-46% in delay of lookup in the 993 DHT. Another alternate simulation study reported in [19] confirm our 994 results and show 31% to 40% lookup stretch reductions using 2^(16) 995 nodes and a Euclidean and transit-stub model. We expect that the 996 lookup delay performance would further reduce in the proposed 997 topology plug-in with log2(N) successors. 999 6.4. Successor Stabilization 1001 Both the primary virtual IDs and secondary virtual IDs of the nodes 1002 are stored as part of the successor and predecessor tables. 1004 In the successor stabilization routine, a peer v asks the peer s that 1005 is the first entry in its successor table for the virtual ID of the 1006 successor's first predecessor p. If the successor's first 1007 predecessor pointer does not point to v's virtual ID but instead to p 1008 (for instance, because p has joined the overlay between v and s), the 1009 peer with virtual ID p should become v's first successor instead of 1010 s. Thus, v adds p to the front of its successor list and notifies p 1011 of v's existence, so that p can change its predecessor to v. 1013 Also successor lists are stabilized as part of the successor 1014 stabilization routine. In order to do this, peer v copies the 1015 successor list of its successor s, removing the last entry and 1016 prepending s to it. If peer v, as a result of running the successor 1017 stabilization routing, notices that its successor has failed, then it 1018 does the following: 1020 o Using the virtual ID, s, of its successor, it looks up its virtual 1021 ID to physical ID mapping to identify which physical node, say S, 1022 has failed. Here, the term 'physical ID' refers to the primary 1023 virtual ID of the peer. 1025 o The peer then uses the same virtual ID to physical ID mapping 1026 table to identify the location of other virtual IDs in its 1027 successor list that correspond to the physical node S. These IDs 1028 are marked for replacement. 1030 o The peer then replaces the successor with the first live entry in 1031 its successor list, say n, and contacts this node for its 1032 successor list. The peer synchronizes n's successor list with its 1033 own removing the IDs that are marked for replacement. This step 1034 is repeated by contacting the other live entries, one after the 1035 other, until all the IDs marked for replacement are updated. 1037 The successor stabilization routine is executed when the 1038 stabilization timer fires. To begin the successor stabilization 1039 routine, peer v MUST send an Update request to its first successor s. 1040 The type of the Update request MUST be 'succ_stab'. Upon receiving 1041 the Update request, peer s MUST send an Update answer to peer v. The 1042 Update answer SHOULD include the successor and predecessor lists of 1043 peer s. If v learns from the predecessor and successor lists 1044 included in the answer that new peers should be included in its 1045 neighborhood set, v MUST send Attach requests to the new peers. Once 1046 a direct connection has been established with each new peer as a 1047 result of the Attach procedure, peer v MUST send an Update request of 1048 type 'notify' to each new peer. This allows the new peers to insert 1049 v into their neighborhood sets. 1051 6.5. Predecessor Stabilization 1053 The predecessor stabilization routine is executed when the 1054 stabilization timer fires. To begin the predecessor stabilization 1055 routine, a peer v MUST send an Update request to its predecessor p. 1056 The type of the Update request MUST be 'pred_stab'. Upon receiving 1057 the Update request, peer p MUST send an Update answer to peer v. The 1058 Update answer SHOULD include the predecessor list of peer p. Peer v 1059 SHOULD use the predecessor list carried in the answer to update its 1060 own predecessor list. If new peers are inserted into the predecessor 1061 list, peer v MUST send Attach requests and Update requests of type 1062 'notify' to the new peers in the same way as during the successor 1063 stabilization routine. 1065 6.6. Joining the Overlay 1067 The process of joining an overlay is as follows: 1069 1. The Joining Peer (JP) SHOULD connect to a bootstrap peer. 1071 2. The JP SHOULD send an Attach request to the bootstrap peer, 1072 which SHOULD route the request towards the Admitting Peer (AP). 1073 Here, the AP is the node that is the successor of JP's primary 1074 virtual server. Once the Attach procedure is finished, there is 1075 a direct connection between the JP and the AP. 1077 3. The JP SHOULD send a Join request to the AP. The AP returns a 1078 Join answer. 1080 4. The AP MUST send an Update request of type 'full' to the JP. 1081 The Update request SHOULD contain the contents of AP's routing 1082 table. The JP SHOULD use the contents of the Update request to 1083 initialize its finger table and DHT parameters, i.e., alpha and 1084 delta. The JP SHOULD set the size of its successor list, 1085 predecessor list, finger table, to the same values that the AP 1086 uses. The values of alpha and delta are also set to the ones 1087 that AP uses. If an enrollment server is present in the 1088 overlay, the information about the choice of DHT parameters, 1089 alpha and delta, can be obtained from it directly. 1091 5. The JP then chooses the virtual server locations, vs_i for i = 1092 2, 3, ..., alpha. This step is not performed if there is an 1093 enrollment server in the overlay. In this scenario, the 1094 locations of the virtual servers are obtained from the 1095 enrollment server apriori. 1097 6. The JP SHOULD send an Attach request to the successor of 1098 vs_alpha, call it n_alpha. The node returns a Attach answer. 1100 7. The peer n_alpha MUST send an Update request of type 'full' to 1101 the JP. The Update request SHOULD contain the contents of the 1102 node's successors. The JP SHOULD use the contents of the Update 1103 request to initialize its successor and predecessor lists. 1105 8. The JP MUST send Attach requests to initiate connections to each 1106 of the peers in its predecessor list, successor list, and finger 1107 table. Since the JP is already connected to the AP and n_alpha, 1108 there is no need to send a new Attach request to these nodes. 1110 9. The JP MUST send an Update request of type 'notify' to each of 1111 the peers in its predecessor and successor lists (except for the 1112 AP and n_alpha that are already aware of the JP). 1114 10. The JP MUST send a Probe request carrying the 'uptime' 1115 ProbeInformationType value in the requested_info field to each 1116 of its fingers. This way the JP will learn the uptimes of its 1117 fingers (the uptimes of predecessors and successors are learned 1118 from Update responses in the previous step). The uptimes are 1119 needed when estimating the join rate of peers in the overlay. 1120 It should be noted that these Probe requests are not routed via 1121 the overlay but are sent on a direct connection. 1123 11. Now, the JP takes ownership of regions in the overlay based on 1124 its virtual server locations, vs_i. In this step, the JP MUST 1125 send one Update message to the successor of each vs_i for all i. 1126 The type of the Update message MUST be set to 1127 'virtualserver_stab_join'. The peer, say n_i, receiving this 1128 message makes a note of this change and locally updates its 1129 ownership locations. Peer n_i then sends an UpdateAns message 1130 with type 'virtualserver_stab_join' to the JP acknowledging the 1131 change. Peer n_i SHOULD then issue a series of Store requests 1132 to JP to transfer ownership of the resources. This step is 1133 repeated for all i = alpha, alpha-1, ..., 1. At the end of this 1134 step all the new virtual servers corresponding to JP have joined 1135 the overlay, and the JP stores the data for these virtual 1136 locations. 1138 6.6.1. Contents of the Join Message 1140 This topology plug-in extends the Join request defined in RELOAD [1]. 1141 In addition to the joining_peer_id, nodes MAY choose to send its 1142 virtual server locations as part of the join message if they are 1143 obtained apriori from the enrollment server as part of the enrollment 1144 process. The JoinReq message is defined as: 1146 struct { 1147 NodeId joining_peer_id; 1148 NodeId joining_peer_virtual_server_ids <0..2^16-1>; 1149 opaque overlay_specific_data <0..2^16-1>; 1150 } JoinReq; 1152 The JoinReq contains the Node-ID which the sending peer wishes to 1153 assume and MAY include other virtual server locations that the 1154 joining peer wishes to occupy. 1156 If the request succeeds, the responding peer responds with a JoinAns 1157 message; this is defined as in the case of RELOAD. 1159 6.7. Leaving the Overlay 1161 The process of leaving the overlay is as follows: 1163 1. If no replication is being performed in the overlay, leaving peer 1164 SHOULD issue a series of Store requests to the successor of each 1165 of its virtual servers, vs_i, to transfer the ownership of the 1166 resource records it is storing. Note that if replication is 1167 being used, the successor of peer is already storing replicas and 1168 the amount of data transferred can be minimized. 1170 2. The leaving peer MUST send a Leave request to the predecessor and 1171 successor of each virtual server, vs_i. The Leave request that 1172 is sent to the vs_i's successor SHOULD contain the predecessor 1173 list of the leaving peer. The Leave request that is sent to the 1174 vs_i's predecessor SHOULD contain the successor list of the 1175 leaving peer. The first successor SHOULD use the predecessor 1176 list carried in the Leave request to update its own predecessor 1177 list. The first predecessor SHOULD use the successor list 1178 carried in the Leave request to update its own successor list. 1180 6.7.1. Contents of the Leave Message 1182 This topology plug-in extends the Leave request defined in RELOAD 1183 [1]. In addition to the leaving_peer_id, a node MUST send its 1184 virtual server locations as part of the leave message as defined in: 1186 public struct { 1187 NodeId leaving_peer_id; 1188 NodeId leaving_peer_virtual_server_ids <0..2^16-1>; 1189 opaque overlay_specific_data <0..2^16-1>; 1191 } LeaveReq; 1193 The overlay_specific_data field of the Leave request MUST contain a 1194 ChordLeaveData structure: 1196 enum { reserved (0), 1197 from_succ(1), from_pred(2), (255) } 1198 ChordLeaveType; 1200 struct { 1201 ChordLeaveType type; 1203 select(type) { 1204 case from_succ: 1205 NodeId successors <0..2^16-1>; 1206 case from_pred: 1207 NodeId predecessors <0..2^16-1>; 1208 }; 1209 } ChordLeaveData; 1211 The 'type' field indicates whether the Leave request was sent by a 1212 predecessor or a successor of the recipient: 1214 from_succ 1216 The Leave request was sent by a successor. 1218 from_pred 1220 The Leave request was sent by a predecessor. 1222 If the type of the request is 'from_succ', the contents will include 1223 the sender's successor list. 1225 If the type of the request is 'from_pred', the contents will include 1226 the sender's predecessor list. 1228 6.8. Self Tuning System Parameters 1230 This section describes how the system parameters for load balancing 1231 and stabilization adapt to changing network conditions. 1233 6.8.1. Self Tuning Load Balancing Algorithm Parameters 1235 All DHTs require constant updating as the size of the network grows. 1236 For example, in the case of Chord, each node starts out with m 1237 entries in the finger table (if node-IDs take values from 0 to 1238 2^m-1). When the number of nodes in the network is small, some of 1239 these m entries collapse to the same node. As the number of nodes in 1240 the network increase, the finger table entries degenerate to pointing 1241 to different nodes and the number of fingers grows as O(log2(N)). 1242 Thus, the number of fingers that each node has to maintain grows as 1243 O(log2(N)) as N increases. 1245 A similar update step is required by the solution in this document 1246 scheme as well because the system parameters, alpha and delta, also 1247 depend upon the value of N. This section describes how nodes MUST 1248 update the values of alpha, delta, and the location of virtual 1249 servers as the network changes in size. This document proposes to 1250 update the system values each time the network size doubles or 1251 halves. 1253 In addition to finger, successor, and predecessor stabilization, each 1254 peer also needs to perform DHT parameter stabilization. Unlike the 1255 other stabilization routines that are done periodically when the 1256 stabilization timer fires, the parameter stabilization routine is 1257 done whenever the number of fingers is observed to change. 1258 Performing stabilization in this way is sufficient for updating the 1259 load balancing parameters (i.e., alpha and delta) because it is not 1260 the highest priority update and the overlay would function 1261 effectively even without this update. In the proposed topology 1262 plug-in, we employ lazy updates for stabilizing the parameters, alpha 1263 and delta. More specifically, we use the change in number of fingers 1264 as a trigger to perform DHT parameter update. 1266 The parameter stabilization routine is executed when the number of 1267 fingers in a node changes. We consider two possible scenarios that 1268 can arise: 1270 1. Number of connections decrease: Let f be the lost connection. 1271 The peer v MUST first check the ID of the lost connection to see 1272 if there is any response. Let n be the ID of the node responding 1273 to the request. If n is the same as f, the peer must re-connect 1274 to f. If n is not the same as f and does not correspond to any 1275 node that is already in the finger table of v, then v MUST send 1276 an Attach request to n. In the above two cases, the parameter 1277 stabilization routine is not performed since the number of 1278 fingers does not change. On the other hand, if n is not the same 1279 as f and corresponds to some other node in the finger table of v, 1280 then v understands that the network size has reduced and invokes 1281 the parameter stabilization routine. 1283 2. Number of connections increase: In this case, v invokes the 1284 parameter stabilization routine. 1286 When the parameter stabilization routine is invoked, the following 1287 functions are performed: 1289 o Step 1: The (alpha-1) virtual servers corresponding to v leave the 1290 overlay. 1292 o Step 2: The value of alpha and delta are updated and the new 1293 values are obtained. The corresponding virtual server locations 1294 are also modified to get vs_i. 1296 o Step 3: The node v joins the overlay at the new virtual server 1297 locations, vs_i. 1299 o Step 4: The successors and predecessors of v are updated. 1301 During the parameter stabilization routine, the portion of the data 1302 space owned by the corresponding physical node changes and so data 1303 needs to be moved to match this change. The locations of successors 1304 and predecessors also change to adjust to the updates in virtual 1305 server locations. However, the fingers do not change because these 1306 are associated with the primary virtual ID and the location of the 1307 primary node-ID is fixed. 1309 In Step 1 of the stabilization routine (see above), the peer v MUST 1310 send one Update message to the successor of the virtual server, vs_i. 1311 The type of the Update message MUST be set to 1312 'virtualserver_stab_leave'. The peer n_i receiving this message 1313 makes a note of this change and locally updates its ownership 1314 locations. Peer n_i then sends an UpdateAns message with type 1315 'virtualserver_stab_leave' to vs_i acknowledging the change. On 1316 receiving the UpdateAns message, peer v SHOULD issue a series of 1317 Store requests to n_i to transfer ownership of the resources. This 1318 step is repeated for all i = 2, ... alpha. At the end of this step, 1319 all the virtual servers corresponding v have left the overlay and 1320 peer v no longer stores the data corresponding to these virtual 1321 locations. 1323 In Step 2, the peer v obtains its new virtual server locations, vs_i. 1324 In the presence of an enrollment server, the peer v MUST request the 1325 enrollment server for a new set of identities and stop participating 1326 in the overlay network with the old node identities. The enrollment 1327 server MAY choose to implement diagnostics using mechanisms in [3] to 1328 ascertain that the node requesting identities is not requesting more 1329 or less than what it should based on the finger table sizes of a 1330 random sampling of other nodes in the overlay. 1332 In the absence of the enrollment server, the following protocol is 1333 implemented to obtain the new values of alpha, delta, and vs_i. 1335 UpdateAlphaDelta() 1336 { 1337 If (number_of_fingers increases by 1) { 1338 // Can be detected as in the case of Chord when finger tables 1339 // are updated. This implies that the network size has about 1340 // doubled. 1342 delta = delta/2; // update delta 1344 // Update existing virtual server locations 1345 // vs_1 = vp as the location of the primary virtual server 1346 // does not change. 1347 for i = 2 to alpha 1348 vs_i = vp - (vp - vs_i)/2; 1350 // Choose new virtual server location between 1351 // (vp - alpha*delta*2^m, vp - (alpha+1)*delta*2^m); 1353 vs_(alpha+1) = random (vp - alpha*delta*2^m, 1354 vp - (alpha+1)*delta*2^m); 1355 vs_(alpha+2) = random (vp (alpha+1)*delta*2^m, 1356 vp - (alpha+2)*delta*2^m); 1357 alpha = alpha + 2; 1358 } 1360 If (number_of_fingers decreases by 1) { 1361 // Can be detected as in the case of Chord when finger tables 1362 // are updated. This implies that the network size has about 1363 // doubled. 1365 // Leave network with those virtual ids 1366 Remove virt_server_(alpha-1); 1367 Remove virt_server_(alpha); 1369 delta = delta * 2; // update delta 1370 alpha = alpha - 2; // update alpha 1372 // Update existing virtual server locations 1373 // vs_1 = vp as the location of the primary virtual server 1374 // does not change. 1375 for i = 2 to alpha 1376 vs_i = vp - (vp - vs_i) * 2; 1377 } 1378 } 1380 Once the location of the virtual servers are obtained, the peer re- 1381 joins the overlay at these positions. During this step, the peer v 1382 MUST send one Update message to the successor of vs_i in Step 3. The 1383 type of the Update message MUST be set to 'virtualserver_stab_join'. 1384 The peer n_i receiving this message makes a note of this change and 1385 locally updates its ownership locations. Peer n_i then sends an 1386 UpdateAns message with type 'virtualserver_stab_join' to v 1387 acknowledging the change. Peer n_i SHOULD then issue a series of 1388 Store requests to v to transfer ownership of the resources. In this 1389 process, the virtual server's n_i are added to the successor list of 1390 v. This step is repeated for all i = alpha, alpha-1, ..., 2. At the 1391 end of this step all the new virtual servers corresponding to v have 1392 joined the overlay, and peer v stores the data for these virtual 1393 locations. 1395 If replication is performed on the overlay, some of the data may 1396 already be present in the node v and this would reduce the amount of 1397 Store requests. 1399 Successor and predecessor stabilization routines are invoked as 1400 described in Section 6.4 and Section 6.5, respectively, in Step 4 and 1401 this completes the parameter stablization routine. 1403 6.8.2. Self Tuning the Stabilization Interval 1405 To ensure that lookups produce correct results as the set of 1406 participating peers changes and to ensure that all peers' connections 1407 be up to date, each peer MUST run a stabilization protocol 1408 periodically in the background. The stabilization protocol uses 1409 three operations: finger stabilization, successor stabilization, and 1410 predecessor stabilization as defined earlier. Each peer MUST 1411 maintain a stabilization timer. When the stabilization timer fires, 1412 the peer MUST restart the timer and carry out the stabilization 1413 operations. In this section, we present methods to compute the 1414 stabilization timer. 1416 When periodic stabilization is used, one faces the problem of 1417 selecting an appropriate execution rate for the stabilization 1418 procedure. If the execution rate of periodic stabilization is high, 1419 changes in the system can be quickly detected, but at the 1420 disadvantage of increased communication overhead. On the other hand, 1421 if the stabilization rate is low and the churn rate is high, routing 1422 tables become inaccurate and DHT performance deteriorates. Thus, the 1423 problem is setting the parameters so that the overlay achieves the 1424 desired reliability and performance even in challenging conditions, 1425 such as under heavy churn. This naturally results in high cost 1426 during periods when the churn level is lower than expected, or 1427 alternatively, poor performance or even network partitioning in worse 1428 than expected conditions. 1430 The current approach is to configure overlays statically. This works 1431 in situations where perfect information about the future is 1432 available. In situations where the operating conditions of the 1433 network are known in advance and remain static throughout the 1434 lifetime of the system, it is possible to choose fixed optimal values 1435 for parameters such as stabilization rate, neighborhood set size, and 1436 routing table size. However, if the operating conditions (e.g., the 1437 size of the overlay and its churn rate) do not remain static but 1438 evolve with time, it is not possible to achieve both a low lookup 1439 failure rate and a low communication overhead by using fixed 1440 parameters [20]. 1442 As an example, to configure the Chord DHT algorithm, one needs to 1443 select appropriate values for the size of successor list and the 1444 stabilization interval. To select an appropriate value for the 1445 stabilization interval, one needs to know the expected churn rate and 1446 overlay size. According to [21], a Chord network in a ring-like 1447 state remains in a ring-like state as long as peers send 1448 Omega(log2^2(N)) messages before N new peers join or N/2 peers fail. 1449 Thus, in a 500-peer overlay churning at a rate such that one peer 1450 joins and one peer leaves the network every 30 seconds, an 1451 appropriate stabilization interval would be on the order of 93s. 1452 According to [4], the size of the successor list should be on the 1453 order of log2(N). Having a successor list of size O(log2(N)) makes 1454 it unlikely that a peer will lose all of its successors, which would 1455 cause the Chord ring to become disconnected. Thus, in a 500-peer 1456 network each peer should maintain on the order of nine successors. 1457 However, if the churn rate doubles and the network size remains 1458 unchanged, the stabilization rate should double as well. That is, 1459 the appropriate maintenance interval would now be on the order of 1460 46s. On the other hand, if the churn rate becomes e.g. six-fold and 1461 the size of the network grows to 2000 peers, on the order of eleven 1462 successors should be maintained and the stabilization interval should 1463 be on the order of 42s. If one continued using the old values, this 1464 could result in inaccurate routing tables, network partitioning, and 1465 deteriorating performance. 1467 The proposed topology plug-in takes into consideration the continuous 1468 evolution of network conditions and adapts to them. Each peer 1469 collects statistical data about the network and adaptively adjusts 1470 its stabilization rate and successor list size based on the analysis 1471 of the data [20]. Reference [22] shows that by using a self-tuning 1472 mechanism, it is possible to achieve high reliability and performance 1473 even in adverse conditions with low maintenance cost. Adaptive 1474 stabilization has been shown to outperform periodic stabilization in 1475 terms of both lookup failure and communication overhead [20]. 1477 The following sub-sections specify methods to determine the 1478 appropriate stabilization rate in an adaptive fashion. The proposed 1479 mechanism is based on [22][21][20]. To calculate an appropriate 1480 stabilization rate, the values of three parameters MUST be estimated: 1481 overlay size N, failure rate U, and join rate L. Peers in the overlay 1482 MUST re-calculate the values of the parameters to self-tune the 1483 algorithm at the end of each stabilization period before re-starting 1484 the stabilization timer. 1486 6.8.2.1. Estimating Overlay Size 1488 Techniques for estimating the size of an overlay network have been 1489 proposed for instance in [22] [23] [24] [25] and [20]. In Chord, the 1490 density of peer identifiers in the successor set can be used to 1491 produce an estimate of the size of the overlay, N [22]. Since peer 1492 identifiers are picked randomly with uniform probability from the 1493 m-bit identifier space, the average distance between peer identifiers 1494 in the successor set is (2^m)/N. 1496 To estimate the overlay network size, a peer MUST compute the average 1497 inter-peer distance d between the successive peers starting from the 1498 most distant predecessor and ending to the most distant successor in 1499 the successor list. The estimated network size MUST be calculated 1500 as: 1502 2^m 1503 N = --------- 1504 d 1506 This estimate has been found to be accurate within 15% of the real 1507 network size [20]. Of course, the size of the neighborhood set 1508 affects the accuracy of the estimate. 1510 When a peer joins the network, the admitting peer sends the joining 1511 peer a copy of its neighborhood set. Thus, a joining peer 1512 immediately has enough information at its disposal to calculate an 1513 estimate of the network size. 1515 6.8.2.2. Estimating Failure Rate 1517 A typical approach is to assume that peers join the overlay according 1518 to a Poisson process with rate L and leave according to a Poisson 1519 process with rate parameter U [22]. The value of U can be estimated 1520 using peer failures in the finger table and neighborhood set [22]. 1521 If peers fail with rate U, a peer with M unique peer identifiers in 1522 its routing table should observe K failures in time K/(M*U). Every 1523 peer in the overlay MUST maintain a history of the last K failures. 1524 The current time MUST be inserted into the history when the peer 1525 joins the overlay. The estimate of U MUST be calculated as: 1527 k 1528 U = ----------, 1529 M * Tk 1531 where M is the number of unique peer identifiers in the routing 1532 table, Tk is the time between the first and the last failure in the 1533 history, and k is the number of failures in the history. If k is 1534 smaller than K, the estimate is computed as if there was a failure at 1535 the current time. It has been shown that an estimate calculated in a 1536 similar manner is accurate within 17% of the real value of U [20]. 1538 The size of the failure history K affects the accuracy of the 1539 estimate of U. One can increase the accuracy by increasing K. 1540 However, this has the side effect of decreasing responsiveness to 1541 changes in the failure rate. On the other hand, a small history size 1542 may cause a peer to overreact each time a new failure occurs. In 1543 [20], K is set 25% of the routing table size. 1545 6.8.2.2.1. Estimating Join Rate 1547 Reference [20] proposes that a peer can estimate the peer join rate 1548 based on the uptime of the peers in its routing table. An increase 1549 in peer join rate will be reflected by a decrease in the average age 1550 of peers in the routing table. Thus, each peer MUST maintain an 1551 array of the ages of the peers in its routing table sorted in 1552 increasing order. Using this information, an estimate of the global 1553 peer join rate L MUST be calculated as: 1555 N 1 1556 L = --- * ---------------, 1557 4 Ages[rsize/4] 1559 where Ages is an array containing the ages of the peers in the 1560 routing table sorted in increasing order and rsize is the size of the 1561 routing table. Only the ages of the 25% of the youngest peers in the 1562 routing table SHOULD be used to reduce the bias that a small number 1563 of peers with very old ages can cause [20]. It has been shown that 1564 the estimate obtained by using this method is accurate within 22% of 1565 the real join rate [20]. Of course, the size of the routing table 1566 affects the accuracy. 1568 In order for this mechanism to work, peers need to exchange 1569 information about the time they have been present in the overlay. 1570 Peers learn the uptimes of their successors and predecessors when 1571 adding the successors and predecessors to their routing tables since 1572 Update requests and answers that are of type 'notify' carry uptime 1573 values. Peers learn the uptimes of their fingers because the Probe 1574 responses sent as part of the finger stabilization routine carry 1575 uptime values. A joining peer learns the admitting peer's uptime 1576 since an Update request of type 'full' contains uptime information. 1578 6.8.2.2.2. Calculating the Stabilization Interval 1580 According to [21], a Chord network in a ring-like state remains in a 1581 ring-like state as long as peers send Omega(log2^2(N)) messages 1582 before N new peers join or N/2 peers fail. We can use the estimate 1583 of peer failure rate, U, to calculate the time Tf in which N/2 peers 1584 fail: 1586 1 1587 Tf = -------- 1588 2*U 1590 Based on this estimate, a stabilization interval Tstab-1 is 1591 calculated as: 1593 Tf 1594 Tstab-1 = ----------- 1595 log2^2(N) 1597 Further, the estimated join rate L can be used to calculate the time 1598 in which N new peers join the overlay. Based on the estimate of L, a 1599 stabilization interval Tstab-2 is calculated as: 1601 N 1602 Tstab-2 = --------------- 1603 L * log2^2(N) 1605 Finally, the actual stabilization interval Tstab that SHOULD be used 1606 can be obtained by taking the minimum of Tstab-1 and Tstab-2. 1608 The results obtained in [26] indicate that making the stabilization 1609 interval too small has the effect of making the overlay less stable 1610 (e.g., in terms of detected loops and path failures). Thus, a lower 1611 limit should be used for the stabilization period. Based on the 1612 results in [26], a lower limit of 15s is proposed, since using a 1613 stabilization period smaller than this will with a high probability 1614 cause too much traffic in the overlay. 1616 7. Security Considerations 1618 One important concern for the virtual servers approach is that it is 1619 not easily enforceable. Given a set of virtual serverIDs for a 1620 physical node, it must be ensured that the physical node actually 1621 takes ownership of all the virtual serverIDs assigned to it. In the 1622 presence of a centralized enrollment server, this server can ensure 1623 that each physical node gets its share of the number of virtual 1624 servers. In the absence of the enrollment server, the verification 1625 can be performed during the join process. For instance, in the case 1626 of Chord, when a new node joins the overlay, it contacts its 1627 neighbors to obtain neighbor relations and finger table entries. A 1628 similar approach can be adopted in the case of virtual servers 1629 approach. In this scenario, each physical node provides its 1630 neighbors a list of 2 log2(N) virtual server IDs. The neighbors 1631 first verify that the number of virtual server locations received is 1632 close to the number of virtual server locations that it owns before 1633 sending the finger table entries to the joining node. In this way, 1634 it can be ensured that each node has O(log2(N)) virtual locations on 1635 the overlay. 1637 8. IANA Considerations 1639 (a) A new overlay algorithm type should be defined for the proposed 1640 new topology plug-in. 1642 (b) This document defines one new Probe Information Type value: 1644 +-----------------+------+---------------+ 1645 | Probe Option | Code | Specification | 1646 +-----------------+------+---------------+ 1647 | uptime | 3 | RFC-AAAA | 1648 +-----------------+------+---------------+ 1650 (c) Other IANA considerations are TBD. 1652 9. Acknowledgments 1654 This document benefited from design discussions with Vidya Narayanan 1655 from Qualcomm Inc. 1657 10. Appendix 1659 This appendix lists a few performance results of the load balancing 1660 solution proposed in this document. 1662 10.1. Comparison of the Load Balancing Algorithm with Chord 1664 Compared to the virtual markers approach in [7], the solution 1665 proposed in this document does not require O(log2(N)) nodes to change 1666 their location upon each node arrival. For instance, in the case of 1667 the virtual markers approach, if O(log2(N)) nodes change location, 1668 then O(log2(N)/N) data objects need to be reassigned and at least O(2 1669 log2(N)) nodes are involved in this step. This is in addition to the 1670 messages required for updating routing tables which is O(log2^3(N)). 1671 In contrast to the virtual markers approach, the proposed solution 1672 requires O(1/N) data objects to be reassigned and around O(log2(N)) 1673 nodes send data to the joining node. The number of routing messages 1674 required is still O(log2^2(N)) similar to the case of Chord with no 1675 virtual servers; this is because the number of connections is still 1676 O(log2(N)) links per node. 1678 We performed some simulation studies to compare the performance of 1679 the solution in this document with Chord. In order to study the 1680 percentage of nodes with significant load imbalance, we look at the 1681 q-percentile load imbalance factor defined as the ratio of the 1682 q-percentile load and the average load. For example, a 99-percentile 1683 imbalance factor of 5 implies that less than 1 percent of nodes in 1684 the system have a load more than the value 5 times the average load. 1686 We tested the solution in this document under N = 2^(15) and compared 1687 the results with Chord. We found that the 99-percentile imbalance 1688 factor for the solution in this document is around 2. In comparison, 1689 our results with Chord suggest that the 100-percentile imbalance 1690 factor is around 9, the 99-percentile imbalance factor is around 4.5 1691 and so on. This result suggests that in Chord, around 1% of the 1692 nodes have an imbalance between 4.5 and 9, 2% have an imbalance 1693 greater than 4 and so on compared to the solution in this document 1694 where less than 1% of the nodes has an imbalance greater than 2. 1696 With regard to routing, our simulations compared the average route 1697 length for Chord (with one virtual server) with this document's 1698 solution (2 log2(N)) servers; the results showed that the both these 1699 DHT implementations require around the same number of hops. 1701 10.2. Performance of the Load Balancing Algorithm as Network Grows 1703 In this section, we take a closer look at the performance of the 1704 solution in this document as network size grows in terms of load 1705 imbalance factor and routing state maintenance. 1707 (1) Load Imbalance: In our simulations, we fix the value of the 1708 estimated network size to be Nest = 2^9, and chose alpha = 18 1709 and delta = 2^(-9). The actual network size is then increased 1710 from 2^2 to 2^(14). When N < Nest, the spacing between virtual 1711 servers (chosen as 1/Nest) is very small and the solution in 1712 this document becomes similar to Chord. In this case, the load 1713 imbalance factor is of the order of O(log2(N)). However, since 1714 the numerical value of N is small, the imbalance factor is still 1715 low and around 3 for most nodes as demonstrated by our 1716 simulations. As the value of network size, N, increases, the 1717 spacing, delta, comes into effect and the virtual servers help 1718 balance the load. Our simulations indicate that the imbalance 1719 is around its lowest value of 2 when N = Nest, and increases 1720 slowly as N becomes larger than Nest. 1722 (2) Routing State: Here, we examine the performance of this solution 1723 with regard to the amount of routing state that it needs to 1724 maintain as the network grows. We analyze the performance of 1725 the solution in this document analytically to determine the 1726 number of hops required to reach a destination as a function of 1727 N and Nest. We perform the analysis in two steps: 1729 * [Step 1] Routing within a distance of 1/N from destination: 1730 If nodes in this solution employ only their primary virtual 1731 servers (and the corresponding O(log2(N)) connections) for 1732 routing as in the case of Chord, it can be shown that any 1733 discretization built upon the solution in this document would 1734 be able to get within a distance of 1/N from the destination 1735 node in O(log2(N)) hops with O(log2(N)) fingers per node. 1736 This results is unaffected by the relative values of N and 1737 Nest. 1739 * [Step 2] Last-mile: The number of hops required to route 1740 messages from within a distance of 1/N to the exact 1741 destination (referred to as the last-mile problem) would 1742 depend upon the exact realization of DHT and its construction 1743 of neighbor tables. In this document, each node has alpha 1744 virtual servers chosen uniformly at random separated by a 1745 distance O(delta*2^m). Therefore, the number of nodes within 1746 a 1/N distance from a chosen location would be of the order 1747 of O(log2(N) + N*log2(1+s alpha delta)), where 's' is a 1748 constant independent of N. If alpha and delta are chosen 1749 using Nest, then the number of nodes within a 1/N distance 1750 can be approximated to be of the order of O(log2(N) + N* 1751 log2(Nest)/Nest) for large Nest. Therefore, a random walk 1752 would lead to the final destination within O(log2(N) + N* 1753 log2(Nest)/Nest) hops. 1755 This result indicates that when N < Nest, the solution in this 1756 document performs similar to Chord and messages can be routed to 1757 any node in O(log2(N)) hops by maintaining O(log2(N)) fingers 1758 and O(log2(N)) successors. On the other hand, when N is much 1759 larger compared to Nest, O(N) hops might be required to reach 1760 the destination if the number of fingers per node is O(log2(N)). 1761 However, this situation arises only when the network size 1762 estimation is very much lower than N and the value of alpha and 1763 delta are not updated as the network grows. 1765 11. References 1767 11.1. Normative References 1769 [1] Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and H. 1770 Schulzrinne, "REsource LOcation And Discovery (RELOAD) Base 1771 Protocol", draft-ietf-p2psip-base-02 (work in progress), 1772 March 2009. 1774 [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement 1775 Levels", BCP 14, RFC 2119, March 1997. 1777 [3] Yongchao, S., Jiang, X., Even, R., and D. Bryan, "P2PSIP 1778 Overlay Diagnostics", draft-ietf-p2psip-diagnostics-01 (work in 1779 progress), June 2009. 1781 11.2. Informative References 1783 [4] Stoica, I., Morris, R., Karger, D., Kaashoek, M., and H. 1784 Balakrishnan, "Chord: A scalable peer-to-peer lookup service 1785 for internet applications", In Proc. of the ACM SIGCOMM, 2001. 1787 [5] Li, J., Strinbling, J., Gil, T., and M. Kaashoek, "Comparing 1788 the performance of distributed hash tables under churn", In 1789 Proc. of the 3rd International Workshop on Peer-to-Peer 1790 Systems, 2004. 1792 [6] Bienkowski, M., Korzeniowski, M., and F. auf der Heide, 1793 "Dynamic load balancing in distributed hash tables", In Proc. 1794 of IPTPS, 2005. 1796 [7] Karger, D. and M. Ruhl, "Simple efficient load balancing 1797 algorithms for peer-to-peer systems", In 3rd International 1798 Workshop on Peer-to-Peer Systems (IPTPS), 2004. 1800 [8] Awerbuch, B. and C. Scheideler, "Group spreading: A protocol 1801 for provably secure distributed name service", In Proc. of the 1802 31st Int. Colloquium on Automata, Languages, and Programming 1803 (ICALP), 2004. 1805 [9] Fraigniaud, P. and C. Gavoille, "The content-addressable 1806 network d2b", Technical Report 1349, LRI, Univ. Paris-Sud, 1807 France, 2003. 1809 [10] Kaashoek, F. and D. Karger, "Koorde: A simple degree-optimal 1810 hash table", In Proc. 2nd International Workshop on Peer-to- 1811 Peer Systems (IPTPS), 2003. 1813 [11] Naor, M. and U. Wieder, "Novel architectures for peer to peer 1814 applications: the continuous-discrete approach.", In Proc. of 1815 the 15th ACM Symp. on Parallel Algorithms and Architectures 1816 (SPAA), 2003. 1818 [12] Byers, J., Considine, J., and M. Mitzenmacher, "Simple load 1819 balancing for distributed hash tables", In 2nd International 1820 Workshop on Peer-to-Peer Systems (IPTPS), 2003. 1822 [13] Manku, G., "Routing Networks for Distributed Hash Tables", In 1823 Proc. of the Principles of Distributed Computing (PODC), 2003. 1825 [14] Godfrey, B. and I. Stoica, "Heterogenity and load balance in 1826 Distributed Hash Tables", IEEE INFOCOM, 2005. 1828 [15] Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R., and I. 1829 Stoica, "Load balancing in dynamic structured peer to peer 1830 systems", In 23rd Conference of the IEEE Communications 1831 Society (INFOCOM), 2004. 1833 [16] Rao, A., Lakshminarayanan, K., Surana, S., Karp, R., and I. 1834 Stoica, "Load balancing in structured peer to peer systems", 1835 In 2nd International Workshop on Peer-to-Peer Systems (IPTPS), 1836 2004. 1838 [17] Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, "Handling 1839 churn in a DHT", In Proc. of the USENIX Annual Techincal 1840 Conference, 2004. 1842 [18] Krishnamurthy, S., El-Ansary, S., Aurell, E., and S. Haridi, 1843 "Comparing maintenance strategies for overlays", In Proc. of 1844 16th Euromicro Conference on Parallel, Distributed and Network- 1845 Based Processing, 2008. 1847 [19] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., Kaashoek, 1848 M., Dabek, F., and H. Balakrishnan, "Chord: A scalable peer-to- 1849 peer lookup protocol for internet applications", IEEE/ACM 1850 Transactions on Networking, 2003. 1852 [20] Ghinita, G. and Y. Teo, "An adaptive stabilization framework 1853 for distributed hash tables", 20th International Parallel and 1854 Distributed Processing Symposium, 2006. 1856 [21] Liben-Nowell, D., Balakrishnan, H., and D. Karger, 1857 "Observations on the dynamic evolution of peer-to-peer 1858 networks", In Proc. of the First International Workshop on 1859 Peer-to-Peer Systems, 2002. 1861 [22] Mahajan, R., Castro, M., and A. Rowstron, "Controlling the cost 1862 of reliability in peer-to-peer overlays", In Proceedings of 1863 the 2nd International Workshop on Peer-to- Peer Systems, 2003. 1865 [23] Horowitz, K. and D. Malkhi, "Estimating Network Size from Local 1866 Information", Information Processing Letters, Volume 88, Issue 1867 5, pp. 237-243, 2003. 1869 [24] Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and A. 1870 Demers, "Decentralized schemes for size estimation in large and 1871 dynamic groups", Fourth IEEE International Symposium on 1872 Network Computing and Applications, pp. 41-48, 2005. 1874 [25] Binzenhofer, A., Kunzmann, G., and R. Henjes, "A scalable 1875 algorithm to monitor chord-based peer to peer systems at 1876 runtime", 20th International Parallel and Distributed 1877 Processing Symposium, 2006. 1879 [26] Maenpaa, J. and G. Camarillo, "A study on maintenance 1880 operations in a Chord-based Peer-to-Peer Session Initiation 1881 Protocol overlay network", Accepted to Sixth International 1882 Workshop on Hot Topics in P2P Systems (HotP2P 2009), 2009. 1884 [27] Ktari, S., Zoubert, M., Hecker, A., and H. Labiod, "Performance 1885 evaluation of replication strategies in DHTs under churn", In 1886 Proc. of the 6th International Conference on Mobile and 1887 Ubiquitous Multimedia, 2007. 1889 Authors' Addresses 1891 Jouni Maenpaa 1892 Ericsson 1893 Hirsalantie 11 1894 Jorvas 02420 1895 Finland 1897 Email: Jouni.Maenpaa@ericsson.com 1899 Ashwin Swaminathan 1900 Qualcomm, Inc. 1901 5775 Morehouse Dr 1902 San Diego, CA 1903 USA 1905 Phone: +1 858-845-8775 1906 Email: sashwin@qualcomm.com 1908 Saumitra M. Das 1909 Qualcomm, Inc. 1910 3195 Kifer Road 1911 Santa Clara, CA 1912 USA 1914 Phone: +1 408-533-9529 1915 Email: saumitra@qualcomm.com 1917 Gonzalo Camarillo 1918 Ericsson 1919 Hirsalantie 11 1920 Jorvas 02420 1921 Finland 1923 Email: Gonzalo.Camarillo@ericsson.com 1925 Jani Hautakorpi 1926 Ericsson 1927 Hirsalantie 11 1928 Jorvas 02420 1929 Finland 1931 Email: Jani.Hautakorpi@ericsson.com