idnits 2.17.1 draft-maenpaa-p2psip-topologyplugin-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 959 has weird spacing: '...ionType typ...' -- The document date (July 6, 2009) is 5407 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '
' and
     '' lines.


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: 'Step 1' on line 1729

  -- Looks like a reference, but probably isn't: 'Step 2' on line 1739

  == Unused Reference: '27' is defined on line 1884, but no explicit
     reference was found in the text

  == Outdated reference: A later version (-26) exists of
     draft-ietf-p2psip-base-02

  == Outdated reference: A later version (-22) exists of
     draft-ietf-p2psip-diagnostics-01


     Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	P2PSIP Working Group                                          J. Maenpaa
3	Internet-Draft                                                  Ericsson
4	Intended status: Standards Track                          A. Swaminathan
5	Expires: January 7, 2010                                          S. Das
6	                                                          Qualcomm, Inc.
7	                                                            G. Camarillo
8	                                                           J. Hautakorpi
9	                                                                Ericsson
10	                                                            July 6, 2009

12	         A Topology Plug-in for REsource LOcation And Discovery
13	                 draft-maenpaa-p2psip-topologyplugin-00

15	Status of this Memo

17	   This Internet-Draft is submitted to IETF in full conformance with the
18	   provisions of BCP 78 and BCP 79.

20	   Internet-Drafts are working documents of the Internet Engineering
21	   Task Force (IETF), its areas, and its working groups.  Note that
22	   other groups may also distribute working documents as Internet-
23	   Drafts.

25	   Internet-Drafts are draft documents valid for a maximum of six months
26	   and may be updated, replaced, or obsoleted by other documents at any
27	   time.  It is inappropriate to use Internet-Drafts as reference
28	   material or to cite them other than as "work in progress."

30	   The list of current Internet-Drafts can be accessed at
31	   http://www.ietf.org/ietf/1id-abstracts.txt.

33	   The list of Internet-Draft Shadow Directories can be accessed at
34	   http://www.ietf.org/shadow.html.

36	   This Internet-Draft will expire on January 7, 2010.

38	Copyright Notice

40	   Copyright (c) 2009 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents in effect on the date of
45	   publication of this document (http://trustee.ietf.org/license-info).
46	   Please review these documents carefully, as they describe your rights
47	   and restrictions with respect to this document.

49	Abstract

51	   REsource LOcation And Discovery (RELOAD) is a peer-to-peer signaling
52	   protocol that can be used to maintain an overlay network, and to
53	   store data in and retrieve data from the overlay.  This document
54	   defines a new topology plug-in for RELOAD that is more appropriate
55	   for real world large scale overlays.  This topology plug-in
56	   implements three important functionalities that allow RELOAD to
57	   operate under real world constraints.  First, it includes a load
58	   balancing algorithm that specifies efficient allocation of load to
59	   different nodes in the network.  Second, the document describes
60	   robust techniques for stabilization of fingers and successors and
61	   specifies self tuning mechanisms that allow dynamic and automatic
62	   adjustment of parameters needed for these advanced techniques in the
63	   topology plug-in.  Finally, it specifies a locality aware finger
64	   selection algorithm that reduces average lookup latency.

66	Table of Contents

68	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  4
69	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  6
70	   3.  Need for Advanced Topology Plug-in . . . . . . . . . . . . . .  8
71	     3.1.  Need for Load Balancing  . . . . . . . . . . . . . . . . .  8
72	     3.2.  Need for Robust Stabilization  . . . . . . . . . . . . . .  9
73	     3.3.  Need for Locality Awareness  . . . . . . . . . . . . . . .  9
74	     3.4.  Need for Self-Tuning of System Parameters  . . . . . . . . 10
75	   4.  Chord  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
76	   5.  Load Balancing in the Proposed Topology Plug-in for RELOAD . . 11
77	     5.1.  Basic Load Balancing scheme  . . . . . . . . . . . . . . . 11
78	     5.2.  Recommendations on Virtual Servers . . . . . . . . . . . . 12
79	     5.3.  Routing  . . . . . . . . . . . . . . . . . . . . . . . . . 13
80	     5.4.  Obtaining Virtual Server Identities  . . . . . . . . . . . 14
81	       5.4.1.  Without Enrollment Server  . . . . . . . . . . . . . . 14
82	       5.4.2.  With Enrollment Server . . . . . . . . . . . . . . . . 15
83	     5.5.  Extensions to Overlays with Heterogeneous Nodes  . . . . . 16
84	   6.  Stabilizing Fingers, Successors, and Predecessors in the
85	       Topology Plug-in . . . . . . . . . . . . . . . . . . . . . . . 16
86	     6.1.  Choice of Approach to Stabilization  . . . . . . . . . . . 16
87	     6.2.  Update Messages for Stabilization  . . . . . . . . . . . . 18
88	     6.3.  Finger Stabilization . . . . . . . . . . . . . . . . . . . 21
89	       6.3.1.  Locality-aware Finger Selection  . . . . . . . . . . . 22
90	     6.4.  Successor Stabilization  . . . . . . . . . . . . . . . . . 23
91	     6.5.  Predecessor Stabilization  . . . . . . . . . . . . . . . . 24
92	     6.6.  Joining the Overlay  . . . . . . . . . . . . . . . . . . . 24
93	       6.6.1.  Contents of the Join Message . . . . . . . . . . . . . 25
94	     6.7.  Leaving the Overlay  . . . . . . . . . . . . . . . . . . . 26
95	       6.7.1.  Contents of the Leave Message  . . . . . . . . . . . . 26
96	     6.8.  Self Tuning System Parameters  . . . . . . . . . . . . . . 27
97	       6.8.1.  Self Tuning Load Balancing Algorithm Parameters  . . . 28
98	       6.8.2.  Self Tuning the Stabilization Interval . . . . . . . . 32
99	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 37
100	   8.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 37
101	   9.  Acknowledgments  . . . . . . . . . . . . . . . . . . . . . . . 37
102	   10. Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
103	     10.1. Comparison of the Load Balancing Algorithm with Chord  . . 38
104	     10.2. Performance of the Load Balancing Algorithm as Network
105	           Grows  . . . . . . . . . . . . . . . . . . . . . . . . . . 38
106	   11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40
107	     11.1. Normative References . . . . . . . . . . . . . . . . . . . 40
108	     11.2. Informative References . . . . . . . . . . . . . . . . . . 40
109	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 42

111	1.  Introduction

113	   REsource LOcation And Discovery (RELOAD) [1] is a peer-to-peer
114	   signaling protocol that can be used to maintain an overlay network,
115	   and to store data in and retrieve data from the overlay.  For
116	   interoperability reasons, RELOAD specifies one overlay algorithm that
117	   is mandatory to implement.  Additionally, RELOAD supports a variety
118	   of other overlay algorithms through the use of topology plug-ins.  A
119	   topology plug-in implements the topology defined by a specific
120	   overlay algorithm.

122	   This document defines a new topology plug-in for RELOAD that is more
123	   appropriate for real world large scale overlays that have to deal
124	   with object storage in a fair manner, provide good lookup performance
125	   to a variety of applications, and self-organize to deal with churn.
126	   This topology plug-in implements three important functionalities that
127	   allow RELOAD to operate under these real world constraints.  First,
128	   it includes a load balancing algorithm that specifies efficient
129	   allocation of load to different nodes in the network.  Second, the
130	   document describes robust techniques for stabilization of fingers and
131	   successors and specifies self tuning mechanisms that allow dynamic
132	   and automatic adjustment of parameters needed for these advanced
133	   techniques in the topology plug-in; this avoids the need for
134	   constants that may only work in specific scenarios.  Finally, it
135	   specifies a locality aware finger selection algorithm that reduces
136	   average lookup latency.

138	   Load balancing is essential to effectively manage data and provide
139	   services on overlays.  Load balancing, as an integral part of the
140	   overlay, encourages participation by imposing collective fate sharing
141	   on the nodes.  Without such a scheme, overlay adoption may be
142	   significantly affected.  However, the mandatory-to-implement RELOAD
143	   DHT protocol based on Chord does not support operating the overlay in
144	   a load balanced manner.  For instance, for a system with N nodes, it
145	   can been shown that the imbalance factor, defined as the maximum load
146	   on any node on the network divided by the average load, is of the
147	   order of O(log2(N)) in the number of items even if all objects are
148	   assumed to be homogeneous.  In the case of heterogeneous networks,
149	   where the capabilities of nodes (storage and bandwidth) can differ by
150	   multiple orders of magnitude, the problem of load balancing becomes
151	   more important because the imbalance in load distribution could
152	   potentially create a bottleneck in the system.  Thus to enable
153	   scalable, real world deployments of RELOAD, the topology plug-in in
154	   this document specifies a scheme for load balancing.

156	   Peer-to-peer overlay networks face a fundamental challenge with
157	   churn: joining and leaving of nodes.  This can affect the integrity
158	   of the routing structure, cause a node to become disconnected from
159	   the overlay, and cause overhead in maintaining consistency.  Several
160	   research studies have been performed on the type of stabilization
161	   that can be used in DHT based peer-to-peer overlays such as RELOAD.
162	   This document specifies a method for stabilization to deal with churn
163	   to allow RELOAD to be scalable and reliable in real world conditions.
164	   We specifically prescribe the use of a periodic stabilization routine
165	   to counter the undesirable effects of churn on routing.

167	   To enable load balancing and stabilization to deal with churn some
168	   parameters need to be set (described later).  The use of specific
169	   constants for such parameters leads to deployment that may only work
170	   for specific scenarios (where the parameters are evaluated).  Thus,
171	   this document specifies self tuning mechanisms for system parameters
172	   to allow RELOAD to scale naturally as network dynamics dictate.  For
173	   example, as churn increases, the topology plug-in specified in this
174	   document adapts by increasing the frequency of stabilization.
175	   Similarly, as network size increases, load balancing parameters and
176	   DHT parameters are modified to ensure quick finger and successor
177	   updates and to keep the load balancing property over time.  To enable
178	   self tuning of system parameters, some characteristics such as churn
179	   rate and network size are estimated.  These characteristics are then
180	   used to dynamically adjust the topology plug-in parameters such as
181	   the size of the successor set, size of the routing table, and rate of
182	   maintenance messages.  The benefit of this approach is that it avoids
183	   the problem with static parameters.  Using static parameters, it is
184	   not possible to achieve a low failure rate and a low communication
185	   overhead.  The topology plug-in specified in this document allows the
186	   system to take into account the evolution of network conditions and
187	   adapt to them.

189	   Even with up-to-date fingers and successors, making progress in the
190	   identifier space can be expensive in terms of network latency because
191	   small progress in identifier space could result in a significant leap
192	   in physical distance.  The successor and predecessor lists can be
193	   used to optimize network latency by relaxing the requirement for
194	   finger selection.  Specifically, at each overlay hop, as progress is
195	   made in the identifier space, small physical distance hops are used
196	   so as to avoid high latencies in overall lookup.  More details on the
197	   proposed locality aware finger selection are described later in this
198	   document.

200	   In summary, the topology plug-in proposed in this document has the
201	   following advantages:

203	   o  First, building on the RELOAD framework, this document introduces
204	      a load balancing algorithm that provides an imbalance factor of
205	      the order of O(1).  The solution proposed in the document can also
206	      be extended to the case of heterogeneous nodes in such a way that
207	      the number of objects assigned to a node is proportional to the
208	      amount of load it can handle;

210	   o  Second, this topology plug-in specifies stabilization techniques
211	      that deal effectively with churn;

213	   o  Third, this document proposes self-tuning of parameters required
214	      in these algorithms such that users no longer need to tune every
215	      DHT parameter correctly for a given operating environment.  By
216	      periodically computing the network parameters, the system
217	      automatically adapts to changing operating conditions; and

219	   o  Finally, the locality aware finger selection algorithm
220	      incorporated with the DHT algorithm further optimizes the
221	      selection of successors and predecessors to reduce network
222	      latency.

224	   This document is organized as follows.  We first describe the
225	   algorithms for load balancing, stabilization, and locality aware
226	   finger selection.  Then we describe the major components required to
227	   operate the topology plug-in: joining, leaving, routing, dealing with
228	   failures etc.  Finally, we describe self tuning of the system
229	   parameters to make the topology plug-in adjust itself to changing
230	   network conditions.

232	2.  Terminology

234	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
235	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
236	   document are to be interpreted as described in RFC 2119 [2].

238	   This document uses the terminology and definitions from the Concepts
239	   and Terminology for Peer-to-Peer SIP [1] draft.

241	   Chord ring:  The Chord DHT orders identifiers on an identifier circle
242	      of size 2^m (m is the number of bits in peer identifiers).  This
243	      identifier circle is called the Chord ring.

245	   DHT:  Distributed Hash Tables (DHTs) are a class of decentralized
246	      distributed systems that provide a lookup service similar to a
247	      hash table.  Given a key, any participating peer can retrieve the
248	      value associated with that key.  The responsibility for
249	      maintaining the mapping from keys to values is distributed among
250	      the peers.

252	   Finger table:  A data structure with up to m entries maintained by
253	      each peer in a Chord-based overlay.  The ith entry in the finger
254	      table of peer n contains the identity of the first primary virtual
255	      server that succeeds n by at least 2^(m-i) on the Chord ring.
256	      This peer is called the ith finger of peer n.  As an example, the
257	      first entry in the finger table of peer n contains a peer half-way
258	      around the Chord ring from peer n.  The purpose of the finger
259	      table is to accelerate lookups.

261	   log2(N):  Logarithm of N with base 2.

263	   N: The number of nodes in the overlay unless otherwise specified.

265	   Neighborhood set:  Consists of successor and predecessor lists.

267	   O(g(n)):  Informally, saying that some equation f(n) = O(g(n)) means
268	      that f(n) is less than some constant multiple of g(n).

270	   Omega(g(n)):  Informally, saying that some equation f(n) =
271	      Omega(g(n)) means that f(n) is more than some constant multiple of
272	      g(n).

274	   Predecessor list:  A data structure containing the predecessors of a
275	      peer.

277	   Successor list:  A data structure containing the successors of a
278	      peer.

280	   Virtual server:  Each physical peer instantiates one or more virtual
281	      servers with random IDs that act as peers in the DHT.

283	   Primary virtual server:  The primary ID assigned to the peer when it
284	      joins the network.  This location is chosen uniformly at random
285	      among the available IDs in [0, 2^m-1].

287	   Secondary virtual server:  List of all identities owned by the
288	      physical peer excluding the primary virtual server.  The location
289	      of the secondary virtual servers are selected so that the ith
290	      virtual server is distributed between [vp-i*delta*2^m, vp-
291	      (i+1)*delta*2^m] where vp is the primary virtual server.  These
292	      locations may change to adapt to network dynamics.

294	   alpha:  number of virtual servers per physical ID.  This includes one
295	      primary virtual server and (alpha-1) secondary virtual servers.

297	   delta:  The value of delta defines the spacing between the virtual
298	      servers.  The location of the secondary virtual servers are
299	      selected so that the ith virtual server is distributed between
300	      [vp-i*delta*2^m, vp-(i+1)*delta*2^m] where vp is the location of
301	      the primary virtual server.

303	   Routing table:  The set of peers that a node can use to route overlay
304	      messages.  The routing table consists of the finger table,
305	      successor list and predecessor list, all populated with primary
306	      virtual server IDs.

308	3.  Need for Advanced Topology Plug-in

310	   This section details the need for specifying load balancing
311	   mechanisms, robust stabilization and locality awareness for a RELOAD
312	   topology plug-in to make it efficient and scalable.

314	3.1.  Need for Load Balancing

316	   Most DHTs, such as Chord, Pastry, CAN, Tapestry, and the RELOAD base
317	   topology plug-in employ uniform hashing to generate object IDs so
318	   that the object IDs are uniformly distributed over the ID space
319	   (example 0 to D = 2^128 - 1).  Objects are then assigned to the nodes
320	   based on their IDs and the exact way that this is done differs in
321	   different DHTs.  Let us assume that there are N nodes in the network.
322	   Suppose the node IDs are generated uniformly at random, then the
323	   probability that the ith largest nodeID is equal to k is given by:

325	   P(nodeID(i) == k) = (N-1)C(i-1) * (k/D)^(i-1) * (N-i)C(1) * (1/D) *
326	   (1 - (k+1)/D)^(N-i+1)

328	   For large N and D, this distribution can be approximated as follows:

330	   P(nodeID(i) == k) = e^(-k*N/D) * (k*N/D)^1 * (1/i!)

332	   Therefore, ith largest node IDs follows a Poisson distribution.  The
333	   inter-node distance then follows an exponential distribution with
334	   parameter N/D. As consequence of this property, the number of data
335	   items assigned to a node is proportional to the inter-node distance
336	   and, thus, follows an exponential distribution (c denotes an
337	   arbitrary constant)

339	   Pr(#items in node == x) = c * (N/D) * e^(c*(N/D)*x)

341	   The goodness of Chord for load balancing can be measured in terms of
342	   the imbalance metric defined as:

344	                                 Maximum # of data items
345	          imbalance factor = -------------------------------
346	                                 Average # of data items

348	   Here, the maximum value is calculated as the one which is largest
349	   with probability: O(1 - (1/N^l)) where l is any constant.  The
350	   imbalance factor measures the storage load of a server; the longer
351	   the interval is, the more data has to be stored in the server.

353	   For the case of Chord, this imbalance factor is of the order of
354	   O(log2(N)) and there are nodes in the overlay that manage log2(N)
355	   times the average node's load.  Further, this result suggests that
356	   the imbalance factor for Chord increases as the number of nodes in
357	   the network and as the number of nodes increase the maximum number of
358	   data items handled per node increases of the order of
359	   O(T*(log2(N)/N)) where T denotes the total number of objects on the
360	   overlay.  Ideally, we would want the maximum value to be close to the
361	   average value of (T/N) giving an imbalance factor close to 1 for a
362	   good load balancing algorithm.  Therefore, RELOAD base DHT is not
363	   very efficient for load balancing and there is a need for load
364	   balancing mechanisms in a topology plug-in that is widely usable.

366	3.2.  Need for Robust Stabilization

368	   To ensure correct lookups in the presence of churn and to ensure
369	   optimal routing of queries, a node needs to ensure that its fingers,
370	   successor list, and predecessor list are up to date.  However,
371	   stabilization of these features comes at a cost: peers in the overlay
372	   network need to consume network bandwidth to maintain routing state.
373	   DHTs use stabilization routines to counter the undesirable effects of
374	   churn on routing.  The purpose of stabilization is to keep the
375	   routing information of each peer in the overlay consistent with the
376	   constantly changing overlay topology.

378	   The current RELOAD base topology plug-in proposes reactive
379	   stabilization.  Research studies have shown that this approach may
380	   not be the most robust to a wide variety of network conditions.  In
381	   this document, we revisit the choice of stabilization and recommend
382	   techniques likely to work efficiently across deployment types.

384	3.3.  Need for Locality Awareness

386	   A major performance issue with structured peer-to-peer based topology
387	   plug-ins is the need for multi-overlay-hop routing to lookup a piece
388	   of data such as the SIP Address of Record (AoR )to IP mapping.  Since
389	   the identifier space is completely random, a lookup for a data item
390	   stored 10ms RTT away can potentially take multiple intercontinental
391	   hops before getting answered significantly affecting lookup latency.
392	   Additionally, the fact that routers in such peer-to-peer networks are
393	   expected to be constrained non-infrastructure nodes adds on
394	   additional delays.  Locality aware routing aims to mitigate the
395	   routing stretch of the topology plug-in so that a lookup is answered
396	   by making progress in the identifier space while minimizing the
397	   amount of distance traveled in physical space at each hop.

399	3.4.  Need for Self-Tuning of System Parameters

401	   Two main advantages of a self-tuning DHT are that users no longer
402	   need to tune every DHT parameter correctly for a given operating
403	   environment and that the system adapts to changing operating
404	   conditions.

406	4.  Chord

408	   This document proposes a new topology plugin for RELOAD that is based
409	   on the Chord DHT algorithm.  Topology plugins allow RELOAD to support
410	   a variety of overlay algorithms.  The proposed topology plugin uses a
411	   load balanced self-tuning version of Chord.  It can be used as an
412	   alternative to the default DHT specified by RELOAD.

414	   Chord [4] is a structured peer-to-peer algorithm that uses consistent
415	   hashing to build a DHT out of several independent peers.  Consistent
416	   hashing assigns each peer and resource an m-bit identifier.  Peers
417	   MUST use SHA-1 as the base hash fuction to generate the identifiers.
418	   The length of the identifiers MUST be m=128 bits.  The identifiers
419	   are ordered on an identifier circle of size 2^m.  On the identifier
420	   circle, key k MUST be assigned to the first peer whose identifier
421	   equals or follows the identifier of k in the identifier space.  The
422	   identifier circle is called the Chord ring.

424	   Different DHTs differ significantly in performance when bandwidth is
425	   limited.  It has been shown that when compared to other DHTs, the
426	   advantages of Chord include that it uses bandwidth efficiently and
427	   can achieve low lookup latencies at little cost [5].

429	   A simple lookup mechanism could be implemented on a Chord ring by
430	   requiring each peer to only know how to contact its current successor
431	   on the identifier circle.  Queries for a given identifier could then
432	   be passed around the circle via the successor pointers until they
433	   encounter the first peer whose identifier is equal to or larger than
434	   the desired identifier.  Such a lookup scheme uses a number of
435	   messages that grows linearly with the number of peers.  To reduce the
436	   cost of lookups, Chord also maintains additional routing information;
437	   each peer n MUST maintain a data structure with up to m entries,
438	   called the finger table.  The first entry in the finger table of peer
439	   n contains the peer half-way around the ring from peer n.  The second
440	   entry contains the peer that is 1/4th of the way around, the third
441	   entry the peer that is 1/8th of the way around, etc.  In other words,
442	   the ith entry in the finger table at peer n SHOULD contain the
443	   identity of the first peer s that succeeds n by at least 2^(m-i) on
444	   the Chord ring.  This peer is called the ith finger of peer n.  The
445	   interval between two consecutive fingers is called a finger interval.
446	   The ith finger interval of peer n covers the range [n.id + 2^(m-i),
447	   n.id + 2^(m-i+1)) on the Chord ring.  In an N-peer network, each peer
448	   maintains information about O(log2(N)) other peers in its finger
449	   table.  As an example, if N=1000, it is sufficient to maintain 10
450	   fingers.

452	   To increase robustness in the event of peer failures, each Chord peer
453	   MUST maintain a successor list containing the peer's immediate
454	   successors on the Chord ring.  The successor list will be further
455	   described in Section 5.3.  Peers MUST also maintain a predecessor
456	   list containing the peer's immediate predecessors on the Chord ring.
457	   The recommeded value for the size of the predecessor list is 3, as is
458	   also specified in [1].

460	5.  Load Balancing in the Proposed Topology Plug-in for RELOAD

462	5.1.  Basic Load Balancing scheme

464	   The basic Chord DHT suffers from drawbacks.  The uniform hashing used
465	   in Chord to generate object IDs result in an imbalance factor close
466	   to O(log2(N)); this implies that there are peers in the Chord ring
467	   that would have to manage O(log2(N)) times the load that is to be
468	   handled by an average peer.  In this section, we build upon the basic
469	   Chord DHT and present the virtual servers algorithm that results in
470	   an imbalance factor close to 1.

472	   Several research papers have proposed schemes to provide load
473	   balanced DHTs.  Some of the schemes have general techniques while
474	   others are targeted towards optimizing a specific DHT.  This solution
475	   proposed in this document takes into account results and ideas from
476	   previous ideas for load balancing in [6], [7], [8], [4], [9], [10],
477	   [11], [12], [13], [14], [15], [16].

479	   The basic scheme for load balancing proposed in this document is an
480	   extension of the virtual servers approach [4] over Chord with the
481	   benefit of reducing the routing state.  At a high level, virtual
482	   servers approach works in improving load balancing as follows.  The
483	   work in [4] suggested that load balancing can be performed
484	   efficiently if each peer simulates a logarithmic number of "virtual
485	   servers".  The method works by associating keys with virtual servers,
486	   and mapping multiple virtual servers (with unrelated identifiers) to
487	   each real node.  The authors show that this will provide a more
488	   uniform coverage of the identifier space.  For example, if we
489	   allocate O(log2(N)) randomly chosen virtual servers to each real
490	   node, with high probability each of the N bins will contain
491	   O(log2(N)) nodes.  The downside of this approach is the additional
492	   routing state and its maintenance cost.

494	   This document proposes load balancing based on a proposal in [14],
495	   and is an improvement over the basic virtual servers approach
496	   introduced in [4].  In our approach, each physical node instantiates
497	   one or more virtual servers that act as peers in the DHT.  The
498	   locations of these virtual servers are chosen close to each other in
499	   the nodeID space allowing the node to share a single set of overlay
500	   links among the virtual servers.

502	   The main system parameters of the solution in this document are: how
503	   many virtual servers should be chosen exactly and what should the
504	   spacing between the node identifiers of those virtual servers.  The
505	   next sections describe this document's recommendations on virtual
506	   nodes and routing for a load balanced DHT.

508	   Since the choices of the system parameters depend on network
509	   dynamics, this document further discusses dynamically adjusting the
510	   topology plug-in with network dynamics and recommends a periodic
511	   stabilization model to keep the system parameters up-to-date.

513	5.2.  Recommendations on Virtual Servers

515	   Selection of virtual servers require two decisions to be made: (1)
516	   how many virtual servers (denoted by alpha) should we choose and (2)
517	   what should be the namespace spacing (represented by delta*2^m) in
518	   between these virtual servers.  This section provides the present
519	   document's recommendation on choosing alpha and delta.

521	   Each physical peer maintains two sets of virtual server identities,
522	   namely, primary virtual server identities and secondary virtual
523	   server identities.  The location of the primary virtual server
524	   (denoted as vp) is chosen using existing techniques in RELOAD.  These
525	   locations remain fixed throughout the entire lifetime of the node.

527	   In addition to the primary virtual server, each node maintains
528	   secondary virtual servers.  The locations of these secondary virtual
529	   servers are chosen such that the ith virtual server, vs_i, is
530	   uniformly distributed in [vp-i*delta*2^m, vp-(i+1)*delta*2^m].

532	   The value of delta * 2^m defines the spacing between the virtual
533	   nodes and this document requires that nodes MUST choose the value of
534	   delta as 1/N where N is the number of nodes in the network.  This
535	   value was found to perform well in simulations and is consistent with
536	   the value reported in [14].

538	   Each node MUST maintain a total of alpha = 2*log2(N) virtual server
539	   identities for load balancing where N is the number of nodes in the
540	   overlay.  This value of alpha includes one primary virtual server and
541	   (alpha-1) secondary virtual server identities.  The choice of
542	   2*log2(N) for alpha has been found to work well in simulations [14].

544	   While both these parameters, alpha and delta, depend on N (the size
545	   of the network), Section 6.8.1 describes an algorithm used to
546	   determine alpha and delta without performing explicit network size
547	   estimation.  Note that explicit network size estimation only needs to
548	   be performed when self-tuning the DHT parameters, since in that case
549	   a more accurate estimate of the network size is needed.

551	5.3.  Routing

553	   Assume an identifier space [0, 2^m-1].  As in the basic Chord
554	   approach, the recommended topology plug-in maintains two types of
555	   routing information such as finger tables and successor tables.  The
556	   finger tables are constructed based on the location of the primary
557	   virtual server, i.e., the ith finger at peer v contains the identity
558	   of the first peer s that succeeds its primary virtual server, vp, by
559	   at least 2^(m-i) on the Chord ring.

561	   The neighbor table of peer v contains the entries of all nodes which
562	   have ownership of any part of the log2(N)/N space that node v owns.
563	   As in the case of Chord, each virtual server belonging to the node
564	   obtains its successor list of its immediate successor.  If this total
565	   length is greater than log2(N), the clockwise-farthest entry is
566	   dropped to restrict the number of neighbor table entries to log2(N).
567	   Further, it can be shown that the maximum number of node IDs that
568	   fall within this space is O(log2(N)) resulting in those many neighbor
569	   table entries.  The added benefit of having log2(N) successors is
570	   that if each peer fails independently with probability p, the
571	   probability that all log2(N) successors fail simultaneously is only
572	   p^log2(N); therefore, the system is much more resilient to failures.

574	   To obtain the entries in the peer's neighbor table, the peer starts
575	   with the alpha-th secondary virtual ID, denoted as vs_alpha.  The
576	   peer then proceeds clockwise from vs_alpha on the Chord ring and
577	   identifies the virtual IDs of peers that fall in this region.  When
578	   it encounters a virtual server vs', it stores the location of both
579	   the virtual server, vs', and the location of the corresponding
580	   primary virtual server, vp'.  On the other hand, when it sees its own
581	   virtual server, it is dropped and no information is stored.  This
582	   process is followed until log2(N) distinct primary virtual servers
583	   are obtained to populate the neighbor table.  It is to be noted that
584	   while the size of the successor list may be greater than log2(N)
585	   because it includes both the primary and the secondary virtual
586	   servers, the number of successor connections is limited to log2(N).

588	   To locate any object in the overlay, the nodes MUST first employ the
589	   finger tables to reach the virtual server ID that is closest to the
590	   query object ID.  At the end of this step, the chosen physical node
591	   may or may not have the object.  If the physical node does not have
592	   the object, then it forwards the object request to its neighbors
593	   using the neighbor table entries.  Therefore the total number of
594	   message hops required for locating an object is of the order of
595	   O(log2(N)) + 1.

597	   Thus, this solution proposed in this document, based on [14],
598	   requires maintaining a lower number of connections than that of a
599	   pure virtual servers based approach for load balancing.  This is
600	   because, while the basic Chord scheme with O(log2(N)) virtual servers
601	   per physical node requires maintaining O(log2(N)) sets of fingers
602	   (and O(log2(N)) routing tables) for each physical node for efficient
603	   routing, this document requires maintaining just one set of fingers
604	   (and routing table) per physical node.

606	   In Section 6, we discuss approaches to keep the fingers and
607	   successors up-to-date that is required to ensure correct look-ups.

609	5.4.  Obtaining Virtual Server Identities

611	   This section describes the procedure for obtaining the primary and
612	   virtual server identities.

614	5.4.1.  Without Enrollment Server

616	   When a new node, v, joins the system, the first primary node identity
617	   vp MUST be computed by applying the digest specified in the self-
618	   signed-permitted element of the overlay configuration document to the
619	   DER representation of the node's public key.  The subsequent (alpha -
620	   1) virtual node IDs MUST be computed as follows: the ith virtual ID
621	   is chosen as random(vp - i*delta*2^m, vp - (i+1)*delta*2^m)

623	   This ensures that the chosen virtual servers are close to each other
624	   and a maximum of log2(N)/N apart.

626	   A node's public key relates to the vp and can be verified by other
627	   nodes.  However, all other virtual server IDs cannot be related to
628	   this public key so they need to be verified based on vp.  There are
629	   two options to do this verification:

631	   o  a.  Use a constant fixed offset for the ith virtual server as (vp
632	      - i*delta*2^m) exactly.  This removes the random component of the
633	      virtual server ID selection.  Since all virtual server IDs are at
634	      fixed intervals from vp, a node can weakly verify the node id.
635	      However node ID collisions may make this hard.

637	   o  b.  Use the current technique described in Section 5.2 with random
638	      selection in an interval.  In this case a node can verify that the
639	      virtual server ID is in a small range near vp.

641	   OPEN ISSUE: At the moment we recommend option (b), but this issue
642	   needs further analysis.

644	5.4.2.  With Enrollment Server

646	   When an enrollment server is present, the added security benefits of
647	   the enrollment server certified node identities are for virtual
648	   server identities.  The enrollment server is provided with the values
649	   of the system parameters alpha and delta, and based on them, the
650	   enrollment server gives out a set of virtual server identities to a
651	   node.  Similar to what a node itself does in a deployment without an
652	   enrollment server; the enrollment server MUST use a uniform hash
653	   function to randomly choose the primary virtual server ID from the
654	   space (0, 2^128 - 1); call this virtual server ID vp.  The remaining
655	   (alpha - 1) node IDs MUST be then chosen by the enrollment server
656	   around vp in such a way that the ith virtual serverID of v is chosen
657	   to be uniformly distributed in (vp - i/N, vp - (i+1)/N).  These node
658	   IDs are then passed on to the node.

660	   When a node initially joins, it does not have an estimate of alpha
661	   and delta since that is based on the number of fingers in the routing
662	   table.  Thus, the enrollment server MUST use existing values of alpha
663	   and delta in requests it receives from nodes in the current overlay
664	   or based on its own diagnostic messages sent into the overlay to
665	   determine the number of virtual servers and the spacing in between
666	   the virtual servers and consequently the node IDs handed out to the
667	   joining node.  If the overlay has recently formed the enrollment
668	   server MAY bootstrap the values of alpha and delta as 20 and 1/1000
669	   respectively.  The enrollment server may also choose to use past
670	   overlay size estimates it may possess to bootstrap alpha and delta as
671	   2log2(Nestimated) and 1/Nestimated.  For example, an enrollment
672	   server at a venue based overlay which is torn down can use past size
673	   estimates.

675	   Note that the enrollment procedure in RELOAD already defines
676	   providing multiple node identities to the enrolling node.  The change
677	   needed is to include the values of alpha and delta in the simple
678	   enrollment request.

680	5.5.  Extensions to Overlays with Heterogeneous Nodes

682	   This load balancing solution can be extended for overlays with nodes
683	   with heterogeneous node capacities.  Let the capacity of the node-v
684	   in the overlay be represented by Cv.

686	   If an overlay has nodes with heterogeneous nodes, nodes in the
687	   overlay MUST use alpha = 2*Cv*log2(N) virtual servers per physical
688	   node.  The location of these (2 Cv log2(N)) virtual nodes MUST be
689	   chosen near the primary node location vp such that the location of
690	   the ith virtual serverID is uniformly distributed in (vp - i* delta *
691	   2^m, vp - (i+1) * delta * 2^m).  Each physical node then maintains a
692	   total of O(Cv log2(N)) neighbor entries and O(Cv log2(N)) routing
693	   fingers.  The imbalance factor for this scenario, defined as,

695	                                  Maximum # of data items
696	          imbalance factor = ---------------------------------
697	                                Cv * Average # of data items

699	   can be shown to be close to 1.

701	   Note that in addition to alpha and delta, the node capacity of the
702	   node MUST be passed to the enrollment server in the simple enrollment
703	   request.  This capacity value is used by the enrollment server to
704	   calculate Cv as

706	                                      node_capacity(node n)
707	              Cv(node n)  = ----------------------------------------
708	                               sum_{k=1}^N node_capacity(node k)

710	   In deployments with no enrollment servers, the node MUST estimate
711	   Cv(node n) by obtaining its neighbors' node capacities and building
712	   an estimate of average node capacity.

714	6.  Stabilizing Fingers, Successors, and Predecessors in the Topology
715	    Plug-in

717	6.1.  Choice of Approach to Stabilization

719	   There are two alternative approaches to stabilization: periodic and
720	   reactive.  Periodic stabilization can either use a fixed
721	   stabilization rate or calculate the stabilization rate in an adaptive
722	   fashion.

724	   In reactive stabilization, a peer reacts to the loss of a peer in its
725	   neighborhood set or to the appearance of a new peer that should be
726	   added to its neighborhood set by sending a copy of its neighbor table
727	   to all peers in the neighborhood set.  Periodic recovery, in
728	   contrast, takes place independently of changes in the neighborhood
729	   set.  In periodic recovery, a peer periodically shares its
730	   neighborhood set with each of the members of that set.

732	   The mandatory-to-implement Chord DHT algorithm in RELOAD [1] uses
733	   reactive stabilization for the neighborhood set, unlike the original
734	   Chord algorithm, which uses periodic stabilization.  It has been
735	   shown in [17] that reactive stabilization works well for small
736	   neighborhood sets (i.e., small overlays) and moderate churn.
737	   However, in large-scale (e.g., 1000 peers or more [17]) or high-churn
738	   overlays, reactive stabilization runs the risk of creating a positive
739	   feedback cycle, which can eventually result in congestion collapse.
740	   In [17], it is shown that a 1000-peer overlay under churn uses
741	   significantly less bandwidth and has lower latencies when periodic
742	   stabilization is used than when reactive stabilization is used.
743	   Although in the experiments carried out in [17], reactive recovery
744	   performed well when there was no churn, its bandwidth use was
745	   observed to jump dramatically under churn.  At higher churn rates and
746	   larger scale overlays, periodic stabilization uses less bandwidth and
747	   the resulting lower contention for the network leads to lower
748	   latencies.  For this reason, most DHTs such as CAN, Chord, Pastry,
749	   Bamboo, etc. use periodic stabilization.  As an example, the first
750	   version of Bamboo used reactive recovery, which caused Bamboo to
751	   suffer from degradation in performance under churn.  To fix this
752	   problem, Bamboo was modified to use periodic stabilization.

754	   In Chord, periodic stabilization is typically done both for
755	   successors and fingers.  An alternative strategy is analyzed in [18].
756	   In this strategy, called the correction-on- change maintenance
757	   strategy, a peer periodically stabilizes its successors but does not
758	   do so for its fingers.  Instead, finger pointers are stabilized in a
759	   reactive fashion.  The results obtained in [18] imply that although
760	   the correction-on-change strategy works well when churn is low,
761	   periodic stabilization outperforms the correction-on-change strategy
762	   when churn is high.

764	   In this document, we propose to use periodic stabilization for
765	   fingers, successors, and predecessors based on these insights.  Each
766	   peer MUST maintain a stabilization timer.  When the stabilization
767	   timer fires, the peer MUST restart the timer and carry out the
768	   stabilization operations.  The stabilization routine is described
769	   next and more details on computing the stabilization interval are
770	   elaborated in Section 6.8.2.

772	6.2.  Update Messages for Stabilization

774	   The stabilization procedures are implemented using Update requests
775	   and answers.  To describe the contents of these messages, the syntax
776	   defined in [1] is used.  A Chord Update request is defined as:

778	           enum { reserved (0), notify(1), succ_stab(2), pred_stab(3),
779	                  full(4), virtualserver_stab_join(5),
780	                  virtualserver_stab_leave(6), (255) }
781	                ChordUpdateType;

783	           struct {
784	             ChordUpdateType       type;
785	             NodeId                sender_id;

787	             select(type) {
788	               case notify:
789	                 uint32            uptime;
790	                 NodeId            sender_virtual_ids <0..2^16-1>;
791	               case pred_stab:     /* Empty */
792	                 ;
793	               case succ_stab:     /* Empty */
794	                 ;
795	               case virtualserver_stab_join:
796	                 NodeId            sender_virtual_ids <0..2^16-1>;
797	               case virtualserver_stab_leave:
798	                 NodeId            sender_virtual_ids <0..2^16-1>;
799	               case full:
800	                 uint32            uptime;
801	                 uint32            alpha;
802	                 uint32            delta;
803	                 NodeId            sender_virtual_ids <0..2^16-1>;
804	                 NodeId            predecessors <0..2^16-1>;
805	                 NodeId            successors <0..2^16-1>;
806	                 NodeId            fingers <0..2^16-1>;
807	             };
808	           } UpdateReq;

810	   The "type" field MUST indicate the reason why the Update was sent:

812	   notify:  the sender of the Update wishes to notify the recipient of
813	      the sender's existence.  Upon receiving the Update, the recipient
814	      SHOULD insert the sender into its routing table, if appropriate.
815	      The 'notify' message MUST include the virtual server IDs of the
816	      sender in the 'sender_virtual_ids' field.

818	   succ_stab:  the Update request is related to the successor
819	      stabilization routine.

821	   pred_stab:  the Update request is related to the predecessor
822	      stabilization routine.

824	   virtualserver_stab_join and virtualserver_stab_leave:  the Update
825	      request is related to the DHT parameter stabilization routine.

827	   full:  the Update request contains the entire routing and neighbor
828	      table of the sender.

830	   The sender_id field contains the sender's primary virtual ID.

832	   The sender_virtual_ids contains the list of all secondary virtual
833	   server IDs belonging to the sender.

835	   If the type of the Update request is 'pred_stab' or 'succ_stab', the
836	   request MUST NOT carry any additional information.

838	   If the type of the Update request is 'notify', the request MUST
839	   contain the sender's current uptime in seconds and the location of
840	   the sender's current virtual server IDs.

842	   If the type of the request is 'virtualserver_stab_join' or
843	   'virtualserver_stab_leave', the contents of the message MUST include
844	   the list of primary and the secondary virtual server IDs of the
845	   sender.

847	   If the type of the request is 'full', the contents of the message
848	   MUST be:

850	   o  uptime: The sender's current uptime in seconds;

852	   o  alpha: The sender's current DHT parameter value - alpha;

854	   o  delta: The sender's current DHT parameter value - delta;

856	   o  sender_virtual_ids: The sender's list of current virtual server
857	      IDs;

859	   o  predecessors: The sender's predecessor list;

861	   o  successors: The sender's successor list;

863	   o  fingers: The sender's finger table.

865	   In the introduced topology plug-in, each peer decides independently
866	   the appropriate size for the successor list, predecessor list, finger
867	   table, and system parameters (alpha and delta).  Thus, the
868	   'predecessors', 'successors', and 'fingers' fields are of variable
869	   length.  The number of virtual IDs assigned to a node along with the
870	   spacing between the virtual IDs are chosen based on the system
871	   parameters and may change as the network changes.  In keeping with
872	   this change, the length of the 'sender_virtual_ids' field MUST also
873	   be of variable length and would include the list of its virtual
874	   server IDs assigned to the sender.  As specified in RELOAD [1],
875	   variable length fields are on the wire preceded by length bytes.  In
876	   the case of the successor list, predecessor list, sender_virtual_ids,
877	   and finger table, there are two length bytes (allowing lengths up to
878	   2^16-1).  The number of NodeId structures included in each field can
879	   be calculated based on the length bytes since the size of a single
880	   NodeId structure is 16 bytes.  If a peer receives more entries than
881	   fit into its successor list, predecessor list or finger table, the
882	   peer SHOULD ignore the extra entries.  If the peer is assigned more
883	   virtual IDs than fit into its ID list, it SHOULD reject the
884	   assignment.  If a peer receives fewer entries than it currently has
885	   in its own data structure, the peer SHOULD NOT drop the extra entries
886	   from its data structure.

888	   If the Update request succeeds, the responding peer sends an
889	   UpdateAns message, which is defined as:

891	           enum { reserved (0), notify(1), succ_stab(2), pred_stab(3),
892	                  full(4), virtualserver_stab_join(5),
893	                  virtualserver_stab_leave(6), (255) }
894	                ChordUpdateType;

896	           struct {
897	             ChordUpdateType         type;

899	             select(type) {
900	               case full:            /* Empty */
901	                 ;
902	               case virtualserver_stab_join:      /* Empty */
903	                 ;
904	               case virtualserver_stab_leave:      /* Empty */
905	                 ;
906	               case notify:
907	                 uint32              uptime;
908	               case pred_stab:
909	                 NodeId              predecessors <0..2^16-1>;
910	               case succ_stab:
911	                 NodeId              predecessors <0..2^16-1>;
912	                 NodeId              successors <0..2^16-1>;
913	             };

915	           } UpdateAns;

917	   If the type of the Update answer is 'full',
918	   'virtualserver_stab_join', or 'virtualserver_stab_leave', the answer
919	   MUST NOT carry any additional information.  If the type is 'notify',
920	   the answer MUST contain the sender's current uptime in seconds.  If
921	   the type is 'pred_stab', the answer SHOULD carry the predecessor list
922	   of the responding peer.  If the type is 'succ_stab', the answer
923	   SHOULD include the predecessor and successor lists of the responding
924	   peer.

926	6.3.  Finger Stabilization

928	   The purpose of the finger stabilization procedure is to incorporate
929	   new peers into the finger table.  In the procedure, peer v MUST
930	   maintain a counter 'next', which stores the index of the next finger
931	   that should be stabilized.  The counter MUST be initialized to value
932	   one and it MUST be incremented by one after each finger stabilization
933	   procedure.  When the stabilization timer fires, peer v MUST choose
934	   one finger interval i from the set of finger_table_size finger
935	   intervals it maintains:

937	   i = next % (finger_table_size + 1),

939	   and send a Probe request addressed to the first identifier belonging
940	   to the chosen finger interval i.  The peer f responding to the Probe
941	   request SHOULD become the ith finger of v.  Peer v SHOULD send an
942	   Attach request to peer f to initiate a new connection to it.

944	   This document defines a new ProbeInformationType value 'uptime'.

946	   When this value is present in the requested_info field of a Probe
947	   request, it indicates that the receiver MUST include in the Probe
948	   response its current uptime in a ProbeInformation structure.  A Probe
949	   request that is sent as part of the finger stabilization procedure
950	   MUST contain the 'uptime' ProbeInformationType in its requested_info
951	   field.  The extended ProbeInformation structure that is returned in
952	   the Probe response is defined as:

954	           enum { responsible_set(1), num_resources(2), uptime(3),
955	                  (255) }
956	                ProbeInformationType;

958	           struct {
959	             ProbeInformationType    type;

961	             select (type) {
962	               case responsible_set:
963	                 uint32              responsible_ppb;

965	               case num_resources:
966	                 uint32              num_resources;

968	               case uptime:
969	                 uint32              uptime;
970	             };
971	           } ProbeInformation;

973	   The types "responsible_ppb" and "num_resources" have been specified
974	   in RELOAD [1].  The "uptime" is a new type and contains the sender's
975	   current uptime in seconds.

977	6.3.1.  Locality-aware Finger Selection

979	   Making progress in the identifier space can be expensive in terms of
980	   network latency.  The successor and predecessor lists can be used to
981	   optimize network latency by relaxing the requirement for finger
982	   selection.  Specifically, for each finger table entry, a node (say v)
983	   first determines a node n that matches the identifier of the finger.
984	   It then retrieves the successors and predecessors of n from n.  Node
985	   v then PINGs the successors and predecessors of node n and chooses
986	   the topologically closest node among these as the choice for the
987	   finger table entry.  The sizes of the successor and predecessor lists
988	   have an impact on network latency; the greater the number of
989	   successors and predecessors, the higher the probability of finding a
990	   topologically close finger table entry.  Our simulations of the basic
991	   Chord protocol with just three successors and three predecessors
992	   itself shows a reduction close to 41-46% in delay of lookup in the
993	   DHT.  Another alternate simulation study reported in [19] confirm our
994	   results and show 31% to 40% lookup stretch reductions using 2^(16)
995	   nodes and a Euclidean and transit-stub model.  We expect that the
996	   lookup delay performance would further reduce in the proposed
997	   topology plug-in with log2(N) successors.

999	6.4.  Successor Stabilization

1001	   Both the primary virtual IDs and secondary virtual IDs of the nodes
1002	   are stored as part of the successor and predecessor tables.

1004	   In the successor stabilization routine, a peer v asks the peer s that
1005	   is the first entry in its successor table for the virtual ID of the
1006	   successor's first predecessor p.  If the successor's first
1007	   predecessor pointer does not point to v's virtual ID but instead to p
1008	   (for instance, because p has joined the overlay between v and s), the
1009	   peer with virtual ID p should become v's first successor instead of
1010	   s.  Thus, v adds p to the front of its successor list and notifies p
1011	   of v's existence, so that p can change its predecessor to v.

1013	   Also successor lists are stabilized as part of the successor
1014	   stabilization routine.  In order to do this, peer v copies the
1015	   successor list of its successor s, removing the last entry and
1016	   prepending s to it.  If peer v, as a result of running the successor
1017	   stabilization routing, notices that its successor has failed, then it
1018	   does the following:

1020	   o  Using the virtual ID, s, of its successor, it looks up its virtual
1021	      ID to physical ID mapping to identify which physical node, say S,
1022	      has failed.  Here, the term 'physical ID' refers to the primary
1023	      virtual ID of the peer.

1025	   o  The peer then uses the same virtual ID to physical ID mapping
1026	      table to identify the location of other virtual IDs in its
1027	      successor list that correspond to the physical node S. These IDs
1028	      are marked for replacement.

1030	   o  The peer then replaces the successor with the first live entry in
1031	      its successor list, say n, and contacts this node for its
1032	      successor list.  The peer synchronizes n's successor list with its
1033	      own removing the IDs that are marked for replacement.  This step
1034	      is repeated by contacting the other live entries, one after the
1035	      other, until all the IDs marked for replacement are updated.

1037	   The successor stabilization routine is executed when the
1038	   stabilization timer fires.  To begin the successor stabilization
1039	   routine, peer v MUST send an Update request to its first successor s.
1040	   The type of the Update request MUST be 'succ_stab'.  Upon receiving
1041	   the Update request, peer s MUST send an Update answer to peer v.  The
1042	   Update answer SHOULD include the successor and predecessor lists of
1043	   peer s.  If v learns from the predecessor and successor lists
1044	   included in the answer that new peers should be included in its
1045	   neighborhood set, v MUST send Attach requests to the new peers.  Once
1046	   a direct connection has been established with each new peer as a
1047	   result of the Attach procedure, peer v MUST send an Update request of
1048	   type 'notify' to each new peer.  This allows the new peers to insert
1049	   v into their neighborhood sets.

1051	6.5.  Predecessor Stabilization

1053	   The predecessor stabilization routine is executed when the
1054	   stabilization timer fires.  To begin the predecessor stabilization
1055	   routine, a peer v MUST send an Update request to its predecessor p.
1056	   The type of the Update request MUST be 'pred_stab'.  Upon receiving
1057	   the Update request, peer p MUST send an Update answer to peer v.  The
1058	   Update answer SHOULD include the predecessor list of peer p.  Peer v
1059	   SHOULD use the predecessor list carried in the answer to update its
1060	   own predecessor list.  If new peers are inserted into the predecessor
1061	   list, peer v MUST send Attach requests and Update requests of type
1062	   'notify' to the new peers in the same way as during the successor
1063	   stabilization routine.

1065	6.6.  Joining the Overlay

1067	   The process of joining an overlay is as follows:

1069	   1.   The Joining Peer (JP) SHOULD connect to a bootstrap peer.

1071	   2.   The JP SHOULD send an Attach request to the bootstrap peer,
1072	        which SHOULD route the request towards the Admitting Peer (AP).
1073	        Here, the AP is the node that is the successor of JP's primary
1074	        virtual server.  Once the Attach procedure is finished, there is
1075	        a direct connection between the JP and the AP.

1077	   3.   The JP SHOULD send a Join request to the AP.  The AP returns a
1078	        Join answer.

1080	   4.   The AP MUST send an Update request of type 'full' to the JP.
1081	        The Update request SHOULD contain the contents of AP's routing
1082	        table.  The JP SHOULD use the contents of the Update request to
1083	        initialize its finger table and DHT parameters, i.e., alpha and
1084	        delta.  The JP SHOULD set the size of its successor list,
1085	        predecessor list, finger table, to the same values that the AP
1086	        uses.  The values of alpha and delta are also set to the ones
1087	        that AP uses.  If an enrollment server is present in the
1088	        overlay, the information about the choice of DHT parameters,
1089	        alpha and delta, can be obtained from it directly.

1091	   5.   The JP then chooses the virtual server locations, vs_i for i =
1092	        2, 3, ..., alpha.  This step is not performed if there is an
1093	        enrollment server in the overlay.  In this scenario, the
1094	        locations of the virtual servers are obtained from the
1095	        enrollment server apriori.

1097	   6.   The JP SHOULD send an Attach request to the successor of
1098	        vs_alpha, call it n_alpha.  The node returns a Attach answer.

1100	   7.   The peer n_alpha MUST send an Update request of type 'full' to
1101	        the JP.  The Update request SHOULD contain the contents of the
1102	        node's successors.  The JP SHOULD use the contents of the Update
1103	        request to initialize its successor and predecessor lists.

1105	   8.   The JP MUST send Attach requests to initiate connections to each
1106	        of the peers in its predecessor list, successor list, and finger
1107	        table.  Since the JP is already connected to the AP and n_alpha,
1108	        there is no need to send a new Attach request to these nodes.

1110	   9.   The JP MUST send an Update request of type 'notify' to each of
1111	        the peers in its predecessor and successor lists (except for the
1112	        AP and n_alpha that are already aware of the JP).

1114	   10.  The JP MUST send a Probe request carrying the 'uptime'
1115	        ProbeInformationType value in the requested_info field to each
1116	        of its fingers.  This way the JP will learn the uptimes of its
1117	        fingers (the uptimes of predecessors and successors are learned
1118	        from Update responses in the previous step).  The uptimes are
1119	        needed when estimating the join rate of peers in the overlay.
1120	        It should be noted that these Probe requests are not routed via
1121	        the overlay but are sent on a direct connection.

1123	   11.  Now, the JP takes ownership of regions in the overlay based on
1124	        its virtual server locations, vs_i.  In this step, the JP MUST
1125	        send one Update message to the successor of each vs_i for all i.
1126	        The type of the Update message MUST be set to
1127	        'virtualserver_stab_join'.  The peer, say n_i, receiving this
1128	        message makes a note of this change and locally updates its
1129	        ownership locations.  Peer n_i then sends an UpdateAns message
1130	        with type 'virtualserver_stab_join' to the JP acknowledging the
1131	        change.  Peer n_i SHOULD then issue a series of Store requests
1132	        to JP to transfer ownership of the resources.  This step is
1133	        repeated for all i = alpha, alpha-1, ..., 1.  At the end of this
1134	        step all the new virtual servers corresponding to JP have joined
1135	        the overlay, and the JP stores the data for these virtual
1136	        locations.

1138	6.6.1.  Contents of the Join Message

1140	   This topology plug-in extends the Join request defined in RELOAD [1].
1141	   In addition to the joining_peer_id, nodes MAY choose to send its
1142	   virtual server locations as part of the join message if they are
1143	   obtained apriori from the enrollment server as part of the enrollment
1144	   process.  The JoinReq message is defined as:

1146	        struct {
1147	          NodeId            joining_peer_id;
1148	          NodeId            joining_peer_virtual_server_ids <0..2^16-1>;
1149	          opaque            overlay_specific_data <0..2^16-1>;
1150	        } JoinReq;

1152	   The JoinReq contains the Node-ID which the sending peer wishes to
1153	   assume and MAY include other virtual server locations that the
1154	   joining peer wishes to occupy.

1156	   If the request succeeds, the responding peer responds with a JoinAns
1157	   message; this is defined as in the case of RELOAD.

1159	6.7.  Leaving the Overlay

1161	   The process of leaving the overlay is as follows:

1163	   1.  If no replication is being performed in the overlay, leaving peer
1164	       SHOULD issue a series of Store requests to the successor of each
1165	       of its virtual servers, vs_i, to transfer the ownership of the
1166	       resource records it is storing.  Note that if replication is
1167	       being used, the successor of peer is already storing replicas and
1168	       the amount of data transferred can be minimized.

1170	   2.  The leaving peer MUST send a Leave request to the predecessor and
1171	       successor of each virtual server, vs_i.  The Leave request that
1172	       is sent to the vs_i's successor SHOULD contain the predecessor
1173	       list of the leaving peer.  The Leave request that is sent to the
1174	       vs_i's predecessor SHOULD contain the successor list of the
1175	       leaving peer.  The first successor SHOULD use the predecessor
1176	       list carried in the Leave request to update its own predecessor
1177	       list.  The first predecessor SHOULD use the successor list
1178	       carried in the Leave request to update its own successor list.

1180	6.7.1.  Contents of the Leave Message

1182	   This topology plug-in extends the Leave request defined in RELOAD
1183	   [1].  In addition to the leaving_peer_id, a node MUST send its
1184	   virtual server locations as part of the leave message as defined in:

1186	            public struct {
1187	              NodeId        leaving_peer_id;
1188	              NodeId        leaving_peer_virtual_server_ids <0..2^16-1>;
1189	              opaque        overlay_specific_data <0..2^16-1>;

1191	            } LeaveReq;

1193	   The overlay_specific_data field of the Leave request MUST contain a
1194	   ChordLeaveData structure:

1196	               enum { reserved (0),
1197	                      from_succ(1), from_pred(2), (255) }
1198	                    ChordLeaveType;

1200	               struct {
1201	                 ChordLeaveType         type;

1203	                 select(type) {
1204	                   case from_succ:
1205	                     NodeId              successors <0..2^16-1>;
1206	                   case from_pred:
1207	                     NodeId              predecessors <0..2^16-1>;
1208	                 };
1209	               } ChordLeaveData;

1211	   The 'type' field indicates whether the Leave request was sent by a
1212	   predecessor or a successor of the recipient:

1214	   from_succ

1216	      The Leave request was sent by a successor.

1218	   from_pred

1220	      The Leave request was sent by a predecessor.

1222	   If the type of the request is 'from_succ', the contents will include
1223	   the sender's successor list.

1225	   If the type of the request is 'from_pred', the contents will include
1226	   the sender's predecessor list.

1228	6.8.  Self Tuning System Parameters

1230	   This section describes how the system parameters for load balancing
1231	   and stabilization adapt to changing network conditions.

1233	6.8.1.  Self Tuning Load Balancing Algorithm Parameters

1235	   All DHTs require constant updating as the size of the network grows.
1236	   For example, in the case of Chord, each node starts out with m
1237	   entries in the finger table (if node-IDs take values from 0 to
1238	   2^m-1).  When the number of nodes in the network is small, some of
1239	   these m entries collapse to the same node.  As the number of nodes in
1240	   the network increase, the finger table entries degenerate to pointing
1241	   to different nodes and the number of fingers grows as O(log2(N)).
1242	   Thus, the number of fingers that each node has to maintain grows as
1243	   O(log2(N)) as N increases.

1245	   A similar update step is required by the solution in this document
1246	   scheme as well because the system parameters, alpha and delta, also
1247	   depend upon the value of N. This section describes how nodes MUST
1248	   update the values of alpha, delta, and the location of virtual
1249	   servers as the network changes in size.  This document proposes to
1250	   update the system values each time the network size doubles or
1251	   halves.

1253	   In addition to finger, successor, and predecessor stabilization, each
1254	   peer also needs to perform DHT parameter stabilization.  Unlike the
1255	   other stabilization routines that are done periodically when the
1256	   stabilization timer fires, the parameter stabilization routine is
1257	   done whenever the number of fingers is observed to change.
1258	   Performing stabilization in this way is sufficient for updating the
1259	   load balancing parameters (i.e., alpha and delta) because it is not
1260	   the highest priority update and the overlay would function
1261	   effectively even without this update.  In the proposed topology
1262	   plug-in, we employ lazy updates for stabilizing the parameters, alpha
1263	   and delta.  More specifically, we use the change in number of fingers
1264	   as a trigger to perform DHT parameter update.

1266	   The parameter stabilization routine is executed when the number of
1267	   fingers in a node changes.  We consider two possible scenarios that
1268	   can arise:

1270	   1.  Number of connections decrease: Let f be the lost connection.
1271	       The peer v MUST first check the ID of the lost connection to see
1272	       if there is any response.  Let n be the ID of the node responding
1273	       to the request.  If n is the same as f, the peer must re-connect
1274	       to f.  If n is not the same as f and does not correspond to any
1275	       node that is already in the finger table of v, then v MUST send
1276	       an Attach request to n.  In the above two cases, the parameter
1277	       stabilization routine is not performed since the number of
1278	       fingers does not change.  On the other hand, if n is not the same
1279	       as f and corresponds to some other node in the finger table of v,
1280	       then v understands that the network size has reduced and invokes
1281	       the parameter stabilization routine.

1283	   2.  Number of connections increase: In this case, v invokes the
1284	       parameter stabilization routine.

1286	   When the parameter stabilization routine is invoked, the following
1287	   functions are performed:

1289	   o  Step 1: The (alpha-1) virtual servers corresponding to v leave the
1290	      overlay.

1292	   o  Step 2: The value of alpha and delta are updated and the new
1293	      values are obtained.  The corresponding virtual server locations
1294	      are also modified to get vs_i.

1296	   o  Step 3: The node v joins the overlay at the new virtual server
1297	      locations, vs_i.

1299	   o  Step 4: The successors and predecessors of v are updated.

1301	   During the parameter stabilization routine, the portion of the data
1302	   space owned by the corresponding physical node changes and so data
1303	   needs to be moved to match this change.  The locations of successors
1304	   and predecessors also change to adjust to the updates in virtual
1305	   server locations.  However, the fingers do not change because these
1306	   are associated with the primary virtual ID and the location of the
1307	   primary node-ID is fixed.

1309	   In Step 1 of the stabilization routine (see above), the peer v MUST
1310	   send one Update message to the successor of the virtual server, vs_i.
1311	   The type of the Update message MUST be set to
1312	   'virtualserver_stab_leave'.  The peer n_i receiving this message
1313	   makes a note of this change and locally updates its ownership
1314	   locations.  Peer n_i then sends an UpdateAns message with type
1315	   'virtualserver_stab_leave' to vs_i acknowledging the change.  On
1316	   receiving the UpdateAns message, peer v SHOULD issue a series of
1317	   Store requests to n_i to transfer ownership of the resources.  This
1318	   step is repeated for all i = 2, ... alpha.  At the end of this step,
1319	   all the virtual servers corresponding v have left the overlay and
1320	   peer v no longer stores the data corresponding to these virtual
1321	   locations.

1323	   In Step 2, the peer v obtains its new virtual server locations, vs_i.
1324	   In the presence of an enrollment server, the peer v MUST request the
1325	   enrollment server for a new set of identities and stop participating
1326	   in the overlay network with the old node identities.  The enrollment
1327	   server MAY choose to implement diagnostics using mechanisms in [3] to
1328	   ascertain that the node requesting identities is not requesting more
1329	   or less than what it should based on the finger table sizes of a
1330	   random sampling of other nodes in the overlay.

1332	   In the absence of the enrollment server, the following protocol is
1333	   implemented to obtain the new values of alpha, delta, and vs_i.

1335	UpdateAlphaDelta()
1336	{
1337	      If (number_of_fingers increases by 1) {
1338	           // Can be detected as in the case of Chord when finger tables
1339	           // are updated. This implies that the network size has about
1340	           // doubled.

1342	           delta = delta/2;     // update delta

1344	           // Update existing virtual server locations
1345	           // vs_1 = vp as the location of the primary virtual server
1346	           // does not change.
1347	           for i = 2 to alpha
1348	               vs_i = vp  - (vp - vs_i)/2;

1350	           // Choose new virtual server location between
1351	           // (vp - alpha*delta*2^m, vp - (alpha+1)*delta*2^m);

1353	           vs_(alpha+1) = random (vp - alpha*delta*2^m,
1354	                vp - (alpha+1)*delta*2^m);
1355	           vs_(alpha+2) = random (vp  (alpha+1)*delta*2^m,
1356	                vp - (alpha+2)*delta*2^m);
1357	           alpha = alpha + 2;
1358	      }

1360	      If (number_of_fingers decreases by 1) {
1361	          // Can be detected as in the case of Chord when finger tables
1362	          // are updated. This implies that the network size has about
1363	          // doubled.

1365	          // Leave network with those virtual ids
1366	          Remove virt_server_(alpha-1);
1367	          Remove virt_server_(alpha);

1369	          delta = delta * 2;     // update delta
1370	          alpha = alpha - 2;     // update alpha

1372	          // Update existing virtual server locations
1373	          // vs_1 = vp as the location of the primary virtual server
1374	          // does not change.
1375	          for i = 2 to alpha
1376	              vs_i = vp  - (vp - vs_i) * 2;
1377	      }
1378	}

1380	   Once the location of the virtual servers are obtained, the peer re-
1381	   joins the overlay at these positions.  During this step, the peer v
1382	   MUST send one Update message to the successor of vs_i in Step 3.  The
1383	   type of the Update message MUST be set to 'virtualserver_stab_join'.
1384	   The peer n_i receiving this message makes a note of this change and
1385	   locally updates its ownership locations.  Peer n_i then sends an
1386	   UpdateAns message with type 'virtualserver_stab_join' to v
1387	   acknowledging the change.  Peer n_i SHOULD then issue a series of
1388	   Store requests to v to transfer ownership of the resources.  In this
1389	   process, the virtual server's n_i are added to the successor list of
1390	   v.  This step is repeated for all i = alpha, alpha-1, ..., 2.  At the
1391	   end of this step all the new virtual servers corresponding to v have
1392	   joined the overlay, and peer v stores the data for these virtual
1393	   locations.

1395	   If replication is performed on the overlay, some of the data may
1396	   already be present in the node v and this would reduce the amount of
1397	   Store requests.

1399	   Successor and predecessor stabilization routines are invoked as
1400	   described in Section 6.4 and Section 6.5, respectively, in Step 4 and
1401	   this completes the parameter stablization routine.

1403	6.8.2.  Self Tuning the Stabilization Interval

1405	   To ensure that lookups produce correct results as the set of
1406	   participating peers changes and to ensure that all peers' connections
1407	   be up to date, each peer MUST run a stabilization protocol
1408	   periodically in the background.  The stabilization protocol uses
1409	   three operations: finger stabilization, successor stabilization, and
1410	   predecessor stabilization as defined earlier.  Each peer MUST
1411	   maintain a stabilization timer.  When the stabilization timer fires,
1412	   the peer MUST restart the timer and carry out the stabilization
1413	   operations.  In this section, we present methods to compute the
1414	   stabilization timer.

1416	   When periodic stabilization is used, one faces the problem of
1417	   selecting an appropriate execution rate for the stabilization
1418	   procedure.  If the execution rate of periodic stabilization is high,
1419	   changes in the system can be quickly detected, but at the
1420	   disadvantage of increased communication overhead.  On the other hand,
1421	   if the stabilization rate is low and the churn rate is high, routing
1422	   tables become inaccurate and DHT performance deteriorates.  Thus, the
1423	   problem is setting the parameters so that the overlay achieves the
1424	   desired reliability and performance even in challenging conditions,
1425	   such as under heavy churn.  This naturally results in high cost
1426	   during periods when the churn level is lower than expected, or
1427	   alternatively, poor performance or even network partitioning in worse
1428	   than expected conditions.

1430	   The current approach is to configure overlays statically.  This works
1431	   in situations where perfect information about the future is
1432	   available.  In situations where the operating conditions of the
1433	   network are known in advance and remain static throughout the
1434	   lifetime of the system, it is possible to choose fixed optimal values
1435	   for parameters such as stabilization rate, neighborhood set size, and
1436	   routing table size.  However, if the operating conditions (e.g., the
1437	   size of the overlay and its churn rate) do not remain static but
1438	   evolve with time, it is not possible to achieve both a low lookup
1439	   failure rate and a low communication overhead by using fixed
1440	   parameters [20].

1442	   As an example, to configure the Chord DHT algorithm, one needs to
1443	   select appropriate values for the size of successor list and the
1444	   stabilization interval.  To select an appropriate value for the
1445	   stabilization interval, one needs to know the expected churn rate and
1446	   overlay size.  According to [21], a Chord network in a ring-like
1447	   state remains in a ring-like state as long as peers send
1448	   Omega(log2^2(N)) messages before N new peers join or N/2 peers fail.
1449	   Thus, in a 500-peer overlay churning at a rate such that one peer
1450	   joins and one peer leaves the network every 30 seconds, an
1451	   appropriate stabilization interval would be on the order of 93s.
1452	   According to [4], the size of the successor list should be on the
1453	   order of log2(N).  Having a successor list of size O(log2(N)) makes
1454	   it unlikely that a peer will lose all of its successors, which would
1455	   cause the Chord ring to become disconnected.  Thus, in a 500-peer
1456	   network each peer should maintain on the order of nine successors.
1457	   However, if the churn rate doubles and the network size remains
1458	   unchanged, the stabilization rate should double as well.  That is,
1459	   the appropriate maintenance interval would now be on the order of
1460	   46s.  On the other hand, if the churn rate becomes e.g. six-fold and
1461	   the size of the network grows to 2000 peers, on the order of eleven
1462	   successors should be maintained and the stabilization interval should
1463	   be on the order of 42s.  If one continued using the old values, this
1464	   could result in inaccurate routing tables, network partitioning, and
1465	   deteriorating performance.

1467	   The proposed topology plug-in takes into consideration the continuous
1468	   evolution of network conditions and adapts to them.  Each peer
1469	   collects statistical data about the network and adaptively adjusts
1470	   its stabilization rate and successor list size based on the analysis
1471	   of the data [20].  Reference [22] shows that by using a self-tuning
1472	   mechanism, it is possible to achieve high reliability and performance
1473	   even in adverse conditions with low maintenance cost.  Adaptive
1474	   stabilization has been shown to outperform periodic stabilization in
1475	   terms of both lookup failure and communication overhead [20].

1477	   The following sub-sections specify methods to determine the
1478	   appropriate stabilization rate in an adaptive fashion.  The proposed
1479	   mechanism is based on [22][21][20].  To calculate an appropriate
1480	   stabilization rate, the values of three parameters MUST be estimated:
1481	   overlay size N, failure rate U, and join rate L. Peers in the overlay
1482	   MUST re-calculate the values of the parameters to self-tune the
1483	   algorithm at the end of each stabilization period before re-starting
1484	   the stabilization timer.

1486	6.8.2.1.  Estimating Overlay Size

1488	   Techniques for estimating the size of an overlay network have been
1489	   proposed for instance in [22] [23] [24] [25] and [20].  In Chord, the
1490	   density of peer identifiers in the successor set can be used to
1491	   produce an estimate of the size of the overlay, N [22].  Since peer
1492	   identifiers are picked randomly with uniform probability from the
1493	   m-bit identifier space, the average distance between peer identifiers
1494	   in the successor set is (2^m)/N.

1496	   To estimate the overlay network size, a peer MUST compute the average
1497	   inter-peer distance d between the successive peers starting from the
1498	   most distant predecessor and ending to the most distant successor in
1499	   the successor list.  The estimated network size MUST be calculated
1500	   as:

1502	                                 2^m
1503	                          N = ---------
1504	                                  d

1506	   This estimate has been found to be accurate within 15% of the real
1507	   network size [20].  Of course, the size of the neighborhood set
1508	   affects the accuracy of the estimate.

1510	   When a peer joins the network, the admitting peer sends the joining
1511	   peer a copy of its neighborhood set.  Thus, a joining peer
1512	   immediately has enough information at its disposal to calculate an
1513	   estimate of the network size.

1515	6.8.2.2.  Estimating Failure Rate

1517	   A typical approach is to assume that peers join the overlay according
1518	   to a Poisson process with rate L and leave according to a Poisson
1519	   process with rate parameter U [22].  The value of U can be estimated
1520	   using peer failures in the finger table and neighborhood set [22].
1521	   If peers fail with rate U, a peer with M unique peer identifiers in
1522	   its routing table should observe K failures in time K/(M*U).  Every
1523	   peer in the overlay MUST maintain a history of the last K failures.
1524	   The current time MUST be inserted into the history when the peer
1525	   joins the overlay.  The estimate of U MUST be calculated as:

1527	                                  k
1528	                          U = ----------,
1529	                                M * Tk

1531	   where M is the number of unique peer identifiers in the routing
1532	   table, Tk is the time between the first and the last failure in the
1533	   history, and k is the number of failures in the history.  If k is
1534	   smaller than K, the estimate is computed as if there was a failure at
1535	   the current time.  It has been shown that an estimate calculated in a
1536	   similar manner is accurate within 17% of the real value of U [20].

1538	   The size of the failure history K affects the accuracy of the
1539	   estimate of U. One can increase the accuracy by increasing K.
1540	   However, this has the side effect of decreasing responsiveness to
1541	   changes in the failure rate.  On the other hand, a small history size
1542	   may cause a peer to overreact each time a new failure occurs.  In
1543	   [20], K is set 25% of the routing table size.

1545	6.8.2.2.1.  Estimating Join Rate

1547	   Reference [20] proposes that a peer can estimate the peer join rate
1548	   based on the uptime of the peers in its routing table.  An increase
1549	   in peer join rate will be reflected by a decrease in the average age
1550	   of peers in the routing table.  Thus, each peer MUST maintain an
1551	   array of the ages of the peers in its routing table sorted in
1552	   increasing order.  Using this information, an estimate of the global
1553	   peer join rate L MUST be calculated as:

1555	                               N           1
1556	                          L = --- * ---------------,
1557	                               4     Ages[rsize/4]

1559	   where Ages is an array containing the ages of the peers in the
1560	   routing table sorted in increasing order and rsize is the size of the
1561	   routing table.  Only the ages of the 25% of the youngest peers in the
1562	   routing table SHOULD be used to reduce the bias that a small number
1563	   of peers with very old ages can cause [20].  It has been shown that
1564	   the estimate obtained by using this method is accurate within 22% of
1565	   the real join rate [20].  Of course, the size of the routing table
1566	   affects the accuracy.

1568	   In order for this mechanism to work, peers need to exchange
1569	   information about the time they have been present in the overlay.
1570	   Peers learn the uptimes of their successors and predecessors when
1571	   adding the successors and predecessors to their routing tables since
1572	   Update requests and answers that are of type 'notify' carry uptime
1573	   values.  Peers learn the uptimes of their fingers because the Probe
1574	   responses sent as part of the finger stabilization routine carry
1575	   uptime values.  A joining peer learns the admitting peer's uptime
1576	   since an Update request of type 'full' contains uptime information.

1578	6.8.2.2.2.   Calculating the Stabilization Interval

1580	   According to [21], a Chord network in a ring-like state remains in a
1581	   ring-like state as long as peers send Omega(log2^2(N)) messages
1582	   before N new peers join or N/2 peers fail.  We can use the estimate
1583	   of peer failure rate, U, to calculate the time Tf in which N/2 peers
1584	   fail:

1586	                                  1
1587	                          Tf = --------
1588	                                 2*U

1590	   Based on this estimate, a stabilization interval Tstab-1 is
1591	   calculated as:

1593	                                        Tf
1594	                          Tstab-1 = -----------
1595	                                     log2^2(N)

1597	   Further, the estimated join rate L can be used to calculate the time
1598	   in which N new peers join the overlay.  Based on the estimate of L, a
1599	   stabilization interval Tstab-2 is calculated as:

1601	                                           N
1602	                          Tstab-2 = ---------------
1603	                                     L * log2^2(N)

1605	   Finally, the actual stabilization interval Tstab that SHOULD be used
1606	   can be obtained by taking the minimum of Tstab-1 and Tstab-2.

1608	   The results obtained in [26] indicate that making the stabilization
1609	   interval too small has the effect of making the overlay less stable
1610	   (e.g., in terms of detected loops and path failures).  Thus, a lower
1611	   limit should be used for the stabilization period.  Based on the
1612	   results in [26], a lower limit of 15s is proposed, since using a
1613	   stabilization period smaller than this will with a high probability
1614	   cause too much traffic in the overlay.

1616	7.  Security Considerations

1618	   One important concern for the virtual servers approach is that it is
1619	   not easily enforceable.  Given a set of virtual serverIDs for a
1620	   physical node, it must be ensured that the physical node actually
1621	   takes ownership of all the virtual serverIDs assigned to it.  In the
1622	   presence of a centralized enrollment server, this server can ensure
1623	   that each physical node gets its share of the number of virtual
1624	   servers.  In the absence of the enrollment server, the verification
1625	   can be performed during the join process.  For instance, in the case
1626	   of Chord, when a new node joins the overlay, it contacts its
1627	   neighbors to obtain neighbor relations and finger table entries.  A
1628	   similar approach can be adopted in the case of virtual servers
1629	   approach.  In this scenario, each physical node provides its
1630	   neighbors a list of 2 log2(N) virtual server IDs.  The neighbors
1631	   first verify that the number of virtual server locations received is
1632	   close to the number of virtual server locations that it owns before
1633	   sending the finger table entries to the joining node.  In this way,
1634	   it can be ensured that each node has O(log2(N)) virtual locations on
1635	   the overlay.

1637	8.  IANA Considerations

1639	   (a) A new overlay algorithm type should be defined for the proposed
1640	   new topology plug-in.

1642	   (b) This document defines one new Probe Information Type value:

1644	                    +-----------------+------+---------------+
1645	                    | Probe Option    | Code | Specification |
1646	                    +-----------------+------+---------------+
1647	                    | uptime          |    3 |      RFC-AAAA |
1648	                    +-----------------+------+---------------+

1650	   (c) Other IANA considerations are TBD.

1652	9.  Acknowledgments

1654	   This document benefited from design discussions with Vidya Narayanan
1655	   from Qualcomm Inc.

1657	10.  Appendix

1659	   This appendix lists a few performance results of the load balancing
1660	   solution proposed in this document.

1662	10.1.  Comparison of the Load Balancing Algorithm with Chord

1664	   Compared to the virtual markers approach in [7], the solution
1665	   proposed in this document does not require O(log2(N)) nodes to change
1666	   their location upon each node arrival.  For instance, in the case of
1667	   the virtual markers approach, if O(log2(N)) nodes change location,
1668	   then O(log2(N)/N) data objects need to be reassigned and at least O(2
1669	   log2(N)) nodes are involved in this step.  This is in addition to the
1670	   messages required for updating routing tables which is O(log2^3(N)).
1671	   In contrast to the virtual markers approach, the proposed solution
1672	   requires O(1/N) data objects to be reassigned and around O(log2(N))
1673	   nodes send data to the joining node.  The number of routing messages
1674	   required is still O(log2^2(N)) similar to the case of Chord with no
1675	   virtual servers; this is because the number of connections is still
1676	   O(log2(N)) links per node.

1678	   We performed some simulation studies to compare the performance of
1679	   the solution in this document with Chord.  In order to study the
1680	   percentage of nodes with significant load imbalance, we look at the
1681	   q-percentile load imbalance factor defined as the ratio of the
1682	   q-percentile load and the average load.  For example, a 99-percentile
1683	   imbalance factor of 5 implies that less than 1 percent of nodes in
1684	   the system have a load more than the value 5 times the average load.

1686	   We tested the solution in this document under N = 2^(15) and compared
1687	   the results with Chord.  We found that the 99-percentile imbalance
1688	   factor for the solution in this document is around 2.  In comparison,
1689	   our results with Chord suggest that the 100-percentile imbalance
1690	   factor is around 9, the 99-percentile imbalance factor is around 4.5
1691	   and so on.  This result suggests that in Chord, around 1% of the
1692	   nodes have an imbalance between 4.5 and 9, 2% have an imbalance
1693	   greater than 4 and so on compared to the solution in this document
1694	   where less than 1% of the nodes has an imbalance greater than 2.

1696	   With regard to routing, our simulations compared the average route
1697	   length for Chord (with one virtual server) with this document's
1698	   solution (2 log2(N)) servers; the results showed that the both these
1699	   DHT implementations require around the same number of hops.

1701	10.2.  Performance of the Load Balancing Algorithm as Network Grows

1703	   In this section, we take a closer look at the performance of the
1704	   solution in this document as network size grows in terms of load
1705	   imbalance factor and routing state maintenance.

1707	   (1)  Load Imbalance: In our simulations, we fix the value of the
1708	        estimated network size to be Nest = 2^9, and chose alpha = 18
1709	        and delta = 2^(-9).  The actual network size is then increased
1710	        from 2^2 to 2^(14).  When N < Nest, the spacing between virtual
1711	        servers (chosen as 1/Nest) is very small and the solution in
1712	        this document becomes similar to Chord.  In this case, the load
1713	        imbalance factor is of the order of O(log2(N)).  However, since
1714	        the numerical value of N is small, the imbalance factor is still
1715	        low and around 3 for most nodes as demonstrated by our
1716	        simulations.  As the value of network size, N, increases, the
1717	        spacing, delta, comes into effect and the virtual servers help
1718	        balance the load.  Our simulations indicate that the imbalance
1719	        is around its lowest value of 2 when N = Nest, and increases
1720	        slowly as N becomes larger than Nest.

1722	   (2)  Routing State: Here, we examine the performance of this solution
1723	        with regard to the amount of routing state that it needs to
1724	        maintain as the network grows.  We analyze the performance of
1725	        the solution in this document analytically to determine the
1726	        number of hops required to reach a destination as a function of
1727	        N and Nest.  We perform the analysis in two steps:

1729	        *  [Step 1] Routing within a distance of 1/N from destination:
1730	           If nodes in this solution employ only their primary virtual
1731	           servers (and the corresponding O(log2(N)) connections) for
1732	           routing as in the case of Chord, it can be shown that any
1733	           discretization built upon the solution in this document would
1734	           be able to get within a distance of 1/N from the destination
1735	           node in O(log2(N)) hops with O(log2(N)) fingers per node.
1736	           This results is unaffected by the relative values of N and
1737	           Nest.

1739	        *  [Step 2] Last-mile: The number of hops required to route
1740	           messages from within a distance of 1/N to the exact
1741	           destination (referred to as the last-mile problem) would
1742	           depend upon the exact realization of DHT and its construction
1743	           of neighbor tables.  In this document, each node has alpha
1744	           virtual servers chosen uniformly at random separated by a
1745	           distance O(delta*2^m).  Therefore, the number of nodes within
1746	           a 1/N distance from a chosen location would be of the order
1747	           of O(log2(N) + N*log2(1+s alpha delta)), where 's' is a
1748	           constant independent of N. If alpha and delta are chosen
1749	           using Nest, then the number of nodes within a 1/N distance
1750	           can be approximated to be of the order of O(log2(N) + N*
1751	           log2(Nest)/Nest) for large Nest.  Therefore, a random walk
1752	           would lead to the final destination within O(log2(N) + N*
1753	           log2(Nest)/Nest) hops.

1755	        This result indicates that when N < Nest, the solution in this
1756	        document performs similar to Chord and messages can be routed to
1757	        any node in O(log2(N)) hops by maintaining O(log2(N)) fingers
1758	        and O(log2(N)) successors.  On the other hand, when N is much
1759	        larger compared to Nest, O(N) hops might be required to reach
1760	        the destination if the number of fingers per node is O(log2(N)).
1761	        However, this situation arises only when the network size
1762	        estimation is very much lower than N and the value of alpha and
1763	        delta are not updated as the network grows.

1765	11.  References

1767	11.1.  Normative References

1769	   [1]   Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and H.
1770	         Schulzrinne, "REsource LOcation And Discovery (RELOAD) Base
1771	         Protocol", draft-ietf-p2psip-base-02 (work in progress),
1772	         March 2009.

1774	   [2]   Bradner, S., "Key words for use in RFCs to Indicate Requirement
1775	         Levels", BCP 14, RFC 2119, March 1997.

1777	   [3]   Yongchao, S., Jiang, X., Even, R., and D. Bryan, "P2PSIP
1778	         Overlay Diagnostics", draft-ietf-p2psip-diagnostics-01 (work in
1779	         progress), June 2009.

1781	11.2.  Informative References

1783	   [4]   Stoica, I., Morris, R., Karger, D., Kaashoek, M., and H.
1784	         Balakrishnan, "Chord: A scalable peer-to-peer lookup service
1785	         for internet applications",  In Proc. of the ACM SIGCOMM, 2001.

1787	   [5]   Li, J., Strinbling, J., Gil, T., and M. Kaashoek, "Comparing
1788	         the performance of distributed hash tables under churn",  In
1789	         Proc. of the 3rd International Workshop on Peer-to-Peer
1790	         Systems, 2004.

1792	   [6]   Bienkowski, M., Korzeniowski, M., and F. auf der Heide,
1793	         "Dynamic load balancing in distributed hash tables",  In Proc.
1794	         of IPTPS, 2005.

1796	   [7]   Karger, D. and M. Ruhl, "Simple efficient load balancing
1797	         algorithms for peer-to-peer systems",  In 3rd International
1798	         Workshop on Peer-to-Peer Systems (IPTPS), 2004.

1800	   [8]   Awerbuch, B. and C. Scheideler, "Group spreading: A protocol
1801	         for provably secure distributed name service",  In Proc. of the
1802	         31st Int. Colloquium on Automata, Languages, and Programming
1803	         (ICALP), 2004.

1805	   [9]   Fraigniaud, P. and C. Gavoille, "The content-addressable
1806	         network d2b",  Technical Report 1349, LRI, Univ. Paris-Sud,
1807	         France, 2003.

1809	   [10]  Kaashoek, F. and D. Karger, "Koorde: A simple degree-optimal
1810	         hash table",  In Proc. 2nd International Workshop on Peer-to-
1811	         Peer Systems (IPTPS), 2003.

1813	   [11]  Naor, M. and U. Wieder, "Novel architectures for peer to peer
1814	         applications: the continuous-discrete approach.",  In Proc. of
1815	         the 15th ACM Symp. on Parallel Algorithms and Architectures
1816	         (SPAA), 2003.

1818	   [12]  Byers, J., Considine, J., and M. Mitzenmacher, "Simple load
1819	         balancing for distributed hash tables",  In 2nd International
1820	         Workshop on Peer-to-Peer Systems (IPTPS), 2003.

1822	   [13]  Manku, G., "Routing Networks for Distributed Hash Tables",  In
1823	         Proc. of the Principles of Distributed Computing (PODC), 2003.

1825	   [14]  Godfrey, B. and I. Stoica, "Heterogenity and load balance in
1826	         Distributed Hash Tables",  IEEE INFOCOM, 2005.

1828	   [15]  Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R., and I.
1829	         Stoica, "Load balancing in dynamic structured peer to peer
1830	         systems",  In 23rd Conference of the IEEE Communications
1831	         Society (INFOCOM), 2004.

1833	   [16]  Rao, A., Lakshminarayanan, K., Surana, S., Karp, R., and I.
1834	         Stoica, "Load balancing in structured peer to peer systems",
1835	          In 2nd International Workshop on Peer-to-Peer Systems (IPTPS),
1836	         2004.

1838	   [17]  Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, "Handling
1839	         churn in a DHT",  In Proc. of the USENIX Annual Techincal
1840	         Conference, 2004.

1842	   [18]  Krishnamurthy, S., El-Ansary, S., Aurell, E., and S. Haridi,
1843	         "Comparing maintenance strategies for overlays",  In Proc. of
1844	         16th Euromicro Conference on Parallel, Distributed and Network-
1845	         Based Processing, 2008.

1847	   [19]  Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., Kaashoek,
1848	         M., Dabek, F., and H. Balakrishnan, "Chord: A scalable peer-to-
1849	         peer lookup protocol for internet applications",  IEEE/ACM
1850	         Transactions on Networking, 2003.

1852	   [20]  Ghinita, G. and Y. Teo, "An adaptive stabilization framework
1853	         for distributed hash tables",  20th International Parallel and
1854	         Distributed Processing Symposium, 2006.

1856	   [21]  Liben-Nowell, D., Balakrishnan, H., and D. Karger,
1857	         "Observations on the dynamic evolution of peer-to-peer
1858	         networks",  In Proc. of the First International Workshop on
1859	         Peer-to-Peer Systems, 2002.

1861	   [22]  Mahajan, R., Castro, M., and A. Rowstron, "Controlling the cost
1862	         of reliability in peer-to-peer overlays",  In Proceedings of
1863	         the 2nd International Workshop on Peer-to- Peer Systems, 2003.

1865	   [23]  Horowitz, K. and D. Malkhi, "Estimating Network Size from Local
1866	         Information",  Information Processing Letters, Volume 88, Issue
1867	         5, pp. 237-243, 2003.

1869	   [24]  Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and A.
1870	         Demers, "Decentralized schemes for size estimation in large and
1871	         dynamic groups",  Fourth IEEE International Symposium on
1872	         Network Computing and Applications, pp. 41-48, 2005.

1874	   [25]  Binzenhofer, A., Kunzmann, G., and R. Henjes, "A scalable
1875	         algorithm to monitor chord-based peer to peer systems at
1876	         runtime",  20th International Parallel and Distributed
1877	         Processing Symposium, 2006.

1879	   [26]  Maenpaa, J. and G. Camarillo, "A study on maintenance
1880	         operations in a Chord-based Peer-to-Peer Session Initiation
1881	         Protocol overlay network",  Accepted to Sixth International
1882	         Workshop on Hot Topics in P2P Systems (HotP2P 2009), 2009.

1884	   [27]  Ktari, S., Zoubert, M., Hecker, A., and H. Labiod, "Performance
1885	         evaluation of replication strategies in DHTs under churn",  In
1886	         Proc. of the 6th International Conference on Mobile and
1887	         Ubiquitous Multimedia, 2007.

1889	Authors' Addresses

1891	   Jouni Maenpaa
1892	   Ericsson
1893	   Hirsalantie 11
1894	   Jorvas 02420
1895	   Finland

1897	   Email: Jouni.Maenpaa@ericsson.com

1899	   Ashwin Swaminathan
1900	   Qualcomm, Inc.
1901	   5775 Morehouse Dr
1902	   San Diego, CA
1903	   USA

1905	   Phone: +1 858-845-8775
1906	   Email: sashwin@qualcomm.com

1908	   Saumitra M. Das
1909	   Qualcomm, Inc.
1910	   3195 Kifer Road
1911	   Santa Clara, CA
1912	   USA

1914	   Phone: +1 408-533-9529
1915	   Email: saumitra@qualcomm.com

1917	   Gonzalo Camarillo
1918	   Ericsson
1919	   Hirsalantie 11
1920	   Jorvas 02420
1921	   Finland

1923	   Email: Gonzalo.Camarillo@ericsson.com

1925	   Jani Hautakorpi
1926	   Ericsson
1927	   Hirsalantie 11
1928	   Jorvas 02420
1929	   Finland

1931	   Email: Jani.Hautakorpi@ericsson.com