idnits 2.17.1 

draft-ietf-p2psip-self-tuning-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Sep 2009 rather than the newer Notice from 28 Dec 2009.  (See
     https://trustee.ietf.org/license-info/)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but
     does not include the phrase in its RFC 2119 key words list.

  -- The document date (March 1, 2010) is 5170 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-26) exists of
     draft-ietf-p2psip-base-07

  ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489)

  == Outdated reference: A later version (-09) exists of
     draft-ietf-p2psip-concepts-02


     Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	P2PSIP Working Group                                          J. Maenpaa
3	Internet-Draft                                              G. Camarillo
4	Intended status: Standards Track                           J. Hautakorpi
5	Expires: September 2, 2010                                      Ericsson
6	                                                           March 1, 2010

8	  A Self-tuning Distributed Hash Table (DHT) for REsource LOcation And
9	                           Discovery (RELOAD)
10	                  draft-ietf-p2psip-self-tuning-01.txt

12	Abstract

14	   REsource LOcation And Discovery (RELOAD) is a peer-to-peer (P2P)
15	   signaling protocol that provides an overlay network service.  Peers
16	   in a RELOAD overlay network collectively run an overlay algorithm to
17	   organize the overlay, and to store and retrieve data.  This document
18	   describes how the default topology plugin of RELOAD can be extended
19	   to support self-tuning, that is, to adapt to changing operating
20	   conditions such as churn and network size.

22	Status of this Memo

24	   This Internet-Draft is submitted to IETF in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF), its areas, and its working groups.  Note that
29	   other groups may also distribute working documents as Internet-
30	   Drafts.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   The list of current Internet-Drafts can be accessed at
38	   http://www.ietf.org/ietf/1id-abstracts.txt.

40	   The list of Internet-Draft Shadow Directories can be accessed at
41	   http://www.ietf.org/shadow.html.

43	   This Internet-Draft will expire on September 2, 2010.

45	Copyright Notice

47	   Copyright (c) 2010 IETF Trust and the persons identified as the
48	   document authors.  All rights reserved.

50	   This document is subject to BCP 78 and the IETF Trust's Legal
51	   Provisions Relating to IETF Documents
52	   (http://trustee.ietf.org/license-info) in effect on the date of
53	   publication of this document.  Please review these documents
54	   carefully, as they describe your rights and restrictions with respect
55	   to this document.  Code Components extracted from this document must
56	   include Simplified BSD License text as described in Section 4.e of
57	   the Trust Legal Provisions and are provided without warranty as
58	   described in the BSD License.

60	Table of Contents

62	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
63	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
64	   3.  Introduction to Stabilization in DHTs  . . . . . . . . . . . .  5
65	     3.1.  Reactive vs. Periodic Stabilization  . . . . . . . . . . .  5
66	     3.2.  Configuring Periodic Stabilization . . . . . . . . . . . .  6
67	     3.3.  Adaptive Stabilization . . . . . . . . . . . . . . . . . .  7
68	   4.  Introduction to Chord  . . . . . . . . . . . . . . . . . . . .  7
69	   5.  Extending Chord-reload to Support Self-tuning  . . . . . . . .  9
70	     5.1.  Update Requests  . . . . . . . . . . . . . . . . . . . . .  9
71	     5.2.  Neighbor Stabilization . . . . . . . . . . . . . . . . . . 10
72	     5.3.  Finger Stabilization . . . . . . . . . . . . . . . . . . . 11
73	     5.4.  Adjusting Finger Table Size  . . . . . . . . . . . . . . . 11
74	     5.5.  Detecting Partitioning . . . . . . . . . . . . . . . . . . 11
75	     5.6.  Leaving the Overlay  . . . . . . . . . . . . . . . . . . . 12
76	   6.  Self-tuning Chord Parameters . . . . . . . . . . . . . . . . . 12
77	     6.1.  Estimating Overlay Size  . . . . . . . . . . . . . . . . . 12
78	     6.2.  Determining Routing Table Size . . . . . . . . . . . . . . 13
79	     6.3.  Estimating Failure Rate  . . . . . . . . . . . . . . . . . 13
80	       6.3.1.  Detecting Failures . . . . . . . . . . . . . . . . . . 14
81	     6.4.  Estimating Join Rate . . . . . . . . . . . . . . . . . . . 15
82	     6.5.  Estimate Sharing . . . . . . . . . . . . . . . . . . . . . 15
83	     6.6.  Calculating the Stabilization Interval . . . . . . . . . . 16
84	   7.  Overlay Configuration Document Extension . . . . . . . . . . . 17
85	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 17
86	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 17
87	     9.1.  Message Extensions . . . . . . . . . . . . . . . . . . . . 18
88	   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
89	     10.1. Normative References . . . . . . . . . . . . . . . . . . . 18
90	     10.2. Informative References . . . . . . . . . . . . . . . . . . 18
91	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20

93	1.  Introduction

95	   REsource LOcation And Discovery (RELOAD) [I-D.ietf-p2psip-base] is a
96	   peer-to-peer signaling protocol that can be used to maintain an
97	   overlay network, and to store data in and retrieve data from the
98	   overlay.  For interoperability reasons, RELOAD specifies one overlay
99	   algorithm, called chord-reload, that is mandatory to implement.  This
100	   document extends the chord-reload algorithm by introducing self-
101	   tuning behavior.

103	   DHT-based overlay networks are self-organizing, scalable and
104	   reliable.  However, these features come at a cost: peers in the
105	   overlay network need to consume network bandwidth to maintain routing
106	   state.  Most DHTs use a periodic stabilization routine to counter the
107	   undesirable effects of churn on routing.  To configure the parameters
108	   of a DHT, some characteristics such as churn rate and network size
109	   need to be known in advance.  These characteristics are then used to
110	   configure the DHT in a static fashion by using fixed values for
111	   parameters such as the size of the successor set, size of the routing
112	   table, and rate of maintenance messages.  The problem with this
113	   approach is that it is not possible to achieve a low failure rate and
114	   a low communication overhead by using fixed parameters.  Instead, a
115	   better approach is to allow the system to take into account the
116	   evolution of network conditions and adapt to them.  This document
117	   extends the mandatory-to-implement chord-reload algorithm by making
118	   it self-tuning.  Two main advantages of self-tuning are that users no
119	   longer need to tune every DHT parameter correctly for a given
120	   operating environment and that the system adapts to changing
121	   operating conditions.

123	   The remainder of this document is structured as follows: Section 2
124	   provides definitions of terms used in this document.  Section 3
125	   discusses alternative approaches to stabilization operations in DHTs,
126	   including reactive stabilization, periodic stabilization, and
127	   adaptive stabilization.  Section 4 gives an introduction to the Chord
128	   DHT algorithm.  Section 5 describes how this document extends the
129	   stabilization routine of the chord-reload algorithm.  Section 6
130	   describes how the stabilization rate and routing table size are
131	   calculated in an adaptive fashion.

133	2.  Terminology

135	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
136	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
137	   document are to be interpreted as described in RFC 2119 [RFC2119].

139	   This document uses the terminology and definitions from the Concepts
140	   and Terminology for Peer to Peer SIP [I-D.ietf-p2psip-concepts]
141	   draft.

143	   Chord Ring:  The Chord DHT orders identifiers on an identifier circle
144	      of size 2^numBitsInNodeId (numBitsInNodeId is the number of bits
145	      in node identifiers).  This identifier circle is called the Chord
146	      ring.

148	   DHT:  Distributed Hash Tables (DHTs) are a class of decentralized
149	      distributed systems that provide a lookup service similar to a
150	      hash table.  Given a key, any participating peer can retrieve the
151	      value associated with that key.  The responsibility for
152	      maintaining the mapping from keys to values is distributed among
153	      the peers.

155	   Finger Table:  A data structure with up to numBitsInNodeId entries
156	      maintained by each peer in a Chord-based overlay.  The ith entry
157	      in the finger table of peer n contains the identity of the first
158	      peer that succeeds n by at least 2^(numBitsInNodeId-i) on the
159	      Chord ring.  This peer is called the ith finger of peer n.  As an
160	      example, the first entry in the finger table of peer n contains a
161	      peer half-way around the Chord ring from peer n.  The purpose of
162	      the finger table is to accelerate lookups.

164	   log2(N):  Logarithm of N with base 2.

166	   n.id:  Peer-ID of peer n.

168	   Neighborhood Set:  Consists of successor and predecessor lists.

170	   numBitsInNodeId:  Number of bits in a Node-ID.

172	   O(g(n)):  Informally, saying that some equation f(n) = O(g(n)) means
173	      that f(n) is less than some constant multiple of g(n).

175	   Omega(g(n)):  Informally, saying that some equation f(n) =
176	      Omega(g(n)) means that f(n) is more than some constant multiple of
177	      g(n).

179	   Predecessor List:  A data structure containing the predecessors of a
180	      peer on the Chord ring.

182	   Routing Table:  The set of peers that a node can use to route overlay
183	      messages.  The routing table consists of the finger table,
184	      successor list and predecessor list.

186	   Successor List:  A data structure containing the first r successors
187	      of a peer on the Chord ring.

189	3.  Introduction to Stabilization in DHTs

191	   DHTs use stabilization routines to counter the undesirable effects of
192	   churn on routing.  The purpose of stabilization is to keep the
193	   routing information of each peer in the overlay consistent with the
194	   constantly changing overlay topology.  There are two alternative
195	   approaches to stabilization: periodic and reactive [rhea2004].
196	   Periodic stabilization can either use a fixed stabilization rate or
197	   calculate the stabilization rate in an adaptive fashion.

199	3.1.  Reactive vs. Periodic Stabilization

201	   In reactive stabilization, a peer reacts to the loss of a peer in its
202	   neighborhood set or to the appearance of a new peer that should be
203	   added to its neighborhood set by sending a copy of its neighbor table
204	   to all peers in the neighborhood set.  Periodic recovery, in
205	   contrast, takes place independently of changes in the neighborhood
206	   set.  In periodic recovery, a peer periodically shares its
207	   neighborhood set with each or a subset of the members of that set.

209	   The chord-reload algorithm [I-D.ietf-p2psip-base] supports both
210	   reactive and periodic stabilization.  It has been shown in [rhea2004]
211	   that reactive stabilization works well for small neighborhood sets
212	   (i.e., small overlays) and moderate churn.  However, in large-scale
213	   (e.g., 1000 peers or more [rhea2004]) or high-churn overlays,
214	   reactive stabilization runs the risk of creating a positive feedback
215	   cycle, which can eventually result in congestion collapse.  In
216	   [rhea2004], it is shown that a 1000-peer overlay under churn uses
217	   significantly less bandwidth and has lower latencies when periodic
218	   stabilization is used than when reactive stabilization is used.
219	   Although in the experiments carried out in [rhea2004], reactive
220	   stabilization performed well when there was no churn, its bandwidth
221	   use was observed to jump dramatically under churn.  At higher churn
222	   rates and larger scale overlays, periodic stabilization uses less
223	   bandwidth and the resulting lower contention for the network leads to
224	   lower latencies.  For this reason, most DHTs such as CAN [CAN], Chord
225	   [Chord], Pastry [Pastry], Bamboo [rhea2004], etc. use periodic
226	   stabilization [ghinita2006].  As an example, the first version of
227	   Bamboo used reactive stabilization, which caused Bamboo to suffer
228	   from degradation in performance under churn.  To fix this problem,
229	   Bamboo was modified to use periodic stabilization.

231	   In Chord, periodic stabilization is typically done both for
232	   successors and fingers.  An alternative strategy is analyzed in
233	   [krishnamurthy2008].  In this strategy, called the correction-on-
234	   change maintenance strategy, a peer periodically stabilizes its
235	   successors but does not do so for its fingers.  Instead, finger
236	   pointers are stabilized in a reactive fashion.  The results obtained
237	   in [krishnamurthy2008] imply that although the correction-on-change
238	   strategy works well when churn is low, periodic stabilization
239	   outperforms the correction-on-change strategy when churn is high.

241	3.2.  Configuring Periodic Stabilization

243	   When periodic stabilization is used, one faces the problem of
244	   selecting an appropriate execution rate for the stabilization
245	   procedure.  If the execution rate of periodic stabilization is high,
246	   changes in the system can be quickly detected, but at the
247	   disadvantage of increased communication overhead.  Alternatively, if
248	   the stabilization rate is low and the churn rate is high, routing
249	   tables become inaccurate and DHT performance deteriorates.  Thus, the
250	   problem is setting the parameters so that the overlay achieves the
251	   desired reliability and performance even in challenging conditions,
252	   such as under heavy churn.  This naturally results in high cost
253	   during periods when the churn level is lower than expected, or
254	   alternatively, poor performance or even network partitioning in worse
255	   than expected conditions.

257	   In addition to selecting an appropriate stabilization interval,
258	   regardless of whether periodic stabilization is used or not, an
259	   appropriate size needs to be selected for the neighborhood set and
260	   for the finger table.

262	   The current approach is to configure overlays statically.  This works
263	   in situations where perfect information about the future is
264	   available.  In situations where the operating conditions of the
265	   network are known in advance and remain static throughout the
266	   lifetime of the system, it is possible to choose fixed optimal values
267	   for parameters such as stabilization rate, neighborhood set size and
268	   routing table size.  However, if the operating conditions (e.g., the
269	   size of the overlay and its churn rate) do not remain static but
270	   evolve with time, it is not possible to achieve both a low lookup
271	   failure rate and a low communication overhead by using fixed
272	   parameters [ghinita2006].

274	   As an example, to configure the Chord DHT algorithm, one needs to
275	   select values for the following parameters: size of successor list,
276	   stabilization interval, and size of the finger table.  To select an
277	   appropriate value for the stabilization interval, one needs to know
278	   the expected churn rate and overlay size.  According to

280	   [liben-nowell2002], a Chord network in a ring-like state remains in a
281	   ring-like state as long as peers send Omega(log2^2(N)) messages
282	   before N new peers join or N/2 peers fail.  Thus, in a 500-peer
283	   overlay churning at a rate such that one peer joins and one peer
284	   leaves the network every 30 seconds, an appropriate stabilization
285	   interval would be on the order of 93s.  According to [Chord], the
286	   size of the successor list and finger table should be on the order of
287	   log2(N).  Having a successor list of size O(log2(N)) makes it
288	   unlikely that a peer will lose all of its successors, which would
289	   cause the Chord ring to become disconnected.  Thus, in a 500-peer
290	   network each peer should maintain on the order of nine successors and
291	   fingers.  However, if the churn rate doubles and the network size
292	   remains unchanged, the stabilization rate should double as well.
293	   That is, the appropriate maintenance interval would now be on the
294	   order of 46s.  On the other hand, if the churn rate becomes e.g. six-
295	   fold and the size of the network grows to 2000 peers, on the order of
296	   eleven fingers and successors should be maintained and the
297	   stabilization interval should be on the order of 42s.  If one
298	   continued using the old values, this could result in inaccurate
299	   routing tables, network partitioning, and deteriorating performance.

301	3.3.  Adaptive Stabilization

303	   A self-tuning DHT takes into consideration the continuous evolution
304	   of network conditions and adapts to them.  In a self-tuning DHT, each
305	   peer collects statistical data about the network and dynamically
306	   adjusts its stabilization rate, neighborhood set size, and finger
307	   table size based on the analysis of the data [ghinita2006].
308	   Reference [mahajan2003] shows that by using self-tuning, it is
309	   possible to achieve high reliability and performance even in adverse
310	   conditions with low maintenance cost.  Adaptive stabilization has
311	   been shown to outperform periodic stabilization in terms of both
312	   lookup failures and communication overhead [ghinita2006].

314	4.  Introduction to Chord

316	   Chord [Chord] is a structured P2P algorithm that uses consistent
317	   hashing to build a DHT out of several independent peers.  Consistent
318	   hashing assigns each peer and resource a fixed-length identifier.
319	   Peers use SHA-1 as the base hash fuction to generate the identifiers.
320	   As specified in RELOAD base, the length of the identifiers is
321	   numBitsInNodeId=128 bits.  The identifiers are ordered on an
322	   identifier circle of size 2^numBitsInNodeId.  On the identifier
323	   circle, key k is assigned to the first peer whose identifier equals
324	   or follows the identifier of k in the identifier space.  The
325	   identifier circle is called the Chord ring.

327	   Different DHTs differ significantly in performance when bandwidth is
328	   limited.  It has been shown that when compared to other DHTs, the
329	   advantages of Chord include that it uses bandwidth efficiently and
330	   can achieve low lookup latencies at little cost [li2004].

332	   A simple lookup mechanism could be implemented on a Chord ring by
333	   requiring each peer to only know how to contact its current successor
334	   on the identifier circle.  Queries for a given identifier could then
335	   be passed around the circle via the successor pointers until they
336	   encounter the first peer whose identifier is equal to or larger than
337	   the desired identifier.  Such a lookup scheme uses a number of
338	   messages that grows linearly with the number of peers.  To reduce the
339	   cost of lookups, Chord maintains also additional routing information;
340	   each peer n maintains a data structure with up to numBitsInNodeId
341	   entries, called the finger table.  The first entry in the finger
342	   table of peer n contains the peer half-way around the ring from peer
343	   n.  The second entry contains the peer that is 1/4th of the way
344	   around, the third entry the peer that is 1/8th of the way around,
345	   etc.  In other words, the ith entry in the finger table at peer n
346	   contains the identity of the first peer s that succeeds n by at least
347	   2^(numBitsInNodeId-i) on the Chord ring.  This peer is called the ith
348	   finger of peer n.  The interval between two consecutive fingers is
349	   called a finger interval.  The ith finger interval of peer n covers
350	   the range [n.id + 2^(numBitsInNodeId-i), n.id +
351	   2^(numBitsInNodeId-i+1)) on the Chord ring.  In an N-peer network,
352	   each peer maintains information about O(log2(N)) other peers in its
353	   finger table.  As an example, if N=100000, it is sufficient to
354	   maintain 17 fingers.

356	   Chord needs all peers' successor pointers to be up to date in order
357	   to ensure that lookups produce correct results as the set of
358	   participating peers changes.  To achieve this, peers run a
359	   stabilization protocol periodically in the background.  The
360	   stabilization protocol of the original Chord algorithm uses two
361	   operations: successor stabilization and finger stabilization.
362	   However, the Chord algorithm of RELOAD base defines two additional
363	   stabilization components, as will be discussed below.

365	   To increase robustness in the event of peer failures, each Chord peer
366	   maintains a successor list of size r, containing the peer's first r
367	   successors.  The benefit of successor lists is that if each peer
368	   fails independently with probability p, the probability that all r
369	   successors fail simultaneously is only p^r.

371	   The original Chord algorithm maintains only a single predecessor
372	   pointer.  However, multiple predecessor pointers (i.e., a predecessor
373	   list) can be maintained to speed up recovery from predecessor
374	   failures.  The routing table of a peer consists of the successor
375	   list, finger table, and predecessor list.

377	5.  Extending Chord-reload to Support Self-tuning

379	   This section describes how the mandatory-to-implement chord-reload
380	   algorithm defined in RELOAD base [I-D.ietf-p2psip-base] can be
381	   extended to support self-tuning.

383	   The chord-reload algorithm supports both reactive and periodic
384	   recovery strategies.  When the self-tuning mechanisms defined in this
385	   document are used, the periodic recovery strategy MUST be used.
386	   Further, chord-reload specifies that at least three predecessors and
387	   three successors need to be maintained.  When the self-tuning
388	   mechanisms are used, the appropriate sizes of the successor list and
389	   predecessor list are determined in an adaptive fashion based on the
390	   estimated network size, as will be described in Section 6.

392	   As specified in RELOAD base, each peer MUST maintain a stabilization
393	   timer.  When the stabilization timer fires, the peer MUST restart the
394	   timer and carry out the overlay stabilization routine.  Overlay
395	   stabilization has four components in chord-reload:

397	   1.  Update the neighbor table.  We refer to this as neighbor
398	       stabilization.
399	   2.  Refreshing the finger table.  We refer to this as finger
400	       stabilization.
401	   3.  Adjusting finger table size.
402	   4.  Detecting partitioning.  We refer to this as strong
403	       stabilization.

405	   As specified in RELOAD base [I-D.ietf-p2psip-base], a peer sends
406	   periodic messages as part of the neighbor stabilization, finger
407	   stabilization, and strong stabilization routines.  In neighbor
408	   stabilization, a peer periodically sends an Update request to every
409	   peer in its Connection Table.  The default time is every ten minutes.
410	   In finger stabilization, a peer periodically searches for new peers
411	   to include in its finger table.  This time defaults to one hour.
412	   This document specifies how the neighbor stabilization and finger
413	   stabilization intervals can be determined in an adaptive fashion
414	   based on the operating conditions of the overlay.  The subsections
415	   below describe how this document extends the four components of
416	   stabilization.

418	5.1.  Update Requests

420	   As described in RELOAD base [I-D.ietf-p2psip-base], the neighbor and
421	   finger stabilization procedures are implemented using Update
422	   requests.  RELOAD base defines three types of Update requests:
423	   'peer_ready', 'neighbors', and 'full'.  Regardless of the type, all
424	   Update requests include an 'uptime' field.  Since the self-tuning
425	   extensions require information on the uptimes of peers in the routing
426	   table, the sender of an Update request MUST include its current
427	   uptime in seconds in the 'uptime' field.

429	   When self-tuning is used, each peer decides independently the
430	   appropriate size for the successor list, predecessor list and finger
431	   table.  Thus, the 'predecessors', 'successors', and 'fingers' fields
432	   included in RELOAD Update requests are of variable length.  As
433	   specified in RELOAD [I-D.ietf-p2psip-base], variable length fields
434	   are on the wire preceded by length bytes.  In the case of the
435	   successor list, predecessor list, and finger table, there are two
436	   length bytes (allowing lengths up to 2^16-1).  The number of NodeId
437	   structures included in each field can be calculated based on the
438	   length bytes since the size of a single NodeId structure is 16 bytes.
439	   If a peer receives more entries than fit into its successor list,
440	   predecessor list or finger table, the peer MUST ignore the extra
441	   entries.  If a peer receives less entries than it currently has in
442	   its own data structure, the peer MUST NOT drop the extra entries from
443	   its data structure.

445	5.2.  Neighbor Stabilization

447	   In the neighbor stabilization operation of chord-reload, a peer
448	   periodically sends an Update request to every peer in its Connection
449	   Table.  In a small, low-churn overlay, the amount of traffic this
450	   process generates is typically acceptable.  However, in a large-scale
451	   overlay churning at a moderate or high churn rate, the traffic load
452	   may no longer be acceptable since the size of the connection table is
453	   large and the stabilization interval relatively short.  The self-
454	   tuning mechanisms described in this document are especially designed
455	   for overlays of the latter type.  Therefore, when the self-tuning
456	   mechanisms are used, each peer MUST send a periodic Update request
457	   only to its first predecessor and first successor on the Chord ring.

459	   The neighbor stabilization routine MUST be executed when the
460	   stabilization timer fires.  To begin the neighbor stabilization
461	   routine, a peer MUST send an Update request to its first successor
462	   and its first predecessor.  The type of the Update request MUST be
463	   'neighbors'.  The Update request MUST include the successor and
464	   predecessor lists of the sender.  If a peer receiving such an Update
465	   request learns from the predecessor and successor lists included in
466	   the request that new peers can be included in its neighborhood set,
467	   it MUST send Attach requests to the new peers.

469	   After a new peer has been added to the predecessor or successor list,
470	   an Update request of type 'peer_ready' MUST be sent to the new peer.
471	   This allows the new peer to insert the sender into its neighborhood
472	   set.

474	5.3.  Finger Stabilization

476	   In the finger stabilization routine of chord-reload, a peer
477	   periodically searches for new peers to replace invalid (that is,
478	   repeated or failed) entries in the finger table.  Chord-reload
479	   provides two possible methods for searching for new peers to include
480	   in the finger table.  In alternative 1, a peer selects one entry from
481	   among the invalid entries in its finger table each time the
482	   stabilization timer fires and sends a Ping request to the selected
483	   entry.  In alternative 2, a peer sends a RouteQuery request to all
484	   invalid entries in the finger table.  After having received
485	   RouteQuery responses, the peer sends a Ping to those entries for
486	   which the RouteQuery response came from a peer not already present in
487	   the routing table.  When the self-tuning mechanisms defined in this
488	   draft are being used in the overlay, alternative 1 MUST be used since
489	   the traffic load it generates is lower and thus more appropriate for
490	   large-scale overlays experiencing churn.

492	   Immediately after a new peer has been added to the finger table, a
493	   Probe request MUST be sent to the new peer to fetch its uptime.  The
494	   requested_info field of the Probe request MUST be set to contain the
495	   ProbeInformationType 'uptime' defined in RELOAD base
496	   [I-D.ietf-p2psip-base].

498	5.4.  Adjusting Finger Table Size

500	   The chord-reload algorithm defines how a peer can make sure that the
501	   finger table is appropriately sized to allow for efficient routing.
502	   Since the self-tuning mechanisms specified in this document produce a
503	   network size estimate, this estimate can be directly used to
504	   calculate the optimal size for the finger table.  This mechanism MUST
505	   be used instead of the one specified by chord-reload.  A peer MUST
506	   use the network size estimate to determine whether it needs to adjust
507	   the size of its finger table each time when the stabilization timer
508	   fires.  The way this is done is explained in Section 6.2.

510	5.5.  Detecting Partitioning

512	   This document does not require any changes to the mechanism chord-
513	   reload uses to detect network partitioning.

515	5.6.  Leaving the Overlay

517	   As specified in RELOAD base [I-D.ietf-p2psip-base], a leaving peer
518	   SHOULD send a Leave request to all members of its neighbor table
519	   prior to leaving the overlay.  The overlay_specific_data field MUST
520	   contain the ChordLeaveData structure.  The Leave requests that are
521	   sent to successors MUST contain the predecessor list of the leaving
522	   peer.  The Leave requests that are sent to the predecessors MUST
523	   contain the successor list of the leaving peer.  If a given successor
524	   can identify better predecessors than are already included in its
525	   predecessor lists by investigating the predecessor list it receives
526	   from the leaving peer, it MUST send Attach requests to them.
527	   Similarly, if a given predecessor identifies better successors by
528	   investigating the successor list it receives from the leaving peer,
529	   it MUST send Attach requests to them.

531	6.  Self-tuning Chord Parameters

533	   This section specifies how to determine an appropriate stabilization
534	   rate and routing table size in an adaptive fashion.  The proposed
535	   mechanism is based on [mahajan2003], [liben-nowell2002], and
536	   [ghinita2006].  To calculate an appropriate stabilization rate, the
537	   values of three parameters MUST be estimated: overlay size N, failure
538	   rate U, and join rate L. To calculate an appropriate routing table
539	   size, the estimated network size N can be used.  Peers in the overlay
540	   MUST re-calculate the values of the parameters to self-tune the
541	   chord-reload algorithm at the end of each stabilization period before
542	   re-starting the stabilization timer.

544	6.1.  Estimating Overlay Size

546	   Techniques for estimating the size of an overlay network have been
547	   proposed for instance in [mahajan2003], [horowitz2003],
548	   [kostoulas2005], [binzenhofer2006], and [ghinita2006].  In Chord, the
549	   density of peer identifiers in the neighborhood set can be used to
550	   produce an estimate of the size of the overlay, N [mahajan2003].
551	   Since peer identifiers are picked randomly with uniform probability
552	   from the numBitsInNodeId-bit identifier space, the average distance
553	   between peer identifiers in the successor set is
554	   (2^numBitsInNodeId)/N.

556	   To estimate the overlay network size, a peer MUST compute the average
557	   inter-peer distance d between the successive peers starting from the
558	   most distant predecessor and ending to the most distant successor in
559	   the successor list.  The estimated network size MUST be calculated
560	   as:

562	                         2^numBitsInNodeId
563	                    N = -------------------
564	                                d

566	   This estimate has been found to be accurate within 15% of the real
567	   network size [ghinita2006].  Of course, the size of the neighborhood
568	   set affects the accuracy of the estimate.

570	   During the join process, a joining peer fills its routing table by
571	   sending a series of Ping and Attach requests, as specified in RELOAD
572	   base [I-D.ietf-p2psip-base].  Thus, a joining peer immediately has
573	   enough information at its disposal to calculate an estimate of the
574	   network size.

576	6.2.  Determining Routing Table Size

578	   As specified in RELOAD base, the finger table must contain at least
579	   16 entries.  When the self-tuning mechanisms are used, the size of
580	   the finger table MUST be set to max(log2(N), 16) using the estimated
581	   network size N.

583	   The size of the successor list MUST be set to log2(N).  An
584	   implementation MAY place a lower limit on the size of the successor
585	   list.  As an example, the implementation might require the size of
586	   the successor list to be always at least three.

588	   A peer MAY choose to maintain a fixed-size predecessor list with only
589	   three entries as specified in RELOAD base.  However, it is
590	   RECOMMENDED that a peer maintains log2(N) predecessors.

592	6.3.  Estimating Failure Rate

594	   A typical approach is to assume that peers join the overlay according
595	   to a Poisson process with rate L and leave according to a Poisson
596	   process with rate parameter U [mahajan2003].  The value of U can be
597	   estimated using peer failures in the finger table and neighborhood
598	   set [mahajan2003].  If peers fail with rate U, a peer with M unique
599	   peer identifiers in its routing table should observe K failures in
600	   time K/(M*U).  Every peer in the overlay MUST maintain a history of
601	   the last K failures.  The current time MUST be inserted into the
602	   history when the peer joins the overlay.  The estimate of U MUST be
603	   calculated as:

605	                             k
606	                     U = --------,
607	                          M * Tk

609	   where M is the number of unique peer identifiers in the routing
610	   table, Tk is the time between the first and the last failure in the
611	   history, and k is the number of failures in the history.  If k is
612	   smaller than K, the estimate MUST be computed as if there was a
613	   failure at the current time.  It has been shown that an estimate
614	   calculated in a similar manner is accurate within 17% of the real
615	   value of U [ghinita2006].

617	   The size of the failure history K affects the accuracy of the
618	   estimate of U. One can increase the accuracy by increasing K.
619	   However, this has the side effect of decreasing responsiveness to
620	   changes in the failure rate.  On the other hand, a small history size
621	   may cause a peer to overreact each time a new failure occurs.  In
622	   [ghinita2006], K is set 25% of the routing table size.  Use of this
623	   approach is RECOMMENDED.

625	6.3.1.  Detecting Failures

627	   A new failure MUST be inserted to the failure history in the
628	   following cases:

630	   1.  A Leave request is received from a neigbhor.
631	   2.  A peer fails to reply to a Ping request sent in the situation
632	       explained below.  If no packets have been received on a
633	       connection during the past 2*Tr seconds (where Tr is the
634	       inactivity timer defined by ICE [I-D.ietf-mmusic-ice]), a RELOAD
635	       Ping request MUST be sent to the remote peer.  RELOAD mandates
636	       the use of STUN [RFC5389] for keepalives.  STUN keepalives take
637	       the form of STUN Binding Indication transactions.  As specified
638	       in ICE [I-D.ietf-mmusic-ice], a peer sends a STUN Binding
639	       Indication if there has been no packet sent on a connection for
640	       Tr seconds.  Tr is configurable and has a default of 15 seconds.
641	       Although STUN Binding Indications do not generate a response, the
642	       fact that a peer has failed can be learned from the lack of
643	       packets (Binding Indications or application protocol packets)
644	       received from the peer.  If the remote peer fails to reply to the
645	       Ping request, the sender MUST consider the remote peer to have
646	       failed.

648	   As an alternative to relying on STUN keepalives to detect peer
649	   failure, a peer could send additional, frequent RELOAD messages to
650	   every peer in its Connection Table.  These messages could be Update
651	   requests, in which case they would serve two purposes: detecting peer
652	   failure and stabilization.  However, as the cost of this approach can
653	   be very high in terms of bandwidth consumption and traffic load,
654	   especially in large-scale overlays experiencing churn, its use is NOT
655	   RECOMMENDED.

657	6.4.  Estimating Join Rate

659	   Reference [ghinita2006] proposes that a peer can estimate the join
660	   rate based on the uptime of the peers in its routing table.  An
661	   increase in peer join rate will be reflected by a decrease in the
662	   average age of peers in the routing table.  Thus, each peer MUST
663	   maintain an array of the ages of the peers in its routing table
664	   sorted in increasing order.  Using this information, an estimate of
665	   the global peer join rate L MUST be calculated as:

667	                         N           1
668	                    L = --- * ---------------,
669	                         4     Ages[rsize/4]

671	   where Ages is an array containing the ages of the peers in the
672	   routing table sorted in increasing order and rsize is the size of the
673	   routing table.  It is RECOMMENDED that only the ages of the 25% of
674	   the youngest peers in the routing table (i.e., the 25th percentile)
675	   are used to reduce the bias that a small number of peers with very
676	   old ages can cause [ghinita2006].  It has been shown that the
677	   estimate obtained by using this method is accurate within 22% of the
678	   real join rate [ghinita2006].  Of course, the size of the routing
679	   table affects the accuracy.

681	   In order for this mechanism to work, peers need to exchange
682	   information about the time they have been present in the overlay.
683	   Peers receive the uptimes of their successors and predecessors during
684	   the stabilization operations since all Update requests carry uptime
685	   values.  A joining peer learns the uptime of the admitting peer since
686	   it receives an Update from the admitting peer during the join
687	   procedure.  Peers learn the uptimes of new fingers since they can
688	   fetch the uptime using a Probe request after having attached to the
689	   new finger.

691	6.5.  Estimate Sharing

693	   To improve the accuracy of network size, join rate, and leave rate
694	   estimates, peers MUST share their estimates.  When the stabilization
695	   timer fires, a peer MUST select number-of-peers-to-probe random peers
696	   from its finger table and send each of them a Probe request.  The
697	   targets of Probe requests are selected from the finger table rather
698	   than from the neighbor table since neighbors are likely to make
699	   similar errors when calculating their estimates. number-of-peers-to-
700	   probe is a new element in the overlay configuration document.  It is
701	   defined in Section 7 and has a default value of 4.  Both the Probe
702	   request and the answer returned by the target peer MUST contain a new
703	   message extension whose MessageExtensionType is 'self_tuning_data'.
704	   This extension type is defined in Section 9.1.  The
705	   extension_contents field of the MessageExtension structure MUST
706	   contain a SelfTuningData structure:

708	               struct {
709	                 uint32                   network_size;
710	                 uint32                   join_rate;
711	                 uint32                   leave_rate;
712	               } SelfTuningData;

714	   The contents of the SelfTuningData structure are as follows:

716	   network_size
717	      The latest network size estimate calculated by the sender.

719	   join_rate
720	      The latest join rate estimate calculated by the sender.

722	   leave_rate
723	      The latest leave rate estimate calculated by the sender.

725	   The join and leave rates are expressed as joins or failures per 24
726	   hours.  As an example, if the global join rate estimate a peer has
727	   calculated is 0.123 peers/s, it would include in the join_rate field
728	   the value 10627 (24*60*60*0.123 = 10627.2).

730	   The 'type' field of the MessageExtension structure MUST be set to
731	   contain the value 'self_tuning_data'.  The 'critical' field of the
732	   structure MUST be set to False.

734	   A peer MUST store all estimates it receives in Probe requests and
735	   answers during a stabilization interval.  When the stabilization
736	   timer fires, the peer MUST calculate the estimates to be used during
737	   the next stabilization interval by taking the 75th percentile of a
738	   data set containing its own estimate and the received estimates.

740	6.6.  Calculating the Stabilization Interval

742	   According to [liben-nowell2002], a Chord network in a ring-like state
743	   remains in a ring-like state as long as peers send Omega(log2^2(N))
744	   messages before N new peers join or N/2 peers fail.  We can use the
745	   estimate of peer failure rate, U, to calculate the time Tf in which
746	   N/2 peers fail:

748	                                  1
749	                           Tf = ------
750	                                 2*U

752	   Based on this estimate, a stabilization interval Tstab-1 MUST be
753	   calculated as:

755	                                         Tf
756	                           Tstab-1 = -----------
757	                                      log2^2(N)

759	   On the other hand, the estimated join rate L can be used to calculate
760	   the time in which N new peers join the overlay.  Based on the
761	   estimate of L, a stabilization interval Tstab-2 MUST be calculated
762	   as:

764	                                             N
765	                            Tstab-2 = ---------------
766	                                       L * log2^2(N)

768	   Finally, the actual stabilization interval Tstab that MUST be used
769	   can be obtained by taking the minimum of Tstab-1 and Tstab-2.

771	   The results obtained in [maenpaa2009] indicate that making the
772	   stabilization interval too small has the effect of making the overlay
773	   less stable (e.g., in terms of detected loops and path failures).
774	   Thus, a lower limit should be used for the stabilization period.
775	   Based on the results in [maenpaa2009], a lower limit of 15s is
776	   RECOMMENDED, since using a stabilization period smaller than this
777	   will with a high probability cause too much traffic in the overlay.

779	7.  Overlay Configuration Document Extension

781	   This document extends the RELOAD overlay configuration document by
782	   adding one new element, "number-of-peers-to-probe", inside each
783	   "configuration" element.

785	   number-of-peers-to-probe:  The number of fingers to which Probe
786	      requests are sent to obtain their network size, join rate, and
787	      leave rate estimates.  The default value is 4.

789	8.  Security Considerations

791	   There are no new security considerations introduced in this document.
792	   The security considerations of RELOAD [I-D.ietf-p2psip-base] apply.

794	9.  IANA Considerations
795	9.1.  Message Extensions

797	   This document introduces one additional extension to the "RELOAD
798	   Extensions" Registry:

800	                  +------------------+-------+---------------+
801	                  | Extension Name   |  Code | Specification |
802	                  +------------------+-------+---------------+
803	                  | self_tuning_data |     1 |      RFC-AAAA |
804	                  +------------------+-------+---------------+

806	   The contents of the extension are defined in Section 6.5.

808	10.  References

810	10.1.  Normative References

812	   [I-D.ietf-mmusic-ice]
813	              Rosenberg, J., "Interactive Connectivity Establishment
814	              (ICE): A Protocol for Network Address Translator (NAT)
815	              Traversal for Offer/Answer Protocols",
816	              draft-ietf-mmusic-ice-19 (work in progress), October 2007.

818	   [I-D.ietf-p2psip-base]
819	              Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and
820	              H. Schulzrinne, "REsource LOcation And Discovery (RELOAD)
821	              Base Protocol", draft-ietf-p2psip-base-07 (work in
822	              progress), February 2010.

824	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
825	              Requirement Levels", BCP 14, RFC 2119, March 1997.

827	   [RFC5389]  Rosenberg, J., Mahy, R., Matthews, P., and D. Wing,
828	              "Session Traversal Utilities for NAT (STUN)", RFC 5389,
829	              October 2008.

831	10.2.  Informative References

833	   [CAN]      Ratnasamy, S., Francis, P., Handley, M., Karp, R., and S.
834	              Schenker, "A scalable content-addressable network", In
835	              Proc. of the 2001 Conference on Applications,
836	              Technologies, Architectures and Protocols for Computer
837	              Communications 2001, pp. 161.172.

839	   [Chord]    Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.,
840	              Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A
841	              Scalable Peer-to-peer Lookup Service for Internet
842	              Applications", IEEE/ACM Transactions on Networking Volume
843	              11, Issue 1, 17-32, Feb 2003.

845	   [I-D.ietf-p2psip-concepts]
846	              Bryan, D., Matthews, P., Shim, E., Willis, D., and S.
847	              Dawkins, "Concepts and Terminology for Peer to Peer SIP",
848	              draft-ietf-p2psip-concepts-02 (work in progress),
849	              July 2008.

851	   [Pastry]   Rowstron, A. and P. Druschel, "Pastry: Scalable,
852	              Decentralized Object Location and Routing for Large-Scale
853	              Peer-to-Peer Systems", In Proc. of the IFIP/ACM
854	              International Conference on Distribued Systems
855	              Platforms Nov. 2001, pp. 329-350.

857	   [binzenhofer2006]
858	              Binzenhofer, A., Kunzmann, G., and R. Henjes, "A scalable
859	              algorithm to monitor chord-based P2P systems at runtime",
860	              20th International Parallel and Distributed Processing
861	              Symposium April 2006.

863	   [ghinita2006]
864	              Ghinita, G. and Y. Teo, "An adaptive stabilization
865	              framework for distributed hash tables", 20th International
866	              Parallel and Distributed Processing Symposium April 2006.

868	   [horowitz2003]
869	              Horowitz, K. and D. Malkhi, "Estimating network size from
870	              local information", Information Processing Letters Dec.
871	              2003, Volume 88, Issue 5, pp. 237-243.

873	   [kostoulas2005]
874	              Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and
875	              A. Demers, "Decentralized schemes for size estimation in
876	              large and dynamic groups", Fourth IEEE International
877	              Symposium on Network Computing and Applications July 2005,
878	              pp. 41-48.

880	   [krishnamurthy2008]
881	              Krishnamurthy, S., El-Ansary, S., Aurell, E., and S.
882	              Haridi, "Comparing maintenance strategies for overlays",
883	              In Proc. of 16th Euromicro Conference on Parallel,
884	              Distributed and Network-Based Processing Feb. 2008, pp.
885	              473-482.

887	   [li2004]   Li, J., Strinbling, J., Gil, T., and M. Kaashoek,
888	              "Comparing the performance of distributed hash tables
889	              under churn", In Proc. of the 3rd International Workshop
890	              on Peer-to-Peer Systems 2004.

892	   [liben-nowell2002]
893	              Liben-Nowell, D., Balakrishnan, H., and D. Karger,
894	              "Observations on the dynamic evolution of peer-to-peer
895	              networks", In Proc. of the First International Workshop on
896	              Peer-to-Peer Systems March 2002.

898	   [maenpaa2009]
899	              Maenpaa, J. and G. Camarillo, "A study on maintenance
900	              operations in a Chord-based Peer-to-Peer Session
901	              Initiation Protocol overlay network", In Proc. of IPDPS
902	              2009 May 2009.

904	   [mahajan2003]
905	              Mahajan, R., Castro, M., and A. Rowstron, "Controlling the
906	              cost of reliability in peer-to-peer overlays", In
907	              Proceedings of the 2nd International Workshop on Peer-to-
908	              Peer Systems Feb. 2003.

910	   [rhea2004]
911	              Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz,
912	              "Handling churn in a DHT", In Proc. of the USENIX Annual
913	              Techincal Conference June 2004.

915	Authors' Addresses

917	   Jouni Maenpaa
918	   Ericsson
919	   Hirsalantie 11
920	   Jorvas  02420
921	   Finland

923	   Email: Jouni.Maenpaa@ericsson.com

925	   Gonzalo Camarillo
926	   Ericsson
927	   Hirsalantie 11
928	   Jorvas  02420
929	   Finland

931	   Email: Gonzalo.Camarillo@ericsson.com
932	   Jani Hautakorpi
933	   Ericsson
934	   Hirsalantie 11
935	   Jorvas  02420
936	   Finland

938	   Email: Jani.Hautakorpi@ericsson.com