idnits 2.17.1 

draft-maenpaa-p2psip-self-tuning-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** The document seems to lack a License Notice according IETF Trust
     Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009
     Section 6.b -- however, there's a paragraph with a matching beginning.
     Boilerplate error?

     (You're using the IETF Trust Provisions' Section 6.b License Notice from
     12 Feb 2009 rather than one of the newer Notices.  See
     https://trustee.ietf.org/license-info/.)


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but
     does not include the phrase in its RFC 2119 key words list.

  -- The document date (October 26, 2009) is 5294 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Outdated reference: A later version (-26) exists of
     draft-ietf-p2psip-base-05

  == Outdated reference: A later version (-09) exists of
     draft-ietf-p2psip-concepts-02

  ** Downref: Normative reference to an Informational draft:
     draft-ietf-p2psip-concepts (ref. 'I-D.ietf-p2psip-concepts')

  ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489)


     Summary: 3 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	P2PSIP Working Group                                          J. Maenpaa
3	Internet-Draft                                              G. Camarillo
4	Intended status: Standards Track                           J. Hautakorpi
5	Expires: April 29, 2010                                         Ericsson
6	                                                        October 26, 2009

8	  A Self-tuning Distributed Hash Table (DHT) for REsource LOcation And
9	                           Discovery (RELOAD)
10	                draft-maenpaa-p2psip-self-tuning-01.txt

12	Status of this Memo

14	   This Internet-Draft is submitted to IETF in full conformance with the
15	   provisions of BCP 78 and BCP 79.

17	   Internet-Drafts are working documents of the Internet Engineering
18	   Task Force (IETF), its areas, and its working groups.  Note that
19	   other groups may also distribute working documents as Internet-
20	   Drafts.

22	   Internet-Drafts are draft documents valid for a maximum of six months
23	   and may be updated, replaced, or obsoleted by other documents at any
24	   time.  It is inappropriate to use Internet-Drafts as reference
25	   material or to cite them other than as "work in progress."

27	   The list of current Internet-Drafts can be accessed at
28	   http://www.ietf.org/ietf/1id-abstracts.txt.

30	   The list of Internet-Draft Shadow Directories can be accessed at
31	   http://www.ietf.org/shadow.html.

33	   This Internet-Draft will expire on April 29, 2010.

35	Copyright Notice

37	   Copyright (c) 2009 IETF Trust and the persons identified as the
38	   document authors.  All rights reserved.

40	   This document is subject to BCP 78 and the IETF Trust's Legal
41	   Provisions Relating to IETF Documents in effect on the date of
42	   publication of this document (http://trustee.ietf.org/license-info).
43	   Please review these documents carefully, as they describe your rights
44	   and restrictions with respect to this document.

46	Abstract

48	   REsource LOcation And Discovery (RELOAD) is a peer-to-peer (P2P)
49	   signaling protocol that provides an overlay network service.  Peers
50	   in a RELOAD overlay network collectively run an overlay algorithm to
51	   organize the overlay, and to store and retrieve data.  This document
52	   describes how the default topology plugin of RELOAD can be extended
53	   to support self-tuning, that is, to adapt to changing operating
54	   conditions such as churn and network size.

56	Table of Contents

58	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
59	   2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
60	   3.  Introduction to Stabilization in DHTs  . . . . . . . . . . . .  5
61	     3.1.  Reactive vs. Periodic Stabilization  . . . . . . . . . . .  5
62	     3.2.  Configuring Periodic Stabilization . . . . . . . . . . . .  6
63	     3.3.  Adaptive Stabilization . . . . . . . . . . . . . . . . . .  7
64	   4.  Introduction to Chord  . . . . . . . . . . . . . . . . . . . .  7
65	   5.  Extending Chord-reload to Support Self-tuning  . . . . . . . .  9
66	     5.1.  Update Requests  . . . . . . . . . . . . . . . . . . . . . 10
67	     5.2.  Neighbor Stabilization . . . . . . . . . . . . . . . . . . 10
68	     5.3.  Finger Stabilization . . . . . . . . . . . . . . . . . . . 11
69	     5.4.  Adjusting Finger Table Size  . . . . . . . . . . . . . . . 11
70	     5.5.  Detecting Partitioning . . . . . . . . . . . . . . . . . . 12
71	     5.6.  Leaving the Overlay  . . . . . . . . . . . . . . . . . . . 12
72	       5.6.1.  Contents of the Leave Message  . . . . . . . . . . . . 12
73	   6.  Self-tuning Chord Parameters . . . . . . . . . . . . . . . . . 13
74	     6.1.  Estimating Overlay Size  . . . . . . . . . . . . . . . . . 13
75	     6.2.  Determining Routing Table Size . . . . . . . . . . . . . . 14
76	     6.3.  Estimating Failure Rate  . . . . . . . . . . . . . . . . . 14
77	       6.3.1.  Detecting Failures . . . . . . . . . . . . . . . . . . 15
78	     6.4.  Estimating Join Rate . . . . . . . . . . . . . . . . . . . 16
79	     6.5.  Calculating the Stabilization Interval . . . . . . . . . . 16
80	   7.  Overlay Configuration Document Extension . . . . . . . . . . . 17
81	   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 17
82	   9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 17
83	   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18
84	     10.1. Normative References . . . . . . . . . . . . . . . . . . . 18
85	     10.2. Informative References . . . . . . . . . . . . . . . . . . 18
86	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20

88	1.  Introduction

90	   REsource LOcation And Discovery (RELOAD) [I-D.ietf-p2psip-base] is a
91	   peer-to-peer signaling protocol that can be used to maintain an
92	   overlay network, and to store data in and retrieve data from the
93	   overlay.  For interoperability reasons, RELOAD specifies one overlay
94	   algorithm, called chord-reload, that is mandatory to implement.  This
95	   document extends the chord-reload algorithm by introducing self-
96	   tuning behavior.

98	   DHT-based overlay networks are self-organizing, scalable and
99	   reliable.  However, these features come at a cost: peers in the
100	   overlay network need to consume network bandwidth to maintain routing
101	   state.  Most DHTs use a periodic stabilization routine to counter the
102	   undesirable effects of churn on routing.  To configure the parameters
103	   of a DHT, some characteristics such as churn rate and network size
104	   need to be known in advance.  These characteristics are then used to
105	   configure the DHT in a static fashion by using fixed values for
106	   parameters such as the size of the successor set, size of the routing
107	   table, and rate of maintenance messages.  The problem with this
108	   approach is that it is not possible to achieve a low failure rate and
109	   a low communication overhead by using fixed parameters.  Instead, a
110	   better approach is to allow the system to take into account the
111	   evolution of network conditions and adapt to them.  This document
112	   extends the mandatory-to-implement chord-reload algorithm by making
113	   it self-tuning.  Two main advantages of self-tuning are that users no
114	   longer need to tune every DHT parameter correctly for a given
115	   operating environment and that the system adapts to changing
116	   operating conditions.

118	   The remainder of this document is structured as follows: Section 2
119	   provides definitions of terms used in this document.  Section 3
120	   discusses alternative approaches to stabilization operations in DHTs,
121	   including reactive stabilization, periodic stabilization, and
122	   adaptive stabilization.  Section 4 gives an introduction to the Chord
123	   DHT algorithm.  Section 5 describes how this document extends the
124	   stabilization routine of the chord-reload algorithm.  Section 6
125	   describes how the stabilization rate and routing table size are
126	   calculated in an adaptive fashion.

128	2.  Terminology

130	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
131	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
132	   document are to be interpreted as described in RFC 2119 [RFC2119].

134	   This document uses the terminology and definitions from the Concepts
135	   and Terminology for Peer to Peer SIP [I-D.ietf-p2psip-concepts]
136	   draft.

138	   Chord Ring:  The Chord DHT orders identifiers on an identifier circle
139	      of size 2^numBitsInNodeId (numBitsInNodeId is the number of bits
140	      in node identifiers).  This identifier circle is called the Chord
141	      ring.

143	   DHT:  Distributed Hash Tables (DHTs) are a class of decentralized
144	      distributed systems that provide a lookup service similar to a
145	      hash table.  Given a key, any participating peer can retrieve the
146	      value associated with that key.  The responsibility for
147	      maintaining the mapping from keys to values is distributed among
148	      the peers.

150	   Finger Table:  A data structure with up to numBitsInNodeId entries
151	      maintained by each peer in a Chord-based overlay.  The ith entry
152	      in the finger table of peer n contains the identity of the first
153	      peer that succeeds n by at least 2^(numBitsInNodeId-i) on the
154	      Chord ring.  This peer is called the ith finger of peer n.  As an
155	      example, the first entry in the finger table of peer n contains a
156	      peer half-way around the Chord ring from peer n.  The purpose of
157	      the finger table is to accelerate lookups.

159	   log2(N):  Logarithm of N with base 2.

161	   n.id:  Peer-ID of peer n.

163	   Neighborhood Set:  Consists of successor and predecessor lists.

165	   numBitsInNodeId:  Number of bits in a Node-ID.

167	   O(g(n)):  Informally, saying that some equation f(n) = O(g(n)) means
168	      that f(n) is less than some constant multiple of g(n).

170	   Omega(g(n)):  Informally, saying that some equation f(n) =
171	      Omega(g(n)) means that f(n) is more than some constant multiple of
172	      g(n).

174	   Predecessor List:  A data structure containing the predecessors of a
175	      peer on the Chord ring.

177	   Responsible Peer:  The peer storing the original copy of a resource.
178	      In Chord, this is the first peer whose identifier is equal to or
179	      follows the identifier of the resource on the Chord ring.  Also
180	      the term "Root Node" can be used for this concept.

182	   Routing Table:  The set of peers that a node can use to route overlay
183	      messages.  The routing table consists of the finger table,
184	      successor list and predecessor list.

186	   Successor List:  A data structure containing the first r successors
187	      of a peer on the Chord ring.

189	3.  Introduction to Stabilization in DHTs

191	   DHT-based peer-to-peer overlay networks are self-organizing, scalable
192	   and reliable.  However, these features come at a cost: peers in the
193	   overlay network need to consume network bandwidth to maintain routing
194	   state.  DHTs use stabilization routines to counter the undesirable
195	   effects of churn on routing.  The purpose of stabilization is to keep
196	   the routing information of each peer in the overlay consistent with
197	   the constantly changing overlay topology.  There are two alternative
198	   approaches to stabilization: periodic and reactive [rhea2004].
199	   Periodic stabilization can either use a fixed stabilization rate or
200	   calculate the stabilization rate in an adaptive fashion.

202	3.1.  Reactive vs. Periodic Stabilization

204	   In reactive stabilization, a peer reacts to the loss of a peer in its
205	   neighborhood set or to the appearance of a new peer that should be
206	   added to its neighborhood set by sending a copy of its neighbor table
207	   to all peers in the neighborhood set.  Periodic recovery, in
208	   contrast, takes place independently of changes in the neighborhood
209	   set.  In periodic recovery, a peer periodically shares its
210	   neighborhood set with each or a subset of the members of that set.

212	   The chord-reload algorithm [I-D.ietf-p2psip-base] supports both
213	   reactive and periodic stabilization.  It has been shown in [rhea2004]
214	   that reactive stabilization works well for small neighborhood sets
215	   (i.e., small overlays) and moderate churn.  However, in large-scale
216	   (e.g., 1000 peers or more [rhea2004]) or high-churn overlays,
217	   reactive stabilization runs the risk of creating a positive feedback
218	   cycle, which can eventually result in congestion collapse.  In
219	   [rhea2004], it is shown that a 1000-peer overlay under churn uses
220	   significantly less bandwidth and has lower latencies when periodic
221	   stabilization is used than when reactive stabilization is used.
222	   Although in the experiments carried out in [rhea2004], reactive
223	   stabilization performed well when there was no churn, its bandwidth
224	   use was observed to jump dramatically under churn.  At higher churn
225	   rates and larger scale overlays, periodic stabilization uses less
226	   bandwidth and the resulting lower contention for the network leads to
227	   lower latencies.  For this reason, most DHTs such as CAN [CAN], Chord
228	   [Chord], Pastry [Pastry], Bamboo [rhea2004], etc. use periodic
229	   stabilization [ghinita2006].  As an example, the first version of
230	   Bamboo used reactive stabilization, which caused Bamboo to suffer
231	   from degradation in performance under churn.  To fix this problem,
232	   Bamboo was modified to use periodic stabilization.

234	   In Chord, periodic stabilization is typically done both for
235	   successors and fingers.  An alternative strategy is analyzed in
236	   [krishnamurthy2008].  In this strategy, called the correction-on-
237	   change maintenance strategy, a peer periodically stabilizes its
238	   successors but does not do so for its fingers.  Instead, finger
239	   pointers are stabilized in a reactive fashion.  The results obtained
240	   in [krishnamurthy2008] imply that although the correction-on-change
241	   strategy works well when churn is low, periodic stabilization
242	   outperforms the correction-on-change strategy when churn is high.

244	3.2.  Configuring Periodic Stabilization

246	   When periodic stabilization is used, one faces the problem of
247	   selecting an appropriate execution rate for the stabilization
248	   procedure.  If the execution rate of periodic stabilization is high,
249	   changes in the system can be quickly detected, but at the
250	   disadvantage of increased communication overhead.  Alternatively, if
251	   the stabilization rate is low and the churn rate is high, routing
252	   tables become inaccurate and DHT performance deteriorates.  Thus, the
253	   problem is setting the parameters so that the overlay achieves the
254	   desired reliability and performance even in challenging conditions,
255	   such as under heavy churn.  This naturally results in high cost
256	   during periods when the churn level is lower than expected, or
257	   alternatively, poor performance or even network partitioning in worse
258	   than expected conditions.

260	   In addition to selecting an appropriate stabilization interval,
261	   regardless of whether periodic stabilization is used or not, an
262	   appropriate size needs to be selected for the neighborhood set and
263	   for the routing table.

265	   The current approach is to configure overlays statically.  This works
266	   in situations where perfect information about the future is
267	   available.  In situations where the operating conditions of the
268	   network are known in advance and remain static throughout the
269	   lifetime of the system, it is possible to choose fixed optimal values
270	   for parameters such as stabilization rate, neighborhood set size and
271	   routing table size.  However, if the operating conditions (e.g., the
272	   size of the overlay and its churn rate) do not remain static but
273	   evolve with time, it is not possible to achieve both a low lookup
274	   failure rate and a low communication overhead by using fixed
275	   parameters [ghinita2006].

277	   As an example, to configure the Chord DHT algorithm, one needs to
278	   select values for the following parameters: size of successor list,
279	   stabilization interval, and size of the finger table.  To select an
280	   appropriate value for the stabilization interval, one needs to know
281	   the expected churn rate and overlay size.  According to
282	   [liben-nowell2002], a Chord network in a ring-like state remains in a
283	   ring-like state as long as peers send Omega(log2^2(N)) messages
284	   before N new peers join or N/2 peers fail.  Thus, in a 500-peer
285	   overlay churning at a rate such that one peer joins and one peer
286	   leaves the network every 30 seconds, an appropriate stabilization
287	   interval would be on the order of 93s.  According to [Chord], the
288	   size of the successor list and finger table should be on the order of
289	   log2(N).  Having a successor list of size O(log2(N)) makes it
290	   unlikely that a peer will lose all of its successors, which would
291	   cause the Chord ring to become disconnected.  Thus, in a 500-peer
292	   network each peer should maintain on the order of nine successors and
293	   fingers.  However, if the churn rate doubles and the network size
294	   remains unchanged, the stabilization rate should double as well.
295	   That is, the appropriate maintenance interval would now be on the
296	   order of 46s.  On the other hand, if the churn rate becomes e.g. six-
297	   fold and the size of the network grows to 2000 peers, on the order of
298	   eleven fingers and successors should be maintained and the
299	   stabilization interval should be on the order of 42s.  If one
300	   continued using the old values, this could result in inaccurate
301	   routing tables, network partitioning, and deteriorating performance.

303	3.3.  Adaptive Stabilization

305	   A self-tuning DHT takes into consideration the continuous evolution
306	   of network conditions and adapts to them.  In a self-tuning DHT, each
307	   peer collects statistical data about the network and dynamically
308	   adjusts its stabilization rate, neighborhood set size, and finger
309	   table size based on the analysis of the data [ghinita2006].
310	   Reference [mahajan2003] shows that by using self-tuning, it is
311	   possible to achieve high reliability and performance even in adverse
312	   conditions with low maintenance cost.  Adaptive stabilization has
313	   been shown to outperform periodic stabilization in terms of both
314	   lookup failures and communication overhead [ghinita2006].

316	4.  Introduction to Chord

318	   Chord [Chord] is a structured P2P algorithm that uses consistent
319	   hashing to build a DHT out of several independent peers.  Consistent
320	   hashing assigns each peer and resource a fixed-length identifier.
321	   Peers use SHA-1 as the base hash fuction to generate the identifiers.

323	   As specified in RELOAD base, the length of the identifiers is
324	   numBitsInNodeId=128 bits.  The identifiers are ordered on an
325	   identifier circle of size 2^numBitsInNodeId.  On the identifier
326	   circle, key k is assigned to the first peer whose identifier equals
327	   or follows the identifier of k in the identifier space.  The
328	   identifier circle is called the Chord ring.

330	   Different DHTs differ significantly in performance when bandwidth is
331	   limited.  It has been shown that when compared to other DHTs, the
332	   advantages of Chord include that it uses bandwidth efficiently and
333	   can achieve low lookup latencies at little cost [li2004].

335	   A simple lookup mechanism could be implemented on a Chord ring by
336	   requiring each peer to only know how to contact its current successor
337	   on the identifier circle.  Queries for a given identifier could then
338	   be passed around the circle via the successor pointers until they
339	   encounter the first peer whose identifier is equal to or larger than
340	   the desired identifier.  Such a lookup scheme uses a number of
341	   messages that grows linearly with the number of peers.  To reduce the
342	   cost of lookups, Chord maintains also additional routing information;
343	   each peer n maintains a data structure with up to numBitsInNodeId
344	   entries, called the finger table.  The first entry in the finger
345	   table of peer n contains the peer half-way around the ring from peer
346	   n.  The second entry contains the peer that is 1/4th of the way
347	   around, the third entry the peer that is 1/8th of the way around,
348	   etc.  In other words, the ith entry in the finger table at peer n
349	   contains the identity of the first peer s that succeeds n by at least
350	   2^(numBitsInNodeId-i) on the Chord ring.  This peer is called the ith
351	   finger of peer n.  The interval between two consecutive fingers is
352	   called a finger interval.  The ith finger interval of peer n covers
353	   the range [n.id + 2^(numBitsInNodeId-i), n.id +
354	   2^(numBitsInNodeId-i+1)) on the Chord ring.  In an N-peer network,
355	   each peer maintains information about O(log2(N)) other peers in its
356	   finger table.  As an example, if N=100000, it is sufficient to
357	   maintain 17 fingers.

359	   Chord needs all peers' successor pointers to be up to date in order
360	   to ensure that lookups produce correct results as the set of
361	   participating peers changes.  To achieve this, peers run a
362	   stabilization protocol periodically in the background.  The
363	   stabilization protocol of the original Chord algorithm uses two
364	   operations: successor stabilization and finger stabilization.
365	   However, the Chord algorithm of RELOAD base defines two additional
366	   stabilization components, as will be discussed below.

368	   To increase robustness in the event of peer failures, each Chord peer
369	   maintains a successor list of size r, containing the peer's first r
370	   successors.  The benefit of successor lists is that if each peer
371	   fails independently with probability p, the probability that all r
372	   successors fail simultaneously is only p^r.

374	   The original Chord algorithm maintains only a single predecessor
375	   pointer.  However, multiple predecessor pointers (i.e., a predecessor
376	   list) can be maintained to speed up recovery from predecessor
377	   failures.  The routing table of a peer consists of the successor
378	   list, finger table, and predecessor list.

380	5.  Extending Chord-reload to Support Self-tuning

382	   This section describes how the mandatory-to-implement chord-reload
383	   algorithm defined in RELOAD base [I-D.ietf-p2psip-base] can be
384	   extended to support self-tuning.

386	   The chord-reload algorithm supports both reactive and periodic
387	   recovery strategies.  When the self-tuning mechanisms defined in this
388	   document are used, the periodic recovery strategy MUST be used.
389	   Further, chord-reload specifies that at least three predecessors and
390	   three successors need to be maintained.  When the self-tuning
391	   mechanisms are used, the appropriate sizes of the successor list and
392	   predecessor list are determined in an adaptive fashion based on the
393	   estimated network size, as will be described in Section 6.

395	   As specified in RELOAD base, each peer MUST maintain a stabilization
396	   timer.  When the stabilization timer fires, the peer MUST restart the
397	   timer and carry out the overlay stabilization routine.  Overlay
398	   stabilization has four components in chord-reload:

400	   1.  Update the neighbor table.  We refer to this as neighbor
401	       stabilization.
402	   2.  Refreshing the finger table.  We refer to this as finger
403	       stabilization.
404	   3.  Adjusting finger table size.
405	   4.  Detecting partitioning.  We refer to this as strong
406	       stabilization.

408	   As specified in RELOAD base [I-D.ietf-p2psip-base], a peer sends
409	   periodic messages as part of the neighbor stabilization, finger
410	   stabilization, and strong stabilization routines.  In neighbor
411	   stabilization, a peer periodically sends an Update request to every
412	   peer in its Connection Table.  The default time is every ten minutes.
413	   In finger stabilization, a peer periodically searches for new peers
414	   to include in its finger table.  This time defaults to one hour.
415	   This document specifies how the neighbor stabilization and finger
416	   stabilization intervals can be determined in an adaptive fashion
417	   based on the operating conditions of the overlay.  The subsections
418	   below describe how this document extends the four components of
419	   stabilization.

421	5.1.  Update Requests

423	   The neighbor and finger stabilization procedures are implemented
424	   using the Update request defined in RELOAD base
425	   [I-D.ietf-p2psip-base].  RELOAD base defines three types of Update
426	   requests: 'peer_ready', 'neighbors', and 'full'.  Regardless of the
427	   type, all Update requests include an 'uptime' field.  Since the self-
428	   tuning extensions require information on the uptimes of peers in the
429	   routing table, the sender of an Update request MUST include its
430	   current uptime in seconds in the 'uptime' field.

432	   When self-tuning is used, each peer decides independently the
433	   appropriate size for the successor list, predecessor list and finger
434	   table.  Thus, the 'predecessors', 'successors', and 'fingers' fields
435	   included in RELOAD Update requests are of variable length.  As
436	   specified in RELOAD [I-D.ietf-p2psip-base], variable length fields
437	   are on the wire preceded by length bytes.  In the case of the
438	   successor list, predecessor list, and finger table, there are two
439	   length bytes (allowing lengths up to 2^16-1).  The number of NodeId
440	   structures included in each field can be calculated based on the
441	   length bytes since the size of a single NodeId structure is 16 bytes.
442	   If a peer receives more entries than fit into its successor list,
443	   predecessor list or finger table, the peer MUST ignore the extra
444	   entries.  If a peer receives less entries than it currently has in
445	   its own data structure, the peer MUST NOT drop the extra entries from
446	   its data structure.

448	5.2.  Neighbor Stabilization

450	   In the neighbor stabilization operation of chord-reload, a peer
451	   periodically sends an Update request to every peer in its Connection
452	   Table.  In a small, low-churn overlay, the amount of traffic this
453	   process generates is typically acceptable.  However, in a large-scale
454	   overlay churning at a moderate or high churn rate, the traffic load
455	   may no longer be acceptable since the size of the connection table is
456	   large and the stabilization interval relatively short.  The self-
457	   tuning mechanisms described in this document are especially designed
458	   for overlays of the latter type.  Therefore, when the self-tuning
459	   mechanisms are used, each peer MUST only send a periodic Update
460	   request to its first predecessor and first successor on the Chord
461	   ring.

463	   The neighbor stabilization routine MUST be executed when the
464	   stabilization timer fires.  To begin the neighbor stabilization
465	   routine, a peer MUST send an Update request to its first successor
466	   and its first predecessor.  The type of the Update request MUST be
467	   'neighbors'.  The Update request MUST include the successor and
468	   predecessor lists of the sender.  If a peer receiving such an Update
469	   request learns from the predecessor and successor lists included in
470	   the request that new peers can be included in its neighborhood set,
471	   it MUST send Attach requests to the new peers.

473	   After a new peer has been added to the predecessor or successor list,
474	   an Update request of type 'peer_ready' MUST be sent to the new peer.
475	   This allows the new peer to insert the sender into its neighborhood
476	   set.

478	5.3.  Finger Stabilization

480	   In the finger stabilization routine of chord-reload, a peer
481	   periodically searches for new peers to replace invalid (that is,
482	   repeated or failed) entries in the finger table.  Chord-reload
483	   provides two possible methods for searching for new peers to include
484	   in the finger table.  In alternative 1, a peer selects one entry from
485	   among the invalid entries in its finger table each time the
486	   stabilization timer fires and sends a Ping request to the selected
487	   entry.  In alternative 2, a peer sends a RouteQuery request to all
488	   invalid entries in the finger table.  After having received
489	   RouteQuery responses, the peer sends a Ping to those entries for
490	   which the RouteQuery response came from a peer not already present in
491	   the routing table.  When the self-tuning mechanisms defined in this
492	   draft are being used in the overlay, alternative 1 MUST be used since
493	   the traffic load it generates is lower and thus more appropriate for
494	   large-scale overlays experiencing churn.

496	   Immediately after a new peer has been added to the finger table, a
497	   Probe request MUST be sent to the new peer to fetch its uptime.  The
498	   requested_info field of the Probe request MUST be set to contain the
499	   ProbeInformationType 'uptime' defined in RELOAD base
500	   [I-D.ietf-p2psip-base].

502	5.4.  Adjusting Finger Table Size

504	   The chord-reload algorithm defines how a peer can make sure that the
505	   finger table is appropriately sized to allow for efficient routing.
506	   Since the self-tuning mechanisms specified in this document produce a
507	   network size estimate, this estimate can be directly used to
508	   calculate the optimal size for the finger table.  This mechanism MUST
509	   be used instead of the one specified by chord-reload.  A peer MUST
510	   use the network size estimate to determine whether it needs to adjust
511	   the size of its finger table each time when the stabilization timer
512	   fires.  The way this is done is explained in Section 6.2.

514	5.5.  Detecting Partitioning

516	   This document does not require any changes to the mechanism chord-
517	   reload uses to detect network partitioning.

519	5.6.  Leaving the Overlay

521	   This document extends the Leave request defined in RELOAD base
522	   [I-D.ietf-p2psip-base].  As specified in RELOAD base, a leaving peer
523	   SHOULD send a Leave request to all members of its neighbor table
524	   prior to leaving the overlay.  When the self-tuning extensions are
525	   used, the Leave requests sent to the first predecessor and first
526	   successor contain additional data.  The Leave request that is sent to
527	   the first successor MUST contain the predecessor list of the leaving
528	   peer.  The Leave request that is sent to the first predecessor MUST
529	   contain the successor list of the leaving peer.  If the first
530	   successor can identify better predecessors than are already included
531	   in its predecessor list by investigating the predecessor list it
532	   receives from the leaving peer, it MUST send Attach requests to them.
533	   Similarly, if the first predecessor identifies better successors by
534	   investigating the successor list it receives from the leaving peer,
535	   it MUST send Attach requests to them.

537	5.6.1.  Contents of the Leave Message

539	   The Leave request defined in RELOAD base [I-D.ietf-p2psip-base] has a
540	   field for overlay specific data.  When the self-tuning extensions are
541	   being used, this overlay_specific_data field MUST contain the
542	   ChordLeaveData structure defined below:

544	                 enum { reserved (0),
545	                        from_succ(1), from_pred(2), (255) }
546	                      ChordLeaveType;

548	                 struct {
549	                   ChordLeaveType         type;

551	                   select(type) {
552	                     case from_succ:
553	                       NodeId              successors<0..2^16-1>;
554	                     case from_pred:
555	                       NodeId              predecessors<0..2^16-1>;
556	                   };
557	                 } ChordLeaveData;

559	   The 'type' field indicates whether the Leave request was sent by a
560	   predecessor or a successor of the recipient:

562	   from_succ
563	      The Leave request was sent by a successor.

565	   from_pred
566	      The Leave request was sent by a predecessor.

568	   If the type of the request is 'from_succ', the contents will be:

570	   successors
571	      The sender's successor list.

573	   If the type of the request is 'from_pred', the contents will be:

575	   predecessors
576	      The sender's predecessor list.

578	6.  Self-tuning Chord Parameters

580	   This section specifies how to determine an appropriate stabilization
581	   rate and routing table size in an adaptive fashion.  The proposed
582	   mechanism is based on [mahajan2003], [liben-nowell2002], and
583	   [ghinita2006].  To calculate an appropriate stabilization rate, the
584	   values of three parameters MUST be estimated: overlay size N, failure
585	   rate U, and join rate L. To calculate an appropriate routing table
586	   size, the estimated network size N can be used.  Peers in the overlay
587	   MUST re-calculate the values of the parameters to self-tune the
588	   chord-reload algorithm at the end of each stabilization period before
589	   re-starting the stabilization timer.

591	6.1.  Estimating Overlay Size

593	   Techniques for estimating the size of an overlay network have been
594	   proposed for instance in [mahajan2003], [horowitz2003],
595	   [kostoulas2005], [binzenhofer2006], and [ghinita2006].  In Chord, the
596	   density of peer identifiers in the neighborhood set can be used to
597	   produce an estimate of the size of the overlay, N [mahajan2003].
598	   Since peer identifiers are picked randomly with uniform probability
599	   from the numBitsInNodeId-bit identifier space, the average distance
600	   between peer identifiers in the successor set is
601	   (2^numBitsInNodeId)/N.

603	   To estimate the overlay network size, a peer MUST compute the average
604	   inter-peer distance d between the successive peers starting from the
605	   most distant predecessor and ending to the most distant successor in
606	   the successor list.  The estimated network size MUST be calculated
607	   as:

609	                         2^numBitsInNodeId
610	                    N = -------------------
611	                                d

613	   This estimate has been found to be accurate within 15% of the real
614	   network size [ghinita2006].  Of course, the size of the neighborhood
615	   set affects the accuracy of the estimate.

617	   During the join process, a joining peer fills its routing table by
618	   sending a series of Ping and Attach requests, as specified in RELOAD
619	   base [I-D.ietf-p2psip-base].  Thus, a joining peer immediately has
620	   enough information at its disposal to calculate an estimate of the
621	   network size.

623	6.2.  Determining Routing Table Size

625	   As specified in RELOAD base, the finger table must contain at least
626	   16 entries.  When the self-tuning mechanisms are used, the size of
627	   the finger table MUST be set to max(log2(N), 16) using the estimated
628	   network size N.

630	   The size of the successor list MUST be set to log2(N).  An
631	   implementation MAY place a lower limit on the size of the successor
632	   list.  As an example, the implementation might require the size of
633	   the successor list to be always at least three.

635	   A peer MAY choose to maintain a fixed-size predecessor list with only
636	   three entries as specified in RELOAD base.  However, it is
637	   RECOMMENDED that a peer maintains log2(N) predecessors.

639	6.3.  Estimating Failure Rate

641	   A typical approach is to assume that peers join the overlay according
642	   to a Poisson process with rate L and leave according to a Poisson
643	   process with rate parameter U [mahajan2003].  The value of U can be
644	   estimated using peer failures in the finger table and neighborhood
645	   set [mahajan2003].  If peers fail with rate U, a peer with M unique
646	   peer identifiers in its routing table should observe K failures in
647	   time K/(M*U).  Every peer in the overlay MUST maintain a history of
648	   the last K failures.  The current time MUST be inserted into the
649	   history when the peer joins the overlay.  The estimate of U MUST be
650	   calculated as:

652	                             k
653	                     U = --------,
654	                          M * Tk

656	   where M is the number of unique peer identifiers in the routing
657	   table, Tk is the time between the first and the last failure in the
658	   history, and k is the number of failures in the history.  If k is
659	   smaller than K, the estimate MUST be computed as if there was a
660	   failure at the current time.  It has been shown that an estimate
661	   calculated in a similar manner is accurate within 17% of the real
662	   value of U [ghinita2006].

664	   The size of the failure history K affects the accuracy of the
665	   estimate of U. One can increase the accuracy by increasing K.
666	   However, this has the side effect of decreasing responsiveness to
667	   changes in the failure rate.  On the other hand, a small history size
668	   may cause a peer to overreact each time a new failure occurs.  In
669	   [ghinita2006], K is set 25% of the routing table size.  Use of this
670	   approach is RECOMMENDED.

672	6.3.1.  Detecting Failures

674	   A new failure MUST be inserted to the failure history in the
675	   following cases:

677	   1.  A Leave request is received from a neigbhor.
678	   2.  No data has been received from a peer in the Connection Table for
679	       a specified amount of time.  RELOAD mandates the use of STUN
680	       [RFC5389] for keepalives.  STUN keepalives take the form of STUN
681	       Binding Indication transactions.  As specified in ICE
682	       [I-D.ietf-mmusic-ice], a peer sends a STUN Binding Indication if
683	       there has been no packet sent on a connection for Tr seconds.  Tr
684	       is configurable and has a default of 15 seconds.  Although STUN
685	       Binding Indications do not generate a response, the fact that a
686	       peer has failed can be learned from the lack of packets (Binding
687	       Indications or application protocol packets) received from the
688	       peer.  A peer MUST be considered to have failed if no packets
689	       have been received from it for the past max-silent-periods*Tr
690	       seconds. max-silent-periods is a new element in the overlay
691	       configuration document.  The default value is three.

693	   As an alternative to relying on STUN keepalives to detect peer
694	   failure, a peer could send additional, frequent RELOAD messages to
695	   every peer in its Connection Table.  These messages could be Update
696	   requests, in which case they would serve two purposes: detecting peer
697	   failure and stabilization.  However, as the cost of this approach can
698	   be very high in terms of bandwidth consumption and traffic load,
699	   especially in large-scale overlays experiencing churn, its use is NOT
700	   RECOMMENDED.

702	6.4.  Estimating Join Rate

704	   Reference [ghinita2006] proposes that a peer can estimate the join
705	   rate based on the uptime of the peers in its routing table.  An
706	   increase in peer join rate will be reflected by a decrease in the
707	   average age of peers in the routing table.  Thus, each peer MUST
708	   maintain an array of the ages of the peers in its routing table
709	   sorted in increasing order.  Using this information, an estimate of
710	   the global peer join rate L MUST be calculated as:

712	                         N           1
713	                    L = --- * ---------------,
714	                         4     Ages[rsize/4]

716	   where Ages is an array containing the ages of the peers in the
717	   routing table sorted in increasing order and rsize is the size of the
718	   routing table.  It is RECOMMENDED that only the ages of the 25% of
719	   the youngest peers in the routing table (i.e., the 25th percentile)
720	   are used to reduce the bias that a small number of peers with very
721	   old ages can cause [ghinita2006].  It has been shown that the
722	   estimate obtained by using this method is accurate within 22% of the
723	   real join rate [ghinita2006].  Of course, the size of the routing
724	   table affects the accuracy.

726	   In order for this mechanism to work, peers need to exchange
727	   information about the time they have been present in the overlay.
728	   Peers receive the uptimes of their successors and predecessors during
729	   the stabilization operations since all Update requests carry uptime
730	   values.  A joining peer learns the uptime of the admitting peer since
731	   it receives an Update from the admitting peer during the join
732	   procedure.  Peers learn the uptimes of new fingers since they can
733	   fetch the uptime using a Probe request after having attached to the
734	   new finger.

736	6.5.  Calculating the Stabilization Interval

738	   According to [liben-nowell2002], a Chord network in a ring-like state
739	   remains in a ring-like state as long as peers send Omega(log2^2(N))
740	   messages before N new peers join or N/2 peers fail.  We can use the
741	   estimate of peer failure rate, U, to calculate the time Tf in which
742	   N/2 peers fail:

744	                                  1
745	                           Tf = ------
746	                                 2*U

748	   Based on this estimate, a stabilization interval Tstab-1 MUST be
749	   calculated as:

751	                                         Tf
752	                           Tstab-1 = -----------
753	                                      log2^2(N)

755	   On the other hand, the estimated join rate L can be used to calculate
756	   the time in which N new peers join the overlay.  Based on the
757	   estimate of L, a stabilization interval Tstab-2 MUST be calculated
758	   as:

760	                                             N
761	                            Tstab-2 = ---------------
762	                                       L * log2^2(N)

764	   Finally, the actual stabilization interval Tstab that MUST be used
765	   can be obtained by taking the minimum of Tstab-1 and Tstab-2.

767	   The results obtained in [maenpaa2009] indicate that making the
768	   stabilization interval too small has the effect of making the overlay
769	   less stable (e.g., in terms of detected loops and path failures).
770	   Thus, a lower limit should be used for the stabilization period.
771	   Based on the results in [maenpaa2009], a lower limit of 15s is
772	   RECOMMENDED, since using a stabilization period smaller than this
773	   will with a high probability cause too much traffic in the overlay.

775	7.  Overlay Configuration Document Extension

777	   This document extends the RELOAD overlay configuration document by
778	   adding one new element, "max-silent-periods", inside each
779	   "configuration" element.

781	   max-silent-periods:  A peer MUST be considered to have failed if no
782	      packets have been received from it for the past max-silent-
783	      periods*Tr seconds (Tr is defined in [I-D.ietf-mmusic-ice]).  The
784	      default value of the max-silent-periods element is 3.

786	8.  Security Considerations

788	   There are no new security considerations introduced in this document.
789	   The security considerations of RELOAD [I-D.ietf-p2psip-base] apply.

791	9.  IANA Considerations

793	   This document has no actions for IANA.

795	10.  References

797	10.1.  Normative References

799	   [I-D.ietf-mmusic-ice]
800	              Rosenberg, J., "Interactive Connectivity Establishment
801	              (ICE): A Protocol for Network Address  Translator (NAT)
802	              Traversal for Offer/Answer Protocols",
803	              draft-ietf-mmusic-ice-19 (work in progress), October 2007.

805	   [I-D.ietf-p2psip-base]
806	              Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and
807	              H. Schulzrinne, "REsource LOcation And Discovery (RELOAD)
808	              Base Protocol", draft-ietf-p2psip-base-05 (work in
809	              progress), October 2009.

811	   [I-D.ietf-p2psip-concepts]
812	              Bryan, D., Matthews, P., Shim, E., Willis, D., and S.
813	              Dawkins, "Concepts and Terminology for Peer to Peer SIP",
814	              draft-ietf-p2psip-concepts-02 (work in progress),
815	              July 2008.

817	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
818	              Requirement Levels", BCP 14, RFC 2119, March 1997.

820	   [RFC5389]  Rosenberg, J., Mahy, R., Matthews, P., and D. Wing,
821	              "Session Traversal Utilities for NAT (STUN)", RFC 5389,
822	              October 2008.

824	10.2.  Informative References

826	   [CAN]      Ratnasamy, S., Francis, P., Handley, M., Karp, R., and S.
827	              Schenker, "A scalable content-addressable network", In
828	              Proc. of the 2001 Conference on Applications,
829	              Technologies, Architectures and Protocols for Computer
830	              Communications 2001, pp. 161.172.

832	   [Chord]    Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.,
833	              Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A
834	              Scalable Peer-to-peer Lookup Service for Internet
835	              Applications", IEEE/ACM Transactions on Networking Volume
836	              11, Issue 1, 17-32, Feb 2003.

838	   [Pastry]   Rowstron, A. and P. Druschel, "Pastry: Scalable,
839	              Decentralized Object Location and Routing for Large-Scale
840	              Peer-to-Peer Systems", In Proc. of the IFIP/ACM
841	              International Conference on Distribued Systems
842	              Platforms Nov. 2001, pp. 329-350.

844	   [binzenhofer2006]
845	              Binzenhofer, A., Kunzmann, G., and R. Henjes, "A scalable
846	              algorithm to monitor chord-based P2P systems at runtime",
847	              20th International Parallel and Distributed Processing
848	              Symposium April 2006.

850	   [ghinita2006]
851	              Ghinita, G. and Y. Teo, "An adaptive stabilization
852	              framework for distributed hash tables", 20th International
853	              Parallel and Distributed Processing Symposium April 2006.

855	   [horowitz2003]
856	              Horowitz, K. and D. Malkhi, "Estimating network size from
857	              local information", Information Processing Letters Dec.
858	              2003, Volume 88, Issue 5, pp. 237-243.

860	   [kostoulas2005]
861	              Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and
862	              A. Demers, "Decentralized schemes for size estimation in
863	              large and dynamic groups", Fourth IEEE International
864	              Symposium on Network Computing and Applications July 2005,
865	              pp. 41-48.

867	   [krishnamurthy2008]
868	              Krishnamurthy, S., El-Ansary, S., Aurell, E., and S.
869	              Haridi, "Comparing maintenance strategies for overlays",
870	              In Proc. of 16th Euromicro Conference on Parallel,
871	              Distributed and Network-Based Processing Feb. 2008, pp.
872	              473-482.

874	   [li2004]   Li, J., Strinbling, J., Gil, T., and M. Kaashoek,
875	              "Comparing the performance of distributed hash tables
876	              under churn", In Proc. of the 3rd International Workshop
877	              on Peer-to-Peer Systems 2004.

879	   [liben-nowell2002]
880	              Liben-Nowell, D., Balakrishnan, H., and D. Karger,
881	              "Observations on the dynamic evolution of peer-to-peer
882	              networks", In Proc. of the First International Workshop on
883	              Peer-to-Peer Systems March 2002.

885	   [maenpaa2009]
886	              Maenpaa, J. and G. Camarillo, "A study on maintenance
887	              operations in a Chord-based Peer-to-Peer Session
888	              Initiation Protocol overlay network", In Proc. of IPDPS
889	              2009 May 2009.

891	   [mahajan2003]
892	              Mahajan, R., Castro, M., and A. Rowstron, "Controlling the
893	              cost of reliability in peer-to-peer overlays", In
894	              Proceedings of the 2nd International Workshop on Peer-to-
895	              Peer Systems Feb. 2003.

897	   [rhea2004]
898	              Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz,
899	              "Handling churn in a DHT", In Proc. of the USENIX Annual
900	              Techincal Conference June 2004.

902	Authors' Addresses

904	   Jouni Maenpaa
905	   Ericsson
906	   Hirsalantie 11
907	   Jorvas  02420
908	   Finland

910	   Email: Jouni.Maenpaa@ericsson.com

912	   Gonzalo Camarillo
913	   Ericsson
914	   Hirsalantie 11
915	   Jorvas  02420
916	   Finland

918	   Email: Gonzalo.Camarillo@ericsson.com

920	   Jani Hautakorpi
921	   Ericsson
922	   Hirsalantie 11
923	   Jorvas  02420
924	   Finland

926	   Email: Jani.Hautakorpi@ericsson.com