idnits 2.17.1 

draft-ietf-p2psip-self-tuning-12.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (June 8, 2014) is 3604 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  ** Obsolete normative reference: RFC 5245 (Obsoleted by RFC 8445, RFC 8839)

  ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489)

  == Outdated reference: A later version (-09) exists of
     draft-ietf-p2psip-concepts-05


     Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	P2PSIP Working Group                                          J. Maenpaa
3	Internet-Draft                                              G. Camarillo
4	Intended status: Standards Track                                Ericsson
5	Expires: December 10, 2014                                  June 8, 2014

7	   Self-tuning Distributed Hash Table (DHT) for REsource LOcation And
8	                           Discovery (RELOAD)
9	                  draft-ietf-p2psip-self-tuning-12.txt

11	Abstract

13	   REsource LOcation And Discovery (RELOAD) is a peer-to-peer (P2P)
14	   signaling protocol that provides an overlay network service.  Peers
15	   in a RELOAD overlay network collectively run an overlay algorithm to
16	   organize the overlay, and to store and retrieve data.  This document
17	   describes how the default topology plugin of RELOAD can be extended
18	   to support self-tuning, that is, to adapt to changing operating
19	   conditions such as churn and network size.

21	Status of This Memo

23	   This Internet-Draft is submitted in full conformance with the
24	   provisions of BCP 78 and BCP 79.

26	   Internet-Drafts are working documents of the Internet Engineering
27	   Task Force (IETF).  Note that other groups may also distribute
28	   working documents as Internet-Drafts.  The list of current Internet-
29	   Drafts is at http://datatracker.ietf.org/drafts/current/.

31	   Internet-Drafts are draft documents valid for a maximum of six months
32	   and may be updated, replaced, or obsoleted by other documents at any
33	   time.  It is inappropriate to use Internet-Drafts as reference
34	   material or to cite them other than as "work in progress."

36	   This Internet-Draft will expire on December 10, 2014.

38	Copyright Notice

40	   Copyright (c) 2014 IETF Trust and the persons identified as the
41	   document authors.  All rights reserved.

43	   This document is subject to BCP 78 and the IETF Trust's Legal
44	   Provisions Relating to IETF Documents
45	   (http://trustee.ietf.org/license-info) in effect on the date of
46	   publication of this document.  Please review these documents
47	   carefully, as they describe your rights and restrictions with respect
48	   to this document.  Code Components extracted from this document must
49	   include Simplified BSD License text as described in Section 4.e of
50	   the Trust Legal Provisions and are provided without warranty as
51	   described in the Simplified BSD License.

53	Table of Contents

55	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
56	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
57	   3.  Introduction to Stabilization in DHTs . . . . . . . . . . . .   5
58	     3.1.  Reactive vs. Periodic Stabilization . . . . . . . . . . .   5
59	     3.2.  Configuring Periodic Stabilization  . . . . . . . . . . .   6
60	     3.3.  Adaptive Stabilization  . . . . . . . . . . . . . . . . .   7
61	   4.  Introduction to Chord . . . . . . . . . . . . . . . . . . . .   8
62	   5.  Extending Chord-reload to Support Self-tuning . . . . . . . .   9
63	     5.1.  Update Requests . . . . . . . . . . . . . . . . . . . . .  10
64	     5.2.  Neighbor Stabilization  . . . . . . . . . . . . . . . . .  10
65	     5.3.  Finger Stabilization  . . . . . . . . . . . . . . . . . .  11
66	     5.4.  Adjusting Finger Table Size . . . . . . . . . . . . . . .  11
67	     5.5.  Detecting Partitioning  . . . . . . . . . . . . . . . . .  11
68	     5.6.  Leaving the Overlay . . . . . . . . . . . . . . . . . . .  12
69	   6.  Self-tuning Chord Parameters  . . . . . . . . . . . . . . . .  12
70	     6.1.  Estimating Overlay Size . . . . . . . . . . . . . . . . .  12
71	     6.2.  Determining Routing Table Size  . . . . . . . . . . . . .  13
72	     6.3.  Estimating Failure Rate . . . . . . . . . . . . . . . . .  13
73	       6.3.1.  Detecting Failures  . . . . . . . . . . . . . . . . .  14
74	     6.4.  Estimating Join Rate  . . . . . . . . . . . . . . . . . .  15
75	     6.5.  Estimate Sharing  . . . . . . . . . . . . . . . . . . . .  15
76	     6.6.  Calculating the Stabilization Interval  . . . . . . . . .  17
77	   7.  Overlay Configuration Document Extension  . . . . . . . . . .  18
78	   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  18
79	   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  19
80	     9.1.  Message Extensions  . . . . . . . . . . . . . . . . . . .  19
81	     9.2.  A New IETF XML Registry . . . . . . . . . . . . . . . . .  19
82	   10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  19
83	   11. References  . . . . . . . . . . . . . . . . . . . . . . . . .  19
84	     11.1.  Normative References . . . . . . . . . . . . . . . . . .  19
85	     11.2.  Informative References . . . . . . . . . . . . . . . . .  20
86	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  22

88	1.  Introduction

90	   REsource LOcation And Discovery (RELOAD) [RFC6940] is a peer-to-peer
91	   signaling protocol that can be used to maintain an overlay network,
92	   and to store data in and retrieve data from the overlay.  For
93	   interoperability reasons, RELOAD specifies one overlay algorithm,
94	   called chord-reload, that is mandatory to implement.  This document
95	   extends the chord-reload algorithm by introducing self-tuning
96	   behavior.

98	   Distributed Hash Table (DHT) based overlay networks are self-
99	   organizing, scalable and reliable.  However, these features come at a
100	   cost: peers in the overlay network need to consume network bandwidth
101	   to maintain routing state.  Most DHTs use a periodic stabilization
102	   routine to counter the undesirable effects of churn on routing.  To
103	   configure the parameters of a DHT, some characteristics such as churn
104	   rate and network size need to be known in advance.  These
105	   characteristics are then used to configure the DHT in a static
106	   fashion by using fixed values for parameters such as the size of the
107	   successor set, size of the routing table, and rate of maintenance
108	   messages.  The problem with this approach is that it is not possible
109	   to achieve a low failure rate and a low communication overhead by
110	   using fixed parameters.  Instead, a better approach is to allow the
111	   system to take into account the evolution of network conditions and
112	   adapt to them.  This document extends the mandatory-to-implement
113	   chord-reload algorithm by making it self-tuning.  Two main advantages
114	   of self-tuning are that users no longer need to tune every DHT
115	   parameter correctly for a given operating environment and that the
116	   system adapts to changing operating conditions.

118	   The remainder of this document is structured as follows: Section 2
119	   provides definitions of terms used in this document.  Section 3
120	   discusses alternative approaches to stabilization operations in DHTs,
121	   including reactive stabilization, periodic stabilization, and
122	   adaptive stabilization.  Section 4 gives an introduction to the Chord
123	   DHT algorithm.  Section 5 describes how this document extends the
124	   stabilization routine of the chord-reload algorithm.  Section 6
125	   describes how the stabilization rate and routing table size are
126	   calculated in an adaptive fashion.

128	2.  Terminology

130	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
131	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
132	   "OPTIONAL" in this document are to be interpreted as described in RFC
133	   2119 [RFC2119].

135	   This document uses the terminology and definitions from the Concepts
136	   and Terminology for Peer to Peer SIP [I-D.ietf-p2psip-concepts]
137	   draft.

139	   numBitsInNodeId:  Specifies the number of bits in a RELOAD Node-ID.

141	   DHT:  Distributed Hash Tables (DHTs) are a class of decentralized
142	      distributed systems that provide a lookup service similar to a
143	      regular hash table.  Given a key, any peer participating in the
144	      system can retrieve the value associated with that key.  The
145	      responsibility for maintaining the mapping from keys to values is
146	      distributed among the peers.

148	   Chord Ring:  The Chord DHT uses ring topology and orders identifiers
149	      on an identifier circle of size 2^numBitsInNodeId.  This
150	      identifier circle is called the Chord ring.  On the Chord ring,
151	      the responsibility for a key k is assigned to the node whose
152	      identifier equals to or immediately follows k.

154	   Finger Table:  A data structure with up to (but typically less than)
155	      numBitsInNodeId entries maintained by each peer in a Chord-based
156	      overlay.  The ith entry in the finger table of peer n contains the
157	      identity of the first peer that succeeds n by at least
158	      2^(numBitsInNodeId-i) on the Chord ring.  This peer is called the
159	      ith finger of peer n.  As an example, the first entry in the
160	      finger table of peer n contains a peer half-way around the Chord
161	      ring from peer n.  The purpose of the finger table is to
162	      accelerate lookups.

164	   n.id:  An abbreviation that is in this document used refer to the
165	      Node-ID of peer n.

167	   O(g(n)):  Informally, saying that some equation f(n) = O(g(n)) means
168	      that f(n) is less than some constant multiple of g(n).  For the
169	      formal definition, please refer to [weiss1998].

171	   Omega(g(n)):  Informally, saying that some equation f(n) =
172	      Omega(g(n)) means that f(n) is more than some constant multiple of
173	      g(n).  For the formal definition, please refer to [weiss1998]

175	   Percentile:  The Pth (0<=P<=100) percentile of N values arranged in
176	      ascending order is obtained by first calculating the (ordinal)
177	      rank n=(P/100)*N, rounding the result to the nearest integer, and
178	      then taking the value corresponding to that rank.

180	   Predecessor List:  A data structure containing the first r
181	      predecessors of a peer on the Chord ring.

183	   Successor List:  A data structure containing the first r successors
184	      of a peer on the Chord ring.

186	   Neighborhood Set:  A term used to refer to the set of peers included
187	      in the successor and predecessor lists of a given peer.

189	   Routing Table:  Contents of a given peer's routing table include the
190	      set of peers that the peer can use to route overlay messages.  The
191	      routing table is made up of the finger table, successor list and
192	      predecessor list.

194	3.  Introduction to Stabilization in DHTs

196	   DHTs use stabilization routines to counter the undesirable effects of
197	   churn on routing.  The purpose of stabilization is to keep the
198	   routing information of each peer in the overlay consistent with the
199	   constantly changing overlay topology.  There are two alternative
200	   approaches to stabilization: periodic and reactive [rhea2004].
201	   Periodic stabilization can either use a fixed stabilization rate or
202	   calculate the stabilization rate in an adaptive fashion.

204	3.1.  Reactive vs. Periodic Stabilization

206	   In reactive stabilization, a peer reacts to the loss of a peer in its
207	   neighborhood set or to the appearance of a new peer that should be
208	   added to its neighborhood set by sending a copy of its neighbor table
209	   to all peers in the neighborhood set.  Periodic recovery, in
210	   contrast, takes place independently of changes in the neighborhood
211	   set.  In periodic recovery, a peer periodically shares its
212	   neighborhood set with each or a subset of the members of that set.

214	   The chord-reload algorithm [RFC6940] supports both reactive and
215	   periodic stabilization.  It has been shown in [rhea2004] that
216	   reactive stabilization works well for small neighborhood sets (i.e.,
217	   small overlays) and moderate churn.  However, in large-scale (e.g.,
218	   1000 peers or more [rhea2004]) or high-churn overlays, reactive
219	   stabilization runs the risk of creating a positive feedback cycle,
220	   which can eventually result in congestion collapse.  In [rhea2004],
221	   it is shown that a 1000-peer overlay under churn uses significantly
222	   less bandwidth and has lower latencies when periodic stabilization is
223	   used than when reactive stabilization is used.  Although in the
224	   experiments carried out in [rhea2004], reactive stabilization
225	   performed well when there was no churn, its bandwidth use was
226	   observed to jump dramatically under churn.  At higher churn rates and
227	   larger scale overlays, periodic stabilization uses less bandwidth and
228	   the resulting lower contention for the network leads to lower
229	   latencies.  For this reason, most DHTs such as CAN [CAN], Chord
230	   [Chord], Pastry [Pastry], Bamboo [rhea2004], etc. use periodic
231	   stabilization [ghinita2006].  As an example, the first version of
232	   Bamboo used reactive stabilization, which caused Bamboo to suffer
233	   from degradation in performance under churn.  To fix this problem,
234	   Bamboo was modified to use periodic stabilization.

236	   In Chord, periodic stabilization is typically done both for
237	   successors and fingers.  An alternative strategy is analyzed in
238	   [krishnamurthy2008].  In this strategy, called the correction-on-
239	   change maintenance strategy, a peer periodically stabilizes its
240	   successors but does not do so for its fingers.  Instead, finger
241	   pointers are stabilized in a reactive fashion.  The results obtained
242	   in [krishnamurthy2008] imply that although the correction-on-change
243	   strategy works well when churn is low, periodic stabilization
244	   outperforms the correction-on-change strategy when churn is high.

246	3.2.  Configuring Periodic Stabilization

248	   When periodic stabilization is used, one faces the problem of
249	   selecting an appropriate execution rate for the stabilization
250	   procedure.  If the execution rate of periodic stabilization is high,
251	   changes in the system can be quickly detected, but at the
252	   disadvantage of increased communication overhead.  Alternatively, if
253	   the stabilization rate is low and the churn rate is high, routing
254	   tables become inaccurate and DHT performance deteriorates.  Thus, the
255	   problem is setting the parameters so that the overlay achieves the
256	   desired reliability and performance even in challenging conditions,
257	   such as under heavy churn.  This naturally results in high cost
258	   during periods when the churn level is lower than expected, or
259	   alternatively, poor performance or even network partitioning in worse
260	   than expected conditions.

262	   In addition to selecting an appropriate stabilization interval,
263	   regardless of whether periodic stabilization is used or not, an
264	   appropriate size needs to be selected for the neighborhood set and
265	   for the finger table.

267	   The current approach is to configure overlays statically.  This works
268	   in situations where perfect information about the future is
269	   available.  In situations where the operating conditions of the
270	   network are known in advance and remain static throughout the
271	   lifetime of the system, it is possible to choose fixed optimal values
272	   for parameters such as stabilization rate, neighborhood set size and
273	   routing table size.  However, if the operating conditions (e.g., the
274	   size of the overlay and its churn rate) do not remain static but
275	   evolve with time, it is not possible to achieve both a low lookup
276	   failure rate and a low communication overhead by using fixed
277	   parameters [ghinita2006].

279	   As an example, to configure the Chord DHT algorithm, one needs to
280	   select values for the following parameters: size of successor list,
281	   stabilization interval, and size of the finger table.  To select an
282	   appropriate value for the stabilization interval, one needs to know
283	   the expected churn rate and overlay size.  According to
284	   [liben-nowell2002], a Chord network in a ring-like state remains in a
285	   ring-like state as long as peers send Omega(square(log(N))) messages
286	   before N new peers join or N/2 peers fail.  Thus, in a 500-peer
287	   overlay churning at a rate such that one peer joins and one peer
288	   leaves the network every 30 seconds, an appropriate stabilization
289	   interval would be on the order of 93s.  According to [Chord], the
290	   size of the successor list and finger table should be on the order of
291	   log(N).  Already a successor list of a modest size (e.g., log2(N) or
292	   2*log2(N), which is the successor list size used in [Chord]) makes it
293	   very unlikely that a peer will lose all of its successors, which
294	   would cause the Chord ring to become disconnected.  Thus, in a
295	   500-peer network each peer should maintain on the order of nine
296	   successors and fingers.  However, if the churn rate doubles and the
297	   network size remains unchanged, the stabilization rate should double
298	   as well.  That is, the appropriate maintenance interval would now be
299	   on the order of 46s.  On the other hand, if the churn rate becomes
300	   e.g. six-fold and the size of the network grows to 2000 peers, on the
301	   order of eleven fingers and successors should be maintained and the
302	   stabilization interval should be on the order of 42s.  If one
303	   continued using the old values, this could result in inaccurate
304	   routing tables, network partitioning, and deteriorating performance.

306	3.3.  Adaptive Stabilization

308	   A self-tuning DHT takes into consideration the continuous evolution
309	   of network conditions and adapts to them.  In a self-tuning DHT, each
310	   peer collects statistical data about the network and dynamically
311	   adjusts its stabilization rate, neighborhood set size, and finger
312	   table size based on the analysis of the data [ghinita2006].
313	   Reference [mahajan2003] shows that by using self-tuning, it is
314	   possible to achieve high reliability and performance even in adverse
315	   conditions with low maintenance cost.  Adaptive stabilization has
316	   been shown to outperform periodic stabilization in terms of both
317	   lookup failures and communication overhead [ghinita2006].

319	4.  Introduction to Chord

321	   Chord [Chord] is a structured P2P algorithm that uses consistent
322	   hashing to build a DHT out of several independent peers.  Consistent
323	   hashing assigns each peer and resource a fixed-length identifier.
324	   Peers use SHA-1 as the base hash fuction to generate the identifiers.
325	   As specified in RELOAD base, the length of the identifiers is
326	   numBitsInNodeId=128 bits.  The identifiers are ordered on an
327	   identifier circle of size 2^numBitsInNodeId.  On the identifier
328	   circle, key k is assigned to the first peer whose identifier equals
329	   or follows the identifier of k in the identifier space.  The
330	   identifier circle is called the Chord ring.

332	   Different DHTs differ significantly in performance when bandwidth is
333	   limited.  It has been shown that when compared to other DHTs, the
334	   advantages of Chord include that it uses bandwidth efficiently and
335	   can achieve low lookup latencies at little cost [li2004].

337	   A simple lookup mechanism could be implemented on a Chord ring by
338	   requiring each peer to only know how to contact its current successor
339	   on the identifier circle.  Queries for a given identifier could then
340	   be passed around the circle via the successor pointers until they
341	   encounter the first peer whose identifier is equal to or larger than
342	   the desired identifier.  Such a lookup scheme uses a number of
343	   messages that grows linearly with the number of peers.  To reduce the
344	   cost of lookups, Chord maintains also additional routing information;
345	   each peer n maintains a data structure with up to numBitsInNodeId
346	   entries, called the finger table.  The first entry in the finger
347	   table of peer n contains the peer half-way around the ring from peer
348	   n.  The second entry contains the peer that is 1/4th of the way
349	   around, the third entry the peer that is 1/8th of the way around,
350	   etc.  In other words, the ith entry in the finger table at peer n
351	   contains the identity of the first peer s that succeeds n by at least
352	   2^(numBitsInNodeId-i) on the Chord ring.  This peer is called the ith
353	   finger of peer n.  The interval between two consecutive fingers is
354	   called a finger interval.  The ith finger interval of peer n covers
355	   the range [n.id + 2^(numBitsInNodeId-i), n.id + 2^(numBitsInNodeId-
356	   i+1)) on the Chord ring.  In an N-peer network, each peer maintains
357	   information about O(log(N)) other peers in its finger table.  As an
358	   example, if N=100000, it is sufficient to maintain 17 fingers.

360	   Chord needs all peers' successor pointers to be up to date in order
361	   to ensure that lookups produce correct results as the set of
362	   participating peers changes.  To achieve this, peers run a
363	   stabilization protocol periodically in the background.  The
364	   stabilization protocol of the original Chord algorithm uses two
365	   operations: successor stabilization and finger stabilization.
366	   However, the Chord algorithm of RELOAD base defines two additional
367	   stabilization components, as will be discussed below.

369	   To increase robustness in the event of peer failures, each Chord peer
370	   maintains a successor list of size r, containing the peer's first r
371	   successors.  The benefit of successor lists is that if each peer
372	   fails independently with probability p, the probability that all r
373	   successors fail simultaneously is only p^r.

375	   The original Chord algorithm maintains only a single predecessor
376	   pointer.  However, multiple predecessor pointers (i.e., a predecessor
377	   list) can be maintained to speed up recovery from predecessor
378	   failures.  The routing table of a peer consists of the successor
379	   list, finger table, and predecessor list.

381	5.  Extending Chord-reload to Support Self-tuning

383	   This section describes how the mandatory-to-implement chord-reload
384	   algorithm defined in RELOAD base [RFC6940] can be extended to support
385	   self-tuning.

387	   The chord-reload algorithm supports both reactive and periodic
388	   recovery strategies.  When the self-tuning mechanisms defined in this
389	   document are used, the periodic recovery strategy MUST be used.
390	   Further, chord-reload specifies that at least three predecessors and
391	   three successors need to be maintained.  When the self-tuning
392	   mechanisms are used, the appropriate sizes of the successor list and
393	   predecessor list are determined in an adaptive fashion based on the
394	   estimated network size, as will be described in Section 6.

396	   As specified in RELOAD base, each peer MUST maintain a stabilization
397	   timer.  When the stabilization timer fires, the peer MUST restart the
398	   timer and carry out the overlay stabilization routine.  Overlay
399	   stabilization has four components in chord-reload:

401	   1.  Update the neighbor table.  We refer to this as neighbor
402	       stabilization.

404	   2.  Refreshing the finger table.  We refer to this as finger
405	       stabilization.

407	   3.  Adjusting finger table size.

409	   4.  Detecting partitioning.  We refer to this as strong
410	       stabilization.

412	   As specified in RELOAD base [RFC6940], a peer sends periodic messages
413	   as part of the neighbor stabilization, finger stabilization, and
414	   strong stabilization routines.  In neighbor stabilization, a peer
415	   periodically sends an Update request to every peer in its Connection
416	   Table.  The default time is every ten minutes.  In finger
417	   stabilization, a peer periodically searches for new peers to include
418	   in its finger table.  This time defaults to one hour.  This document
419	   specifies how the neighbor stabilization and finger stabilization
420	   intervals can be determined in an adaptive fashion based on the
421	   operating conditions of the overlay.  The subsections below describe
422	   how this document extends the four components of stabilization.

424	5.1.  Update Requests

426	   As described in RELOAD base [RFC6940], the neighbor and finger
427	   stabilization procedures are implemented using Update requests.
428	   RELOAD base defines three types of Update requests: 'peer_ready',
429	   'neighbors', and 'full'.  Regardless of the type, all Update requests
430	   include an 'uptime' field.  Since the self-tuning extensions require
431	   information on the uptimes of peers in the routing table, the sender
432	   of an Update request MUST include its current uptime in seconds in
433	   the 'uptime' field.

435	   When self-tuning is used, each peer decides independently the
436	   appropriate size for the successor list, predecessor list and finger
437	   table.  Thus, the 'predecessors', 'successors', and 'fingers' fields
438	   included in RELOAD Update requests are of variable length.  As
439	   specified in RELOAD [RFC6940], variable length fields are on the wire
440	   preceded by length bytes.  In the case of the successor list,
441	   predecessor list, and finger table, there are two length bytes
442	   (allowing lengths up to 2^16-1).  The number of NodeId structures
443	   included in each field can be calculated based on the length bytes
444	   since the size of a single NodeId structure is 16 bytes.  If a peer
445	   receives more entries than fit into its successor list, predecessor
446	   list or finger table, the peer MUST ignore the extra entries.  If a
447	   peer receives less entries than it currently has in its own data
448	   structure, the peer MUST NOT drop the extra entries from its data
449	   structure.

451	5.2.  Neighbor Stabilization

453	   In the neighbor stabilization operation of chord-reload, a peer
454	   periodically sends an Update request to every peer in its Connection
455	   Table.  In a small, low-churn overlay, the amount of traffic this
456	   process generates is typically acceptable.  However, in a large-scale
457	   overlay churning at a moderate or high churn rate, the traffic load
458	   may no longer be acceptable since the size of the connection table is
459	   large and the stabilization interval relatively short.  The self-
460	   tuning mechanisms described in this document are especially designed
461	   for overlays of the latter type.  Therefore, when the self-tuning
462	   mechanisms are used, each peer MUST send a periodic Update request
463	   only to its first predecessor and first successor on the Chord ring.

465	   The neighbor stabilization routine MUST be executed when the
466	   stabilization timer fires.  To begin the neighbor stabilization
467	   routine, a peer MUST send an Update request to its first successor
468	   and its first predecessor.  The type of the Update request MUST be
469	   'neighbors'.  The Update request MUST include the successor and
470	   predecessor lists of the sender.  If a peer receiving such an Update
471	   request learns from the predecessor and successor lists included in
472	   the request that new peers can be included in its neighborhood set,
473	   it MUST send Attach requests to the new peers.

475	   After a new peer has been added to the predecessor or successor list,
476	   an Update request of type 'peer_ready' MUST be sent to the new peer.
477	   This allows the new peer to insert the sender into its neighborhood
478	   set.

480	5.3.  Finger Stabilization

482	   Chord-reload specifies two alternative methods for searching for new
483	   peers to the finger table.  Both of the alternatives can be used with
484	   the self-tuning extensions defined in this document.

486	   Immediately after a new peer has been added to the finger table, a
487	   Probe request MUST be sent to the new peer to fetch its uptime.  The
488	   requested_info field of the Probe request MUST be set to contain the
489	   ProbeInformationType 'uptime' defined in RELOAD base [RFC6940].

491	5.4.  Adjusting Finger Table Size

493	   The chord-reload algorithm defines how a peer can make sure that the
494	   finger table is appropriately sized to allow for efficient routing.
495	   Since the self-tuning mechanisms specified in this document produce a
496	   network size estimate, this estimate can be directly used to
497	   calculate the optimal size for the finger table.  This mechanism MUST
498	   be used instead of the one specified by chord-reload.  A peer MUST
499	   use the network size estimate to determine whether it needs to adjust
500	   the size of its finger table each time when the stabilization timer
501	   fires.  The way this is done is explained in Section 6.2.

503	5.5.  Detecting Partitioning

505	   This document does not require any changes to the mechanism chord-
506	   reload uses to detect network partitioning.

508	5.6.  Leaving the Overlay

510	   As specified in RELOAD base [RFC6940], a leaving peer SHOULD send a
511	   Leave request to all members of its neighbor table prior to leaving
512	   the overlay.  The overlay_specific_data field MUST contain the
513	   ChordLeaveData structure.  The Leave requests that are sent to
514	   successors MUST contain the predecessor list of the leaving peer.
515	   The Leave requests that are sent to the predecessors MUST contain the
516	   successor list of the leaving peer.  If a given successor can
517	   identify better predecessors than are already included in its
518	   predecessor lists by investigating the predecessor list it receives
519	   from the leaving peer, it MUST send Attach requests to them.
520	   Similarly, if a given predecessor identifies better successors by
521	   investigating the successor list it receives from the leaving peer,
522	   it MUST send Attach requests to them.

524	6.  Self-tuning Chord Parameters

526	   This section specifies how to determine an appropriate stabilization
527	   rate and routing table size in an adaptive fashion.  The proposed
528	   mechanism is based on [mahajan2003], [liben-nowell2002], and
529	   [ghinita2006].  To calculate an appropriate stabilization rate, the
530	   values of three parameters must be estimated: overlay size N, failure
531	   rate U, and join rate L.  To calculate an appropriate routing table
532	   size, the estimated network size N can be used.  Peers in the overlay
533	   MUST re-calculate the values of the parameters to self-tune the
534	   chord-reload algorithm at the end of each stabilization period before
535	   re-starting the stabilization timer.

537	6.1.  Estimating Overlay Size

539	   Techniques for estimating the size of an overlay network have been
540	   proposed for instance in [mahajan2003], [horowitz2003],
541	   [kostoulas2005], [binzenhofer2006], and [ghinita2006].  In Chord, the
542	   density of peer identifiers in the neighborhood set can be used to
543	   produce an estimate of the size of the overlay, N [mahajan2003].
544	   Since peer identifiers are picked randomly with uniform probability
545	   from the numBitsInNodeId-bit identifier space, the average distance
546	   between peer identifiers in the successor set is
547	   (2^numBitsInNodeId)/N.

549	   To estimate the overlay network size, a peer MUST compute the average
550	   inter-peer distance d between the successive peers starting from the
551	   most distant predecessor and ending to the most distant successor in
552	   the successor list.  The estimated network size MUST be calculated
553	   as:

555	                         2^numBitsInNodeId
556	                    N = -------------------
557	                                d

559	   This estimate has been found to be accurate within 15% of the real
560	   network size [ghinita2006].  Of course, the size of the neighborhood
561	   set affects the accuracy of the estimate.

563	   During the join process, a joining peer fills its routing table by
564	   sending a series of Ping and Attach requests, as specified in RELOAD
565	   base [RFC6940].  Thus, a joining peer immediately has enough
566	   information at its disposal to calculate an estimate of the network
567	   size.

569	6.2.  Determining Routing Table Size

571	   As specified in RELOAD base, the finger table must contain at least
572	   16 entries.  When the self-tuning mechanisms are used, the size of
573	   the finger table MUST be set to max(ceiling(log2(N)), 16) using the
574	   estimated network size N.

576	   The size of the successor list MUST be set to ceiling(log2(N)).  An
577	   implementation MAY place a lower limit on the size of the successor
578	   list.  As an example, the implementation might require the size of
579	   the successor list to be always at least three.

581	   A peer MAY choose to maintain a fixed-size predecessor list with only
582	   three entries as specified in RELOAD base.  However, it is
583	   RECOMMENDED that a peer maintains ceiling(log2(N)) predecessors.

585	6.3.  Estimating Failure Rate

587	   A typical approach is to assume that peers join the overlay according
588	   to a Poisson process with rate L and leave according to a Poisson
589	   process with rate parameter U [mahajan2003].  The value of U can be
590	   estimated using peer failures in the finger table and neighborhood
591	   set [mahajan2003].  If peers fail with rate U, a peer with M unique
592	   peer identifiers in its routing table should observe K failures in
593	   time K/(M*U).  Every peer in the overlay MUST maintain a history of
594	   the last K failures.  The current time MUST be inserted into the
595	   history when the peer joins the overlay.  The estimate of U MUST be
596	   calculated as:

598	                             k
599	                     U = --------,
600	                          M * Tk

602	   where M is the number of unique peer identifiers in the routing
603	   table, Tk is the time between the first and the last failure in the
604	   history, and k is the number of failures in the history.  If k is
605	   smaller than K, the estimate MUST be computed as if there was a
606	   failure at the current time.  It has been shown that an estimate
607	   calculated in a similar manner is accurate within 17% of the real
608	   value of U [ghinita2006].

610	   The size of the failure history K affects the accuracy of the
611	   estimate of U.  One can increase the accuracy by increasing K.
612	   However, this has the side effect of decreasing responsiveness to
613	   changes in the failure rate.  On the other hand, a small history size
614	   may cause a peer to overreact each time a new failure occurs.  In
615	   [ghinita2006], K is set to 25% of the routing table size.  Use of
616	   this approach is RECOMMENDED.

618	6.3.1.  Detecting Failures

620	   A new failure MUST be inserted to the failure history in the
621	   following cases:

623	   1.  A Leave request is received from a neigbhor.

625	   2.  A peer fails to reply to a Ping request sent in the situation
626	       explained below.  If no packets have been received on a
627	       connection during the past 2*Tr seconds (where Tr is the
628	       inactivity timer defined by ICE [RFC5245]), a RELOAD Ping request
629	       MUST be sent to the remote peer.  RELOAD mandates the use of STUN
630	       [RFC5389] for keepalives.  STUN keepalives take the form of STUN
631	       Binding Indication transactions.  As specified in ICE [RFC5245],
632	       a peer sends a STUN Binding Indication if there has been no
633	       packet sent on a connection for Tr seconds.  Tr is configurable
634	       and has a default of 15 seconds.  Although STUN Binding
635	       Indications do not generate a response, the fact that a peer has
636	       failed can be learned from the lack of packets (Binding
637	       Indications or application protocol packets) received from the
638	       peer.  If the remote peer fails to reply to the Ping request, the
639	       sender MUST consider the remote peer to have failed.

641	   As an alternative to relying on STUN keepalives to detect peer
642	   failure, a peer could send additional, frequent RELOAD messages to
643	   every peer in its Connection Table.  These messages could be Update
644	   requests, in which case they would serve two purposes: detecting peer
645	   failure and stabilization.  However, as the cost of this approach can
646	   be very high in terms of bandwidth consumption and traffic load,
647	   especially in large-scale overlays experiencing churn, its use is NOT
648	   RECOMMENDED.

650	6.4.  Estimating Join Rate

652	   Reference [ghinita2006] proposes that a peer can estimate the join
653	   rate based on the uptime of the peers in its routing table.  An
654	   increase in peer join rate will be reflected by a decrease in the
655	   average age of peers in the routing table.  Thus, each peer MUST
656	   maintain an array of the ages of the peers in its routing table
657	   sorted in increasing order.  Using this information, an estimate of
658	   the global peer join rate L MUST be calculated as:

660	                                  N
661	                    L = ----------------------,
662	                         Ages[floor(rsize/2)]

664	   where Ages is an array containing the ages of the peers in the
665	   routing table sorted in increasing order and rsize is the size of the
666	   routing table.  It has been shown that the estimate obtained by using
667	   this method is accurate within 22% of the real join rate
668	   [ghinita2006].  Of course, the size of the routing table affects the
669	   accuracy.

671	   In order for this mechanism to work, peers need to exchange
672	   information about the time they have been present in the overlay.
673	   Peers receive the uptimes of their successors and predecessors during
674	   the stabilization operations since all Update requests carry uptime
675	   values.  A joining peer learns the uptime of the admitting peer since
676	   it receives an Update from the admitting peer during the join
677	   procedure.  Peers learn the uptimes of new fingers since they can
678	   fetch the uptime using a Probe request after having attached to the
679	   new finger.

681	6.5.  Estimate Sharing

683	   To improve the accuracy of network size, join rate, and leave rate
684	   estimates, peers MUST share their estimates.  When the stabilization
685	   timer fires, a peer MUST select number-of-peers-to-probe random peers
686	   from its finger table and send each of them a Probe request.  The
687	   targets of Probe requests are selected from the finger table rather
688	   than from the neighbor table since neighbors are likely to make
689	   similar errors when calculating their estimates. number-of-peers-to-
690	   probe is a new element in the overlay configuration document.  It is
691	   defined in Section 7.  Both the Probe request and the answer returned
692	   by the target peer MUST contain a new message extension whose
693	   MessageExtensionType is 'self_tuning_data'.  This extension type is
694	   defined in Section 9.1.  The extension_contents field of the
695	   MessageExtension structure MUST contain a SelfTuningData structure:

697	               struct {
698	                 uint32                   network_size;
699	                 uint32                   join_rate;
700	                 uint32                   leave_rate;
701	               } SelfTuningData;

703	   The contents of the SelfTuningData structure are as follows:

705	   network_size

707	      The latest network size estimate calculated by the sender.

709	   join_rate

711	      The latest join rate estimate calculated by the sender.

713	   leave_rate

715	      The latest leave rate estimate calculated by the sender.

717	   The join and leave rates are expressed as joins or failures per 24
718	   hours.  As an example, if the global join rate estimate a peer has
719	   calculated is 0.123 peers/s, it would include in the join_rate field
720	   the ceiling of the value 10627.2 (24*60*60*0.123 = 10627.2), that is,
721	   the value 10628.

723	   The 'type' field of the MessageExtension structure MUST be set to
724	   contain the value 'self_tuning_data'.  The 'critical' field of the
725	   structure MUST be set to False.

727	   A peer MUST store all estimates it receives in Probe requests and
728	   answers during a stabilization interval.  When the stabilization
729	   timer fires, the peer MUST calculate the estimates to be used during
730	   the next stabilization interval by taking the 75th percentile (i.e.,
731	   third quartile) of a data set containing its own estimate and the
732	   received estimates.

734	   The default value for number-of-peers-to-probe is 4.  This default
735	   value is recommended to allow a peer to receive a sufficiently large
736	   set of estimates from other peers.  With a value of 4, a peer
737	   receives four estimates in Probe answers.  On the average, each peer
738	   also receives four Probe requests each carrying an estimate.  Thus,
739	   on the average, each peer has nine estimates (including its own) that
740	   it can use at the end of the stablization interval.  A value smaller
741	   than 4 is NOT RECOMMENDED to keep the number of received estimates
742	   high enough.  As an example, if the value were 2, there would be
743	   peers in the overlay that would only receive two estimates during a
744	   stabilization interval.  Such peers would only have three estimates
745	   available at the end of the interval, which may not be reliable
746	   enough since even a single exceptionally high or low estimate can
747	   have a large impact.

749	6.6.  Calculating the Stabilization Interval

751	   According to [liben-nowell2002], a Chord network in a ring-like state
752	   remains in a ring-like state as long as peers send
753	   Omega(square(log(N))) messages before N new peers join or N/2 peers
754	   fail.  We can use the estimate of peer failure rate, U, to calculate
755	   the time Tf in which N/2 peers fail:

757	                                  1
758	                           Tf = ------
759	                                 2*U

761	   Based on this estimate, a stabilization interval Tstab-1 MUST be
762	   calculated as:

764	                                           Tf
765	                           Tstab-1 = -----------------
766	                                      square(log2(N))

768	   On the other hand, the estimated join rate L can be used to calculate
769	   the time in which N new peers join the overlay.  Based on the
770	   estimate of L, a stabilization interval Tstab-2 MUST be calculated
771	   as:

773	                                               N
774	                            Tstab-2 = ---------------------
775	                                       L * square(log2(N))

777	   Finally, the actual stabilization interval Tstab that MUST be used
778	   can be obtained by taking the minimum of Tstab-1 and Tstab-2.

780	   The results obtained in [maenpaa2009] indicate that making the
781	   stabilization interval too small has the effect of making the overlay
782	   less stable (e.g., in terms of detected loops and path failures).
783	   Thus, a lower limit should be used for the stabilization period.
784	   Based on the results in [maenpaa2009], a lower limit of 15s is
785	   RECOMMENDED, since using a stabilization period smaller than this
786	   will with a high probability cause too much traffic in the overlay.

788	7.  Overlay Configuration Document Extension

790	   This document extends the RELOAD overlay configuration document by
791	   adding one new element, "number-of-peers-to-probe", inside each
792	   "configuration" element.

794	   self-tuning:number-of-peers-to-probe:  The number of fingers to which
795	      Probe requests are sent to obtain their network size, join rate,
796	      and leave rate estimates.  The default value is 4.

798	   The Relax NG Grammar for this element is:

800	   namespace self-tuning = "urn:ietf:params:xml:ns:p2p:self-tuning"

802	   parameter &= element self-tuning:number-of-peers-to-probe {
803	   xsd:unsignedInt }?

805	   This namespace is added into the <mandatory-extension> element in the
806	   overlay configuration file.

808	8.  Security Considerations

810	   In the same way as malicious or compromised peers implementing the
811	   RELOAD base protocol [RFC6940] can advertise false network metrics or
812	   distribute false routing table information for instance in RELOAD
813	   Update messages, malicious peers implementing this specification may
814	   share false join rate, leave rate, and network size estimates.  For
815	   such attacks, the same security concerns apply as in the RELOAD base
816	   specification.  In addition, as long as the amount of malicious peers
817	   in the overlay remains modest, the statistical mechanisms applied in
818	   Section 6.5 (i.e., the use of 75th percentiles) to process the shared
819	   estimates a peer obtains help ensure that estimates that are clearly
820	   different from (i.e., larger or smaller than) other received
821	   estimates will not significantly influence the process of adapting
822	   the stabilization interval and routing table size.  However, it
823	   should be noted that if an attacker is able to impersonate a high
824	   number of other peers in the overlay in strategic locations, it may
825	   be able to send a high enough number of false estimates to a victim
826	   and therefore influence the victim's choice of a stabilization
827	   interval.

829	9.  IANA Considerations

831	9.1.  Message Extensions

833	   This document introduces one additional extension to the "RELOAD
834	   Extensions" Registry:

836	                  +------------------+-------+---------------+
837	                  | Extension Name   |  Code | Specification |
838	                  +------------------+-------+---------------+
839	                  | self_tuning_data |   0x3 |      RFC-AAAA |
840	                  +------------------+-------+---------------+

842	   The contents of the extension are defined in Section 6.5.

844	   Note to RFC Editor: please replace AAAA with the RFC number for this
845	   specification.

847	9.2.  A New IETF XML Registry

849	   This document registers one new URI for the self-tuning namespace in
850	   the IETF XML registry defined in [RFC3688].

852	   URI: urn:ietf:params:xml:ns:p2p:self-tuning

854	   Registrant Contact: The IESG

856	   XML: N/A, the requested URI is an XML namespace

858	10.  Acknowledgments

860	   The authors would like to thank Jani Hautakorpi for his contributions
861	   to the document.  The authors would also like to thank Carlos
862	   Bernardos and Martin Durst for their comments on the document.

864	11.  References

866	11.1.  Normative References

868	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
869	              Requirement Levels", BCP 14, RFC 2119, March 1997.

871	   [RFC3688]  Mealling, M., "The IETF XML Registry", BCP 81, RFC 3688,
872	              January 2004.

874	   [RFC5245]  Rosenberg, J., "Interactive Connectivity Establishment
875	              (ICE): A Protocol for Network Address Translator (NAT)
876	              Traversal for Offer/Answer Protocols", RFC 5245, April
877	              2010.

879	   [RFC5389]  Rosenberg, J., Mahy, R., Matthews, P., and D. Wing,
880	              "Session Traversal Utilities for NAT (STUN)", RFC 5389,
881	              October 2008.

883	   [RFC6940]  Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and
884	              H. Schulzrinne, "REsource LOcation And Discovery (RELOAD)
885	              Base Protocol", RFC 6940, January 2014.

887	11.2.  Informative References

889	   [CAN]      Ratnasamy, S., Francis, P., Handley, M., Karp, R., and S.
890	              Schenker, "A Scalable Content-Addressable Network", In
891	              Proceedings of the 2001 Conference on Applications,
892	              Technologies, Architectures and Protocols for Computer
893	              Communications pp. 161-172, August 2001.

895	   [Chord]    Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.,
896	              Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A
897	              Scalable Peer-to-peer Lookup Service for Internet
898	              Applications", IEEE/ACM Transactions on Networking Volume
899	              11, Issue 1, pp. 17-32, February 2003.

901	   [I-D.ietf-p2psip-concepts]
902	              Bryan, D., Matthews, P., Shim, E., Willis, D., and S.
903	              Dawkins, "Concepts and Terminology for Peer to Peer SIP",
904	              draft-ietf-p2psip-concepts-05 (work in progress), July
905	              2013.

907	   [Pastry]   Rowstron, A. and P. Druschel, "Pastry: Scalable,
908	              Decentralized Object Location and Routing for Large-Scale
909	              Peer-to-Peer Systems", In Proceedings of the IFIP/ACM
910	              International Conference on Distribued Systems Platforms
911	              pp. 329-350, November 2001.

913	   [binzenhofer2006]
914	              Binzenhofer, A., Kunzmann, G., and R. Henjes, "A Scalable
915	              Algorithm to Monitor Chord-Based P2P Systems at Runtime",
916	              In Proceedings of the 20th IEEE International Parallel and
917	              Distributed Processing Symposium (IPDPS) pp. 1-8, April
918	              2006.

920	   [ghinita2006]
921	              Ghinita, G. and Y. Teo, "An Adaptive Stabilization
922	              Framework for Distributed Hash Tables", In Proceedings of
923	              the 20th IEEE International Parallel and Distributed
924	              Processing Symposium (IPDPS) pp. 29-38, April 2006.

926	   [horowitz2003]
927	              Horowitz, K. and D. Malkhi, "Estimating Network Size from
928	              Local Information", Information Processing Letters Volume
929	              88, Issue 5, pp. 237-243, December 2003.

931	   [kostoulas2005]
932	              Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and
933	              A. Demers, "Decentralized Schemes for Size Estimation in
934	              Large and Dynamic Groups", In Proceedings of the 4th IEEE
935	              International Symposium on Network Computing and
936	              Applications pp. 41-48, July 2005.

938	   [krishnamurthy2008]
939	              Krishnamurthy, S., El-Ansary, S., Aurell, E., and S.
940	              Haridi, "Comparing Maintenance Strategies for Overlays",
941	              In Proceedings of the 16th Euromicro Conference on
942	              Parallel, Distributed and Network-Based Processing pp.
943	              473-482, February 2008.

945	   [li2004]   Li, J., Strinbling, J., Gil, T., Morris, R., and M.
946	              Kaashoek, "Comparing the Performance of Distributed Hash
947	              Tables Under Churn", Peer-to-Peer Systems III, volume 3279
948	              of Lecture Notes in Computer Science Springer, pp. 87-99,
949	              February 2005.

951	   [liben-nowell2002]
952	              Liben-Nowell, D., Balakrishnan, H., and D. Karger,
953	              "Observations on the Dynamic Evolution of Peer-to-Peer
954	              Networks", In Proceedings of the 1st International
955	              Workshop on Peer-to-Peer Systems (IPTPS) pp. 22-33, March
956	              2002.

958	   [maenpaa2009]
959	              Maenpaa, J. and G. Camarillo, "A Study on Maintenance
960	              Operations in a Chord-Based Peer-to-Peer Session
961	              Initiation Protocol Overlay Network", In Proceedings of
962	              the 23rd IEEE International Parallel and Distributed
963	              Processing Symposium (IPDPS) pp. 1-9, May 2009.

965	   [mahajan2003]
966	              Mahajan, R., Castro, M., and A. Rowstron, "Controlling the
967	              Cost of Reliability in Peer-to-Peer Overlays", In
968	              Proceedings of the 2nd International Workshop on Peer-to-
969	              Peer Systems (IPTPS) pp. 21-32, February 2003.

971	   [rhea2004]
972	              Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz,
973	              "Handling Churn in a DHT", In Proceedings of the USENIX
974	              Annual Technical Conference pp. 127-140, June 2004.

976	   [weiss1998]
977	              Weiss, M., "Data Structures and Algorithm Analysis in
978	              C++", Addison-Wesley Longman Publishin Co., Inc. 2nd
979	              Edition, ISBN:0201361221, 1998.

981	Authors' Addresses

983	   Jouni Maenpaa
984	   Ericsson
985	   Hirsalantie 11
986	   Jorvas  02420
987	   Finland

989	   Email: Jouni.Maenpaa@ericsson.com

991	   Gonzalo Camarillo
992	   Ericsson
993	   Hirsalantie 11
994	   Jorvas  02420
995	   Finland

997	   Email: Gonzalo.Camarillo@ericsson.com