idnits 2.17.1 draft-ietf-p2psip-self-tuning-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Sep 2009 rather than the newer Notice from 28 Dec 2009. (See https://trustee.ietf.org/license-info/) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. -- The document date (December 8, 2009) is 5252 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-26) exists of draft-ietf-p2psip-base-06 ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489) == Outdated reference: A later version (-09) exists of draft-ietf-p2psip-concepts-02 Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 P2PSIP Working Group J. Maenpaa 3 Internet-Draft G. Camarillo 4 Intended status: Standards Track J. Hautakorpi 5 Expires: June 11, 2010 Ericsson 6 December 8, 2009 8 A Self-tuning Distributed Hash Table (DHT) for REsource LOcation And 9 Discovery (RELOAD) 10 draft-ietf-p2psip-self-tuning-00.txt 12 Abstract 14 REsource LOcation And Discovery (RELOAD) is a peer-to-peer (P2P) 15 signaling protocol that provides an overlay network service. Peers 16 in a RELOAD overlay network collectively run an overlay algorithm to 17 organize the overlay, and to store and retrieve data. This document 18 describes how the default topology plugin of RELOAD can be extended 19 to support self-tuning, that is, to adapt to changing operating 20 conditions such as churn and network size. 22 Status of this Memo 24 This Internet-Draft is submitted to IETF in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF), its areas, and its working groups. Note that 29 other groups may also distribute working documents as Internet- 30 Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/ietf/1id-abstracts.txt. 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html. 43 This Internet-Draft will expire on June 11, 2010. 45 Copyright Notice 47 Copyright (c) 2009 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 3. Introduction to Stabilization in DHTs . . . . . . . . . . . . 5 65 3.1. Reactive vs. Periodic Stabilization . . . . . . . . . . . 5 66 3.2. Configuring Periodic Stabilization . . . . . . . . . . . . 6 67 3.3. Adaptive Stabilization . . . . . . . . . . . . . . . . . . 7 68 4. Introduction to Chord . . . . . . . . . . . . . . . . . . . . 7 69 5. Extending Chord-reload to Support Self-tuning . . . . . . . . 9 70 5.1. Update Requests . . . . . . . . . . . . . . . . . . . . . 10 71 5.2. Neighbor Stabilization . . . . . . . . . . . . . . . . . . 10 72 5.3. Finger Stabilization . . . . . . . . . . . . . . . . . . . 11 73 5.4. Adjusting Finger Table Size . . . . . . . . . . . . . . . 11 74 5.5. Detecting Partitioning . . . . . . . . . . . . . . . . . . 12 75 5.6. Leaving the Overlay . . . . . . . . . . . . . . . . . . . 12 76 5.6.1. Contents of the Leave Message . . . . . . . . . . . . 12 77 6. Self-tuning Chord Parameters . . . . . . . . . . . . . . . . . 13 78 6.1. Estimating Overlay Size . . . . . . . . . . . . . . . . . 13 79 6.2. Determining Routing Table Size . . . . . . . . . . . . . . 14 80 6.3. Estimating Failure Rate . . . . . . . . . . . . . . . . . 14 81 6.3.1. Detecting Failures . . . . . . . . . . . . . . . . . . 15 82 6.4. Estimating Join Rate . . . . . . . . . . . . . . . . . . . 16 83 6.5. Calculating the Stabilization Interval . . . . . . . . . . 16 84 7. Overlay Configuration Document Extension . . . . . . . . . . . 17 85 8. Security Considerations . . . . . . . . . . . . . . . . . . . 17 86 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 87 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 88 10.1. Normative References . . . . . . . . . . . . . . . . . . . 18 89 10.2. Informative References . . . . . . . . . . . . . . . . . . 18 90 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20 92 1. Introduction 94 REsource LOcation And Discovery (RELOAD) [I-D.ietf-p2psip-base] is a 95 peer-to-peer signaling protocol that can be used to maintain an 96 overlay network, and to store data in and retrieve data from the 97 overlay. For interoperability reasons, RELOAD specifies one overlay 98 algorithm, called chord-reload, that is mandatory to implement. This 99 document extends the chord-reload algorithm by introducing self- 100 tuning behavior. 102 DHT-based overlay networks are self-organizing, scalable and 103 reliable. However, these features come at a cost: peers in the 104 overlay network need to consume network bandwidth to maintain routing 105 state. Most DHTs use a periodic stabilization routine to counter the 106 undesirable effects of churn on routing. To configure the parameters 107 of a DHT, some characteristics such as churn rate and network size 108 need to be known in advance. These characteristics are then used to 109 configure the DHT in a static fashion by using fixed values for 110 parameters such as the size of the successor set, size of the routing 111 table, and rate of maintenance messages. The problem with this 112 approach is that it is not possible to achieve a low failure rate and 113 a low communication overhead by using fixed parameters. Instead, a 114 better approach is to allow the system to take into account the 115 evolution of network conditions and adapt to them. This document 116 extends the mandatory-to-implement chord-reload algorithm by making 117 it self-tuning. Two main advantages of self-tuning are that users no 118 longer need to tune every DHT parameter correctly for a given 119 operating environment and that the system adapts to changing 120 operating conditions. 122 The remainder of this document is structured as follows: Section 2 123 provides definitions of terms used in this document. Section 3 124 discusses alternative approaches to stabilization operations in DHTs, 125 including reactive stabilization, periodic stabilization, and 126 adaptive stabilization. Section 4 gives an introduction to the Chord 127 DHT algorithm. Section 5 describes how this document extends the 128 stabilization routine of the chord-reload algorithm. Section 6 129 describes how the stabilization rate and routing table size are 130 calculated in an adaptive fashion. 132 2. Terminology 134 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 135 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 136 document are to be interpreted as described in RFC 2119 [RFC2119]. 138 This document uses the terminology and definitions from the Concepts 139 and Terminology for Peer to Peer SIP [I-D.ietf-p2psip-concepts] 140 draft. 142 Chord Ring: The Chord DHT orders identifiers on an identifier circle 143 of size 2^numBitsInNodeId (numBitsInNodeId is the number of bits 144 in node identifiers). This identifier circle is called the Chord 145 ring. 147 DHT: Distributed Hash Tables (DHTs) are a class of decentralized 148 distributed systems that provide a lookup service similar to a 149 hash table. Given a key, any participating peer can retrieve the 150 value associated with that key. The responsibility for 151 maintaining the mapping from keys to values is distributed among 152 the peers. 154 Finger Table: A data structure with up to numBitsInNodeId entries 155 maintained by each peer in a Chord-based overlay. The ith entry 156 in the finger table of peer n contains the identity of the first 157 peer that succeeds n by at least 2^(numBitsInNodeId-i) on the 158 Chord ring. This peer is called the ith finger of peer n. As an 159 example, the first entry in the finger table of peer n contains a 160 peer half-way around the Chord ring from peer n. The purpose of 161 the finger table is to accelerate lookups. 163 log2(N): Logarithm of N with base 2. 165 n.id: Peer-ID of peer n. 167 Neighborhood Set: Consists of successor and predecessor lists. 169 numBitsInNodeId: Number of bits in a Node-ID. 171 O(g(n)): Informally, saying that some equation f(n) = O(g(n)) means 172 that f(n) is less than some constant multiple of g(n). 174 Omega(g(n)): Informally, saying that some equation f(n) = 175 Omega(g(n)) means that f(n) is more than some constant multiple of 176 g(n). 178 Predecessor List: A data structure containing the predecessors of a 179 peer on the Chord ring. 181 Responsible Peer: The peer storing the original copy of a resource. 182 In Chord, this is the first peer whose identifier is equal to or 183 follows the identifier of the resource on the Chord ring. Also 184 the term "Root Node" can be used for this concept. 186 Routing Table: The set of peers that a node can use to route overlay 187 messages. The routing table consists of the finger table, 188 successor list and predecessor list. 190 Successor List: A data structure containing the first r successors 191 of a peer on the Chord ring. 193 3. Introduction to Stabilization in DHTs 195 DHT-based peer-to-peer overlay networks are self-organizing, scalable 196 and reliable. However, these features come at a cost: peers in the 197 overlay network need to consume network bandwidth to maintain routing 198 state. DHTs use stabilization routines to counter the undesirable 199 effects of churn on routing. The purpose of stabilization is to keep 200 the routing information of each peer in the overlay consistent with 201 the constantly changing overlay topology. There are two alternative 202 approaches to stabilization: periodic and reactive [rhea2004]. 203 Periodic stabilization can either use a fixed stabilization rate or 204 calculate the stabilization rate in an adaptive fashion. 206 3.1. Reactive vs. Periodic Stabilization 208 In reactive stabilization, a peer reacts to the loss of a peer in its 209 neighborhood set or to the appearance of a new peer that should be 210 added to its neighborhood set by sending a copy of its neighbor table 211 to all peers in the neighborhood set. Periodic recovery, in 212 contrast, takes place independently of changes in the neighborhood 213 set. In periodic recovery, a peer periodically shares its 214 neighborhood set with each or a subset of the members of that set. 216 The chord-reload algorithm [I-D.ietf-p2psip-base] supports both 217 reactive and periodic stabilization. It has been shown in [rhea2004] 218 that reactive stabilization works well for small neighborhood sets 219 (i.e., small overlays) and moderate churn. However, in large-scale 220 (e.g., 1000 peers or more [rhea2004]) or high-churn overlays, 221 reactive stabilization runs the risk of creating a positive feedback 222 cycle, which can eventually result in congestion collapse. In 223 [rhea2004], it is shown that a 1000-peer overlay under churn uses 224 significantly less bandwidth and has lower latencies when periodic 225 stabilization is used than when reactive stabilization is used. 226 Although in the experiments carried out in [rhea2004], reactive 227 stabilization performed well when there was no churn, its bandwidth 228 use was observed to jump dramatically under churn. At higher churn 229 rates and larger scale overlays, periodic stabilization uses less 230 bandwidth and the resulting lower contention for the network leads to 231 lower latencies. For this reason, most DHTs such as CAN [CAN], Chord 232 [Chord], Pastry [Pastry], Bamboo [rhea2004], etc. use periodic 233 stabilization [ghinita2006]. As an example, the first version of 234 Bamboo used reactive stabilization, which caused Bamboo to suffer 235 from degradation in performance under churn. To fix this problem, 236 Bamboo was modified to use periodic stabilization. 238 In Chord, periodic stabilization is typically done both for 239 successors and fingers. An alternative strategy is analyzed in 240 [krishnamurthy2008]. In this strategy, called the correction-on- 241 change maintenance strategy, a peer periodically stabilizes its 242 successors but does not do so for its fingers. Instead, finger 243 pointers are stabilized in a reactive fashion. The results obtained 244 in [krishnamurthy2008] imply that although the correction-on-change 245 strategy works well when churn is low, periodic stabilization 246 outperforms the correction-on-change strategy when churn is high. 248 3.2. Configuring Periodic Stabilization 250 When periodic stabilization is used, one faces the problem of 251 selecting an appropriate execution rate for the stabilization 252 procedure. If the execution rate of periodic stabilization is high, 253 changes in the system can be quickly detected, but at the 254 disadvantage of increased communication overhead. Alternatively, if 255 the stabilization rate is low and the churn rate is high, routing 256 tables become inaccurate and DHT performance deteriorates. Thus, the 257 problem is setting the parameters so that the overlay achieves the 258 desired reliability and performance even in challenging conditions, 259 such as under heavy churn. This naturally results in high cost 260 during periods when the churn level is lower than expected, or 261 alternatively, poor performance or even network partitioning in worse 262 than expected conditions. 264 In addition to selecting an appropriate stabilization interval, 265 regardless of whether periodic stabilization is used or not, an 266 appropriate size needs to be selected for the neighborhood set and 267 for the routing table. 269 The current approach is to configure overlays statically. This works 270 in situations where perfect information about the future is 271 available. In situations where the operating conditions of the 272 network are known in advance and remain static throughout the 273 lifetime of the system, it is possible to choose fixed optimal values 274 for parameters such as stabilization rate, neighborhood set size and 275 routing table size. However, if the operating conditions (e.g., the 276 size of the overlay and its churn rate) do not remain static but 277 evolve with time, it is not possible to achieve both a low lookup 278 failure rate and a low communication overhead by using fixed 279 parameters [ghinita2006]. 281 As an example, to configure the Chord DHT algorithm, one needs to 282 select values for the following parameters: size of successor list, 283 stabilization interval, and size of the finger table. To select an 284 appropriate value for the stabilization interval, one needs to know 285 the expected churn rate and overlay size. According to 286 [liben-nowell2002], a Chord network in a ring-like state remains in a 287 ring-like state as long as peers send Omega(log2^2(N)) messages 288 before N new peers join or N/2 peers fail. Thus, in a 500-peer 289 overlay churning at a rate such that one peer joins and one peer 290 leaves the network every 30 seconds, an appropriate stabilization 291 interval would be on the order of 93s. According to [Chord], the 292 size of the successor list and finger table should be on the order of 293 log2(N). Having a successor list of size O(log2(N)) makes it 294 unlikely that a peer will lose all of its successors, which would 295 cause the Chord ring to become disconnected. Thus, in a 500-peer 296 network each peer should maintain on the order of nine successors and 297 fingers. However, if the churn rate doubles and the network size 298 remains unchanged, the stabilization rate should double as well. 299 That is, the appropriate maintenance interval would now be on the 300 order of 46s. On the other hand, if the churn rate becomes e.g. six- 301 fold and the size of the network grows to 2000 peers, on the order of 302 eleven fingers and successors should be maintained and the 303 stabilization interval should be on the order of 42s. If one 304 continued using the old values, this could result in inaccurate 305 routing tables, network partitioning, and deteriorating performance. 307 3.3. Adaptive Stabilization 309 A self-tuning DHT takes into consideration the continuous evolution 310 of network conditions and adapts to them. In a self-tuning DHT, each 311 peer collects statistical data about the network and dynamically 312 adjusts its stabilization rate, neighborhood set size, and finger 313 table size based on the analysis of the data [ghinita2006]. 314 Reference [mahajan2003] shows that by using self-tuning, it is 315 possible to achieve high reliability and performance even in adverse 316 conditions with low maintenance cost. Adaptive stabilization has 317 been shown to outperform periodic stabilization in terms of both 318 lookup failures and communication overhead [ghinita2006]. 320 4. Introduction to Chord 322 Chord [Chord] is a structured P2P algorithm that uses consistent 323 hashing to build a DHT out of several independent peers. Consistent 324 hashing assigns each peer and resource a fixed-length identifier. 325 Peers use SHA-1 as the base hash fuction to generate the identifiers. 327 As specified in RELOAD base, the length of the identifiers is 328 numBitsInNodeId=128 bits. The identifiers are ordered on an 329 identifier circle of size 2^numBitsInNodeId. On the identifier 330 circle, key k is assigned to the first peer whose identifier equals 331 or follows the identifier of k in the identifier space. The 332 identifier circle is called the Chord ring. 334 Different DHTs differ significantly in performance when bandwidth is 335 limited. It has been shown that when compared to other DHTs, the 336 advantages of Chord include that it uses bandwidth efficiently and 337 can achieve low lookup latencies at little cost [li2004]. 339 A simple lookup mechanism could be implemented on a Chord ring by 340 requiring each peer to only know how to contact its current successor 341 on the identifier circle. Queries for a given identifier could then 342 be passed around the circle via the successor pointers until they 343 encounter the first peer whose identifier is equal to or larger than 344 the desired identifier. Such a lookup scheme uses a number of 345 messages that grows linearly with the number of peers. To reduce the 346 cost of lookups, Chord maintains also additional routing information; 347 each peer n maintains a data structure with up to numBitsInNodeId 348 entries, called the finger table. The first entry in the finger 349 table of peer n contains the peer half-way around the ring from peer 350 n. The second entry contains the peer that is 1/4th of the way 351 around, the third entry the peer that is 1/8th of the way around, 352 etc. In other words, the ith entry in the finger table at peer n 353 contains the identity of the first peer s that succeeds n by at least 354 2^(numBitsInNodeId-i) on the Chord ring. This peer is called the ith 355 finger of peer n. The interval between two consecutive fingers is 356 called a finger interval. The ith finger interval of peer n covers 357 the range [n.id + 2^(numBitsInNodeId-i), n.id + 358 2^(numBitsInNodeId-i+1)) on the Chord ring. In an N-peer network, 359 each peer maintains information about O(log2(N)) other peers in its 360 finger table. As an example, if N=100000, it is sufficient to 361 maintain 17 fingers. 363 Chord needs all peers' successor pointers to be up to date in order 364 to ensure that lookups produce correct results as the set of 365 participating peers changes. To achieve this, peers run a 366 stabilization protocol periodically in the background. The 367 stabilization protocol of the original Chord algorithm uses two 368 operations: successor stabilization and finger stabilization. 369 However, the Chord algorithm of RELOAD base defines two additional 370 stabilization components, as will be discussed below. 372 To increase robustness in the event of peer failures, each Chord peer 373 maintains a successor list of size r, containing the peer's first r 374 successors. The benefit of successor lists is that if each peer 375 fails independently with probability p, the probability that all r 376 successors fail simultaneously is only p^r. 378 The original Chord algorithm maintains only a single predecessor 379 pointer. However, multiple predecessor pointers (i.e., a predecessor 380 list) can be maintained to speed up recovery from predecessor 381 failures. The routing table of a peer consists of the successor 382 list, finger table, and predecessor list. 384 5. Extending Chord-reload to Support Self-tuning 386 This section describes how the mandatory-to-implement chord-reload 387 algorithm defined in RELOAD base [I-D.ietf-p2psip-base] can be 388 extended to support self-tuning. 390 The chord-reload algorithm supports both reactive and periodic 391 recovery strategies. When the self-tuning mechanisms defined in this 392 document are used, the periodic recovery strategy MUST be used. 393 Further, chord-reload specifies that at least three predecessors and 394 three successors need to be maintained. When the self-tuning 395 mechanisms are used, the appropriate sizes of the successor list and 396 predecessor list are determined in an adaptive fashion based on the 397 estimated network size, as will be described in Section 6. 399 As specified in RELOAD base, each peer MUST maintain a stabilization 400 timer. When the stabilization timer fires, the peer MUST restart the 401 timer and carry out the overlay stabilization routine. Overlay 402 stabilization has four components in chord-reload: 404 1. Update the neighbor table. We refer to this as neighbor 405 stabilization. 406 2. Refreshing the finger table. We refer to this as finger 407 stabilization. 408 3. Adjusting finger table size. 409 4. Detecting partitioning. We refer to this as strong 410 stabilization. 412 As specified in RELOAD base [I-D.ietf-p2psip-base], a peer sends 413 periodic messages as part of the neighbor stabilization, finger 414 stabilization, and strong stabilization routines. In neighbor 415 stabilization, a peer periodically sends an Update request to every 416 peer in its Connection Table. The default time is every ten minutes. 417 In finger stabilization, a peer periodically searches for new peers 418 to include in its finger table. This time defaults to one hour. 419 This document specifies how the neighbor stabilization and finger 420 stabilization intervals can be determined in an adaptive fashion 421 based on the operating conditions of the overlay. The subsections 422 below describe how this document extends the four components of 423 stabilization. 425 5.1. Update Requests 427 The neighbor and finger stabilization procedures are implemented 428 using the Update request defined in RELOAD base 429 [I-D.ietf-p2psip-base]. RELOAD base defines three types of Update 430 requests: 'peer_ready', 'neighbors', and 'full'. Regardless of the 431 type, all Update requests include an 'uptime' field. Since the self- 432 tuning extensions require information on the uptimes of peers in the 433 routing table, the sender of an Update request MUST include its 434 current uptime in seconds in the 'uptime' field. 436 When self-tuning is used, each peer decides independently the 437 appropriate size for the successor list, predecessor list and finger 438 table. Thus, the 'predecessors', 'successors', and 'fingers' fields 439 included in RELOAD Update requests are of variable length. As 440 specified in RELOAD [I-D.ietf-p2psip-base], variable length fields 441 are on the wire preceded by length bytes. In the case of the 442 successor list, predecessor list, and finger table, there are two 443 length bytes (allowing lengths up to 2^16-1). The number of NodeId 444 structures included in each field can be calculated based on the 445 length bytes since the size of a single NodeId structure is 16 bytes. 446 If a peer receives more entries than fit into its successor list, 447 predecessor list or finger table, the peer MUST ignore the extra 448 entries. If a peer receives less entries than it currently has in 449 its own data structure, the peer MUST NOT drop the extra entries from 450 its data structure. 452 5.2. Neighbor Stabilization 454 In the neighbor stabilization operation of chord-reload, a peer 455 periodically sends an Update request to every peer in its Connection 456 Table. In a small, low-churn overlay, the amount of traffic this 457 process generates is typically acceptable. However, in a large-scale 458 overlay churning at a moderate or high churn rate, the traffic load 459 may no longer be acceptable since the size of the connection table is 460 large and the stabilization interval relatively short. The self- 461 tuning mechanisms described in this document are especially designed 462 for overlays of the latter type. Therefore, when the self-tuning 463 mechanisms are used, each peer MUST only send a periodic Update 464 request to its first predecessor and first successor on the Chord 465 ring. 467 The neighbor stabilization routine MUST be executed when the 468 stabilization timer fires. To begin the neighbor stabilization 469 routine, a peer MUST send an Update request to its first successor 470 and its first predecessor. The type of the Update request MUST be 471 'neighbors'. The Update request MUST include the successor and 472 predecessor lists of the sender. If a peer receiving such an Update 473 request learns from the predecessor and successor lists included in 474 the request that new peers can be included in its neighborhood set, 475 it MUST send Attach requests to the new peers. 477 After a new peer has been added to the predecessor or successor list, 478 an Update request of type 'peer_ready' MUST be sent to the new peer. 479 This allows the new peer to insert the sender into its neighborhood 480 set. 482 5.3. Finger Stabilization 484 In the finger stabilization routine of chord-reload, a peer 485 periodically searches for new peers to replace invalid (that is, 486 repeated or failed) entries in the finger table. Chord-reload 487 provides two possible methods for searching for new peers to include 488 in the finger table. In alternative 1, a peer selects one entry from 489 among the invalid entries in its finger table each time the 490 stabilization timer fires and sends a Ping request to the selected 491 entry. In alternative 2, a peer sends a RouteQuery request to all 492 invalid entries in the finger table. After having received 493 RouteQuery responses, the peer sends a Ping to those entries for 494 which the RouteQuery response came from a peer not already present in 495 the routing table. When the self-tuning mechanisms defined in this 496 draft are being used in the overlay, alternative 1 MUST be used since 497 the traffic load it generates is lower and thus more appropriate for 498 large-scale overlays experiencing churn. 500 Immediately after a new peer has been added to the finger table, a 501 Probe request MUST be sent to the new peer to fetch its uptime. The 502 requested_info field of the Probe request MUST be set to contain the 503 ProbeInformationType 'uptime' defined in RELOAD base 504 [I-D.ietf-p2psip-base]. 506 5.4. Adjusting Finger Table Size 508 The chord-reload algorithm defines how a peer can make sure that the 509 finger table is appropriately sized to allow for efficient routing. 510 Since the self-tuning mechanisms specified in this document produce a 511 network size estimate, this estimate can be directly used to 512 calculate the optimal size for the finger table. This mechanism MUST 513 be used instead of the one specified by chord-reload. A peer MUST 514 use the network size estimate to determine whether it needs to adjust 515 the size of its finger table each time when the stabilization timer 516 fires. The way this is done is explained in Section 6.2. 518 5.5. Detecting Partitioning 520 This document does not require any changes to the mechanism chord- 521 reload uses to detect network partitioning. 523 5.6. Leaving the Overlay 525 This document extends the Leave request defined in RELOAD base 526 [I-D.ietf-p2psip-base]. As specified in RELOAD base, a leaving peer 527 SHOULD send a Leave request to all members of its neighbor table 528 prior to leaving the overlay. When the self-tuning extensions are 529 used, the Leave requests sent to the first predecessor and first 530 successor contain additional data. The Leave request that is sent to 531 the first successor MUST contain the predecessor list of the leaving 532 peer. The Leave request that is sent to the first predecessor MUST 533 contain the successor list of the leaving peer. If the first 534 successor can identify better predecessors than are already included 535 in its predecessor list by investigating the predecessor list it 536 receives from the leaving peer, it MUST send Attach requests to them. 537 Similarly, if the first predecessor identifies better successors by 538 investigating the successor list it receives from the leaving peer, 539 it MUST send Attach requests to them. 541 5.6.1. Contents of the Leave Message 543 The Leave request defined in RELOAD base [I-D.ietf-p2psip-base] has a 544 field for overlay specific data. When the self-tuning extensions are 545 being used, this overlay_specific_data field MUST contain the 546 ChordLeaveData structure defined below: 548 enum { reserved (0), 549 from_succ(1), from_pred(2), (255) } 550 ChordLeaveType; 552 struct { 553 ChordLeaveType type; 555 select(type) { 556 case from_succ: 557 NodeId successors<0..2^16-1>; 558 case from_pred: 559 NodeId predecessors<0..2^16-1>; 560 }; 561 } ChordLeaveData; 563 The 'type' field indicates whether the Leave request was sent by a 564 predecessor or a successor of the recipient: 566 from_succ 567 The Leave request was sent by a successor. 569 from_pred 570 The Leave request was sent by a predecessor. 572 If the type of the request is 'from_succ', the contents will be: 574 successors 575 The sender's successor list. 577 If the type of the request is 'from_pred', the contents will be: 579 predecessors 580 The sender's predecessor list. 582 6. Self-tuning Chord Parameters 584 This section specifies how to determine an appropriate stabilization 585 rate and routing table size in an adaptive fashion. The proposed 586 mechanism is based on [mahajan2003], [liben-nowell2002], and 587 [ghinita2006]. To calculate an appropriate stabilization rate, the 588 values of three parameters MUST be estimated: overlay size N, failure 589 rate U, and join rate L. To calculate an appropriate routing table 590 size, the estimated network size N can be used. Peers in the overlay 591 MUST re-calculate the values of the parameters to self-tune the 592 chord-reload algorithm at the end of each stabilization period before 593 re-starting the stabilization timer. 595 6.1. Estimating Overlay Size 597 Techniques for estimating the size of an overlay network have been 598 proposed for instance in [mahajan2003], [horowitz2003], 599 [kostoulas2005], [binzenhofer2006], and [ghinita2006]. In Chord, the 600 density of peer identifiers in the neighborhood set can be used to 601 produce an estimate of the size of the overlay, N [mahajan2003]. 602 Since peer identifiers are picked randomly with uniform probability 603 from the numBitsInNodeId-bit identifier space, the average distance 604 between peer identifiers in the successor set is 605 (2^numBitsInNodeId)/N. 607 To estimate the overlay network size, a peer MUST compute the average 608 inter-peer distance d between the successive peers starting from the 609 most distant predecessor and ending to the most distant successor in 610 the successor list. The estimated network size MUST be calculated 611 as: 613 2^numBitsInNodeId 614 N = ------------------- 615 d 617 This estimate has been found to be accurate within 15% of the real 618 network size [ghinita2006]. Of course, the size of the neighborhood 619 set affects the accuracy of the estimate. 621 During the join process, a joining peer fills its routing table by 622 sending a series of Ping and Attach requests, as specified in RELOAD 623 base [I-D.ietf-p2psip-base]. Thus, a joining peer immediately has 624 enough information at its disposal to calculate an estimate of the 625 network size. 627 6.2. Determining Routing Table Size 629 As specified in RELOAD base, the finger table must contain at least 630 16 entries. When the self-tuning mechanisms are used, the size of 631 the finger table MUST be set to max(log2(N), 16) using the estimated 632 network size N. 634 The size of the successor list MUST be set to log2(N). An 635 implementation MAY place a lower limit on the size of the successor 636 list. As an example, the implementation might require the size of 637 the successor list to be always at least three. 639 A peer MAY choose to maintain a fixed-size predecessor list with only 640 three entries as specified in RELOAD base. However, it is 641 RECOMMENDED that a peer maintains log2(N) predecessors. 643 6.3. Estimating Failure Rate 645 A typical approach is to assume that peers join the overlay according 646 to a Poisson process with rate L and leave according to a Poisson 647 process with rate parameter U [mahajan2003]. The value of U can be 648 estimated using peer failures in the finger table and neighborhood 649 set [mahajan2003]. If peers fail with rate U, a peer with M unique 650 peer identifiers in its routing table should observe K failures in 651 time K/(M*U). Every peer in the overlay MUST maintain a history of 652 the last K failures. The current time MUST be inserted into the 653 history when the peer joins the overlay. The estimate of U MUST be 654 calculated as: 656 k 657 U = --------, 658 M * Tk 660 where M is the number of unique peer identifiers in the routing 661 table, Tk is the time between the first and the last failure in the 662 history, and k is the number of failures in the history. If k is 663 smaller than K, the estimate MUST be computed as if there was a 664 failure at the current time. It has been shown that an estimate 665 calculated in a similar manner is accurate within 17% of the real 666 value of U [ghinita2006]. 668 The size of the failure history K affects the accuracy of the 669 estimate of U. One can increase the accuracy by increasing K. 670 However, this has the side effect of decreasing responsiveness to 671 changes in the failure rate. On the other hand, a small history size 672 may cause a peer to overreact each time a new failure occurs. In 673 [ghinita2006], K is set 25% of the routing table size. Use of this 674 approach is RECOMMENDED. 676 6.3.1. Detecting Failures 678 A new failure MUST be inserted to the failure history in the 679 following cases: 681 1. A Leave request is received from a neigbhor. 682 2. No data has been received from a peer in the Connection Table for 683 a specified amount of time. RELOAD mandates the use of STUN 684 [RFC5389] for keepalives. STUN keepalives take the form of STUN 685 Binding Indication transactions. As specified in ICE 686 [I-D.ietf-mmusic-ice], a peer sends a STUN Binding Indication if 687 there has been no packet sent on a connection for Tr seconds. Tr 688 is configurable and has a default of 15 seconds. Although STUN 689 Binding Indications do not generate a response, the fact that a 690 peer has failed can be learned from the lack of packets (Binding 691 Indications or application protocol packets) received from the 692 peer. A peer MUST be considered to have failed if no packets 693 have been received from it for the past max-silent-periods*Tr 694 seconds. max-silent-periods is a new element in the overlay 695 configuration document. The default value is three. 697 As an alternative to relying on STUN keepalives to detect peer 698 failure, a peer could send additional, frequent RELOAD messages to 699 every peer in its Connection Table. These messages could be Update 700 requests, in which case they would serve two purposes: detecting peer 701 failure and stabilization. However, as the cost of this approach can 702 be very high in terms of bandwidth consumption and traffic load, 703 especially in large-scale overlays experiencing churn, its use is NOT 704 RECOMMENDED. 706 6.4. Estimating Join Rate 708 Reference [ghinita2006] proposes that a peer can estimate the join 709 rate based on the uptime of the peers in its routing table. An 710 increase in peer join rate will be reflected by a decrease in the 711 average age of peers in the routing table. Thus, each peer MUST 712 maintain an array of the ages of the peers in its routing table 713 sorted in increasing order. Using this information, an estimate of 714 the global peer join rate L MUST be calculated as: 716 N 1 717 L = --- * ---------------, 718 4 Ages[rsize/4] 720 where Ages is an array containing the ages of the peers in the 721 routing table sorted in increasing order and rsize is the size of the 722 routing table. It is RECOMMENDED that only the ages of the 25% of 723 the youngest peers in the routing table (i.e., the 25th percentile) 724 are used to reduce the bias that a small number of peers with very 725 old ages can cause [ghinita2006]. It has been shown that the 726 estimate obtained by using this method is accurate within 22% of the 727 real join rate [ghinita2006]. Of course, the size of the routing 728 table affects the accuracy. 730 In order for this mechanism to work, peers need to exchange 731 information about the time they have been present in the overlay. 732 Peers receive the uptimes of their successors and predecessors during 733 the stabilization operations since all Update requests carry uptime 734 values. A joining peer learns the uptime of the admitting peer since 735 it receives an Update from the admitting peer during the join 736 procedure. Peers learn the uptimes of new fingers since they can 737 fetch the uptime using a Probe request after having attached to the 738 new finger. 740 6.5. Calculating the Stabilization Interval 742 According to [liben-nowell2002], a Chord network in a ring-like state 743 remains in a ring-like state as long as peers send Omega(log2^2(N)) 744 messages before N new peers join or N/2 peers fail. We can use the 745 estimate of peer failure rate, U, to calculate the time Tf in which 746 N/2 peers fail: 748 1 749 Tf = ------ 750 2*U 752 Based on this estimate, a stabilization interval Tstab-1 MUST be 753 calculated as: 755 Tf 756 Tstab-1 = ----------- 757 log2^2(N) 759 On the other hand, the estimated join rate L can be used to calculate 760 the time in which N new peers join the overlay. Based on the 761 estimate of L, a stabilization interval Tstab-2 MUST be calculated 762 as: 764 N 765 Tstab-2 = --------------- 766 L * log2^2(N) 768 Finally, the actual stabilization interval Tstab that MUST be used 769 can be obtained by taking the minimum of Tstab-1 and Tstab-2. 771 The results obtained in [maenpaa2009] indicate that making the 772 stabilization interval too small has the effect of making the overlay 773 less stable (e.g., in terms of detected loops and path failures). 774 Thus, a lower limit should be used for the stabilization period. 775 Based on the results in [maenpaa2009], a lower limit of 15s is 776 RECOMMENDED, since using a stabilization period smaller than this 777 will with a high probability cause too much traffic in the overlay. 779 7. Overlay Configuration Document Extension 781 This document extends the RELOAD overlay configuration document by 782 adding one new element, "max-silent-periods", inside each 783 "configuration" element. 785 max-silent-periods: A peer MUST be considered to have failed if no 786 packets have been received from it for the past max-silent- 787 periods*Tr seconds (Tr is defined in [I-D.ietf-mmusic-ice]). The 788 default value of the max-silent-periods element is 3. 790 8. Security Considerations 792 There are no new security considerations introduced in this document. 793 The security considerations of RELOAD [I-D.ietf-p2psip-base] apply. 795 9. IANA Considerations 797 This document has no actions for IANA. 799 10. References 801 10.1. Normative References 803 [I-D.ietf-mmusic-ice] 804 Rosenberg, J., "Interactive Connectivity Establishment 805 (ICE): A Protocol for Network Address Translator (NAT) 806 Traversal for Offer/Answer Protocols", 807 draft-ietf-mmusic-ice-19 (work in progress), October 2007. 809 [I-D.ietf-p2psip-base] 810 Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and 811 H. Schulzrinne, "REsource LOcation And Discovery (RELOAD) 812 Base Protocol", draft-ietf-p2psip-base-06 (work in 813 progress), November 2009. 815 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 816 Requirement Levels", BCP 14, RFC 2119, March 1997. 818 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, 819 "Session Traversal Utilities for NAT (STUN)", RFC 5389, 820 October 2008. 822 10.2. Informative References 824 [CAN] Ratnasamy, S., Francis, P., Handley, M., Karp, R., and S. 825 Schenker, "A scalable content-addressable network", In 826 Proc. of the 2001 Conference on Applications, 827 Technologies, Architectures and Protocols for Computer 828 Communications 2001, pp. 161.172. 830 [Chord] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., 831 Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A 832 Scalable Peer-to-peer Lookup Service for Internet 833 Applications", IEEE/ACM Transactions on Networking Volume 834 11, Issue 1, 17-32, Feb 2003. 836 [I-D.ietf-p2psip-concepts] 837 Bryan, D., Matthews, P., Shim, E., Willis, D., and S. 838 Dawkins, "Concepts and Terminology for Peer to Peer SIP", 839 draft-ietf-p2psip-concepts-02 (work in progress), 840 July 2008. 842 [Pastry] Rowstron, A. and P. Druschel, "Pastry: Scalable, 843 Decentralized Object Location and Routing for Large-Scale 844 Peer-to-Peer Systems", In Proc. of the IFIP/ACM 845 International Conference on Distribued Systems 846 Platforms Nov. 2001, pp. 329-350. 848 [binzenhofer2006] 849 Binzenhofer, A., Kunzmann, G., and R. Henjes, "A scalable 850 algorithm to monitor chord-based P2P systems at runtime", 851 20th International Parallel and Distributed Processing 852 Symposium April 2006. 854 [ghinita2006] 855 Ghinita, G. and Y. Teo, "An adaptive stabilization 856 framework for distributed hash tables", 20th International 857 Parallel and Distributed Processing Symposium April 2006. 859 [horowitz2003] 860 Horowitz, K. and D. Malkhi, "Estimating network size from 861 local information", Information Processing Letters Dec. 862 2003, Volume 88, Issue 5, pp. 237-243. 864 [kostoulas2005] 865 Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and 866 A. Demers, "Decentralized schemes for size estimation in 867 large and dynamic groups", Fourth IEEE International 868 Symposium on Network Computing and Applications July 2005, 869 pp. 41-48. 871 [krishnamurthy2008] 872 Krishnamurthy, S., El-Ansary, S., Aurell, E., and S. 873 Haridi, "Comparing maintenance strategies for overlays", 874 In Proc. of 16th Euromicro Conference on Parallel, 875 Distributed and Network-Based Processing Feb. 2008, pp. 876 473-482. 878 [li2004] Li, J., Strinbling, J., Gil, T., and M. Kaashoek, 879 "Comparing the performance of distributed hash tables 880 under churn", In Proc. of the 3rd International Workshop 881 on Peer-to-Peer Systems 2004. 883 [liben-nowell2002] 884 Liben-Nowell, D., Balakrishnan, H., and D. Karger, 885 "Observations on the dynamic evolution of peer-to-peer 886 networks", In Proc. of the First International Workshop on 887 Peer-to-Peer Systems March 2002. 889 [maenpaa2009] 890 Maenpaa, J. and G. Camarillo, "A study on maintenance 891 operations in a Chord-based Peer-to-Peer Session 892 Initiation Protocol overlay network", In Proc. of IPDPS 893 2009 May 2009. 895 [mahajan2003] 896 Mahajan, R., Castro, M., and A. Rowstron, "Controlling the 897 cost of reliability in peer-to-peer overlays", In 898 Proceedings of the 2nd International Workshop on Peer-to- 899 Peer Systems Feb. 2003. 901 [rhea2004] 902 Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, 903 "Handling churn in a DHT", In Proc. of the USENIX Annual 904 Techincal Conference June 2004. 906 Authors' Addresses 908 Jouni Maenpaa 909 Ericsson 910 Hirsalantie 11 911 Jorvas 02420 912 Finland 914 Email: Jouni.Maenpaa@ericsson.com 916 Gonzalo Camarillo 917 Ericsson 918 Hirsalantie 11 919 Jorvas 02420 920 Finland 922 Email: Gonzalo.Camarillo@ericsson.com 924 Jani Hautakorpi 925 Ericsson 926 Hirsalantie 11 927 Jorvas 02420 928 Finland 930 Email: Jani.Hautakorpi@ericsson.com