idnits 2.17.1 draft-maenpaa-p2psip-self-tuning-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** The document seems to lack a License Notice according IETF Trust Provisions of 28 Dec 2009, Section 6.b.ii or Provisions of 12 Sep 2009 Section 6.b -- however, there's a paragraph with a matching beginning. Boilerplate error? (You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Feb 2009 rather than one of the newer Notices. See https://trustee.ietf.org/license-info/.) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. -- The document date (October 26, 2009) is 5294 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-26) exists of draft-ietf-p2psip-base-05 == Outdated reference: A later version (-09) exists of draft-ietf-p2psip-concepts-02 ** Downref: Normative reference to an Informational draft: draft-ietf-p2psip-concepts (ref. 'I-D.ietf-p2psip-concepts') ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489) Summary: 3 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 P2PSIP Working Group J. Maenpaa 3 Internet-Draft G. Camarillo 4 Intended status: Standards Track J. Hautakorpi 5 Expires: April 29, 2010 Ericsson 6 October 26, 2009 8 A Self-tuning Distributed Hash Table (DHT) for REsource LOcation And 9 Discovery (RELOAD) 10 draft-maenpaa-p2psip-self-tuning-01.txt 12 Status of this Memo 14 This Internet-Draft is submitted to IETF in full conformance with the 15 provisions of BCP 78 and BCP 79. 17 Internet-Drafts are working documents of the Internet Engineering 18 Task Force (IETF), its areas, and its working groups. Note that 19 other groups may also distribute working documents as Internet- 20 Drafts. 22 Internet-Drafts are draft documents valid for a maximum of six months 23 and may be updated, replaced, or obsoleted by other documents at any 24 time. It is inappropriate to use Internet-Drafts as reference 25 material or to cite them other than as "work in progress." 27 The list of current Internet-Drafts can be accessed at 28 http://www.ietf.org/ietf/1id-abstracts.txt. 30 The list of Internet-Draft Shadow Directories can be accessed at 31 http://www.ietf.org/shadow.html. 33 This Internet-Draft will expire on April 29, 2010. 35 Copyright Notice 37 Copyright (c) 2009 IETF Trust and the persons identified as the 38 document authors. All rights reserved. 40 This document is subject to BCP 78 and the IETF Trust's Legal 41 Provisions Relating to IETF Documents in effect on the date of 42 publication of this document (http://trustee.ietf.org/license-info). 43 Please review these documents carefully, as they describe your rights 44 and restrictions with respect to this document. 46 Abstract 48 REsource LOcation And Discovery (RELOAD) is a peer-to-peer (P2P) 49 signaling protocol that provides an overlay network service. Peers 50 in a RELOAD overlay network collectively run an overlay algorithm to 51 organize the overlay, and to store and retrieve data. This document 52 describes how the default topology plugin of RELOAD can be extended 53 to support self-tuning, that is, to adapt to changing operating 54 conditions such as churn and network size. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 60 3. Introduction to Stabilization in DHTs . . . . . . . . . . . . 5 61 3.1. Reactive vs. Periodic Stabilization . . . . . . . . . . . 5 62 3.2. Configuring Periodic Stabilization . . . . . . . . . . . . 6 63 3.3. Adaptive Stabilization . . . . . . . . . . . . . . . . . . 7 64 4. Introduction to Chord . . . . . . . . . . . . . . . . . . . . 7 65 5. Extending Chord-reload to Support Self-tuning . . . . . . . . 9 66 5.1. Update Requests . . . . . . . . . . . . . . . . . . . . . 10 67 5.2. Neighbor Stabilization . . . . . . . . . . . . . . . . . . 10 68 5.3. Finger Stabilization . . . . . . . . . . . . . . . . . . . 11 69 5.4. Adjusting Finger Table Size . . . . . . . . . . . . . . . 11 70 5.5. Detecting Partitioning . . . . . . . . . . . . . . . . . . 12 71 5.6. Leaving the Overlay . . . . . . . . . . . . . . . . . . . 12 72 5.6.1. Contents of the Leave Message . . . . . . . . . . . . 12 73 6. Self-tuning Chord Parameters . . . . . . . . . . . . . . . . . 13 74 6.1. Estimating Overlay Size . . . . . . . . . . . . . . . . . 13 75 6.2. Determining Routing Table Size . . . . . . . . . . . . . . 14 76 6.3. Estimating Failure Rate . . . . . . . . . . . . . . . . . 14 77 6.3.1. Detecting Failures . . . . . . . . . . . . . . . . . . 15 78 6.4. Estimating Join Rate . . . . . . . . . . . . . . . . . . . 16 79 6.5. Calculating the Stabilization Interval . . . . . . . . . . 16 80 7. Overlay Configuration Document Extension . . . . . . . . . . . 17 81 8. Security Considerations . . . . . . . . . . . . . . . . . . . 17 82 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 83 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 84 10.1. Normative References . . . . . . . . . . . . . . . . . . . 18 85 10.2. Informative References . . . . . . . . . . . . . . . . . . 18 86 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20 88 1. Introduction 90 REsource LOcation And Discovery (RELOAD) [I-D.ietf-p2psip-base] is a 91 peer-to-peer signaling protocol that can be used to maintain an 92 overlay network, and to store data in and retrieve data from the 93 overlay. For interoperability reasons, RELOAD specifies one overlay 94 algorithm, called chord-reload, that is mandatory to implement. This 95 document extends the chord-reload algorithm by introducing self- 96 tuning behavior. 98 DHT-based overlay networks are self-organizing, scalable and 99 reliable. However, these features come at a cost: peers in the 100 overlay network need to consume network bandwidth to maintain routing 101 state. Most DHTs use a periodic stabilization routine to counter the 102 undesirable effects of churn on routing. To configure the parameters 103 of a DHT, some characteristics such as churn rate and network size 104 need to be known in advance. These characteristics are then used to 105 configure the DHT in a static fashion by using fixed values for 106 parameters such as the size of the successor set, size of the routing 107 table, and rate of maintenance messages. The problem with this 108 approach is that it is not possible to achieve a low failure rate and 109 a low communication overhead by using fixed parameters. Instead, a 110 better approach is to allow the system to take into account the 111 evolution of network conditions and adapt to them. This document 112 extends the mandatory-to-implement chord-reload algorithm by making 113 it self-tuning. Two main advantages of self-tuning are that users no 114 longer need to tune every DHT parameter correctly for a given 115 operating environment and that the system adapts to changing 116 operating conditions. 118 The remainder of this document is structured as follows: Section 2 119 provides definitions of terms used in this document. Section 3 120 discusses alternative approaches to stabilization operations in DHTs, 121 including reactive stabilization, periodic stabilization, and 122 adaptive stabilization. Section 4 gives an introduction to the Chord 123 DHT algorithm. Section 5 describes how this document extends the 124 stabilization routine of the chord-reload algorithm. Section 6 125 describes how the stabilization rate and routing table size are 126 calculated in an adaptive fashion. 128 2. Terminology 130 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 131 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 132 document are to be interpreted as described in RFC 2119 [RFC2119]. 134 This document uses the terminology and definitions from the Concepts 135 and Terminology for Peer to Peer SIP [I-D.ietf-p2psip-concepts] 136 draft. 138 Chord Ring: The Chord DHT orders identifiers on an identifier circle 139 of size 2^numBitsInNodeId (numBitsInNodeId is the number of bits 140 in node identifiers). This identifier circle is called the Chord 141 ring. 143 DHT: Distributed Hash Tables (DHTs) are a class of decentralized 144 distributed systems that provide a lookup service similar to a 145 hash table. Given a key, any participating peer can retrieve the 146 value associated with that key. The responsibility for 147 maintaining the mapping from keys to values is distributed among 148 the peers. 150 Finger Table: A data structure with up to numBitsInNodeId entries 151 maintained by each peer in a Chord-based overlay. The ith entry 152 in the finger table of peer n contains the identity of the first 153 peer that succeeds n by at least 2^(numBitsInNodeId-i) on the 154 Chord ring. This peer is called the ith finger of peer n. As an 155 example, the first entry in the finger table of peer n contains a 156 peer half-way around the Chord ring from peer n. The purpose of 157 the finger table is to accelerate lookups. 159 log2(N): Logarithm of N with base 2. 161 n.id: Peer-ID of peer n. 163 Neighborhood Set: Consists of successor and predecessor lists. 165 numBitsInNodeId: Number of bits in a Node-ID. 167 O(g(n)): Informally, saying that some equation f(n) = O(g(n)) means 168 that f(n) is less than some constant multiple of g(n). 170 Omega(g(n)): Informally, saying that some equation f(n) = 171 Omega(g(n)) means that f(n) is more than some constant multiple of 172 g(n). 174 Predecessor List: A data structure containing the predecessors of a 175 peer on the Chord ring. 177 Responsible Peer: The peer storing the original copy of a resource. 178 In Chord, this is the first peer whose identifier is equal to or 179 follows the identifier of the resource on the Chord ring. Also 180 the term "Root Node" can be used for this concept. 182 Routing Table: The set of peers that a node can use to route overlay 183 messages. The routing table consists of the finger table, 184 successor list and predecessor list. 186 Successor List: A data structure containing the first r successors 187 of a peer on the Chord ring. 189 3. Introduction to Stabilization in DHTs 191 DHT-based peer-to-peer overlay networks are self-organizing, scalable 192 and reliable. However, these features come at a cost: peers in the 193 overlay network need to consume network bandwidth to maintain routing 194 state. DHTs use stabilization routines to counter the undesirable 195 effects of churn on routing. The purpose of stabilization is to keep 196 the routing information of each peer in the overlay consistent with 197 the constantly changing overlay topology. There are two alternative 198 approaches to stabilization: periodic and reactive [rhea2004]. 199 Periodic stabilization can either use a fixed stabilization rate or 200 calculate the stabilization rate in an adaptive fashion. 202 3.1. Reactive vs. Periodic Stabilization 204 In reactive stabilization, a peer reacts to the loss of a peer in its 205 neighborhood set or to the appearance of a new peer that should be 206 added to its neighborhood set by sending a copy of its neighbor table 207 to all peers in the neighborhood set. Periodic recovery, in 208 contrast, takes place independently of changes in the neighborhood 209 set. In periodic recovery, a peer periodically shares its 210 neighborhood set with each or a subset of the members of that set. 212 The chord-reload algorithm [I-D.ietf-p2psip-base] supports both 213 reactive and periodic stabilization. It has been shown in [rhea2004] 214 that reactive stabilization works well for small neighborhood sets 215 (i.e., small overlays) and moderate churn. However, in large-scale 216 (e.g., 1000 peers or more [rhea2004]) or high-churn overlays, 217 reactive stabilization runs the risk of creating a positive feedback 218 cycle, which can eventually result in congestion collapse. In 219 [rhea2004], it is shown that a 1000-peer overlay under churn uses 220 significantly less bandwidth and has lower latencies when periodic 221 stabilization is used than when reactive stabilization is used. 222 Although in the experiments carried out in [rhea2004], reactive 223 stabilization performed well when there was no churn, its bandwidth 224 use was observed to jump dramatically under churn. At higher churn 225 rates and larger scale overlays, periodic stabilization uses less 226 bandwidth and the resulting lower contention for the network leads to 227 lower latencies. For this reason, most DHTs such as CAN [CAN], Chord 228 [Chord], Pastry [Pastry], Bamboo [rhea2004], etc. use periodic 229 stabilization [ghinita2006]. As an example, the first version of 230 Bamboo used reactive stabilization, which caused Bamboo to suffer 231 from degradation in performance under churn. To fix this problem, 232 Bamboo was modified to use periodic stabilization. 234 In Chord, periodic stabilization is typically done both for 235 successors and fingers. An alternative strategy is analyzed in 236 [krishnamurthy2008]. In this strategy, called the correction-on- 237 change maintenance strategy, a peer periodically stabilizes its 238 successors but does not do so for its fingers. Instead, finger 239 pointers are stabilized in a reactive fashion. The results obtained 240 in [krishnamurthy2008] imply that although the correction-on-change 241 strategy works well when churn is low, periodic stabilization 242 outperforms the correction-on-change strategy when churn is high. 244 3.2. Configuring Periodic Stabilization 246 When periodic stabilization is used, one faces the problem of 247 selecting an appropriate execution rate for the stabilization 248 procedure. If the execution rate of periodic stabilization is high, 249 changes in the system can be quickly detected, but at the 250 disadvantage of increased communication overhead. Alternatively, if 251 the stabilization rate is low and the churn rate is high, routing 252 tables become inaccurate and DHT performance deteriorates. Thus, the 253 problem is setting the parameters so that the overlay achieves the 254 desired reliability and performance even in challenging conditions, 255 such as under heavy churn. This naturally results in high cost 256 during periods when the churn level is lower than expected, or 257 alternatively, poor performance or even network partitioning in worse 258 than expected conditions. 260 In addition to selecting an appropriate stabilization interval, 261 regardless of whether periodic stabilization is used or not, an 262 appropriate size needs to be selected for the neighborhood set and 263 for the routing table. 265 The current approach is to configure overlays statically. This works 266 in situations where perfect information about the future is 267 available. In situations where the operating conditions of the 268 network are known in advance and remain static throughout the 269 lifetime of the system, it is possible to choose fixed optimal values 270 for parameters such as stabilization rate, neighborhood set size and 271 routing table size. However, if the operating conditions (e.g., the 272 size of the overlay and its churn rate) do not remain static but 273 evolve with time, it is not possible to achieve both a low lookup 274 failure rate and a low communication overhead by using fixed 275 parameters [ghinita2006]. 277 As an example, to configure the Chord DHT algorithm, one needs to 278 select values for the following parameters: size of successor list, 279 stabilization interval, and size of the finger table. To select an 280 appropriate value for the stabilization interval, one needs to know 281 the expected churn rate and overlay size. According to 282 [liben-nowell2002], a Chord network in a ring-like state remains in a 283 ring-like state as long as peers send Omega(log2^2(N)) messages 284 before N new peers join or N/2 peers fail. Thus, in a 500-peer 285 overlay churning at a rate such that one peer joins and one peer 286 leaves the network every 30 seconds, an appropriate stabilization 287 interval would be on the order of 93s. According to [Chord], the 288 size of the successor list and finger table should be on the order of 289 log2(N). Having a successor list of size O(log2(N)) makes it 290 unlikely that a peer will lose all of its successors, which would 291 cause the Chord ring to become disconnected. Thus, in a 500-peer 292 network each peer should maintain on the order of nine successors and 293 fingers. However, if the churn rate doubles and the network size 294 remains unchanged, the stabilization rate should double as well. 295 That is, the appropriate maintenance interval would now be on the 296 order of 46s. On the other hand, if the churn rate becomes e.g. six- 297 fold and the size of the network grows to 2000 peers, on the order of 298 eleven fingers and successors should be maintained and the 299 stabilization interval should be on the order of 42s. If one 300 continued using the old values, this could result in inaccurate 301 routing tables, network partitioning, and deteriorating performance. 303 3.3. Adaptive Stabilization 305 A self-tuning DHT takes into consideration the continuous evolution 306 of network conditions and adapts to them. In a self-tuning DHT, each 307 peer collects statistical data about the network and dynamically 308 adjusts its stabilization rate, neighborhood set size, and finger 309 table size based on the analysis of the data [ghinita2006]. 310 Reference [mahajan2003] shows that by using self-tuning, it is 311 possible to achieve high reliability and performance even in adverse 312 conditions with low maintenance cost. Adaptive stabilization has 313 been shown to outperform periodic stabilization in terms of both 314 lookup failures and communication overhead [ghinita2006]. 316 4. Introduction to Chord 318 Chord [Chord] is a structured P2P algorithm that uses consistent 319 hashing to build a DHT out of several independent peers. Consistent 320 hashing assigns each peer and resource a fixed-length identifier. 321 Peers use SHA-1 as the base hash fuction to generate the identifiers. 323 As specified in RELOAD base, the length of the identifiers is 324 numBitsInNodeId=128 bits. The identifiers are ordered on an 325 identifier circle of size 2^numBitsInNodeId. On the identifier 326 circle, key k is assigned to the first peer whose identifier equals 327 or follows the identifier of k in the identifier space. The 328 identifier circle is called the Chord ring. 330 Different DHTs differ significantly in performance when bandwidth is 331 limited. It has been shown that when compared to other DHTs, the 332 advantages of Chord include that it uses bandwidth efficiently and 333 can achieve low lookup latencies at little cost [li2004]. 335 A simple lookup mechanism could be implemented on a Chord ring by 336 requiring each peer to only know how to contact its current successor 337 on the identifier circle. Queries for a given identifier could then 338 be passed around the circle via the successor pointers until they 339 encounter the first peer whose identifier is equal to or larger than 340 the desired identifier. Such a lookup scheme uses a number of 341 messages that grows linearly with the number of peers. To reduce the 342 cost of lookups, Chord maintains also additional routing information; 343 each peer n maintains a data structure with up to numBitsInNodeId 344 entries, called the finger table. The first entry in the finger 345 table of peer n contains the peer half-way around the ring from peer 346 n. The second entry contains the peer that is 1/4th of the way 347 around, the third entry the peer that is 1/8th of the way around, 348 etc. In other words, the ith entry in the finger table at peer n 349 contains the identity of the first peer s that succeeds n by at least 350 2^(numBitsInNodeId-i) on the Chord ring. This peer is called the ith 351 finger of peer n. The interval between two consecutive fingers is 352 called a finger interval. The ith finger interval of peer n covers 353 the range [n.id + 2^(numBitsInNodeId-i), n.id + 354 2^(numBitsInNodeId-i+1)) on the Chord ring. In an N-peer network, 355 each peer maintains information about O(log2(N)) other peers in its 356 finger table. As an example, if N=100000, it is sufficient to 357 maintain 17 fingers. 359 Chord needs all peers' successor pointers to be up to date in order 360 to ensure that lookups produce correct results as the set of 361 participating peers changes. To achieve this, peers run a 362 stabilization protocol periodically in the background. The 363 stabilization protocol of the original Chord algorithm uses two 364 operations: successor stabilization and finger stabilization. 365 However, the Chord algorithm of RELOAD base defines two additional 366 stabilization components, as will be discussed below. 368 To increase robustness in the event of peer failures, each Chord peer 369 maintains a successor list of size r, containing the peer's first r 370 successors. The benefit of successor lists is that if each peer 371 fails independently with probability p, the probability that all r 372 successors fail simultaneously is only p^r. 374 The original Chord algorithm maintains only a single predecessor 375 pointer. However, multiple predecessor pointers (i.e., a predecessor 376 list) can be maintained to speed up recovery from predecessor 377 failures. The routing table of a peer consists of the successor 378 list, finger table, and predecessor list. 380 5. Extending Chord-reload to Support Self-tuning 382 This section describes how the mandatory-to-implement chord-reload 383 algorithm defined in RELOAD base [I-D.ietf-p2psip-base] can be 384 extended to support self-tuning. 386 The chord-reload algorithm supports both reactive and periodic 387 recovery strategies. When the self-tuning mechanisms defined in this 388 document are used, the periodic recovery strategy MUST be used. 389 Further, chord-reload specifies that at least three predecessors and 390 three successors need to be maintained. When the self-tuning 391 mechanisms are used, the appropriate sizes of the successor list and 392 predecessor list are determined in an adaptive fashion based on the 393 estimated network size, as will be described in Section 6. 395 As specified in RELOAD base, each peer MUST maintain a stabilization 396 timer. When the stabilization timer fires, the peer MUST restart the 397 timer and carry out the overlay stabilization routine. Overlay 398 stabilization has four components in chord-reload: 400 1. Update the neighbor table. We refer to this as neighbor 401 stabilization. 402 2. Refreshing the finger table. We refer to this as finger 403 stabilization. 404 3. Adjusting finger table size. 405 4. Detecting partitioning. We refer to this as strong 406 stabilization. 408 As specified in RELOAD base [I-D.ietf-p2psip-base], a peer sends 409 periodic messages as part of the neighbor stabilization, finger 410 stabilization, and strong stabilization routines. In neighbor 411 stabilization, a peer periodically sends an Update request to every 412 peer in its Connection Table. The default time is every ten minutes. 413 In finger stabilization, a peer periodically searches for new peers 414 to include in its finger table. This time defaults to one hour. 415 This document specifies how the neighbor stabilization and finger 416 stabilization intervals can be determined in an adaptive fashion 417 based on the operating conditions of the overlay. The subsections 418 below describe how this document extends the four components of 419 stabilization. 421 5.1. Update Requests 423 The neighbor and finger stabilization procedures are implemented 424 using the Update request defined in RELOAD base 425 [I-D.ietf-p2psip-base]. RELOAD base defines three types of Update 426 requests: 'peer_ready', 'neighbors', and 'full'. Regardless of the 427 type, all Update requests include an 'uptime' field. Since the self- 428 tuning extensions require information on the uptimes of peers in the 429 routing table, the sender of an Update request MUST include its 430 current uptime in seconds in the 'uptime' field. 432 When self-tuning is used, each peer decides independently the 433 appropriate size for the successor list, predecessor list and finger 434 table. Thus, the 'predecessors', 'successors', and 'fingers' fields 435 included in RELOAD Update requests are of variable length. As 436 specified in RELOAD [I-D.ietf-p2psip-base], variable length fields 437 are on the wire preceded by length bytes. In the case of the 438 successor list, predecessor list, and finger table, there are two 439 length bytes (allowing lengths up to 2^16-1). The number of NodeId 440 structures included in each field can be calculated based on the 441 length bytes since the size of a single NodeId structure is 16 bytes. 442 If a peer receives more entries than fit into its successor list, 443 predecessor list or finger table, the peer MUST ignore the extra 444 entries. If a peer receives less entries than it currently has in 445 its own data structure, the peer MUST NOT drop the extra entries from 446 its data structure. 448 5.2. Neighbor Stabilization 450 In the neighbor stabilization operation of chord-reload, a peer 451 periodically sends an Update request to every peer in its Connection 452 Table. In a small, low-churn overlay, the amount of traffic this 453 process generates is typically acceptable. However, in a large-scale 454 overlay churning at a moderate or high churn rate, the traffic load 455 may no longer be acceptable since the size of the connection table is 456 large and the stabilization interval relatively short. The self- 457 tuning mechanisms described in this document are especially designed 458 for overlays of the latter type. Therefore, when the self-tuning 459 mechanisms are used, each peer MUST only send a periodic Update 460 request to its first predecessor and first successor on the Chord 461 ring. 463 The neighbor stabilization routine MUST be executed when the 464 stabilization timer fires. To begin the neighbor stabilization 465 routine, a peer MUST send an Update request to its first successor 466 and its first predecessor. The type of the Update request MUST be 467 'neighbors'. The Update request MUST include the successor and 468 predecessor lists of the sender. If a peer receiving such an Update 469 request learns from the predecessor and successor lists included in 470 the request that new peers can be included in its neighborhood set, 471 it MUST send Attach requests to the new peers. 473 After a new peer has been added to the predecessor or successor list, 474 an Update request of type 'peer_ready' MUST be sent to the new peer. 475 This allows the new peer to insert the sender into its neighborhood 476 set. 478 5.3. Finger Stabilization 480 In the finger stabilization routine of chord-reload, a peer 481 periodically searches for new peers to replace invalid (that is, 482 repeated or failed) entries in the finger table. Chord-reload 483 provides two possible methods for searching for new peers to include 484 in the finger table. In alternative 1, a peer selects one entry from 485 among the invalid entries in its finger table each time the 486 stabilization timer fires and sends a Ping request to the selected 487 entry. In alternative 2, a peer sends a RouteQuery request to all 488 invalid entries in the finger table. After having received 489 RouteQuery responses, the peer sends a Ping to those entries for 490 which the RouteQuery response came from a peer not already present in 491 the routing table. When the self-tuning mechanisms defined in this 492 draft are being used in the overlay, alternative 1 MUST be used since 493 the traffic load it generates is lower and thus more appropriate for 494 large-scale overlays experiencing churn. 496 Immediately after a new peer has been added to the finger table, a 497 Probe request MUST be sent to the new peer to fetch its uptime. The 498 requested_info field of the Probe request MUST be set to contain the 499 ProbeInformationType 'uptime' defined in RELOAD base 500 [I-D.ietf-p2psip-base]. 502 5.4. Adjusting Finger Table Size 504 The chord-reload algorithm defines how a peer can make sure that the 505 finger table is appropriately sized to allow for efficient routing. 506 Since the self-tuning mechanisms specified in this document produce a 507 network size estimate, this estimate can be directly used to 508 calculate the optimal size for the finger table. This mechanism MUST 509 be used instead of the one specified by chord-reload. A peer MUST 510 use the network size estimate to determine whether it needs to adjust 511 the size of its finger table each time when the stabilization timer 512 fires. The way this is done is explained in Section 6.2. 514 5.5. Detecting Partitioning 516 This document does not require any changes to the mechanism chord- 517 reload uses to detect network partitioning. 519 5.6. Leaving the Overlay 521 This document extends the Leave request defined in RELOAD base 522 [I-D.ietf-p2psip-base]. As specified in RELOAD base, a leaving peer 523 SHOULD send a Leave request to all members of its neighbor table 524 prior to leaving the overlay. When the self-tuning extensions are 525 used, the Leave requests sent to the first predecessor and first 526 successor contain additional data. The Leave request that is sent to 527 the first successor MUST contain the predecessor list of the leaving 528 peer. The Leave request that is sent to the first predecessor MUST 529 contain the successor list of the leaving peer. If the first 530 successor can identify better predecessors than are already included 531 in its predecessor list by investigating the predecessor list it 532 receives from the leaving peer, it MUST send Attach requests to them. 533 Similarly, if the first predecessor identifies better successors by 534 investigating the successor list it receives from the leaving peer, 535 it MUST send Attach requests to them. 537 5.6.1. Contents of the Leave Message 539 The Leave request defined in RELOAD base [I-D.ietf-p2psip-base] has a 540 field for overlay specific data. When the self-tuning extensions are 541 being used, this overlay_specific_data field MUST contain the 542 ChordLeaveData structure defined below: 544 enum { reserved (0), 545 from_succ(1), from_pred(2), (255) } 546 ChordLeaveType; 548 struct { 549 ChordLeaveType type; 551 select(type) { 552 case from_succ: 553 NodeId successors<0..2^16-1>; 554 case from_pred: 555 NodeId predecessors<0..2^16-1>; 556 }; 557 } ChordLeaveData; 559 The 'type' field indicates whether the Leave request was sent by a 560 predecessor or a successor of the recipient: 562 from_succ 563 The Leave request was sent by a successor. 565 from_pred 566 The Leave request was sent by a predecessor. 568 If the type of the request is 'from_succ', the contents will be: 570 successors 571 The sender's successor list. 573 If the type of the request is 'from_pred', the contents will be: 575 predecessors 576 The sender's predecessor list. 578 6. Self-tuning Chord Parameters 580 This section specifies how to determine an appropriate stabilization 581 rate and routing table size in an adaptive fashion. The proposed 582 mechanism is based on [mahajan2003], [liben-nowell2002], and 583 [ghinita2006]. To calculate an appropriate stabilization rate, the 584 values of three parameters MUST be estimated: overlay size N, failure 585 rate U, and join rate L. To calculate an appropriate routing table 586 size, the estimated network size N can be used. Peers in the overlay 587 MUST re-calculate the values of the parameters to self-tune the 588 chord-reload algorithm at the end of each stabilization period before 589 re-starting the stabilization timer. 591 6.1. Estimating Overlay Size 593 Techniques for estimating the size of an overlay network have been 594 proposed for instance in [mahajan2003], [horowitz2003], 595 [kostoulas2005], [binzenhofer2006], and [ghinita2006]. In Chord, the 596 density of peer identifiers in the neighborhood set can be used to 597 produce an estimate of the size of the overlay, N [mahajan2003]. 598 Since peer identifiers are picked randomly with uniform probability 599 from the numBitsInNodeId-bit identifier space, the average distance 600 between peer identifiers in the successor set is 601 (2^numBitsInNodeId)/N. 603 To estimate the overlay network size, a peer MUST compute the average 604 inter-peer distance d between the successive peers starting from the 605 most distant predecessor and ending to the most distant successor in 606 the successor list. The estimated network size MUST be calculated 607 as: 609 2^numBitsInNodeId 610 N = ------------------- 611 d 613 This estimate has been found to be accurate within 15% of the real 614 network size [ghinita2006]. Of course, the size of the neighborhood 615 set affects the accuracy of the estimate. 617 During the join process, a joining peer fills its routing table by 618 sending a series of Ping and Attach requests, as specified in RELOAD 619 base [I-D.ietf-p2psip-base]. Thus, a joining peer immediately has 620 enough information at its disposal to calculate an estimate of the 621 network size. 623 6.2. Determining Routing Table Size 625 As specified in RELOAD base, the finger table must contain at least 626 16 entries. When the self-tuning mechanisms are used, the size of 627 the finger table MUST be set to max(log2(N), 16) using the estimated 628 network size N. 630 The size of the successor list MUST be set to log2(N). An 631 implementation MAY place a lower limit on the size of the successor 632 list. As an example, the implementation might require the size of 633 the successor list to be always at least three. 635 A peer MAY choose to maintain a fixed-size predecessor list with only 636 three entries as specified in RELOAD base. However, it is 637 RECOMMENDED that a peer maintains log2(N) predecessors. 639 6.3. Estimating Failure Rate 641 A typical approach is to assume that peers join the overlay according 642 to a Poisson process with rate L and leave according to a Poisson 643 process with rate parameter U [mahajan2003]. The value of U can be 644 estimated using peer failures in the finger table and neighborhood 645 set [mahajan2003]. If peers fail with rate U, a peer with M unique 646 peer identifiers in its routing table should observe K failures in 647 time K/(M*U). Every peer in the overlay MUST maintain a history of 648 the last K failures. The current time MUST be inserted into the 649 history when the peer joins the overlay. The estimate of U MUST be 650 calculated as: 652 k 653 U = --------, 654 M * Tk 656 where M is the number of unique peer identifiers in the routing 657 table, Tk is the time between the first and the last failure in the 658 history, and k is the number of failures in the history. If k is 659 smaller than K, the estimate MUST be computed as if there was a 660 failure at the current time. It has been shown that an estimate 661 calculated in a similar manner is accurate within 17% of the real 662 value of U [ghinita2006]. 664 The size of the failure history K affects the accuracy of the 665 estimate of U. One can increase the accuracy by increasing K. 666 However, this has the side effect of decreasing responsiveness to 667 changes in the failure rate. On the other hand, a small history size 668 may cause a peer to overreact each time a new failure occurs. In 669 [ghinita2006], K is set 25% of the routing table size. Use of this 670 approach is RECOMMENDED. 672 6.3.1. Detecting Failures 674 A new failure MUST be inserted to the failure history in the 675 following cases: 677 1. A Leave request is received from a neigbhor. 678 2. No data has been received from a peer in the Connection Table for 679 a specified amount of time. RELOAD mandates the use of STUN 680 [RFC5389] for keepalives. STUN keepalives take the form of STUN 681 Binding Indication transactions. As specified in ICE 682 [I-D.ietf-mmusic-ice], a peer sends a STUN Binding Indication if 683 there has been no packet sent on a connection for Tr seconds. Tr 684 is configurable and has a default of 15 seconds. Although STUN 685 Binding Indications do not generate a response, the fact that a 686 peer has failed can be learned from the lack of packets (Binding 687 Indications or application protocol packets) received from the 688 peer. A peer MUST be considered to have failed if no packets 689 have been received from it for the past max-silent-periods*Tr 690 seconds. max-silent-periods is a new element in the overlay 691 configuration document. The default value is three. 693 As an alternative to relying on STUN keepalives to detect peer 694 failure, a peer could send additional, frequent RELOAD messages to 695 every peer in its Connection Table. These messages could be Update 696 requests, in which case they would serve two purposes: detecting peer 697 failure and stabilization. However, as the cost of this approach can 698 be very high in terms of bandwidth consumption and traffic load, 699 especially in large-scale overlays experiencing churn, its use is NOT 700 RECOMMENDED. 702 6.4. Estimating Join Rate 704 Reference [ghinita2006] proposes that a peer can estimate the join 705 rate based on the uptime of the peers in its routing table. An 706 increase in peer join rate will be reflected by a decrease in the 707 average age of peers in the routing table. Thus, each peer MUST 708 maintain an array of the ages of the peers in its routing table 709 sorted in increasing order. Using this information, an estimate of 710 the global peer join rate L MUST be calculated as: 712 N 1 713 L = --- * ---------------, 714 4 Ages[rsize/4] 716 where Ages is an array containing the ages of the peers in the 717 routing table sorted in increasing order and rsize is the size of the 718 routing table. It is RECOMMENDED that only the ages of the 25% of 719 the youngest peers in the routing table (i.e., the 25th percentile) 720 are used to reduce the bias that a small number of peers with very 721 old ages can cause [ghinita2006]. It has been shown that the 722 estimate obtained by using this method is accurate within 22% of the 723 real join rate [ghinita2006]. Of course, the size of the routing 724 table affects the accuracy. 726 In order for this mechanism to work, peers need to exchange 727 information about the time they have been present in the overlay. 728 Peers receive the uptimes of their successors and predecessors during 729 the stabilization operations since all Update requests carry uptime 730 values. A joining peer learns the uptime of the admitting peer since 731 it receives an Update from the admitting peer during the join 732 procedure. Peers learn the uptimes of new fingers since they can 733 fetch the uptime using a Probe request after having attached to the 734 new finger. 736 6.5. Calculating the Stabilization Interval 738 According to [liben-nowell2002], a Chord network in a ring-like state 739 remains in a ring-like state as long as peers send Omega(log2^2(N)) 740 messages before N new peers join or N/2 peers fail. We can use the 741 estimate of peer failure rate, U, to calculate the time Tf in which 742 N/2 peers fail: 744 1 745 Tf = ------ 746 2*U 748 Based on this estimate, a stabilization interval Tstab-1 MUST be 749 calculated as: 751 Tf 752 Tstab-1 = ----------- 753 log2^2(N) 755 On the other hand, the estimated join rate L can be used to calculate 756 the time in which N new peers join the overlay. Based on the 757 estimate of L, a stabilization interval Tstab-2 MUST be calculated 758 as: 760 N 761 Tstab-2 = --------------- 762 L * log2^2(N) 764 Finally, the actual stabilization interval Tstab that MUST be used 765 can be obtained by taking the minimum of Tstab-1 and Tstab-2. 767 The results obtained in [maenpaa2009] indicate that making the 768 stabilization interval too small has the effect of making the overlay 769 less stable (e.g., in terms of detected loops and path failures). 770 Thus, a lower limit should be used for the stabilization period. 771 Based on the results in [maenpaa2009], a lower limit of 15s is 772 RECOMMENDED, since using a stabilization period smaller than this 773 will with a high probability cause too much traffic in the overlay. 775 7. Overlay Configuration Document Extension 777 This document extends the RELOAD overlay configuration document by 778 adding one new element, "max-silent-periods", inside each 779 "configuration" element. 781 max-silent-periods: A peer MUST be considered to have failed if no 782 packets have been received from it for the past max-silent- 783 periods*Tr seconds (Tr is defined in [I-D.ietf-mmusic-ice]). The 784 default value of the max-silent-periods element is 3. 786 8. Security Considerations 788 There are no new security considerations introduced in this document. 789 The security considerations of RELOAD [I-D.ietf-p2psip-base] apply. 791 9. IANA Considerations 793 This document has no actions for IANA. 795 10. References 797 10.1. Normative References 799 [I-D.ietf-mmusic-ice] 800 Rosenberg, J., "Interactive Connectivity Establishment 801 (ICE): A Protocol for Network Address Translator (NAT) 802 Traversal for Offer/Answer Protocols", 803 draft-ietf-mmusic-ice-19 (work in progress), October 2007. 805 [I-D.ietf-p2psip-base] 806 Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and 807 H. Schulzrinne, "REsource LOcation And Discovery (RELOAD) 808 Base Protocol", draft-ietf-p2psip-base-05 (work in 809 progress), October 2009. 811 [I-D.ietf-p2psip-concepts] 812 Bryan, D., Matthews, P., Shim, E., Willis, D., and S. 813 Dawkins, "Concepts and Terminology for Peer to Peer SIP", 814 draft-ietf-p2psip-concepts-02 (work in progress), 815 July 2008. 817 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 818 Requirement Levels", BCP 14, RFC 2119, March 1997. 820 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, 821 "Session Traversal Utilities for NAT (STUN)", RFC 5389, 822 October 2008. 824 10.2. Informative References 826 [CAN] Ratnasamy, S., Francis, P., Handley, M., Karp, R., and S. 827 Schenker, "A scalable content-addressable network", In 828 Proc. of the 2001 Conference on Applications, 829 Technologies, Architectures and Protocols for Computer 830 Communications 2001, pp. 161.172. 832 [Chord] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., 833 Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A 834 Scalable Peer-to-peer Lookup Service for Internet 835 Applications", IEEE/ACM Transactions on Networking Volume 836 11, Issue 1, 17-32, Feb 2003. 838 [Pastry] Rowstron, A. and P. Druschel, "Pastry: Scalable, 839 Decentralized Object Location and Routing for Large-Scale 840 Peer-to-Peer Systems", In Proc. of the IFIP/ACM 841 International Conference on Distribued Systems 842 Platforms Nov. 2001, pp. 329-350. 844 [binzenhofer2006] 845 Binzenhofer, A., Kunzmann, G., and R. Henjes, "A scalable 846 algorithm to monitor chord-based P2P systems at runtime", 847 20th International Parallel and Distributed Processing 848 Symposium April 2006. 850 [ghinita2006] 851 Ghinita, G. and Y. Teo, "An adaptive stabilization 852 framework for distributed hash tables", 20th International 853 Parallel and Distributed Processing Symposium April 2006. 855 [horowitz2003] 856 Horowitz, K. and D. Malkhi, "Estimating network size from 857 local information", Information Processing Letters Dec. 858 2003, Volume 88, Issue 5, pp. 237-243. 860 [kostoulas2005] 861 Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and 862 A. Demers, "Decentralized schemes for size estimation in 863 large and dynamic groups", Fourth IEEE International 864 Symposium on Network Computing and Applications July 2005, 865 pp. 41-48. 867 [krishnamurthy2008] 868 Krishnamurthy, S., El-Ansary, S., Aurell, E., and S. 869 Haridi, "Comparing maintenance strategies for overlays", 870 In Proc. of 16th Euromicro Conference on Parallel, 871 Distributed and Network-Based Processing Feb. 2008, pp. 872 473-482. 874 [li2004] Li, J., Strinbling, J., Gil, T., and M. Kaashoek, 875 "Comparing the performance of distributed hash tables 876 under churn", In Proc. of the 3rd International Workshop 877 on Peer-to-Peer Systems 2004. 879 [liben-nowell2002] 880 Liben-Nowell, D., Balakrishnan, H., and D. Karger, 881 "Observations on the dynamic evolution of peer-to-peer 882 networks", In Proc. of the First International Workshop on 883 Peer-to-Peer Systems March 2002. 885 [maenpaa2009] 886 Maenpaa, J. and G. Camarillo, "A study on maintenance 887 operations in a Chord-based Peer-to-Peer Session 888 Initiation Protocol overlay network", In Proc. of IPDPS 889 2009 May 2009. 891 [mahajan2003] 892 Mahajan, R., Castro, M., and A. Rowstron, "Controlling the 893 cost of reliability in peer-to-peer overlays", In 894 Proceedings of the 2nd International Workshop on Peer-to- 895 Peer Systems Feb. 2003. 897 [rhea2004] 898 Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, 899 "Handling churn in a DHT", In Proc. of the USENIX Annual 900 Techincal Conference June 2004. 902 Authors' Addresses 904 Jouni Maenpaa 905 Ericsson 906 Hirsalantie 11 907 Jorvas 02420 908 Finland 910 Email: Jouni.Maenpaa@ericsson.com 912 Gonzalo Camarillo 913 Ericsson 914 Hirsalantie 11 915 Jorvas 02420 916 Finland 918 Email: Gonzalo.Camarillo@ericsson.com 920 Jani Hautakorpi 921 Ericsson 922 Hirsalantie 11 923 Jorvas 02420 924 Finland 926 Email: Jani.Hautakorpi@ericsson.com