idnits 2.17.1 draft-ietf-p2psip-self-tuning-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** You're using the IETF Trust Provisions' Section 6.b License Notice from 12 Sep 2009 rather than the newer Notice from 28 Dec 2009. (See https://trustee.ietf.org/license-info/) Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. -- The document date (March 1, 2010) is 5170 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-26) exists of draft-ietf-p2psip-base-07 ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489) == Outdated reference: A later version (-09) exists of draft-ietf-p2psip-concepts-02 Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 P2PSIP Working Group J. Maenpaa 3 Internet-Draft G. Camarillo 4 Intended status: Standards Track J. Hautakorpi 5 Expires: September 2, 2010 Ericsson 6 March 1, 2010 8 A Self-tuning Distributed Hash Table (DHT) for REsource LOcation And 9 Discovery (RELOAD) 10 draft-ietf-p2psip-self-tuning-01.txt 12 Abstract 14 REsource LOcation And Discovery (RELOAD) is a peer-to-peer (P2P) 15 signaling protocol that provides an overlay network service. Peers 16 in a RELOAD overlay network collectively run an overlay algorithm to 17 organize the overlay, and to store and retrieve data. This document 18 describes how the default topology plugin of RELOAD can be extended 19 to support self-tuning, that is, to adapt to changing operating 20 conditions such as churn and network size. 22 Status of this Memo 24 This Internet-Draft is submitted to IETF in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF), its areas, and its working groups. Note that 29 other groups may also distribute working documents as Internet- 30 Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/ietf/1id-abstracts.txt. 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html. 43 This Internet-Draft will expire on September 2, 2010. 45 Copyright Notice 47 Copyright (c) 2010 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the BSD License. 60 Table of Contents 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 3. Introduction to Stabilization in DHTs . . . . . . . . . . . . 5 65 3.1. Reactive vs. Periodic Stabilization . . . . . . . . . . . 5 66 3.2. Configuring Periodic Stabilization . . . . . . . . . . . . 6 67 3.3. Adaptive Stabilization . . . . . . . . . . . . . . . . . . 7 68 4. Introduction to Chord . . . . . . . . . . . . . . . . . . . . 7 69 5. Extending Chord-reload to Support Self-tuning . . . . . . . . 9 70 5.1. Update Requests . . . . . . . . . . . . . . . . . . . . . 9 71 5.2. Neighbor Stabilization . . . . . . . . . . . . . . . . . . 10 72 5.3. Finger Stabilization . . . . . . . . . . . . . . . . . . . 11 73 5.4. Adjusting Finger Table Size . . . . . . . . . . . . . . . 11 74 5.5. Detecting Partitioning . . . . . . . . . . . . . . . . . . 11 75 5.6. Leaving the Overlay . . . . . . . . . . . . . . . . . . . 12 76 6. Self-tuning Chord Parameters . . . . . . . . . . . . . . . . . 12 77 6.1. Estimating Overlay Size . . . . . . . . . . . . . . . . . 12 78 6.2. Determining Routing Table Size . . . . . . . . . . . . . . 13 79 6.3. Estimating Failure Rate . . . . . . . . . . . . . . . . . 13 80 6.3.1. Detecting Failures . . . . . . . . . . . . . . . . . . 14 81 6.4. Estimating Join Rate . . . . . . . . . . . . . . . . . . . 15 82 6.5. Estimate Sharing . . . . . . . . . . . . . . . . . . . . . 15 83 6.6. Calculating the Stabilization Interval . . . . . . . . . . 16 84 7. Overlay Configuration Document Extension . . . . . . . . . . . 17 85 8. Security Considerations . . . . . . . . . . . . . . . . . . . 17 86 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 87 9.1. Message Extensions . . . . . . . . . . . . . . . . . . . . 18 88 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 89 10.1. Normative References . . . . . . . . . . . . . . . . . . . 18 90 10.2. Informative References . . . . . . . . . . . . . . . . . . 18 91 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20 93 1. Introduction 95 REsource LOcation And Discovery (RELOAD) [I-D.ietf-p2psip-base] is a 96 peer-to-peer signaling protocol that can be used to maintain an 97 overlay network, and to store data in and retrieve data from the 98 overlay. For interoperability reasons, RELOAD specifies one overlay 99 algorithm, called chord-reload, that is mandatory to implement. This 100 document extends the chord-reload algorithm by introducing self- 101 tuning behavior. 103 DHT-based overlay networks are self-organizing, scalable and 104 reliable. However, these features come at a cost: peers in the 105 overlay network need to consume network bandwidth to maintain routing 106 state. Most DHTs use a periodic stabilization routine to counter the 107 undesirable effects of churn on routing. To configure the parameters 108 of a DHT, some characteristics such as churn rate and network size 109 need to be known in advance. These characteristics are then used to 110 configure the DHT in a static fashion by using fixed values for 111 parameters such as the size of the successor set, size of the routing 112 table, and rate of maintenance messages. The problem with this 113 approach is that it is not possible to achieve a low failure rate and 114 a low communication overhead by using fixed parameters. Instead, a 115 better approach is to allow the system to take into account the 116 evolution of network conditions and adapt to them. This document 117 extends the mandatory-to-implement chord-reload algorithm by making 118 it self-tuning. Two main advantages of self-tuning are that users no 119 longer need to tune every DHT parameter correctly for a given 120 operating environment and that the system adapts to changing 121 operating conditions. 123 The remainder of this document is structured as follows: Section 2 124 provides definitions of terms used in this document. Section 3 125 discusses alternative approaches to stabilization operations in DHTs, 126 including reactive stabilization, periodic stabilization, and 127 adaptive stabilization. Section 4 gives an introduction to the Chord 128 DHT algorithm. Section 5 describes how this document extends the 129 stabilization routine of the chord-reload algorithm. Section 6 130 describes how the stabilization rate and routing table size are 131 calculated in an adaptive fashion. 133 2. Terminology 135 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 136 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 137 document are to be interpreted as described in RFC 2119 [RFC2119]. 139 This document uses the terminology and definitions from the Concepts 140 and Terminology for Peer to Peer SIP [I-D.ietf-p2psip-concepts] 141 draft. 143 Chord Ring: The Chord DHT orders identifiers on an identifier circle 144 of size 2^numBitsInNodeId (numBitsInNodeId is the number of bits 145 in node identifiers). This identifier circle is called the Chord 146 ring. 148 DHT: Distributed Hash Tables (DHTs) are a class of decentralized 149 distributed systems that provide a lookup service similar to a 150 hash table. Given a key, any participating peer can retrieve the 151 value associated with that key. The responsibility for 152 maintaining the mapping from keys to values is distributed among 153 the peers. 155 Finger Table: A data structure with up to numBitsInNodeId entries 156 maintained by each peer in a Chord-based overlay. The ith entry 157 in the finger table of peer n contains the identity of the first 158 peer that succeeds n by at least 2^(numBitsInNodeId-i) on the 159 Chord ring. This peer is called the ith finger of peer n. As an 160 example, the first entry in the finger table of peer n contains a 161 peer half-way around the Chord ring from peer n. The purpose of 162 the finger table is to accelerate lookups. 164 log2(N): Logarithm of N with base 2. 166 n.id: Peer-ID of peer n. 168 Neighborhood Set: Consists of successor and predecessor lists. 170 numBitsInNodeId: Number of bits in a Node-ID. 172 O(g(n)): Informally, saying that some equation f(n) = O(g(n)) means 173 that f(n) is less than some constant multiple of g(n). 175 Omega(g(n)): Informally, saying that some equation f(n) = 176 Omega(g(n)) means that f(n) is more than some constant multiple of 177 g(n). 179 Predecessor List: A data structure containing the predecessors of a 180 peer on the Chord ring. 182 Routing Table: The set of peers that a node can use to route overlay 183 messages. The routing table consists of the finger table, 184 successor list and predecessor list. 186 Successor List: A data structure containing the first r successors 187 of a peer on the Chord ring. 189 3. Introduction to Stabilization in DHTs 191 DHTs use stabilization routines to counter the undesirable effects of 192 churn on routing. The purpose of stabilization is to keep the 193 routing information of each peer in the overlay consistent with the 194 constantly changing overlay topology. There are two alternative 195 approaches to stabilization: periodic and reactive [rhea2004]. 196 Periodic stabilization can either use a fixed stabilization rate or 197 calculate the stabilization rate in an adaptive fashion. 199 3.1. Reactive vs. Periodic Stabilization 201 In reactive stabilization, a peer reacts to the loss of a peer in its 202 neighborhood set or to the appearance of a new peer that should be 203 added to its neighborhood set by sending a copy of its neighbor table 204 to all peers in the neighborhood set. Periodic recovery, in 205 contrast, takes place independently of changes in the neighborhood 206 set. In periodic recovery, a peer periodically shares its 207 neighborhood set with each or a subset of the members of that set. 209 The chord-reload algorithm [I-D.ietf-p2psip-base] supports both 210 reactive and periodic stabilization. It has been shown in [rhea2004] 211 that reactive stabilization works well for small neighborhood sets 212 (i.e., small overlays) and moderate churn. However, in large-scale 213 (e.g., 1000 peers or more [rhea2004]) or high-churn overlays, 214 reactive stabilization runs the risk of creating a positive feedback 215 cycle, which can eventually result in congestion collapse. In 216 [rhea2004], it is shown that a 1000-peer overlay under churn uses 217 significantly less bandwidth and has lower latencies when periodic 218 stabilization is used than when reactive stabilization is used. 219 Although in the experiments carried out in [rhea2004], reactive 220 stabilization performed well when there was no churn, its bandwidth 221 use was observed to jump dramatically under churn. At higher churn 222 rates and larger scale overlays, periodic stabilization uses less 223 bandwidth and the resulting lower contention for the network leads to 224 lower latencies. For this reason, most DHTs such as CAN [CAN], Chord 225 [Chord], Pastry [Pastry], Bamboo [rhea2004], etc. use periodic 226 stabilization [ghinita2006]. As an example, the first version of 227 Bamboo used reactive stabilization, which caused Bamboo to suffer 228 from degradation in performance under churn. To fix this problem, 229 Bamboo was modified to use periodic stabilization. 231 In Chord, periodic stabilization is typically done both for 232 successors and fingers. An alternative strategy is analyzed in 233 [krishnamurthy2008]. In this strategy, called the correction-on- 234 change maintenance strategy, a peer periodically stabilizes its 235 successors but does not do so for its fingers. Instead, finger 236 pointers are stabilized in a reactive fashion. The results obtained 237 in [krishnamurthy2008] imply that although the correction-on-change 238 strategy works well when churn is low, periodic stabilization 239 outperforms the correction-on-change strategy when churn is high. 241 3.2. Configuring Periodic Stabilization 243 When periodic stabilization is used, one faces the problem of 244 selecting an appropriate execution rate for the stabilization 245 procedure. If the execution rate of periodic stabilization is high, 246 changes in the system can be quickly detected, but at the 247 disadvantage of increased communication overhead. Alternatively, if 248 the stabilization rate is low and the churn rate is high, routing 249 tables become inaccurate and DHT performance deteriorates. Thus, the 250 problem is setting the parameters so that the overlay achieves the 251 desired reliability and performance even in challenging conditions, 252 such as under heavy churn. This naturally results in high cost 253 during periods when the churn level is lower than expected, or 254 alternatively, poor performance or even network partitioning in worse 255 than expected conditions. 257 In addition to selecting an appropriate stabilization interval, 258 regardless of whether periodic stabilization is used or not, an 259 appropriate size needs to be selected for the neighborhood set and 260 for the finger table. 262 The current approach is to configure overlays statically. This works 263 in situations where perfect information about the future is 264 available. In situations where the operating conditions of the 265 network are known in advance and remain static throughout the 266 lifetime of the system, it is possible to choose fixed optimal values 267 for parameters such as stabilization rate, neighborhood set size and 268 routing table size. However, if the operating conditions (e.g., the 269 size of the overlay and its churn rate) do not remain static but 270 evolve with time, it is not possible to achieve both a low lookup 271 failure rate and a low communication overhead by using fixed 272 parameters [ghinita2006]. 274 As an example, to configure the Chord DHT algorithm, one needs to 275 select values for the following parameters: size of successor list, 276 stabilization interval, and size of the finger table. To select an 277 appropriate value for the stabilization interval, one needs to know 278 the expected churn rate and overlay size. According to 280 [liben-nowell2002], a Chord network in a ring-like state remains in a 281 ring-like state as long as peers send Omega(log2^2(N)) messages 282 before N new peers join or N/2 peers fail. Thus, in a 500-peer 283 overlay churning at a rate such that one peer joins and one peer 284 leaves the network every 30 seconds, an appropriate stabilization 285 interval would be on the order of 93s. According to [Chord], the 286 size of the successor list and finger table should be on the order of 287 log2(N). Having a successor list of size O(log2(N)) makes it 288 unlikely that a peer will lose all of its successors, which would 289 cause the Chord ring to become disconnected. Thus, in a 500-peer 290 network each peer should maintain on the order of nine successors and 291 fingers. However, if the churn rate doubles and the network size 292 remains unchanged, the stabilization rate should double as well. 293 That is, the appropriate maintenance interval would now be on the 294 order of 46s. On the other hand, if the churn rate becomes e.g. six- 295 fold and the size of the network grows to 2000 peers, on the order of 296 eleven fingers and successors should be maintained and the 297 stabilization interval should be on the order of 42s. If one 298 continued using the old values, this could result in inaccurate 299 routing tables, network partitioning, and deteriorating performance. 301 3.3. Adaptive Stabilization 303 A self-tuning DHT takes into consideration the continuous evolution 304 of network conditions and adapts to them. In a self-tuning DHT, each 305 peer collects statistical data about the network and dynamically 306 adjusts its stabilization rate, neighborhood set size, and finger 307 table size based on the analysis of the data [ghinita2006]. 308 Reference [mahajan2003] shows that by using self-tuning, it is 309 possible to achieve high reliability and performance even in adverse 310 conditions with low maintenance cost. Adaptive stabilization has 311 been shown to outperform periodic stabilization in terms of both 312 lookup failures and communication overhead [ghinita2006]. 314 4. Introduction to Chord 316 Chord [Chord] is a structured P2P algorithm that uses consistent 317 hashing to build a DHT out of several independent peers. Consistent 318 hashing assigns each peer and resource a fixed-length identifier. 319 Peers use SHA-1 as the base hash fuction to generate the identifiers. 320 As specified in RELOAD base, the length of the identifiers is 321 numBitsInNodeId=128 bits. The identifiers are ordered on an 322 identifier circle of size 2^numBitsInNodeId. On the identifier 323 circle, key k is assigned to the first peer whose identifier equals 324 or follows the identifier of k in the identifier space. The 325 identifier circle is called the Chord ring. 327 Different DHTs differ significantly in performance when bandwidth is 328 limited. It has been shown that when compared to other DHTs, the 329 advantages of Chord include that it uses bandwidth efficiently and 330 can achieve low lookup latencies at little cost [li2004]. 332 A simple lookup mechanism could be implemented on a Chord ring by 333 requiring each peer to only know how to contact its current successor 334 on the identifier circle. Queries for a given identifier could then 335 be passed around the circle via the successor pointers until they 336 encounter the first peer whose identifier is equal to or larger than 337 the desired identifier. Such a lookup scheme uses a number of 338 messages that grows linearly with the number of peers. To reduce the 339 cost of lookups, Chord maintains also additional routing information; 340 each peer n maintains a data structure with up to numBitsInNodeId 341 entries, called the finger table. The first entry in the finger 342 table of peer n contains the peer half-way around the ring from peer 343 n. The second entry contains the peer that is 1/4th of the way 344 around, the third entry the peer that is 1/8th of the way around, 345 etc. In other words, the ith entry in the finger table at peer n 346 contains the identity of the first peer s that succeeds n by at least 347 2^(numBitsInNodeId-i) on the Chord ring. This peer is called the ith 348 finger of peer n. The interval between two consecutive fingers is 349 called a finger interval. The ith finger interval of peer n covers 350 the range [n.id + 2^(numBitsInNodeId-i), n.id + 351 2^(numBitsInNodeId-i+1)) on the Chord ring. In an N-peer network, 352 each peer maintains information about O(log2(N)) other peers in its 353 finger table. As an example, if N=100000, it is sufficient to 354 maintain 17 fingers. 356 Chord needs all peers' successor pointers to be up to date in order 357 to ensure that lookups produce correct results as the set of 358 participating peers changes. To achieve this, peers run a 359 stabilization protocol periodically in the background. The 360 stabilization protocol of the original Chord algorithm uses two 361 operations: successor stabilization and finger stabilization. 362 However, the Chord algorithm of RELOAD base defines two additional 363 stabilization components, as will be discussed below. 365 To increase robustness in the event of peer failures, each Chord peer 366 maintains a successor list of size r, containing the peer's first r 367 successors. The benefit of successor lists is that if each peer 368 fails independently with probability p, the probability that all r 369 successors fail simultaneously is only p^r. 371 The original Chord algorithm maintains only a single predecessor 372 pointer. However, multiple predecessor pointers (i.e., a predecessor 373 list) can be maintained to speed up recovery from predecessor 374 failures. The routing table of a peer consists of the successor 375 list, finger table, and predecessor list. 377 5. Extending Chord-reload to Support Self-tuning 379 This section describes how the mandatory-to-implement chord-reload 380 algorithm defined in RELOAD base [I-D.ietf-p2psip-base] can be 381 extended to support self-tuning. 383 The chord-reload algorithm supports both reactive and periodic 384 recovery strategies. When the self-tuning mechanisms defined in this 385 document are used, the periodic recovery strategy MUST be used. 386 Further, chord-reload specifies that at least three predecessors and 387 three successors need to be maintained. When the self-tuning 388 mechanisms are used, the appropriate sizes of the successor list and 389 predecessor list are determined in an adaptive fashion based on the 390 estimated network size, as will be described in Section 6. 392 As specified in RELOAD base, each peer MUST maintain a stabilization 393 timer. When the stabilization timer fires, the peer MUST restart the 394 timer and carry out the overlay stabilization routine. Overlay 395 stabilization has four components in chord-reload: 397 1. Update the neighbor table. We refer to this as neighbor 398 stabilization. 399 2. Refreshing the finger table. We refer to this as finger 400 stabilization. 401 3. Adjusting finger table size. 402 4. Detecting partitioning. We refer to this as strong 403 stabilization. 405 As specified in RELOAD base [I-D.ietf-p2psip-base], a peer sends 406 periodic messages as part of the neighbor stabilization, finger 407 stabilization, and strong stabilization routines. In neighbor 408 stabilization, a peer periodically sends an Update request to every 409 peer in its Connection Table. The default time is every ten minutes. 410 In finger stabilization, a peer periodically searches for new peers 411 to include in its finger table. This time defaults to one hour. 412 This document specifies how the neighbor stabilization and finger 413 stabilization intervals can be determined in an adaptive fashion 414 based on the operating conditions of the overlay. The subsections 415 below describe how this document extends the four components of 416 stabilization. 418 5.1. Update Requests 420 As described in RELOAD base [I-D.ietf-p2psip-base], the neighbor and 421 finger stabilization procedures are implemented using Update 422 requests. RELOAD base defines three types of Update requests: 423 'peer_ready', 'neighbors', and 'full'. Regardless of the type, all 424 Update requests include an 'uptime' field. Since the self-tuning 425 extensions require information on the uptimes of peers in the routing 426 table, the sender of an Update request MUST include its current 427 uptime in seconds in the 'uptime' field. 429 When self-tuning is used, each peer decides independently the 430 appropriate size for the successor list, predecessor list and finger 431 table. Thus, the 'predecessors', 'successors', and 'fingers' fields 432 included in RELOAD Update requests are of variable length. As 433 specified in RELOAD [I-D.ietf-p2psip-base], variable length fields 434 are on the wire preceded by length bytes. In the case of the 435 successor list, predecessor list, and finger table, there are two 436 length bytes (allowing lengths up to 2^16-1). The number of NodeId 437 structures included in each field can be calculated based on the 438 length bytes since the size of a single NodeId structure is 16 bytes. 439 If a peer receives more entries than fit into its successor list, 440 predecessor list or finger table, the peer MUST ignore the extra 441 entries. If a peer receives less entries than it currently has in 442 its own data structure, the peer MUST NOT drop the extra entries from 443 its data structure. 445 5.2. Neighbor Stabilization 447 In the neighbor stabilization operation of chord-reload, a peer 448 periodically sends an Update request to every peer in its Connection 449 Table. In a small, low-churn overlay, the amount of traffic this 450 process generates is typically acceptable. However, in a large-scale 451 overlay churning at a moderate or high churn rate, the traffic load 452 may no longer be acceptable since the size of the connection table is 453 large and the stabilization interval relatively short. The self- 454 tuning mechanisms described in this document are especially designed 455 for overlays of the latter type. Therefore, when the self-tuning 456 mechanisms are used, each peer MUST send a periodic Update request 457 only to its first predecessor and first successor on the Chord ring. 459 The neighbor stabilization routine MUST be executed when the 460 stabilization timer fires. To begin the neighbor stabilization 461 routine, a peer MUST send an Update request to its first successor 462 and its first predecessor. The type of the Update request MUST be 463 'neighbors'. The Update request MUST include the successor and 464 predecessor lists of the sender. If a peer receiving such an Update 465 request learns from the predecessor and successor lists included in 466 the request that new peers can be included in its neighborhood set, 467 it MUST send Attach requests to the new peers. 469 After a new peer has been added to the predecessor or successor list, 470 an Update request of type 'peer_ready' MUST be sent to the new peer. 471 This allows the new peer to insert the sender into its neighborhood 472 set. 474 5.3. Finger Stabilization 476 In the finger stabilization routine of chord-reload, a peer 477 periodically searches for new peers to replace invalid (that is, 478 repeated or failed) entries in the finger table. Chord-reload 479 provides two possible methods for searching for new peers to include 480 in the finger table. In alternative 1, a peer selects one entry from 481 among the invalid entries in its finger table each time the 482 stabilization timer fires and sends a Ping request to the selected 483 entry. In alternative 2, a peer sends a RouteQuery request to all 484 invalid entries in the finger table. After having received 485 RouteQuery responses, the peer sends a Ping to those entries for 486 which the RouteQuery response came from a peer not already present in 487 the routing table. When the self-tuning mechanisms defined in this 488 draft are being used in the overlay, alternative 1 MUST be used since 489 the traffic load it generates is lower and thus more appropriate for 490 large-scale overlays experiencing churn. 492 Immediately after a new peer has been added to the finger table, a 493 Probe request MUST be sent to the new peer to fetch its uptime. The 494 requested_info field of the Probe request MUST be set to contain the 495 ProbeInformationType 'uptime' defined in RELOAD base 496 [I-D.ietf-p2psip-base]. 498 5.4. Adjusting Finger Table Size 500 The chord-reload algorithm defines how a peer can make sure that the 501 finger table is appropriately sized to allow for efficient routing. 502 Since the self-tuning mechanisms specified in this document produce a 503 network size estimate, this estimate can be directly used to 504 calculate the optimal size for the finger table. This mechanism MUST 505 be used instead of the one specified by chord-reload. A peer MUST 506 use the network size estimate to determine whether it needs to adjust 507 the size of its finger table each time when the stabilization timer 508 fires. The way this is done is explained in Section 6.2. 510 5.5. Detecting Partitioning 512 This document does not require any changes to the mechanism chord- 513 reload uses to detect network partitioning. 515 5.6. Leaving the Overlay 517 As specified in RELOAD base [I-D.ietf-p2psip-base], a leaving peer 518 SHOULD send a Leave request to all members of its neighbor table 519 prior to leaving the overlay. The overlay_specific_data field MUST 520 contain the ChordLeaveData structure. The Leave requests that are 521 sent to successors MUST contain the predecessor list of the leaving 522 peer. The Leave requests that are sent to the predecessors MUST 523 contain the successor list of the leaving peer. If a given successor 524 can identify better predecessors than are already included in its 525 predecessor lists by investigating the predecessor list it receives 526 from the leaving peer, it MUST send Attach requests to them. 527 Similarly, if a given predecessor identifies better successors by 528 investigating the successor list it receives from the leaving peer, 529 it MUST send Attach requests to them. 531 6. Self-tuning Chord Parameters 533 This section specifies how to determine an appropriate stabilization 534 rate and routing table size in an adaptive fashion. The proposed 535 mechanism is based on [mahajan2003], [liben-nowell2002], and 536 [ghinita2006]. To calculate an appropriate stabilization rate, the 537 values of three parameters MUST be estimated: overlay size N, failure 538 rate U, and join rate L. To calculate an appropriate routing table 539 size, the estimated network size N can be used. Peers in the overlay 540 MUST re-calculate the values of the parameters to self-tune the 541 chord-reload algorithm at the end of each stabilization period before 542 re-starting the stabilization timer. 544 6.1. Estimating Overlay Size 546 Techniques for estimating the size of an overlay network have been 547 proposed for instance in [mahajan2003], [horowitz2003], 548 [kostoulas2005], [binzenhofer2006], and [ghinita2006]. In Chord, the 549 density of peer identifiers in the neighborhood set can be used to 550 produce an estimate of the size of the overlay, N [mahajan2003]. 551 Since peer identifiers are picked randomly with uniform probability 552 from the numBitsInNodeId-bit identifier space, the average distance 553 between peer identifiers in the successor set is 554 (2^numBitsInNodeId)/N. 556 To estimate the overlay network size, a peer MUST compute the average 557 inter-peer distance d between the successive peers starting from the 558 most distant predecessor and ending to the most distant successor in 559 the successor list. The estimated network size MUST be calculated 560 as: 562 2^numBitsInNodeId 563 N = ------------------- 564 d 566 This estimate has been found to be accurate within 15% of the real 567 network size [ghinita2006]. Of course, the size of the neighborhood 568 set affects the accuracy of the estimate. 570 During the join process, a joining peer fills its routing table by 571 sending a series of Ping and Attach requests, as specified in RELOAD 572 base [I-D.ietf-p2psip-base]. Thus, a joining peer immediately has 573 enough information at its disposal to calculate an estimate of the 574 network size. 576 6.2. Determining Routing Table Size 578 As specified in RELOAD base, the finger table must contain at least 579 16 entries. When the self-tuning mechanisms are used, the size of 580 the finger table MUST be set to max(log2(N), 16) using the estimated 581 network size N. 583 The size of the successor list MUST be set to log2(N). An 584 implementation MAY place a lower limit on the size of the successor 585 list. As an example, the implementation might require the size of 586 the successor list to be always at least three. 588 A peer MAY choose to maintain a fixed-size predecessor list with only 589 three entries as specified in RELOAD base. However, it is 590 RECOMMENDED that a peer maintains log2(N) predecessors. 592 6.3. Estimating Failure Rate 594 A typical approach is to assume that peers join the overlay according 595 to a Poisson process with rate L and leave according to a Poisson 596 process with rate parameter U [mahajan2003]. The value of U can be 597 estimated using peer failures in the finger table and neighborhood 598 set [mahajan2003]. If peers fail with rate U, a peer with M unique 599 peer identifiers in its routing table should observe K failures in 600 time K/(M*U). Every peer in the overlay MUST maintain a history of 601 the last K failures. The current time MUST be inserted into the 602 history when the peer joins the overlay. The estimate of U MUST be 603 calculated as: 605 k 606 U = --------, 607 M * Tk 609 where M is the number of unique peer identifiers in the routing 610 table, Tk is the time between the first and the last failure in the 611 history, and k is the number of failures in the history. If k is 612 smaller than K, the estimate MUST be computed as if there was a 613 failure at the current time. It has been shown that an estimate 614 calculated in a similar manner is accurate within 17% of the real 615 value of U [ghinita2006]. 617 The size of the failure history K affects the accuracy of the 618 estimate of U. One can increase the accuracy by increasing K. 619 However, this has the side effect of decreasing responsiveness to 620 changes in the failure rate. On the other hand, a small history size 621 may cause a peer to overreact each time a new failure occurs. In 622 [ghinita2006], K is set 25% of the routing table size. Use of this 623 approach is RECOMMENDED. 625 6.3.1. Detecting Failures 627 A new failure MUST be inserted to the failure history in the 628 following cases: 630 1. A Leave request is received from a neigbhor. 631 2. A peer fails to reply to a Ping request sent in the situation 632 explained below. If no packets have been received on a 633 connection during the past 2*Tr seconds (where Tr is the 634 inactivity timer defined by ICE [I-D.ietf-mmusic-ice]), a RELOAD 635 Ping request MUST be sent to the remote peer. RELOAD mandates 636 the use of STUN [RFC5389] for keepalives. STUN keepalives take 637 the form of STUN Binding Indication transactions. As specified 638 in ICE [I-D.ietf-mmusic-ice], a peer sends a STUN Binding 639 Indication if there has been no packet sent on a connection for 640 Tr seconds. Tr is configurable and has a default of 15 seconds. 641 Although STUN Binding Indications do not generate a response, the 642 fact that a peer has failed can be learned from the lack of 643 packets (Binding Indications or application protocol packets) 644 received from the peer. If the remote peer fails to reply to the 645 Ping request, the sender MUST consider the remote peer to have 646 failed. 648 As an alternative to relying on STUN keepalives to detect peer 649 failure, a peer could send additional, frequent RELOAD messages to 650 every peer in its Connection Table. These messages could be Update 651 requests, in which case they would serve two purposes: detecting peer 652 failure and stabilization. However, as the cost of this approach can 653 be very high in terms of bandwidth consumption and traffic load, 654 especially in large-scale overlays experiencing churn, its use is NOT 655 RECOMMENDED. 657 6.4. Estimating Join Rate 659 Reference [ghinita2006] proposes that a peer can estimate the join 660 rate based on the uptime of the peers in its routing table. An 661 increase in peer join rate will be reflected by a decrease in the 662 average age of peers in the routing table. Thus, each peer MUST 663 maintain an array of the ages of the peers in its routing table 664 sorted in increasing order. Using this information, an estimate of 665 the global peer join rate L MUST be calculated as: 667 N 1 668 L = --- * ---------------, 669 4 Ages[rsize/4] 671 where Ages is an array containing the ages of the peers in the 672 routing table sorted in increasing order and rsize is the size of the 673 routing table. It is RECOMMENDED that only the ages of the 25% of 674 the youngest peers in the routing table (i.e., the 25th percentile) 675 are used to reduce the bias that a small number of peers with very 676 old ages can cause [ghinita2006]. It has been shown that the 677 estimate obtained by using this method is accurate within 22% of the 678 real join rate [ghinita2006]. Of course, the size of the routing 679 table affects the accuracy. 681 In order for this mechanism to work, peers need to exchange 682 information about the time they have been present in the overlay. 683 Peers receive the uptimes of their successors and predecessors during 684 the stabilization operations since all Update requests carry uptime 685 values. A joining peer learns the uptime of the admitting peer since 686 it receives an Update from the admitting peer during the join 687 procedure. Peers learn the uptimes of new fingers since they can 688 fetch the uptime using a Probe request after having attached to the 689 new finger. 691 6.5. Estimate Sharing 693 To improve the accuracy of network size, join rate, and leave rate 694 estimates, peers MUST share their estimates. When the stabilization 695 timer fires, a peer MUST select number-of-peers-to-probe random peers 696 from its finger table and send each of them a Probe request. The 697 targets of Probe requests are selected from the finger table rather 698 than from the neighbor table since neighbors are likely to make 699 similar errors when calculating their estimates. number-of-peers-to- 700 probe is a new element in the overlay configuration document. It is 701 defined in Section 7 and has a default value of 4. Both the Probe 702 request and the answer returned by the target peer MUST contain a new 703 message extension whose MessageExtensionType is 'self_tuning_data'. 704 This extension type is defined in Section 9.1. The 705 extension_contents field of the MessageExtension structure MUST 706 contain a SelfTuningData structure: 708 struct { 709 uint32 network_size; 710 uint32 join_rate; 711 uint32 leave_rate; 712 } SelfTuningData; 714 The contents of the SelfTuningData structure are as follows: 716 network_size 717 The latest network size estimate calculated by the sender. 719 join_rate 720 The latest join rate estimate calculated by the sender. 722 leave_rate 723 The latest leave rate estimate calculated by the sender. 725 The join and leave rates are expressed as joins or failures per 24 726 hours. As an example, if the global join rate estimate a peer has 727 calculated is 0.123 peers/s, it would include in the join_rate field 728 the value 10627 (24*60*60*0.123 = 10627.2). 730 The 'type' field of the MessageExtension structure MUST be set to 731 contain the value 'self_tuning_data'. The 'critical' field of the 732 structure MUST be set to False. 734 A peer MUST store all estimates it receives in Probe requests and 735 answers during a stabilization interval. When the stabilization 736 timer fires, the peer MUST calculate the estimates to be used during 737 the next stabilization interval by taking the 75th percentile of a 738 data set containing its own estimate and the received estimates. 740 6.6. Calculating the Stabilization Interval 742 According to [liben-nowell2002], a Chord network in a ring-like state 743 remains in a ring-like state as long as peers send Omega(log2^2(N)) 744 messages before N new peers join or N/2 peers fail. We can use the 745 estimate of peer failure rate, U, to calculate the time Tf in which 746 N/2 peers fail: 748 1 749 Tf = ------ 750 2*U 752 Based on this estimate, a stabilization interval Tstab-1 MUST be 753 calculated as: 755 Tf 756 Tstab-1 = ----------- 757 log2^2(N) 759 On the other hand, the estimated join rate L can be used to calculate 760 the time in which N new peers join the overlay. Based on the 761 estimate of L, a stabilization interval Tstab-2 MUST be calculated 762 as: 764 N 765 Tstab-2 = --------------- 766 L * log2^2(N) 768 Finally, the actual stabilization interval Tstab that MUST be used 769 can be obtained by taking the minimum of Tstab-1 and Tstab-2. 771 The results obtained in [maenpaa2009] indicate that making the 772 stabilization interval too small has the effect of making the overlay 773 less stable (e.g., in terms of detected loops and path failures). 774 Thus, a lower limit should be used for the stabilization period. 775 Based on the results in [maenpaa2009], a lower limit of 15s is 776 RECOMMENDED, since using a stabilization period smaller than this 777 will with a high probability cause too much traffic in the overlay. 779 7. Overlay Configuration Document Extension 781 This document extends the RELOAD overlay configuration document by 782 adding one new element, "number-of-peers-to-probe", inside each 783 "configuration" element. 785 number-of-peers-to-probe: The number of fingers to which Probe 786 requests are sent to obtain their network size, join rate, and 787 leave rate estimates. The default value is 4. 789 8. Security Considerations 791 There are no new security considerations introduced in this document. 792 The security considerations of RELOAD [I-D.ietf-p2psip-base] apply. 794 9. IANA Considerations 795 9.1. Message Extensions 797 This document introduces one additional extension to the "RELOAD 798 Extensions" Registry: 800 +------------------+-------+---------------+ 801 | Extension Name | Code | Specification | 802 +------------------+-------+---------------+ 803 | self_tuning_data | 1 | RFC-AAAA | 804 +------------------+-------+---------------+ 806 The contents of the extension are defined in Section 6.5. 808 10. References 810 10.1. Normative References 812 [I-D.ietf-mmusic-ice] 813 Rosenberg, J., "Interactive Connectivity Establishment 814 (ICE): A Protocol for Network Address Translator (NAT) 815 Traversal for Offer/Answer Protocols", 816 draft-ietf-mmusic-ice-19 (work in progress), October 2007. 818 [I-D.ietf-p2psip-base] 819 Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and 820 H. Schulzrinne, "REsource LOcation And Discovery (RELOAD) 821 Base Protocol", draft-ietf-p2psip-base-07 (work in 822 progress), February 2010. 824 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 825 Requirement Levels", BCP 14, RFC 2119, March 1997. 827 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, 828 "Session Traversal Utilities for NAT (STUN)", RFC 5389, 829 October 2008. 831 10.2. Informative References 833 [CAN] Ratnasamy, S., Francis, P., Handley, M., Karp, R., and S. 834 Schenker, "A scalable content-addressable network", In 835 Proc. of the 2001 Conference on Applications, 836 Technologies, Architectures and Protocols for Computer 837 Communications 2001, pp. 161.172. 839 [Chord] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., 840 Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A 841 Scalable Peer-to-peer Lookup Service for Internet 842 Applications", IEEE/ACM Transactions on Networking Volume 843 11, Issue 1, 17-32, Feb 2003. 845 [I-D.ietf-p2psip-concepts] 846 Bryan, D., Matthews, P., Shim, E., Willis, D., and S. 847 Dawkins, "Concepts and Terminology for Peer to Peer SIP", 848 draft-ietf-p2psip-concepts-02 (work in progress), 849 July 2008. 851 [Pastry] Rowstron, A. and P. Druschel, "Pastry: Scalable, 852 Decentralized Object Location and Routing for Large-Scale 853 Peer-to-Peer Systems", In Proc. of the IFIP/ACM 854 International Conference on Distribued Systems 855 Platforms Nov. 2001, pp. 329-350. 857 [binzenhofer2006] 858 Binzenhofer, A., Kunzmann, G., and R. Henjes, "A scalable 859 algorithm to monitor chord-based P2P systems at runtime", 860 20th International Parallel and Distributed Processing 861 Symposium April 2006. 863 [ghinita2006] 864 Ghinita, G. and Y. Teo, "An adaptive stabilization 865 framework for distributed hash tables", 20th International 866 Parallel and Distributed Processing Symposium April 2006. 868 [horowitz2003] 869 Horowitz, K. and D. Malkhi, "Estimating network size from 870 local information", Information Processing Letters Dec. 871 2003, Volume 88, Issue 5, pp. 237-243. 873 [kostoulas2005] 874 Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and 875 A. Demers, "Decentralized schemes for size estimation in 876 large and dynamic groups", Fourth IEEE International 877 Symposium on Network Computing and Applications July 2005, 878 pp. 41-48. 880 [krishnamurthy2008] 881 Krishnamurthy, S., El-Ansary, S., Aurell, E., and S. 882 Haridi, "Comparing maintenance strategies for overlays", 883 In Proc. of 16th Euromicro Conference on Parallel, 884 Distributed and Network-Based Processing Feb. 2008, pp. 885 473-482. 887 [li2004] Li, J., Strinbling, J., Gil, T., and M. Kaashoek, 888 "Comparing the performance of distributed hash tables 889 under churn", In Proc. of the 3rd International Workshop 890 on Peer-to-Peer Systems 2004. 892 [liben-nowell2002] 893 Liben-Nowell, D., Balakrishnan, H., and D. Karger, 894 "Observations on the dynamic evolution of peer-to-peer 895 networks", In Proc. of the First International Workshop on 896 Peer-to-Peer Systems March 2002. 898 [maenpaa2009] 899 Maenpaa, J. and G. Camarillo, "A study on maintenance 900 operations in a Chord-based Peer-to-Peer Session 901 Initiation Protocol overlay network", In Proc. of IPDPS 902 2009 May 2009. 904 [mahajan2003] 905 Mahajan, R., Castro, M., and A. Rowstron, "Controlling the 906 cost of reliability in peer-to-peer overlays", In 907 Proceedings of the 2nd International Workshop on Peer-to- 908 Peer Systems Feb. 2003. 910 [rhea2004] 911 Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, 912 "Handling churn in a DHT", In Proc. of the USENIX Annual 913 Techincal Conference June 2004. 915 Authors' Addresses 917 Jouni Maenpaa 918 Ericsson 919 Hirsalantie 11 920 Jorvas 02420 921 Finland 923 Email: Jouni.Maenpaa@ericsson.com 925 Gonzalo Camarillo 926 Ericsson 927 Hirsalantie 11 928 Jorvas 02420 929 Finland 931 Email: Gonzalo.Camarillo@ericsson.com 932 Jani Hautakorpi 933 Ericsson 934 Hirsalantie 11 935 Jorvas 02420 936 Finland 938 Email: Jani.Hautakorpi@ericsson.com