idnits 2.17.1 draft-ietf-p2psip-self-tuning-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to use 'NOT RECOMMENDED' as an RFC 2119 keyword, but does not include the phrase in its RFC 2119 key words list. -- The document date (February 16, 2013) is 4059 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-26) exists of draft-ietf-p2psip-base-24 ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489) == Outdated reference: A later version (-09) exists of draft-ietf-p2psip-concepts-04 Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 P2PSIP Working Group J. Maenpaa 3 Internet-Draft G. Camarillo 4 Intended status: Standards Track Ericsson 5 Expires: August 20, 2013 February 16, 2013 7 A Self-tuning Distributed Hash Table (DHT) for REsource LOcation And 8 Discovery (RELOAD) 9 draft-ietf-p2psip-self-tuning-08.txt 11 Abstract 13 REsource LOcation And Discovery (RELOAD) is a peer-to-peer (P2P) 14 signaling protocol that provides an overlay network service. Peers 15 in a RELOAD overlay network collectively run an overlay algorithm to 16 organize the overlay, and to store and retrieve data. This document 17 describes how the default topology plugin of RELOAD can be extended 18 to support self-tuning, that is, to adapt to changing operating 19 conditions such as churn and network size. 21 Status of this Memo 23 This Internet-Draft is submitted to IETF in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on August 20, 2013. 38 Copyright Notice 40 Copyright (c) 2013 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 56 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 3. Introduction to Stabilization in DHTs . . . . . . . . . . . . 5 58 3.1. Reactive vs. Periodic Stabilization . . . . . . . . . . . 5 59 3.2. Configuring Periodic Stabilization . . . . . . . . . . . . 6 60 3.3. Adaptive Stabilization . . . . . . . . . . . . . . . . . . 7 61 4. Introduction to Chord . . . . . . . . . . . . . . . . . . . . 7 62 5. Extending Chord-reload to Support Self-tuning . . . . . . . . 9 63 5.1. Update Requests . . . . . . . . . . . . . . . . . . . . . 9 64 5.2. Neighbor Stabilization . . . . . . . . . . . . . . . . . . 10 65 5.3. Finger Stabilization . . . . . . . . . . . . . . . . . . . 11 66 5.4. Adjusting Finger Table Size . . . . . . . . . . . . . . . 11 67 5.5. Detecting Partitioning . . . . . . . . . . . . . . . . . . 11 68 5.6. Leaving the Overlay . . . . . . . . . . . . . . . . . . . 11 69 6. Self-tuning Chord Parameters . . . . . . . . . . . . . . . . . 12 70 6.1. Estimating Overlay Size . . . . . . . . . . . . . . . . . 12 71 6.2. Determining Routing Table Size . . . . . . . . . . . . . . 13 72 6.3. Estimating Failure Rate . . . . . . . . . . . . . . . . . 13 73 6.3.1. Detecting Failures . . . . . . . . . . . . . . . . . . 14 74 6.4. Estimating Join Rate . . . . . . . . . . . . . . . . . . . 14 75 6.5. Estimate Sharing . . . . . . . . . . . . . . . . . . . . . 15 76 6.6. Calculating the Stabilization Interval . . . . . . . . . . 16 77 7. Overlay Configuration Document Extension . . . . . . . . . . . 17 78 8. Security Considerations . . . . . . . . . . . . . . . . . . . 17 79 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 80 9.1. Message Extensions . . . . . . . . . . . . . . . . . . . . 18 81 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 18 82 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 83 11.1. Normative References . . . . . . . . . . . . . . . . . . . 18 84 11.2. Informative References . . . . . . . . . . . . . . . . . . 18 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 20 87 1. Introduction 89 REsource LOcation And Discovery (RELOAD) [I-D.ietf-p2psip-base] is a 90 peer-to-peer signaling protocol that can be used to maintain an 91 overlay network, and to store data in and retrieve data from the 92 overlay. For interoperability reasons, RELOAD specifies one overlay 93 algorithm, called chord-reload, that is mandatory to implement. This 94 document extends the chord-reload algorithm by introducing self- 95 tuning behavior. 97 DHT-based overlay networks are self-organizing, scalable and 98 reliable. However, these features come at a cost: peers in the 99 overlay network need to consume network bandwidth to maintain routing 100 state. Most DHTs use a periodic stabilization routine to counter the 101 undesirable effects of churn on routing. To configure the parameters 102 of a DHT, some characteristics such as churn rate and network size 103 need to be known in advance. These characteristics are then used to 104 configure the DHT in a static fashion by using fixed values for 105 parameters such as the size of the successor set, size of the routing 106 table, and rate of maintenance messages. The problem with this 107 approach is that it is not possible to achieve a low failure rate and 108 a low communication overhead by using fixed parameters. Instead, a 109 better approach is to allow the system to take into account the 110 evolution of network conditions and adapt to them. This document 111 extends the mandatory-to-implement chord-reload algorithm by making 112 it self-tuning. Two main advantages of self-tuning are that users no 113 longer need to tune every DHT parameter correctly for a given 114 operating environment and that the system adapts to changing 115 operating conditions. 117 The remainder of this document is structured as follows: Section 2 118 provides definitions of terms used in this document. Section 3 119 discusses alternative approaches to stabilization operations in DHTs, 120 including reactive stabilization, periodic stabilization, and 121 adaptive stabilization. Section 4 gives an introduction to the Chord 122 DHT algorithm. Section 5 describes how this document extends the 123 stabilization routine of the chord-reload algorithm. Section 6 124 describes how the stabilization rate and routing table size are 125 calculated in an adaptive fashion. 127 2. Terminology 129 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 130 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 131 document are to be interpreted as described in RFC 2119 [RFC2119]. 133 This document uses the terminology and definitions from the Concepts 134 and Terminology for Peer to Peer SIP [I-D.ietf-p2psip-concepts] 135 draft. 137 Chord Ring: The Chord DHT orders identifiers on an identifier circle 138 of size 2^numBitsInNodeId (numBitsInNodeId is the number of bits 139 in node identifiers). This identifier circle is called the Chord 140 ring. 142 DHT: Distributed Hash Tables (DHTs) are a class of decentralized 143 distributed systems that provide a lookup service similar to a 144 hash table. Given a key, any participating peer can retrieve the 145 value associated with that key. The responsibility for 146 maintaining the mapping from keys to values is distributed among 147 the peers. 149 Finger Table: A data structure with up to numBitsInNodeId entries 150 maintained by each peer in a Chord-based overlay. The ith entry 151 in the finger table of peer n contains the identity of the first 152 peer that succeeds n by at least 2^(numBitsInNodeId-i) on the 153 Chord ring. This peer is called the ith finger of peer n. As an 154 example, the first entry in the finger table of peer n contains a 155 peer half-way around the Chord ring from peer n. The purpose of 156 the finger table is to accelerate lookups. 158 log2(N): Logarithm of N with base 2. 160 n.id: Peer-ID of peer n. 162 Neighborhood Set: Consists of successor and predecessor lists. 164 numBitsInNodeId: Number of bits in a Node-ID. 166 O(g(n)): Informally, saying that some equation f(n) = O(g(n)) means 167 that f(n) is less than some constant multiple of g(n). 169 Omega(g(n)): Informally, saying that some equation f(n) = 170 Omega(g(n)) means that f(n) is more than some constant multiple of 171 g(n). 173 Predecessor List: A data structure containing the predecessors of a 174 peer on the Chord ring. 176 Routing Table: The set of peers that a node can use to route overlay 177 messages. The routing table consists of the finger table, 178 successor list and predecessor list. 180 Successor List: A data structure containing the first r successors 181 of a peer on the Chord ring. 183 3. Introduction to Stabilization in DHTs 185 DHTs use stabilization routines to counter the undesirable effects of 186 churn on routing. The purpose of stabilization is to keep the 187 routing information of each peer in the overlay consistent with the 188 constantly changing overlay topology. There are two alternative 189 approaches to stabilization: periodic and reactive [rhea2004]. 190 Periodic stabilization can either use a fixed stabilization rate or 191 calculate the stabilization rate in an adaptive fashion. 193 3.1. Reactive vs. Periodic Stabilization 195 In reactive stabilization, a peer reacts to the loss of a peer in its 196 neighborhood set or to the appearance of a new peer that should be 197 added to its neighborhood set by sending a copy of its neighbor table 198 to all peers in the neighborhood set. Periodic recovery, in 199 contrast, takes place independently of changes in the neighborhood 200 set. In periodic recovery, a peer periodically shares its 201 neighborhood set with each or a subset of the members of that set. 203 The chord-reload algorithm [I-D.ietf-p2psip-base] supports both 204 reactive and periodic stabilization. It has been shown in [rhea2004] 205 that reactive stabilization works well for small neighborhood sets 206 (i.e., small overlays) and moderate churn. However, in large-scale 207 (e.g., 1000 peers or more [rhea2004]) or high-churn overlays, 208 reactive stabilization runs the risk of creating a positive feedback 209 cycle, which can eventually result in congestion collapse. In 210 [rhea2004], it is shown that a 1000-peer overlay under churn uses 211 significantly less bandwidth and has lower latencies when periodic 212 stabilization is used than when reactive stabilization is used. 213 Although in the experiments carried out in [rhea2004], reactive 214 stabilization performed well when there was no churn, its bandwidth 215 use was observed to jump dramatically under churn. At higher churn 216 rates and larger scale overlays, periodic stabilization uses less 217 bandwidth and the resulting lower contention for the network leads to 218 lower latencies. For this reason, most DHTs such as CAN [CAN], Chord 219 [Chord], Pastry [Pastry], Bamboo [rhea2004], etc. use periodic 220 stabilization [ghinita2006]. As an example, the first version of 221 Bamboo used reactive stabilization, which caused Bamboo to suffer 222 from degradation in performance under churn. To fix this problem, 223 Bamboo was modified to use periodic stabilization. 225 In Chord, periodic stabilization is typically done both for 226 successors and fingers. An alternative strategy is analyzed in 227 [krishnamurthy2008]. In this strategy, called the correction-on- 228 change maintenance strategy, a peer periodically stabilizes its 229 successors but does not do so for its fingers. Instead, finger 230 pointers are stabilized in a reactive fashion. The results obtained 231 in [krishnamurthy2008] imply that although the correction-on-change 232 strategy works well when churn is low, periodic stabilization 233 outperforms the correction-on-change strategy when churn is high. 235 3.2. Configuring Periodic Stabilization 237 When periodic stabilization is used, one faces the problem of 238 selecting an appropriate execution rate for the stabilization 239 procedure. If the execution rate of periodic stabilization is high, 240 changes in the system can be quickly detected, but at the 241 disadvantage of increased communication overhead. Alternatively, if 242 the stabilization rate is low and the churn rate is high, routing 243 tables become inaccurate and DHT performance deteriorates. Thus, the 244 problem is setting the parameters so that the overlay achieves the 245 desired reliability and performance even in challenging conditions, 246 such as under heavy churn. This naturally results in high cost 247 during periods when the churn level is lower than expected, or 248 alternatively, poor performance or even network partitioning in worse 249 than expected conditions. 251 In addition to selecting an appropriate stabilization interval, 252 regardless of whether periodic stabilization is used or not, an 253 appropriate size needs to be selected for the neighborhood set and 254 for the finger table. 256 The current approach is to configure overlays statically. This works 257 in situations where perfect information about the future is 258 available. In situations where the operating conditions of the 259 network are known in advance and remain static throughout the 260 lifetime of the system, it is possible to choose fixed optimal values 261 for parameters such as stabilization rate, neighborhood set size and 262 routing table size. However, if the operating conditions (e.g., the 263 size of the overlay and its churn rate) do not remain static but 264 evolve with time, it is not possible to achieve both a low lookup 265 failure rate and a low communication overhead by using fixed 266 parameters [ghinita2006]. 268 As an example, to configure the Chord DHT algorithm, one needs to 269 select values for the following parameters: size of successor list, 270 stabilization interval, and size of the finger table. To select an 271 appropriate value for the stabilization interval, one needs to know 272 the expected churn rate and overlay size. According to 274 [liben-nowell2002], a Chord network in a ring-like state remains in a 275 ring-like state as long as peers send Omega(log2^2(N)) messages 276 before N new peers join or N/2 peers fail. Thus, in a 500-peer 277 overlay churning at a rate such that one peer joins and one peer 278 leaves the network every 30 seconds, an appropriate stabilization 279 interval would be on the order of 93s. According to [Chord], the 280 size of the successor list and finger table should be on the order of 281 log2(N). Having a successor list of size O(log2(N)) makes it 282 unlikely that a peer will lose all of its successors, which would 283 cause the Chord ring to become disconnected. Thus, in a 500-peer 284 network each peer should maintain on the order of nine successors and 285 fingers. However, if the churn rate doubles and the network size 286 remains unchanged, the stabilization rate should double as well. 287 That is, the appropriate maintenance interval would now be on the 288 order of 46s. On the other hand, if the churn rate becomes e.g. six- 289 fold and the size of the network grows to 2000 peers, on the order of 290 eleven fingers and successors should be maintained and the 291 stabilization interval should be on the order of 42s. If one 292 continued using the old values, this could result in inaccurate 293 routing tables, network partitioning, and deteriorating performance. 295 3.3. Adaptive Stabilization 297 A self-tuning DHT takes into consideration the continuous evolution 298 of network conditions and adapts to them. In a self-tuning DHT, each 299 peer collects statistical data about the network and dynamically 300 adjusts its stabilization rate, neighborhood set size, and finger 301 table size based on the analysis of the data [ghinita2006]. 302 Reference [mahajan2003] shows that by using self-tuning, it is 303 possible to achieve high reliability and performance even in adverse 304 conditions with low maintenance cost. Adaptive stabilization has 305 been shown to outperform periodic stabilization in terms of both 306 lookup failures and communication overhead [ghinita2006]. 308 4. Introduction to Chord 310 Chord [Chord] is a structured P2P algorithm that uses consistent 311 hashing to build a DHT out of several independent peers. Consistent 312 hashing assigns each peer and resource a fixed-length identifier. 313 Peers use SHA-1 as the base hash fuction to generate the identifiers. 314 As specified in RELOAD base, the length of the identifiers is 315 numBitsInNodeId=128 bits. The identifiers are ordered on an 316 identifier circle of size 2^numBitsInNodeId. On the identifier 317 circle, key k is assigned to the first peer whose identifier equals 318 or follows the identifier of k in the identifier space. The 319 identifier circle is called the Chord ring. 321 Different DHTs differ significantly in performance when bandwidth is 322 limited. It has been shown that when compared to other DHTs, the 323 advantages of Chord include that it uses bandwidth efficiently and 324 can achieve low lookup latencies at little cost [li2004]. 326 A simple lookup mechanism could be implemented on a Chord ring by 327 requiring each peer to only know how to contact its current successor 328 on the identifier circle. Queries for a given identifier could then 329 be passed around the circle via the successor pointers until they 330 encounter the first peer whose identifier is equal to or larger than 331 the desired identifier. Such a lookup scheme uses a number of 332 messages that grows linearly with the number of peers. To reduce the 333 cost of lookups, Chord maintains also additional routing information; 334 each peer n maintains a data structure with up to numBitsInNodeId 335 entries, called the finger table. The first entry in the finger 336 table of peer n contains the peer half-way around the ring from peer 337 n. The second entry contains the peer that is 1/4th of the way 338 around, the third entry the peer that is 1/8th of the way around, 339 etc. In other words, the ith entry in the finger table at peer n 340 contains the identity of the first peer s that succeeds n by at least 341 2^(numBitsInNodeId-i) on the Chord ring. This peer is called the ith 342 finger of peer n. The interval between two consecutive fingers is 343 called a finger interval. The ith finger interval of peer n covers 344 the range [n.id + 2^(numBitsInNodeId-i), n.id + 345 2^(numBitsInNodeId-i+1)) on the Chord ring. In an N-peer network, 346 each peer maintains information about O(log2(N)) other peers in its 347 finger table. As an example, if N=100000, it is sufficient to 348 maintain 17 fingers. 350 Chord needs all peers' successor pointers to be up to date in order 351 to ensure that lookups produce correct results as the set of 352 participating peers changes. To achieve this, peers run a 353 stabilization protocol periodically in the background. The 354 stabilization protocol of the original Chord algorithm uses two 355 operations: successor stabilization and finger stabilization. 356 However, the Chord algorithm of RELOAD base defines two additional 357 stabilization components, as will be discussed below. 359 To increase robustness in the event of peer failures, each Chord peer 360 maintains a successor list of size r, containing the peer's first r 361 successors. The benefit of successor lists is that if each peer 362 fails independently with probability p, the probability that all r 363 successors fail simultaneously is only p^r. 365 The original Chord algorithm maintains only a single predecessor 366 pointer. However, multiple predecessor pointers (i.e., a predecessor 367 list) can be maintained to speed up recovery from predecessor 368 failures. The routing table of a peer consists of the successor 369 list, finger table, and predecessor list. 371 5. Extending Chord-reload to Support Self-tuning 373 This section describes how the mandatory-to-implement chord-reload 374 algorithm defined in RELOAD base [I-D.ietf-p2psip-base] can be 375 extended to support self-tuning. 377 The chord-reload algorithm supports both reactive and periodic 378 recovery strategies. When the self-tuning mechanisms defined in this 379 document are used, the periodic recovery strategy MUST be used. 380 Further, chord-reload specifies that at least three predecessors and 381 three successors need to be maintained. When the self-tuning 382 mechanisms are used, the appropriate sizes of the successor list and 383 predecessor list are determined in an adaptive fashion based on the 384 estimated network size, as will be described in Section 6. 386 As specified in RELOAD base, each peer MUST maintain a stabilization 387 timer. When the stabilization timer fires, the peer MUST restart the 388 timer and carry out the overlay stabilization routine. Overlay 389 stabilization has four components in chord-reload: 391 1. Update the neighbor table. We refer to this as neighbor 392 stabilization. 393 2. Refreshing the finger table. We refer to this as finger 394 stabilization. 395 3. Adjusting finger table size. 396 4. Detecting partitioning. We refer to this as strong 397 stabilization. 399 As specified in RELOAD base [I-D.ietf-p2psip-base], a peer sends 400 periodic messages as part of the neighbor stabilization, finger 401 stabilization, and strong stabilization routines. In neighbor 402 stabilization, a peer periodically sends an Update request to every 403 peer in its Connection Table. The default time is every ten minutes. 404 In finger stabilization, a peer periodically searches for new peers 405 to include in its finger table. This time defaults to one hour. 406 This document specifies how the neighbor stabilization and finger 407 stabilization intervals can be determined in an adaptive fashion 408 based on the operating conditions of the overlay. The subsections 409 below describe how this document extends the four components of 410 stabilization. 412 5.1. Update Requests 414 As described in RELOAD base [I-D.ietf-p2psip-base], the neighbor and 415 finger stabilization procedures are implemented using Update 416 requests. RELOAD base defines three types of Update requests: 417 'peer_ready', 'neighbors', and 'full'. Regardless of the type, all 418 Update requests include an 'uptime' field. Since the self-tuning 419 extensions require information on the uptimes of peers in the routing 420 table, the sender of an Update request MUST include its current 421 uptime in seconds in the 'uptime' field. 423 When self-tuning is used, each peer decides independently the 424 appropriate size for the successor list, predecessor list and finger 425 table. Thus, the 'predecessors', 'successors', and 'fingers' fields 426 included in RELOAD Update requests are of variable length. As 427 specified in RELOAD [I-D.ietf-p2psip-base], variable length fields 428 are on the wire preceded by length bytes. In the case of the 429 successor list, predecessor list, and finger table, there are two 430 length bytes (allowing lengths up to 2^16-1). The number of NodeId 431 structures included in each field can be calculated based on the 432 length bytes since the size of a single NodeId structure is 16 bytes. 433 If a peer receives more entries than fit into its successor list, 434 predecessor list or finger table, the peer MUST ignore the extra 435 entries. If a peer receives less entries than it currently has in 436 its own data structure, the peer MUST NOT drop the extra entries from 437 its data structure. 439 5.2. Neighbor Stabilization 441 In the neighbor stabilization operation of chord-reload, a peer 442 periodically sends an Update request to every peer in its Connection 443 Table. In a small, low-churn overlay, the amount of traffic this 444 process generates is typically acceptable. However, in a large-scale 445 overlay churning at a moderate or high churn rate, the traffic load 446 may no longer be acceptable since the size of the connection table is 447 large and the stabilization interval relatively short. The self- 448 tuning mechanisms described in this document are especially designed 449 for overlays of the latter type. Therefore, when the self-tuning 450 mechanisms are used, each peer MUST send a periodic Update request 451 only to its first predecessor and first successor on the Chord ring. 453 The neighbor stabilization routine MUST be executed when the 454 stabilization timer fires. To begin the neighbor stabilization 455 routine, a peer MUST send an Update request to its first successor 456 and its first predecessor. The type of the Update request MUST be 457 'neighbors'. The Update request MUST include the successor and 458 predecessor lists of the sender. If a peer receiving such an Update 459 request learns from the predecessor and successor lists included in 460 the request that new peers can be included in its neighborhood set, 461 it MUST send Attach requests to the new peers. 463 After a new peer has been added to the predecessor or successor list, 464 an Update request of type 'peer_ready' MUST be sent to the new peer. 465 This allows the new peer to insert the sender into its neighborhood 466 set. 468 5.3. Finger Stabilization 470 Chord-reload specifies two alternative methods for searching for new 471 peers to the finger table. Both of the alternatives can be used with 472 the self-tuning extensions defined in this document. 474 Immediately after a new peer has been added to the finger table, a 475 Probe request MUST be sent to the new peer to fetch its uptime. The 476 requested_info field of the Probe request MUST be set to contain the 477 ProbeInformationType 'uptime' defined in RELOAD base 478 [I-D.ietf-p2psip-base]. 480 5.4. Adjusting Finger Table Size 482 The chord-reload algorithm defines how a peer can make sure that the 483 finger table is appropriately sized to allow for efficient routing. 484 Since the self-tuning mechanisms specified in this document produce a 485 network size estimate, this estimate can be directly used to 486 calculate the optimal size for the finger table. This mechanism MUST 487 be used instead of the one specified by chord-reload. A peer MUST 488 use the network size estimate to determine whether it needs to adjust 489 the size of its finger table each time when the stabilization timer 490 fires. The way this is done is explained in Section 6.2. 492 5.5. Detecting Partitioning 494 This document does not require any changes to the mechanism chord- 495 reload uses to detect network partitioning. 497 5.6. Leaving the Overlay 499 As specified in RELOAD base [I-D.ietf-p2psip-base], a leaving peer 500 SHOULD send a Leave request to all members of its neighbor table 501 prior to leaving the overlay. The overlay_specific_data field MUST 502 contain the ChordLeaveData structure. The Leave requests that are 503 sent to successors MUST contain the predecessor list of the leaving 504 peer. The Leave requests that are sent to the predecessors MUST 505 contain the successor list of the leaving peer. If a given successor 506 can identify better predecessors than are already included in its 507 predecessor lists by investigating the predecessor list it receives 508 from the leaving peer, it MUST send Attach requests to them. 509 Similarly, if a given predecessor identifies better successors by 510 investigating the successor list it receives from the leaving peer, 511 it MUST send Attach requests to them. 513 6. Self-tuning Chord Parameters 515 This section specifies how to determine an appropriate stabilization 516 rate and routing table size in an adaptive fashion. The proposed 517 mechanism is based on [mahajan2003], [liben-nowell2002], and 518 [ghinita2006]. To calculate an appropriate stabilization rate, the 519 values of three parameters MUST be estimated: overlay size N, failure 520 rate U, and join rate L. To calculate an appropriate routing table 521 size, the estimated network size N can be used. Peers in the overlay 522 MUST re-calculate the values of the parameters to self-tune the 523 chord-reload algorithm at the end of each stabilization period before 524 re-starting the stabilization timer. 526 6.1. Estimating Overlay Size 528 Techniques for estimating the size of an overlay network have been 529 proposed for instance in [mahajan2003], [horowitz2003], 530 [kostoulas2005], [binzenhofer2006], and [ghinita2006]. In Chord, the 531 density of peer identifiers in the neighborhood set can be used to 532 produce an estimate of the size of the overlay, N [mahajan2003]. 533 Since peer identifiers are picked randomly with uniform probability 534 from the numBitsInNodeId-bit identifier space, the average distance 535 between peer identifiers in the successor set is 536 (2^numBitsInNodeId)/N. 538 To estimate the overlay network size, a peer MUST compute the average 539 inter-peer distance d between the successive peers starting from the 540 most distant predecessor and ending to the most distant successor in 541 the successor list. The estimated network size MUST be calculated 542 as: 544 2^numBitsInNodeId 545 N = ------------------- 546 d 548 This estimate has been found to be accurate within 15% of the real 549 network size [ghinita2006]. Of course, the size of the neighborhood 550 set affects the accuracy of the estimate. 552 During the join process, a joining peer fills its routing table by 553 sending a series of Ping and Attach requests, as specified in RELOAD 554 base [I-D.ietf-p2psip-base]. Thus, a joining peer immediately has 555 enough information at its disposal to calculate an estimate of the 556 network size. 558 6.2. Determining Routing Table Size 560 As specified in RELOAD base, the finger table must contain at least 561 16 entries. When the self-tuning mechanisms are used, the size of 562 the finger table MUST be set to max(log2(N), 16) using the estimated 563 network size N. 565 The size of the successor list MUST be set to log2(N). An 566 implementation MAY place a lower limit on the size of the successor 567 list. As an example, the implementation might require the size of 568 the successor list to be always at least three. 570 A peer MAY choose to maintain a fixed-size predecessor list with only 571 three entries as specified in RELOAD base. However, it is 572 RECOMMENDED that a peer maintains log2(N) predecessors. 574 6.3. Estimating Failure Rate 576 A typical approach is to assume that peers join the overlay according 577 to a Poisson process with rate L and leave according to a Poisson 578 process with rate parameter U [mahajan2003]. The value of U can be 579 estimated using peer failures in the finger table and neighborhood 580 set [mahajan2003]. If peers fail with rate U, a peer with M unique 581 peer identifiers in its routing table should observe K failures in 582 time K/(M*U). Every peer in the overlay MUST maintain a history of 583 the last K failures. The current time MUST be inserted into the 584 history when the peer joins the overlay. The estimate of U MUST be 585 calculated as: 587 k 588 U = --------, 589 M * Tk 591 where M is the number of unique peer identifiers in the routing 592 table, Tk is the time between the first and the last failure in the 593 history, and k is the number of failures in the history. If k is 594 smaller than K, the estimate MUST be computed as if there was a 595 failure at the current time. It has been shown that an estimate 596 calculated in a similar manner is accurate within 17% of the real 597 value of U [ghinita2006]. 599 The size of the failure history K affects the accuracy of the 600 estimate of U. One can increase the accuracy by increasing K. 601 However, this has the side effect of decreasing responsiveness to 602 changes in the failure rate. On the other hand, a small history size 603 may cause a peer to overreact each time a new failure occurs. In 604 [ghinita2006], K is set 25% of the routing table size. Use of this 605 approach is RECOMMENDED. 607 6.3.1. Detecting Failures 609 A new failure MUST be inserted to the failure history in the 610 following cases: 612 1. A Leave request is received from a neigbhor. 613 2. A peer fails to reply to a Ping request sent in the situation 614 explained below. If no packets have been received on a 615 connection during the past 2*Tr seconds (where Tr is the 616 inactivity timer defined by ICE [I-D.ietf-mmusic-ice]), a RELOAD 617 Ping request MUST be sent to the remote peer. RELOAD mandates 618 the use of STUN [RFC5389] for keepalives. STUN keepalives take 619 the form of STUN Binding Indication transactions. As specified 620 in ICE [I-D.ietf-mmusic-ice], a peer sends a STUN Binding 621 Indication if there has been no packet sent on a connection for 622 Tr seconds. Tr is configurable and has a default of 15 seconds. 623 Although STUN Binding Indications do not generate a response, the 624 fact that a peer has failed can be learned from the lack of 625 packets (Binding Indications or application protocol packets) 626 received from the peer. If the remote peer fails to reply to the 627 Ping request, the sender MUST consider the remote peer to have 628 failed. 630 As an alternative to relying on STUN keepalives to detect peer 631 failure, a peer could send additional, frequent RELOAD messages to 632 every peer in its Connection Table. These messages could be Update 633 requests, in which case they would serve two purposes: detecting peer 634 failure and stabilization. However, as the cost of this approach can 635 be very high in terms of bandwidth consumption and traffic load, 636 especially in large-scale overlays experiencing churn, its use is NOT 637 RECOMMENDED. 639 6.4. Estimating Join Rate 641 Reference [ghinita2006] proposes that a peer can estimate the join 642 rate based on the uptime of the peers in its routing table. An 643 increase in peer join rate will be reflected by a decrease in the 644 average age of peers in the routing table. Thus, each peer MUST 645 maintain an array of the ages of the peers in its routing table 646 sorted in increasing order. Using this information, an estimate of 647 the global peer join rate L MUST be calculated as: 649 N 1 650 L = --- * ---------------, 651 4 Ages[rsize/4] 653 where Ages is an array containing the ages of the peers in the 654 routing table sorted in increasing order and rsize is the size of the 655 routing table. It is RECOMMENDED that only the ages of the 25% of 656 the youngest peers in the routing table (i.e., the 25th percentile) 657 are used to reduce the bias that a small number of peers with very 658 old ages can cause [ghinita2006]. It has been shown that the 659 estimate obtained by using this method is accurate within 22% of the 660 real join rate [ghinita2006]. Of course, the size of the routing 661 table affects the accuracy. 663 In order for this mechanism to work, peers need to exchange 664 information about the time they have been present in the overlay. 665 Peers receive the uptimes of their successors and predecessors during 666 the stabilization operations since all Update requests carry uptime 667 values. A joining peer learns the uptime of the admitting peer since 668 it receives an Update from the admitting peer during the join 669 procedure. Peers learn the uptimes of new fingers since they can 670 fetch the uptime using a Probe request after having attached to the 671 new finger. 673 6.5. Estimate Sharing 675 To improve the accuracy of network size, join rate, and leave rate 676 estimates, peers MUST share their estimates. When the stabilization 677 timer fires, a peer MUST select number-of-peers-to-probe random peers 678 from its finger table and send each of them a Probe request. The 679 targets of Probe requests are selected from the finger table rather 680 than from the neighbor table since neighbors are likely to make 681 similar errors when calculating their estimates. number-of-peers-to- 682 probe is a new element in the overlay configuration document. It is 683 defined in Section 7 and has a default value of 4. Both the Probe 684 request and the answer returned by the target peer MUST contain a new 685 message extension whose MessageExtensionType is 'self_tuning_data'. 686 This extension type is defined in Section 9.1. The 687 extension_contents field of the MessageExtension structure MUST 688 contain a SelfTuningData structure: 690 struct { 691 uint32 network_size; 692 uint32 join_rate; 693 uint32 leave_rate; 694 } SelfTuningData; 696 The contents of the SelfTuningData structure are as follows: 698 network_size 699 The latest network size estimate calculated by the sender. 701 join_rate 702 The latest join rate estimate calculated by the sender. 704 leave_rate 705 The latest leave rate estimate calculated by the sender. 707 The join and leave rates are expressed as joins or failures per 24 708 hours. As an example, if the global join rate estimate a peer has 709 calculated is 0.123 peers/s, it would include in the join_rate field 710 the value 10627 (24*60*60*0.123 = 10627.2). 712 The 'type' field of the MessageExtension structure MUST be set to 713 contain the value 'self_tuning_data'. The 'critical' field of the 714 structure MUST be set to False. 716 A peer MUST store all estimates it receives in Probe requests and 717 answers during a stabilization interval. When the stabilization 718 timer fires, the peer MUST calculate the estimates to be used during 719 the next stabilization interval by taking the 75th percentile of a 720 data set containing its own estimate and the received estimates. 722 6.6. Calculating the Stabilization Interval 724 According to [liben-nowell2002], a Chord network in a ring-like state 725 remains in a ring-like state as long as peers send Omega(log2^2(N)) 726 messages before N new peers join or N/2 peers fail. We can use the 727 estimate of peer failure rate, U, to calculate the time Tf in which 728 N/2 peers fail: 730 1 731 Tf = ------ 732 2*U 734 Based on this estimate, a stabilization interval Tstab-1 MUST be 735 calculated as: 737 Tf 738 Tstab-1 = ----------- 739 log2^2(N) 741 On the other hand, the estimated join rate L can be used to calculate 742 the time in which N new peers join the overlay. Based on the 743 estimate of L, a stabilization interval Tstab-2 MUST be calculated 744 as: 746 N 747 Tstab-2 = --------------- 748 L * log2^2(N) 750 Finally, the actual stabilization interval Tstab that MUST be used 751 can be obtained by taking the minimum of Tstab-1 and Tstab-2. 753 The results obtained in [maenpaa2009] indicate that making the 754 stabilization interval too small has the effect of making the overlay 755 less stable (e.g., in terms of detected loops and path failures). 756 Thus, a lower limit should be used for the stabilization period. 757 Based on the results in [maenpaa2009], a lower limit of 15s is 758 RECOMMENDED, since using a stabilization period smaller than this 759 will with a high probability cause too much traffic in the overlay. 761 7. Overlay Configuration Document Extension 763 This document extends the RELOAD overlay configuration document by 764 adding one new element, "number-of-peers-to-probe", inside each 765 "configuration" element. 767 self-tuning:number-of-peers-to-probe: The number of fingers to which 768 Probe requests are sent to obtain their network size, join rate, 769 and leave rate estimates. The default value is 4. 771 This new element is formally defined as follows: 773 namespace self-tuning = "urn:ietf:params:xml:ns:p2p:self-tuning" 775 parameter &= element self-tuning:number-of-peers-to-probe { xsd: 776 unsignedInt } 778 This namespace is added into the element in the 779 overlay configuration file. 781 8. Security Considerations 783 There are no new security considerations introduced in this document. 784 The security considerations of RELOAD [I-D.ietf-p2psip-base] apply. 786 9. IANA Considerations 787 9.1. Message Extensions 789 This document introduces one additional extension to the "RELOAD 790 Extensions" Registry: 792 +------------------+-------+---------------+ 793 | Extension Name | Code | Specification | 794 +------------------+-------+---------------+ 795 | self_tuning_data | 3 | RFC-AAAA | 796 +------------------+-------+---------------+ 798 The contents of the extension are defined in Section 6.5. 800 10. Acknowledgments 802 The authors would like to thank Jani Hautakorpi for his contributions 803 to the document. 805 11. References 807 11.1. Normative References 809 [I-D.ietf-mmusic-ice] 810 Rosenberg, J., "Interactive Connectivity Establishment 811 (ICE): A Protocol for Network Address Translator (NAT) 812 Traversal for Offer/Answer Protocols", 813 draft-ietf-mmusic-ice-19 (work in progress), October 2007. 815 [I-D.ietf-p2psip-base] 816 Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and 817 H. Schulzrinne, "REsource LOcation And Discovery (RELOAD) 818 Base Protocol", draft-ietf-p2psip-base-24 (work in 819 progress), January 2013. 821 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 822 Requirement Levels", BCP 14, RFC 2119, March 1997. 824 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, 825 "Session Traversal Utilities for NAT (STUN)", RFC 5389, 826 October 2008. 828 11.2. Informative References 830 [CAN] Ratnasamy, S., Francis, P., Handley, M., Karp, R., and S. 831 Schenker, "A scalable content-addressable network", In 832 Proc. of the 2001 Conference on Applications, 833 Technologies, Architectures and Protocols for Computer 834 Communications 2001, pp. 161.172. 836 [Chord] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., 837 Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A 838 Scalable Peer-to-peer Lookup Service for Internet 839 Applications", IEEE/ACM Transactions on Networking Volume 840 11, Issue 1, 17-32, Feb 2003. 842 [I-D.ietf-p2psip-concepts] 843 Bryan, D., Willis, D., Shim, E., Matthews, P., and S. 844 Dawkins, "Concepts and Terminology for Peer to Peer SIP", 845 draft-ietf-p2psip-concepts-04 (work in progress), 846 October 2011. 848 [Pastry] Rowstron, A. and P. Druschel, "Pastry: Scalable, 849 Decentralized Object Location and Routing for Large-Scale 850 Peer-to-Peer Systems", In Proc. of the IFIP/ACM 851 International Conference on Distribued Systems 852 Platforms Nov. 2001, pp. 329-350. 854 [binzenhofer2006] 855 Binzenhofer, A., Kunzmann, G., and R. Henjes, "A scalable 856 algorithm to monitor chord-based P2P systems at runtime", 857 20th International Parallel and Distributed Processing 858 Symposium April 2006. 860 [ghinita2006] 861 Ghinita, G. and Y. Teo, "An adaptive stabilization 862 framework for distributed hash tables", 20th International 863 Parallel and Distributed Processing Symposium April 2006. 865 [horowitz2003] 866 Horowitz, K. and D. Malkhi, "Estimating network size from 867 local information", Information Processing Letters Dec. 868 2003, Volume 88, Issue 5, pp. 237-243. 870 [kostoulas2005] 871 Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and 872 A. Demers, "Decentralized schemes for size estimation in 873 large and dynamic groups", Fourth IEEE International 874 Symposium on Network Computing and Applications July 2005, 875 pp. 41-48. 877 [krishnamurthy2008] 878 Krishnamurthy, S., El-Ansary, S., Aurell, E., and S. 879 Haridi, "Comparing maintenance strategies for overlays", 880 In Proc. of 16th Euromicro Conference on Parallel, 881 Distributed and Network-Based Processing Feb. 2008, pp. 882 473-482. 884 [li2004] Li, J., Strinbling, J., Gil, T., and M. Kaashoek, 885 "Comparing the performance of distributed hash tables 886 under churn", In Proc. of the 3rd International Workshop 887 on Peer-to-Peer Systems 2004. 889 [liben-nowell2002] 890 Liben-Nowell, D., Balakrishnan, H., and D. Karger, 891 "Observations on the dynamic evolution of peer-to-peer 892 networks", In Proc. of the First International Workshop on 893 Peer-to-Peer Systems March 2002. 895 [maenpaa2009] 896 Maenpaa, J. and G. Camarillo, "A study on maintenance 897 operations in a Chord-based Peer-to-Peer Session 898 Initiation Protocol overlay network", In Proc. of IPDPS 899 2009 May 2009. 901 [mahajan2003] 902 Mahajan, R., Castro, M., and A. Rowstron, "Controlling the 903 cost of reliability in peer-to-peer overlays", In 904 Proceedings of the 2nd International Workshop on Peer-to- 905 Peer Systems Feb. 2003. 907 [rhea2004] 908 Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, 909 "Handling churn in a DHT", In Proc. of the USENIX Annual 910 Techincal Conference June 2004. 912 Authors' Addresses 914 Jouni Maenpaa 915 Ericsson 916 Hirsalantie 11 917 Jorvas 02420 918 Finland 920 Email: Jouni.Maenpaa@ericsson.com 921 Gonzalo Camarillo 922 Ericsson 923 Hirsalantie 11 924 Jorvas 02420 925 Finland 927 Email: Gonzalo.Camarillo@ericsson.com