idnits 2.17.1 draft-ietf-p2psip-self-tuning-15.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 26, 2014) is 3564 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 5245 (Obsoleted by RFC 8445, RFC 8839) ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489) Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 P2PSIP Working Group J. Maenpaa 3 Internet-Draft G. Camarillo 4 Intended status: Standards Track Ericsson 5 Expires: December 28, 2014 June 26, 2014 7 Self-tuning Distributed Hash Table (DHT) for REsource LOcation And 8 Discovery (RELOAD) 9 draft-ietf-p2psip-self-tuning-15.txt 11 Abstract 13 REsource LOcation And Discovery (RELOAD) is a peer-to-peer (P2P) 14 signaling protocol that provides an overlay network service. Peers 15 in a RELOAD overlay network collectively run an overlay algorithm to 16 organize the overlay, and to store and retrieve data. This document 17 describes how the default topology plugin of RELOAD can be extended 18 to support self-tuning, that is, to adapt to changing operating 19 conditions such as churn and network size. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on December 28, 2014. 38 Copyright Notice 40 Copyright (c) 2014 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 3. Introduction to Stabilization in DHTs . . . . . . . . . . . . 5 58 3.1. Reactive vs. Periodic Stabilization . . . . . . . . . . . 6 59 3.2. Configuring Periodic Stabilization . . . . . . . . . . . 6 60 3.3. Adaptive Stabilization . . . . . . . . . . . . . . . . . 8 61 4. Introduction to Chord . . . . . . . . . . . . . . . . . . . . 8 62 5. Extending Chord-reload to Support Self-tuning . . . . . . . . 9 63 5.1. Update Requests . . . . . . . . . . . . . . . . . . . . . 10 64 5.2. Neighbor Stabilization . . . . . . . . . . . . . . . . . 11 65 5.3. Finger Stabilization . . . . . . . . . . . . . . . . . . 11 66 5.4. Adjusting Finger Table Size . . . . . . . . . . . . . . . 12 67 5.5. Detecting Partitioning . . . . . . . . . . . . . . . . . 12 68 5.6. Leaving the Overlay . . . . . . . . . . . . . . . . . . . 12 69 6. Self-tuning Chord Parameters . . . . . . . . . . . . . . . . 12 70 6.1. Estimating Overlay Size . . . . . . . . . . . . . . . . . 13 71 6.2. Determining Routing Table Size . . . . . . . . . . . . . 13 72 6.3. Estimating Failure Rate . . . . . . . . . . . . . . . . . 14 73 6.3.1. Detecting Failures . . . . . . . . . . . . . . . . . 14 74 6.4. Estimating Join Rate . . . . . . . . . . . . . . . . . . 15 75 6.5. Estimate Sharing . . . . . . . . . . . . . . . . . . . . 16 76 6.6. Calculating the Stabilization Interval . . . . . . . . . 17 77 7. Overlay Configuration Document Extension . . . . . . . . . . 18 78 8. Security Considerations . . . . . . . . . . . . . . . . . . . 18 79 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 80 9.1. Message Extensions . . . . . . . . . . . . . . . . . . . 19 81 9.2. New Overlay Algorithm Type . . . . . . . . . . . . . . . 19 82 9.3. A New IETF XML Registry . . . . . . . . . . . . . . . . . 19 83 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 20 84 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 20 85 11.1. Normative References . . . . . . . . . . . . . . . . . . 20 86 11.2. Informative References . . . . . . . . . . . . . . . . . 20 87 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 89 1. Introduction 91 REsource LOcation And Discovery (RELOAD) [RFC6940] is a peer-to-peer 92 signaling protocol that can be used to maintain an overlay network, 93 and to store data in and retrieve data from the overlay. For 94 interoperability reasons, RELOAD specifies one overlay algorithm, 95 called chord-reload, that is mandatory to implement. This document 96 extends the chord-reload algorithm by introducing self-tuning 97 behavior. 99 Distributed Hash Table (DHT) based overlay networks are self- 100 organizing, scalable and reliable. However, these features come at a 101 cost: peers in the overlay network need to consume network bandwidth 102 to maintain routing state. Most DHTs use a periodic stabilization 103 routine to counter the undesirable effects of churn on routing. To 104 configure the parameters of a DHT, some characteristics such as churn 105 rate and network size need to be known in advance. These 106 characteristics are then used to configure the DHT in a static 107 fashion by using fixed values for parameters such as the size of the 108 successor set, size of the routing table, and rate of maintenance 109 messages. The problem with this approach is that it is not possible 110 to achieve a low failure rate and a low communication overhead by 111 using fixed parameters. Instead, a better approach is to allow the 112 system to take into account the evolution of network conditions and 113 adapt to them. 115 This document extends the mandatory-to-implement chord-reload 116 algorithm by making it self-tuning. The use of the self-tuning 117 feature is optional. However, when used, it needs to be supported by 118 all peers in the RELOAD overlay network. The fact that a RELOAD 119 overlay uses the self-tuning feature is indicated in the RELOAD 120 overlay configuration document using the CHORD-SELF-TUNING algorithm 121 name specified in Section 9.2 in the topology-plugin element. Two 122 main advantages of self-tuning are that users no longer need to tune 123 every DHT parameter correctly for a given operating environment and 124 that the system adapts to changing operating conditions. 126 The remainder of this document is structured as follows: Section 2 127 provides definitions of terms used in this document. Section 3 128 discusses alternative approaches to stabilization operations in DHTs, 129 including reactive stabilization, periodic stabilization, and 130 adaptive stabilization. Section 4 gives an introduction to the Chord 131 DHT algorithm. Section 5 describes how this document extends the 132 stabilization routine of the chord-reload algorithm. Section 6 133 describes how the stabilization rate and routing table size are 134 calculated in an adaptive fashion. 136 2. Terminology 138 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 139 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 140 "OPTIONAL" in this document are to be interpreted as described in RFC 141 2119 [RFC2119]. 143 This document uses terminology and definitions from the RELOAD base 144 specification [RFC6940]. 146 numBitsInNodeId: Specifies the number of bits in a RELOAD Node-ID. 148 DHT: Distributed Hash Tables (DHTs) are a class of decentralized 149 distributed systems that provide a lookup service similar to a 150 regular hash table. Given a key, any peer participating in the 151 system can retrieve the value associated with that key. The 152 responsibility for maintaining the mapping from keys to values is 153 distributed among the peers. 155 Chord Ring: The Chord DHT uses ring topology and orders identifiers 156 on an identifier circle of size 2^numBitsInNodeId. This 157 identifier circle is called the Chord ring. On the Chord ring, 158 the responsibility for a key k is assigned to the node whose 159 identifier equals to or immediately follows k. 161 Finger Table: A data structure with up to (but typically less than) 162 numBitsInNodeId entries maintained by each peer in a Chord-based 163 overlay. The ith entry in the finger table of peer n contains the 164 identity of the first peer that succeeds n by at least 165 2^(numBitsInNodeId-i) on the Chord ring. This peer is called the 166 ith finger of peer n. As an example, the first entry in the 167 finger table of peer n contains a peer half-way around the Chord 168 ring from peer n. The purpose of the finger table is to 169 accelerate lookups. 171 n.id: An abbreviation that is in this document used refer to the 172 Node-ID of peer n. 174 O(g(n)): Informally, saying that some equation f(n) = O(g(n)) means 175 that f(n) is less than some constant multiple of g(n). For the 176 formal definition, please refer to [weiss1998]. 178 Omega(g(n)): Informally, saying that some equation f(n) = 179 Omega(g(n)) means that f(n) is more than some constant multiple of 180 g(n). For the formal definition, please refer to [weiss1998] 182 Percentile: The Pth (0<=P<=100) percentile of N values arranged in 183 ascending order is obtained by first calculating the (ordinal) 184 rank n=(P/100)*N, rounding the result to the nearest integer, and 185 then taking the value corresponding to that rank. 187 Predecessor List: A data structure containing the first r 188 predecessors of a peer on the Chord ring. 190 Successor List: A data structure containing the first r successors 191 of a peer on the Chord ring. 193 Neighborhood Set: A term used to refer to the set of peers included 194 in the successor and predecessor lists of a given peer. 196 Routing Table: Contents of a given peer's routing table include the 197 set of peers that the peer can use to route overlay messages. The 198 routing table is made up of the finger table, successor list and 199 predecessor list. 201 3. Introduction to Stabilization in DHTs 203 DHTs use stabilization routines to counter the undesirable effects of 204 churn on routing. The purpose of stabilization is to keep the 205 routing information of each peer in the overlay consistent with the 206 constantly changing overlay topology. There are two alternative 207 approaches to stabilization: periodic and reactive [rhea2004]. 208 Periodic stabilization can either use a fixed stabilization rate or 209 calculate the stabilization rate in an adaptive fashion. 211 3.1. Reactive vs. Periodic Stabilization 213 In reactive stabilization, a peer reacts to the loss of a peer in its 214 neighborhood set or to the appearance of a new peer that should be 215 added to its neighborhood set by sending a copy of its neighbor table 216 to all peers in the neighborhood set. Periodic recovery, in 217 contrast, takes place independently of changes in the neighborhood 218 set. In periodic recovery, a peer periodically shares its 219 neighborhood set with each or a subset of the members of that set. 221 The chord-reload algorithm [RFC6940] supports both reactive and 222 periodic stabilization. It has been shown in [rhea2004] that 223 reactive stabilization works well for small neighborhood sets (i.e., 224 small overlays) and moderate churn. However, in large-scale (e.g., 225 1000 peers or more [rhea2004]) or high-churn overlays, reactive 226 stabilization runs the risk of creating a positive feedback cycle, 227 which can eventually result in congestion collapse. In [rhea2004], 228 it is shown that a 1000-peer overlay under churn uses significantly 229 less bandwidth and has lower latencies when periodic stabilization is 230 used than when reactive stabilization is used. Although in the 231 experiments carried out in [rhea2004], reactive stabilization 232 performed well when there was no churn, its bandwidth use was 233 observed to jump dramatically under churn. At higher churn rates and 234 larger scale overlays, periodic stabilization uses less bandwidth and 235 the resulting lower contention for the network leads to lower 236 latencies. For this reason, most DHTs such as CAN [CAN], Chord 237 [Chord], Pastry [Pastry], Bamboo [rhea2004], etc. use periodic 238 stabilization [ghinita2006]. As an example, the first version of 239 Bamboo used reactive stabilization, which caused Bamboo to suffer 240 from degradation in performance under churn. To fix this problem, 241 Bamboo was modified to use periodic stabilization. 243 In Chord, periodic stabilization is typically done both for 244 successors and fingers. An alternative strategy is analyzed in 245 [krishnamurthy2008]. In this strategy, called the correction-on- 246 change maintenance strategy, a peer periodically stabilizes its 247 successors but does not do so for its fingers. Instead, finger 248 pointers are stabilized in a reactive fashion. The results obtained 249 in [krishnamurthy2008] imply that although the correction-on-change 250 strategy works well when churn is low, periodic stabilization 251 outperforms the correction-on-change strategy when churn is high. 253 3.2. Configuring Periodic Stabilization 255 When periodic stabilization is used, one faces the problem of 256 selecting an appropriate execution rate for the stabilization 257 procedure. If the execution rate of periodic stabilization is high, 258 changes in the system can be quickly detected, but at the 259 disadvantage of increased communication overhead. Alternatively, if 260 the stabilization rate is low and the churn rate is high, routing 261 tables become inaccurate and DHT performance deteriorates. Thus, the 262 problem is setting the parameters so that the overlay achieves the 263 desired reliability and performance even in challenging conditions, 264 such as under heavy churn. This naturally results in high cost 265 during periods when the churn level is lower than expected, or 266 alternatively, poor performance or even network partitioning in worse 267 than expected conditions. 269 In addition to selecting an appropriate stabilization interval, 270 regardless of whether periodic stabilization is used or not, an 271 appropriate size needs to be selected for the neighborhood set and 272 for the finger table. 274 The current approach is to configure overlays statically. This works 275 in situations where perfect information about the future is 276 available. In situations where the operating conditions of the 277 network are known in advance and remain static throughout the 278 lifetime of the system, it is possible to choose fixed optimal values 279 for parameters such as stabilization rate, neighborhood set size and 280 routing table size. However, if the operating conditions (e.g., the 281 size of the overlay and its churn rate) do not remain static but 282 evolve with time, it is not possible to achieve both a low lookup 283 failure rate and a low communication overhead by using fixed 284 parameters [ghinita2006]. 286 As an example, to configure the Chord DHT algorithm, one needs to 287 select values for the following parameters: size of successor list, 288 stabilization interval, and size of the finger table. To select an 289 appropriate value for the stabilization interval, one needs to know 290 the expected churn rate and overlay size. According to 291 [liben-nowell2002], a Chord network in a ring-like state remains in a 292 ring-like state as long as peers send Omega(square(log(N))) messages 293 before N new peers join or N/2 peers fail. Thus, in a 500-peer 294 overlay churning at a rate such that one peer joins and one peer 295 leaves the network every 30 seconds, an appropriate stabilization 296 interval would be on the order of 93s. According to [Chord], the 297 size of the successor list and finger table should be on the order of 298 log(N). Already a successor list of a modest size (e.g., log2(N) or 299 2*log2(N), which is the successor list size used in [Chord]) makes it 300 very unlikely that a peer will lose all of its successors, which 301 would cause the Chord ring to become disconnected. Thus, in a 302 500-peer network each peer should maintain on the order of nine 303 successors and fingers. However, if the churn rate doubles and the 304 network size remains unchanged, the stabilization rate should double 305 as well. That is, the appropriate maintenance interval would now be 306 on the order of 46s. On the other hand, if the churn rate becomes 307 e.g. six-fold and the size of the network grows to 2000 peers, on the 308 order of eleven fingers and successors should be maintained and the 309 stabilization interval should be on the order of 42s. If one 310 continued using the old values, this could result in inaccurate 311 routing tables, network partitioning, and deteriorating performance. 313 3.3. Adaptive Stabilization 315 A self-tuning DHT takes into consideration the continuous evolution 316 of network conditions and adapts to them. In a self-tuning DHT, each 317 peer collects statistical data about the network and dynamically 318 adjusts its stabilization rate, neighborhood set size, and finger 319 table size based on the analysis of the data [ghinita2006]. 320 Reference [mahajan2003] shows that by using self-tuning, it is 321 possible to achieve high reliability and performance even in adverse 322 conditions with low maintenance cost. Adaptive stabilization has 323 been shown to outperform periodic stabilization in terms of both 324 lookup failures and communication overhead [ghinita2006]. 326 4. Introduction to Chord 328 Chord [Chord] is a structured P2P algorithm that uses consistent 329 hashing to build a DHT out of several independent peers. Consistent 330 hashing assigns each peer and resource a fixed-length identifier. 331 Peers use SHA-1 as the base hash fuction to generate the identifiers. 332 As specified in RELOAD base, the length of the identifiers is 333 numBitsInNodeId=128 bits. The identifiers are ordered on an 334 identifier circle of size 2^numBitsInNodeId. On the identifier 335 circle, key k is assigned to the first peer whose identifier equals 336 or follows the identifier of k in the identifier space. The 337 identifier circle is called the Chord ring. 339 Different DHTs differ significantly in performance when bandwidth is 340 limited. It has been shown that when compared to other DHTs, the 341 advantages of Chord include that it uses bandwidth efficiently and 342 can achieve low lookup latencies at little cost [li2004]. 344 A simple lookup mechanism could be implemented on a Chord ring by 345 requiring each peer to only know how to contact its current successor 346 on the identifier circle. Queries for a given identifier could then 347 be passed around the circle via the successor pointers until they 348 encounter the first peer whose identifier is equal to or larger than 349 the desired identifier. Such a lookup scheme uses a number of 350 messages that grows linearly with the number of peers. To reduce the 351 cost of lookups, Chord maintains also additional routing information; 352 each peer n maintains a data structure with up to numBitsInNodeId 353 entries, called the finger table. The first entry in the finger 354 table of peer n contains the peer half-way around the ring from peer 355 n. The second entry contains the peer that is 1/4th of the way 356 around, the third entry the peer that is 1/8th of the way around, 357 etc. In other words, the ith entry in the finger table at peer n 358 contains the identity of the first peer s that succeeds n by at least 359 2^(numBitsInNodeId-i) on the Chord ring. This peer is called the ith 360 finger of peer n. The interval between two consecutive fingers is 361 called a finger interval. The ith finger interval of peer n covers 362 the range [n.id + 2^(numBitsInNodeId-i), n.id + 2^(numBitsInNodeId- 363 i+1)) on the Chord ring. In an N-peer network, each peer maintains 364 information about O(log(N)) other peers in its finger table. As an 365 example, if N=100000, it is sufficient to maintain 17 fingers. 367 Chord needs all peers' successor pointers to be up to date in order 368 to ensure that lookups produce correct results as the set of 369 participating peers changes. To achieve this, peers run a 370 stabilization protocol periodically in the background. The 371 stabilization protocol of the original Chord algorithm uses two 372 operations: successor stabilization and finger stabilization. 373 However, the Chord algorithm of RELOAD base defines two additional 374 stabilization components, as will be discussed below. 376 To increase robustness in the event of peer failures, each Chord peer 377 maintains a successor list of size r, containing the peer's first r 378 successors. The benefit of successor lists is that if each peer 379 fails independently with probability p, the probability that all r 380 successors fail simultaneously is only p^r. 382 The original Chord algorithm maintains only a single predecessor 383 pointer. However, multiple predecessor pointers (i.e., a predecessor 384 list) can be maintained to speed up recovery from predecessor 385 failures. The routing table of a peer consists of the successor 386 list, finger table, and predecessor list. 388 5. Extending Chord-reload to Support Self-tuning 390 This section describes how the mandatory-to-implement chord-reload 391 algorithm defined in RELOAD base [RFC6940] can be extended to support 392 self-tuning. 394 The chord-reload algorithm supports both reactive and periodic 395 recovery strategies. When the self-tuning mechanisms defined in this 396 document are used, the periodic recovery strategy is used. Further, 397 chord-reload specifies that at least three predecessors and three 398 successors need to be maintained. When the self-tuning mechanisms 399 are used, the appropriate sizes of the successor list and predecessor 400 list are determined in an adaptive fashion based on the estimated 401 network size, as will be described in Section 6. 403 As specified in RELOAD base, each peer maintains a stabilization 404 timer. When the stabilization timer fires, the peer restarts the 405 timer and carries out the overlay stabilization routine. Overlay 406 stabilization has four components in chord-reload: 408 1. Update the neighbor table. We refer to this as neighbor 409 stabilization. 411 2. Refreshing the finger table. We refer to this as finger 412 stabilization. 414 3. Adjusting finger table size. 416 4. Detecting partitioning. We refer to this as strong 417 stabilization. 419 As specified in RELOAD base [RFC6940], a peer sends periodic messages 420 as part of the neighbor stabilization, finger stabilization, and 421 strong stabilization routines. In neighbor stabilization, a peer 422 periodically sends an Update request to every peer in its Connection 423 Table. The default time is every ten minutes. In finger 424 stabilization, a peer periodically searches for new peers to include 425 in its finger table. This time defaults to one hour. This document 426 specifies how the neighbor stabilization and finger stabilization 427 intervals can be determined in an adaptive fashion based on the 428 operating conditions of the overlay. The subsections below describe 429 how this document extends the four components of stabilization. 431 5.1. Update Requests 433 As described in RELOAD base [RFC6940], the neighbor and finger 434 stabilization procedures are implemented using Update requests. 435 RELOAD base defines three types of Update requests: 'peer_ready', 436 'neighbors', and 'full'. Regardless of the type, all Update requests 437 include an 'uptime' field. The self-tuning extensions require 438 information on the uptimes of peers in the routing table. The sender 439 of an Update request includes its current uptime (in seconds) in the 440 'uptime' field. Regardless of the type, all Update requests MUST 441 include an 'uptime' field. 443 When self-tuning is used, each peer decides independently the 444 appropriate size for the successor list, predecessor list and finger 445 table. Thus, the 'predecessors', 'successors', and 'fingers' fields 446 included in RELOAD Update requests are of variable length. As 447 specified in RELOAD [RFC6940], variable length fields are on the wire 448 preceded by length bytes. In the case of the successor list, 449 predecessor list, and finger table, there are two length bytes 450 (allowing lengths up to 2^16-1). The number of NodeId structures 451 included in each field can be calculated based on the length bytes 452 since the size of a single NodeId structure is 16 bytes. If a peer 453 receives more entries than fit into its successor list, predecessor 454 list or finger table, the peer MUST ignore the extra entries. A peer 455 may also receive less entries than it currently has in its own data 456 structure. In that case, it uses the received entries to update only 457 a subset of the entries in its data structure. As an example, a peer 458 that has a successor list of size 8 may receive a successor list of 459 size 4 from its immediate successor. In that case, the received 460 successor list can only be used to update the first few successors on 461 the peer's successor list. The rest of the successors will remain 462 intact. 464 5.2. Neighbor Stabilization 466 In the neighbor stabilization operation of chord-reload, a peer 467 periodically sends an Update request to every peer in its Connection 468 Table. In a small, low-churn overlay, the amount of traffic this 469 process generates is typically acceptable. However, in a large-scale 470 overlay churning at a moderate or high churn rate, the traffic load 471 may no longer be acceptable since the size of the connection table is 472 large and the stabilization interval relatively short. The self- 473 tuning mechanisms described in this document are especially designed 474 for overlays of the latter type. Therefore, when the self-tuning 475 mechanisms are used, each peer only sends a periodic Update request 476 to its first predecessor and first successor on the Chord ring; it 477 MUST NOT send Update requests to others. 479 The neighbor stabilization routine is executed when the stabilization 480 timer fires. To begin the neighbor stabilization routine, a peer 481 sends an Update request to its first successor and its first 482 predecessor. The type of the Update request MUST be 'neighbors'. 483 The Update request includes the successor and predecessor lists of 484 the sender. If a peer receiving such an Update request learns from 485 the predecessor and successor lists included in the request that new 486 peers can be included in its neighborhood set, it sends Attach 487 requests to the new peers. 489 After a new peer has been added to the predecessor or successor list, 490 an Update request of type 'peer_ready' is sent to the new peer. This 491 allows the new peer to insert the sender into its neighborhood set. 493 5.3. Finger Stabilization 495 Chord-reload specifies two alternative methods for searching for new 496 peers to the finger table. Both of the alternatives can be used with 497 the self-tuning extensions defined in this document. 499 Immediately after a new peer has been added to the finger table, a 500 Probe request is sent to the new peer to fetch its uptime. The 501 requested_info field of the Probe request MUST be set to contain the 502 ProbeInformationType 'uptime' defined in RELOAD base [RFC6940]. 504 5.4. Adjusting Finger Table Size 506 The chord-reload algorithm defines how a peer can make sure that the 507 finger table is appropriately sized to allow for efficient routing. 508 Since the self-tuning mechanisms specified in this document produce a 509 network size estimate, this estimate can be directly used to 510 calculate the optimal size for the finger table. This mechanism is 511 used instead of the one specified by chord-reload. A peer uses the 512 network size estimate to determine whether it needs to adjust the 513 size of its finger table each time when the stabilization timer 514 fires. The way this is done is explained in Section 6.2. 516 5.5. Detecting Partitioning 518 This document does not require any changes to the mechanism chord- 519 reload uses to detect network partitioning. 521 5.6. Leaving the Overlay 523 As specified in RELOAD base [RFC6940], a leaving peer SHOULD send a 524 Leave request to all members of its neighbor table prior to leaving 525 the overlay. The overlay_specific_data field MUST contain the 526 ChordLeaveData structure. The Leave requests that are sent to 527 successors contain the predecessor list of the leaving peer. The 528 Leave requests that are sent to the predecessors contain the 529 successor list of the leaving peer. If a given successor can 530 identify better predecessors (that is, predecessors that are closer 531 to it on the Chord ring than its existing predecessors) than are 532 already included in its predecessor lists by investigating the 533 predecessor list it receives from the leaving peer, it sends Attach 534 requests to them. Similarly, if a given predecessor identifies 535 better successors by investigating the successor list it receives 536 from the leaving peer, it sends Attach requests to them. 538 6. Self-tuning Chord Parameters 540 This section specifies how to determine an appropriate stabilization 541 rate and routing table size in an adaptive fashion. The proposed 542 mechanism is based on [mahajan2003], [liben-nowell2002], and 543 [ghinita2006]. To calculate an appropriate stabilization rate, the 544 values of three parameters must be estimated: overlay size N, failure 545 rate U, and join rate L. To calculate an appropriate routing table 546 size, the estimated network size N can be used. Peers in the overlay 547 MUST re-calculate the values of the parameters to self-tune the 548 chord-reload algorithm at the end of each stabilization period before 549 re-starting the stabilization timer. 551 6.1. Estimating Overlay Size 553 Techniques for estimating the size of an overlay network have been 554 proposed for instance in [mahajan2003], [horowitz2003], 555 [kostoulas2005], [binzenhofer2006], and [ghinita2006]. In Chord, the 556 density of peer identifiers in the neighborhood set can be used to 557 produce an estimate of the size of the overlay, N [mahajan2003]. 558 Since peer identifiers are picked randomly with uniform probability 559 from the numBitsInNodeId-bit identifier space, the average distance 560 between peer identifiers in the successor set is 561 (2^numBitsInNodeId)/N. 563 To estimate the overlay network size, a peer computes the average 564 inter-peer distance d between the successive peers starting from the 565 most distant predecessor and ending to the most distant successor in 566 the successor list. The estimated network size is calculated as: 568 2^numBitsInNodeId 569 N = ------------------- 570 d 572 This estimate has been found to be accurate within 15% of the real 573 network size [ghinita2006]. Of course, the size of the neighborhood 574 set affects the accuracy of the estimate. 576 During the join process, a joining peer fills its routing table by 577 sending a series of Ping and Attach requests, as specified in RELOAD 578 base [RFC6940]. Thus, a joining peer immediately has enough 579 information at its disposal to calculate an estimate of the network 580 size. 582 6.2. Determining Routing Table Size 584 As specified in RELOAD base, the finger table must contain at least 585 16 entries. When the self-tuning mechanisms are used, the size of 586 the finger table MUST be set to max(ceiling(log2(N)), 16) using the 587 estimated network size N. 589 The size of the successor list MUST be set to a maximum of 590 ceiling(log2(N)). An implementation can place a lower limit on the 591 size of the successor list. As an example, the implementation might 592 require the size of the successor list to be always at least three. 594 The size of the predecessor list MUST be set to ceiling(log2(N)). 596 6.3. Estimating Failure Rate 598 A typical approach is to assume that peers join the overlay according 599 to a Poisson process with rate L and leave according to a Poisson 600 process with rate parameter U [mahajan2003]. The value of U can be 601 estimated using peer failures in the finger table and neighborhood 602 set [mahajan2003]. If peers fail with rate U, a peer with M unique 603 peer identifiers in its routing table should observe K failures in 604 time K/(M*U). Every peer in the overlay maintains a history of the 605 last K failures. The current time is inserted into the history when 606 the peer joins the overlay. The estimate of U is calculated as: 608 k 609 U = --------, 610 M * Tk 612 where M is the number of unique peer identifiers in the routing 613 table, Tk is the time between the first and the last failure in the 614 history, and k is the number of failures in the history. If k is 615 smaller than K, the estimate is computed as if there was a failure at 616 the current time. It has been shown that an estimate calculated in a 617 similar manner is accurate within 17% of the real value of U 618 [ghinita2006]. 620 The size of the failure history K affects the accuracy of the 621 estimate of U. One can increase the accuracy by increasing K. 622 However, this has the side effect of decreasing responsiveness to 623 changes in the failure rate. On the other hand, a small history size 624 may cause a peer to overreact each time a new failure occurs. In 625 [ghinita2006], K is set to 25% of the routing table size. Use of 626 this value is RECOMMENDED. 628 6.3.1. Detecting Failures 630 A new failure is inserted to the failure history in the following 631 cases: 633 1. A Leave request is received from a neigbhor. 635 2. A peer fails to reply to a Ping request sent in the situation 636 explained below. If no packets have been received on a 637 connection during the past 2*Tr seconds (where Tr is the 638 inactivity timer defined by ICE [RFC5245]), a RELOAD Ping request 639 MUST be sent to the remote peer. RELOAD mandates the use of STUN 640 [RFC5389] for keepalives. STUN keepalives take the form of STUN 641 Binding Indication transactions. As specified in ICE [RFC5245], 642 a peer sends a STUN Binding Indication if there has been no 643 packet sent on a connection for Tr seconds. Tr is configurable 644 and has a default of 15 seconds. Although STUN Binding 645 Indications do not generate a response, the fact that a peer has 646 failed can be learned from the lack of packets (Binding 647 Indications or application protocol packets) received from the 648 peer. If the remote peer fails to reply to the Ping request, the 649 sender should consider the remote peer to have failed. 651 As an alternative to relying on STUN keepalives to detect peer 652 failure, a peer could send additional, frequent RELOAD messages to 653 every peer in its Connection Table. These messages could be Update 654 requests, in which case they would serve two purposes: detecting peer 655 failure and stabilization. However, as the cost of this approach can 656 be very high in terms of bandwidth consumption and traffic load, 657 especially in large-scale overlays experiencing churn, its use is NOT 658 RECOMMENDED. 660 6.4. Estimating Join Rate 662 Reference [ghinita2006] proposes that a peer can estimate the join 663 rate based on the uptime of the peers in its routing table. An 664 increase in peer join rate will be reflected by a decrease in the 665 average age of peers in the routing table. Thus, each peer maintaind 666 an array of the ages of the peers in its routing table sorted in 667 increasing order. Using this information, an estimate of the global 668 peer join rate L is calculated as: 670 N 671 L = ----------------------, 672 Ages[floor(rsize/2)] 674 where Ages is an array containing the ages of the peers in the 675 routing table sorted in increasing order and rsize is the size of the 676 routing table. It has been shown that the estimate obtained by using 677 this method is accurate within 22% of the real join rate 678 [ghinita2006]. Of course, the size of the routing table affects the 679 accuracy. 681 In order for this mechanism to work, peers need to exchange 682 information about the time they have been present in the overlay. 683 Peers receive the uptimes of their successors and predecessors during 684 the stabilization operations since all Update requests carry uptime 685 values. A joining peer learns the uptime of the admitting peer since 686 it receives an Update from the admitting peer during the join 687 procedure. Peers learn the uptimes of new fingers since they can 688 fetch the uptime using a Probe request after having attached to the 689 new finger. 691 6.5. Estimate Sharing 693 To improve the accuracy of network size, join rate, and leave rate 694 estimates, peers share their estimates. When the stabilization timer 695 fires, a peer selects number-of-peers-to-probe random peers from its 696 finger table and send each of them a Probe request. The targets of 697 Probe requests are selected from the finger table rather than from 698 the neighbor table since neighbors are likely to make similar errors 699 when calculating their estimates. number-of-peers-to-probe is a new 700 element in the overlay configuration document. It is defined in 701 Section 7. Both the Probe request and the answer returned by the 702 target peer MUST contain a new message extension whose 703 MessageExtensionType is 'self_tuning_data'. This extension type is 704 defined in Section 9.1. The extension_contents field of the 705 MessageExtension structure MUST contain a SelfTuningData structure: 707 struct { 708 uint32 network_size; 709 uint32 join_rate; 710 uint32 leave_rate; 711 } SelfTuningData; 713 The contents of the SelfTuningData structure are as follows: 715 network_size 717 The latest network size estimate calculated by the sender. 719 join_rate 721 The latest join rate estimate calculated by the sender. 723 leave_rate 725 The latest leave rate estimate calculated by the sender. 727 The join and leave rates are expressed as joins or failures per 24 728 hours. As an example, if the global join rate estimate a peer has 729 calculated is 0.123 peers/s, it would include in the join_rate field 730 the ceiling of the value 10627.2 (24*60*60*0.123 = 10627.2), that is, 731 the value 10628. 733 The 'type' field of the MessageExtension structure MUST be set to 734 contain the value 'self_tuning_data'. The 'critical' field of the 735 structure MUST be set to False. 737 A peer stores all estimates it receives in Probe requests and answers 738 during a stabilization interval. When the stabilization timer fires, 739 the peer calculates the estimates to be used during the next 740 stabilization interval by taking the 75th percentile (i.e., third 741 quartile) of a data set containing its own estimate and the received 742 estimates. 744 The default value for number-of-peers-to-probe is 4. This default 745 value is recommended to allow a peer to receive a sufficiently large 746 set of estimates from other peers. With a value of 4, a peer 747 receives four estimates in Probe answers. On the average, each peer 748 also receives four Probe requests each carrying an estimate. Thus, 749 on the average, each peer has nine estimates (including its own) that 750 it can use at the end of the stablization interval. A value smaller 751 than 4 is NOT RECOMMENDED to keep the number of received estimates 752 high enough. As an example, if the value were 2, there would be 753 peers in the overlay that would only receive two estimates during a 754 stabilization interval. Such peers would only have three estimates 755 available at the end of the interval, which may not be reliable 756 enough since even a single exceptionally high or low estimate can 757 have a large impact. 759 6.6. Calculating the Stabilization Interval 761 According to [liben-nowell2002], a Chord network in a ring-like state 762 remains in a ring-like state as long as peers send 763 Omega(square(log(N))) messages before N new peers join or N/2 peers 764 fail. We can use the estimate of peer failure rate, U, to calculate 765 the time Tf in which N/2 peers fail: 767 1 768 Tf = ------ 769 2*U 771 Based on this estimate, a stabilization interval Tstab-1 is 772 calculated as: 774 Tf 775 Tstab-1 = ----------------- 776 square(log2(N)) 778 On the other hand, the estimated join rate L can be used to calculate 779 the time in which N new peers join the overlay. Based on the 780 estimate of L, a stabilization interval Tstab-2 is calculated as: 782 N 783 Tstab-2 = --------------------- 784 L * square(log2(N)) 786 Finally, the actual stabilization interval Tstab that is used can be 787 obtained by taking the minimum of Tstab-1 and Tstab-2. 789 The results obtained in [maenpaa2009] indicate that making the 790 stabilization interval too small has the effect of making the overlay 791 less stable (e.g., in terms of detected loops and path failures). 792 Thus, a lower limit should be used for the stabilization period. 793 Based on the results in [maenpaa2009], a lower limit of 15s is 794 RECOMMENDED, since using a stabilization period smaller than this 795 will with a high probability cause too much traffic in the overlay. 797 7. Overlay Configuration Document Extension 799 This document extends the RELOAD overlay configuration document by 800 adding one new element, "number-of-peers-to-probe", inside each 801 "configuration" element. 803 self-tuning:number-of-peers-to-probe: The number of fingers to which 804 Probe requests are sent to obtain their network size, join rate, 805 and leave rate estimates. The default value is 4. 807 The Relax NG Grammar for this element is: 809 namespace self-tuning = "urn:ietf:params:xml:ns:p2p:self-tuning" 811 parameter &= element self-tuning:number-of-peers-to-probe { 812 xsd:unsignedInt }? 814 This namespace is added into the element in the 815 overlay configuration file. 817 8. Security Considerations 819 In the same way as malicious or compromised peers implementing the 820 RELOAD base protocol [RFC6940] can advertise false network metrics or 821 distribute false routing table information for instance in RELOAD 822 Update messages, malicious peers implementing this specification may 823 share false join rate, leave rate, and network size estimates. For 824 such attacks, the same security concerns apply as in the RELOAD base 825 specification. In addition, as long as the amount of malicious peers 826 in the overlay remains modest, the statistical mechanisms applied in 827 Section 6.5 (i.e., the use of 75th percentiles) to process the shared 828 estimates a peer obtains help ensure that estimates that are clearly 829 different from (i.e., larger or smaller than) other received 830 estimates will not significantly influence the process of adapting 831 the stabilization interval and routing table size. However, it 832 should be noted that if an attacker is able to impersonate a high 833 number of other peers in the overlay in strategic locations, it may 834 be able to send a high enough number of false estimates to a victim 835 and therefore influence the victim's choice of a stabilization 836 interval. 838 9. IANA Considerations 840 9.1. Message Extensions 842 This document introduces one additional extension to the "RELOAD 843 Extensions" Registry: 845 +------------------+-------+---------------+ 846 | Extension Name | Code | Specification | 847 +------------------+-------+---------------+ 848 | self_tuning_data | 0x3 | RFC-AAAA | 849 +------------------+-------+---------------+ 851 The contents of the extension are defined in Section 6.5. 853 Note to RFC Editor: please replace AAAA with the RFC number for this 854 specification. 856 9.2. New Overlay Algorithm Type 858 This document introduces one additional overlay algorithm type to the 859 "RELOAD Overlay Algorithm Types" Registry: 861 +-------------------+-----------+ 862 | Algorithm Name | Reference | 863 +-------------------+-----------+ 864 | CHORD-SELF-TUNING | RFC-AAAA | 865 +-------------------+-----------+ 867 Note to RFC Editor: please replace AAAA with the RFC number for this 868 specification. 870 9.3. A New IETF XML Registry 872 This document registers one new URI for the self-tuning namespace in 873 the "ns" subregistry of the IETF XML registry defined in [RFC3688]. 875 URI: urn:ietf:params:xml:ns:p2p:self-tuning 876 Registrant Contact: The IESG 878 XML: N/A, the requested URI is an XML namespace 880 10. Acknowledgments 882 The authors would like to thank Jani Hautakorpi for his contributions 883 to the document. The authors would also like to thank Carlos 884 Bernardos, Martin Durst, Alissa Cooper, Tobias Gondrom, and Barry 885 Leiba for their comments on the document. 887 11. References 889 11.1. Normative References 891 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 892 Requirement Levels", BCP 14, RFC 2119, March 1997. 894 [RFC5245] Rosenberg, J., "Interactive Connectivity Establishment 895 (ICE): A Protocol for Network Address Translator (NAT) 896 Traversal for Offer/Answer Protocols", RFC 5245, April 897 2010. 899 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, 900 "Session Traversal Utilities for NAT (STUN)", RFC 5389, 901 October 2008. 903 [RFC6940] Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and 904 H. Schulzrinne, "REsource LOcation And Discovery (RELOAD) 905 Base Protocol", RFC 6940, January 2014. 907 11.2. Informative References 909 [CAN] Ratnasamy, S., Francis, P., Handley, M., Karp, R., and S. 910 Schenker, "A Scalable Content-Addressable Network", In 911 Proceedings of the 2001 Conference on Applications, 912 Technologies, Architectures and Protocols for Computer 913 Communications pp. 161-172, August 2001. 915 [Chord] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., 916 Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A 917 Scalable Peer-to-peer Lookup Service for Internet 918 Applications", IEEE/ACM Transactions on Networking Volume 919 11, Issue 1, pp. 17-32, February 2003. 921 [Pastry] Rowstron, A. and P. Druschel, "Pastry: Scalable, 922 Decentralized Object Location and Routing for Large-Scale 923 Peer-to-Peer Systems", In Proceedings of the IFIP/ACM 924 International Conference on Distribued Systems Platforms 925 pp. 329-350, November 2001. 927 [RFC3688] Mealling, M., "The IETF XML Registry", BCP 81, RFC 3688, 928 January 2004. 930 [binzenhofer2006] 931 Binzenhofer, A., Kunzmann, G., and R. Henjes, "A Scalable 932 Algorithm to Monitor Chord-Based P2P Systems at Runtime", 933 In Proceedings of the 20th IEEE International Parallel and 934 Distributed Processing Symposium (IPDPS) pp. 1-8, April 935 2006. 937 [ghinita2006] 938 Ghinita, G. and Y. Teo, "An Adaptive Stabilization 939 Framework for Distributed Hash Tables", In Proceedings of 940 the 20th IEEE International Parallel and Distributed 941 Processing Symposium (IPDPS) pp. 29-38, April 2006. 943 [horowitz2003] 944 Horowitz, K. and D. Malkhi, "Estimating Network Size from 945 Local Information", Information Processing Letters Volume 946 88, Issue 5, pp. 237-243, December 2003. 948 [kostoulas2005] 949 Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and 950 A. Demers, "Decentralized Schemes for Size Estimation in 951 Large and Dynamic Groups", In Proceedings of the 4th IEEE 952 International Symposium on Network Computing and 953 Applications pp. 41-48, July 2005. 955 [krishnamurthy2008] 956 Krishnamurthy, S., El-Ansary, S., Aurell, E., and S. 957 Haridi, "Comparing Maintenance Strategies for Overlays", 958 In Proceedings of the 16th Euromicro Conference on 959 Parallel, Distributed and Network-Based Processing pp. 960 473-482, February 2008. 962 [li2004] Li, J., Strinbling, J., Gil, T., Morris, R., and M. 963 Kaashoek, "Comparing the Performance of Distributed Hash 964 Tables Under Churn", Peer-to-Peer Systems III, volume 3279 965 of Lecture Notes in Computer Science Springer, pp. 87-99, 966 February 2005. 968 [liben-nowell2002] 969 Liben-Nowell, D., Balakrishnan, H., and D. Karger, 970 "Observations on the Dynamic Evolution of Peer-to-Peer 971 Networks", In Proceedings of the 1st International 972 Workshop on Peer-to-Peer Systems (IPTPS) pp. 22-33, March 973 2002. 975 [maenpaa2009] 976 Maenpaa, J. and G. Camarillo, "A Study on Maintenance 977 Operations in a Chord-Based Peer-to-Peer Session 978 Initiation Protocol Overlay Network", In Proceedings of 979 the 23rd IEEE International Parallel and Distributed 980 Processing Symposium (IPDPS) pp. 1-9, May 2009. 982 [mahajan2003] 983 Mahajan, R., Castro, M., and A. Rowstron, "Controlling the 984 Cost of Reliability in Peer-to-Peer Overlays", In 985 Proceedings of the 2nd International Workshop on Peer-to- 986 Peer Systems (IPTPS) pp. 21-32, February 2003. 988 [rhea2004] 989 Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, 990 "Handling Churn in a DHT", In Proceedings of the USENIX 991 Annual Technical Conference pp. 127-140, June 2004. 993 [weiss1998] 994 Weiss, M., "Data Structures and Algorithm Analysis in 995 C++", Addison-Wesley Longman Publishin Co., Inc. 2nd 996 Edition, ISBN:0201361221, 1998. 998 Authors' Addresses 1000 Jouni Maenpaa 1001 Ericsson 1002 Hirsalantie 11 1003 Jorvas 02420 1004 Finland 1006 Email: Jouni.Maenpaa@ericsson.com 1008 Gonzalo Camarillo 1009 Ericsson 1010 Hirsalantie 11 1011 Jorvas 02420 1012 Finland 1014 Email: Gonzalo.Camarillo@ericsson.com