idnits 2.17.1 draft-ietf-p2psip-self-tuning-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 2, 2014) is 3730 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 5245 (Obsoleted by RFC 8445, RFC 8839) ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489) == Outdated reference: A later version (-09) exists of draft-ietf-p2psip-concepts-05 Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 P2PSIP Working Group J. Maenpaa 3 Internet-Draft G. Camarillo 4 Intended status: Standards Track Ericsson 5 Expires: August 6, 2014 February 2, 2014 7 Self-tuning Distributed Hash Table (DHT) for REsource LOcation And 8 Discovery (RELOAD) 9 draft-ietf-p2psip-self-tuning-10.txt 11 Abstract 13 REsource LOcation And Discovery (RELOAD) is a peer-to-peer (P2P) 14 signaling protocol that provides an overlay network service. Peers 15 in a RELOAD overlay network collectively run an overlay algorithm to 16 organize the overlay, and to store and retrieve data. This document 17 describes how the default topology plugin of RELOAD can be extended 18 to support self-tuning, that is, to adapt to changing operating 19 conditions such as churn and network size. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on August 6, 2014. 38 Copyright Notice 40 Copyright (c) 2014 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 3. Introduction to Stabilization in DHTs . . . . . . . . . . . . 5 58 3.1. Reactive vs. Periodic Stabilization . . . . . . . . . . . 5 59 3.2. Configuring Periodic Stabilization . . . . . . . . . . . 6 60 3.3. Adaptive Stabilization . . . . . . . . . . . . . . . . . 7 61 4. Introduction to Chord . . . . . . . . . . . . . . . . . . . . 8 62 5. Extending Chord-reload to Support Self-tuning . . . . . . . . 9 63 5.1. Update Requests . . . . . . . . . . . . . . . . . . . . . 10 64 5.2. Neighbor Stabilization . . . . . . . . . . . . . . . . . 10 65 5.3. Finger Stabilization . . . . . . . . . . . . . . . . . . 11 66 5.4. Adjusting Finger Table Size . . . . . . . . . . . . . . . 11 67 5.5. Detecting Partitioning . . . . . . . . . . . . . . . . . 11 68 5.6. Leaving the Overlay . . . . . . . . . . . . . . . . . . . 12 69 6. Self-tuning Chord Parameters . . . . . . . . . . . . . . . . 12 70 6.1. Estimating Overlay Size . . . . . . . . . . . . . . . . . 12 71 6.2. Determining Routing Table Size . . . . . . . . . . . . . 13 72 6.3. Estimating Failure Rate . . . . . . . . . . . . . . . . . 13 73 6.3.1. Detecting Failures . . . . . . . . . . . . . . . . . 14 74 6.4. Estimating Join Rate . . . . . . . . . . . . . . . . . . 15 75 6.5. Estimate Sharing . . . . . . . . . . . . . . . . . . . . 15 76 6.6. Calculating the Stabilization Interval . . . . . . . . . 16 77 7. Overlay Configuration Document Extension . . . . . . . . . . 17 78 8. Security Considerations . . . . . . . . . . . . . . . . . . . 18 79 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 80 9.1. Message Extensions . . . . . . . . . . . . . . . . . . . 18 81 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 18 82 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 83 11.1. Normative References . . . . . . . . . . . . . . . . . . 19 84 11.2. Informative References . . . . . . . . . . . . . . . . . 19 85 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 21 87 1. Introduction 89 REsource LOcation And Discovery (RELOAD) [I-D.ietf-p2psip-base] is a 90 peer-to-peer signaling protocol that can be used to maintain an 91 overlay network, and to store data in and retrieve data from the 92 overlay. For interoperability reasons, RELOAD specifies one overlay 93 algorithm, called chord-reload, that is mandatory to implement. This 94 document extends the chord-reload algorithm by introducing self- 95 tuning behavior. 97 Distributed Hash Table (DHT) based overlay networks are self- 98 organizing, scalable and reliable. However, these features come at a 99 cost: peers in the overlay network need to consume network bandwidth 100 to maintain routing state. Most DHTs use a periodic stabilization 101 routine to counter the undesirable effects of churn on routing. To 102 configure the parameters of a DHT, some characteristics such as churn 103 rate and network size need to be known in advance. These 104 characteristics are then used to configure the DHT in a static 105 fashion by using fixed values for parameters such as the size of the 106 successor set, size of the routing table, and rate of maintenance 107 messages. The problem with this approach is that it is not possible 108 to achieve a low failure rate and a low communication overhead by 109 using fixed parameters. Instead, a better approach is to allow the 110 system to take into account the evolution of network conditions and 111 adapt to them. This document extends the mandatory-to-implement 112 chord-reload algorithm by making it self-tuning. Two main advantages 113 of self-tuning are that users no longer need to tune every DHT 114 parameter correctly for a given operating environment and that the 115 system adapts to changing operating conditions. 117 The remainder of this document is structured as follows: Section 2 118 provides definitions of terms used in this document. Section 3 119 discusses alternative approaches to stabilization operations in DHTs, 120 including reactive stabilization, periodic stabilization, and 121 adaptive stabilization. Section 4 gives an introduction to the Chord 122 DHT algorithm. Section 5 describes how this document extends the 123 stabilization routine of the chord-reload algorithm. Section 6 124 describes how the stabilization rate and routing table size are 125 calculated in an adaptive fashion. 127 2. Terminology 129 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 130 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 131 "OPTIONAL" in this document are to be interpreted as described in RFC 132 2119 [RFC2119]. 134 This document uses the terminology and definitions from the Concepts 135 and Terminology for Peer to Peer SIP [I-D.ietf-p2psip-concepts] 136 draft. 138 numBitsInNodeId: Specifies the number of bits in a RELOAD Node-ID. 140 DHT: Distributed Hash Tables (DHTs) are a class of decentralized 141 distributed systems that provide a lookup service similar to a 142 regular hash table. Given a key, any peer participating in the 143 system can retrieve the value associated with that key. The 144 responsibility for maintaining the mapping from keys to values is 145 distributed among the peers. 147 Chord Ring: The Chord DHT uses ring topology and orders identifiers 148 on an identifier circle of size 2^numBitsInNodeId. This 149 identifier circle is called the Chord ring. On the Chord ring, 150 the responsibility for a key k is assigned to the node whose 151 identifier equals to or immediately follows k. 153 Finger Table: A data structure with up to (but typically less than) 154 numBitsInNodeId entries maintained by each peer in a Chord-based 155 overlay. The ith entry in the finger table of peer n contains the 156 identity of the first peer that succeeds n by at least 157 2^(numBitsInNodeId-i) on the Chord ring. This peer is called the 158 ith finger of peer n. As an example, the first entry in the finger 159 table of peer n contains a peer half-way around the Chord ring 160 from peer n. The purpose of the finger table is to accelerate 161 lookups. 163 n.id: An abbreviation that is in this document used refer to the 164 Node-ID of peer n. 166 O(g(n)): Informally, saying that some equation f(n) = O(g(n)) means 167 that f(n) is less than some constant multiple of g(n). 169 Omega(g(n)): Informally, saying that some equation f(n) = 170 Omega(g(n)) means that f(n) is more than some constant multiple of 171 g(n). 173 Predecessor List: A data structure containing the first predecessors 174 of a peer on the Chord ring. 176 Successor List: A data structure containing the first r successors 177 of a peer on the Chord ring. 179 Neighborhood Set: A term used to refer to the set of peers included 180 in the successor and predecessor lists of a given peer. 182 Routing Table: Contents of a given peer's routing table include the 183 set of peers that the peer can use to route overlay messages. The 184 routing table is made up of the finger table, successor list and 185 predecessor list. 187 3. Introduction to Stabilization in DHTs 189 DHTs use stabilization routines to counter the undesirable effects of 190 churn on routing. The purpose of stabilization is to keep the 191 routing information of each peer in the overlay consistent with the 192 constantly changing overlay topology. There are two alternative 193 approaches to stabilization: periodic and reactive [rhea2004]. 194 Periodic stabilization can either use a fixed stabilization rate or 195 calculate the stabilization rate in an adaptive fashion. 197 3.1. Reactive vs. Periodic Stabilization 199 In reactive stabilization, a peer reacts to the loss of a peer in its 200 neighborhood set or to the appearance of a new peer that should be 201 added to its neighborhood set by sending a copy of its neighbor table 202 to all peers in the neighborhood set. Periodic recovery, in 203 contrast, takes place independently of changes in the neighborhood 204 set. In periodic recovery, a peer periodically shares its 205 neighborhood set with each or a subset of the members of that set. 207 The chord-reload algorithm [I-D.ietf-p2psip-base] supports both 208 reactive and periodic stabilization. It has been shown in [rhea2004] 209 that reactive stabilization works well for small neighborhood sets 210 (i.e., small overlays) and moderate churn. However, in large-scale 211 (e.g., 1000 peers or more [rhea2004]) or high-churn overlays, 212 reactive stabilization runs the risk of creating a positive feedback 213 cycle, which can eventually result in congestion collapse. In 214 [rhea2004], it is shown that a 1000-peer overlay under churn uses 215 significantly less bandwidth and has lower latencies when periodic 216 stabilization is used than when reactive stabilization is used. 217 Although in the experiments carried out in [rhea2004], reactive 218 stabilization performed well when there was no churn, its bandwidth 219 use was observed to jump dramatically under churn. At higher churn 220 rates and larger scale overlays, periodic stabilization uses less 221 bandwidth and the resulting lower contention for the network leads to 222 lower latencies. For this reason, most DHTs such as CAN [CAN], Chord 223 [Chord], Pastry [Pastry], Bamboo [rhea2004], etc. use periodic 224 stabilization [ghinita2006]. As an example, the first version of 225 Bamboo used reactive stabilization, which caused Bamboo to suffer 226 from degradation in performance under churn. To fix this problem, 227 Bamboo was modified to use periodic stabilization. 229 In Chord, periodic stabilization is typically done both for 230 successors and fingers. An alternative strategy is analyzed in 231 [krishnamurthy2008]. In this strategy, called the correction-on- 232 change maintenance strategy, a peer periodically stabilizes its 233 successors but does not do so for its fingers. Instead, finger 234 pointers are stabilized in a reactive fashion. The results obtained 235 in [krishnamurthy2008] imply that although the correction-on-change 236 strategy works well when churn is low, periodic stabilization 237 outperforms the correction-on-change strategy when churn is high. 239 3.2. Configuring Periodic Stabilization 241 When periodic stabilization is used, one faces the problem of 242 selecting an appropriate execution rate for the stabilization 243 procedure. If the execution rate of periodic stabilization is high, 244 changes in the system can be quickly detected, but at the 245 disadvantage of increased communication overhead. Alternatively, if 246 the stabilization rate is low and the churn rate is high, routing 247 tables become inaccurate and DHT performance deteriorates. Thus, the 248 problem is setting the parameters so that the overlay achieves the 249 desired reliability and performance even in challenging conditions, 250 such as under heavy churn. This naturally results in high cost 251 during periods when the churn level is lower than expected, or 252 alternatively, poor performance or even network partitioning in worse 253 than expected conditions. 255 In addition to selecting an appropriate stabilization interval, 256 regardless of whether periodic stabilization is used or not, an 257 appropriate size needs to be selected for the neighborhood set and 258 for the finger table. 260 The current approach is to configure overlays statically. This works 261 in situations where perfect information about the future is 262 available. In situations where the operating conditions of the 263 network are known in advance and remain static throughout the 264 lifetime of the system, it is possible to choose fixed optimal values 265 for parameters such as stabilization rate, neighborhood set size and 266 routing table size. However, if the operating conditions (e.g., the 267 size of the overlay and its churn rate) do not remain static but 268 evolve with time, it is not possible to achieve both a low lookup 269 failure rate and a low communication overhead by using fixed 270 parameters [ghinita2006]. 272 As an example, to configure the Chord DHT algorithm, one needs to 273 select values for the following parameters: size of successor list, 274 stabilization interval, and size of the finger table. To select an 275 appropriate value for the stabilization interval, one needs to know 276 the expected churn rate and overlay size. According to 277 [liben-nowell2002], a Chord network in a ring-like state remains in a 278 ring-like state as long as peers send Omega(log2^2(N)) messages 279 before N new peers join or N/2 peers fail. Thus, in a 500-peer 280 overlay churning at a rate such that one peer joins and one peer 281 leaves the network every 30 seconds, an appropriate stabilization 282 interval would be on the order of 93s. According to [Chord], the size 283 of the successor list and finger table should be on the order of 284 log2(N). Having a successor list of size O(log2(N)) makes it 285 unlikely that a peer will lose all of its successors, which would 286 cause the Chord ring to become disconnected. Thus, in a 500-peer 287 network each peer should maintain on the order of nine successors and 288 fingers. However, if the churn rate doubles and the network size 289 remains unchanged, the stabilization rate should double as well. 290 That is, the appropriate maintenance interval would now be on the 291 order of 46s. On the other hand, if the churn rate becomes e.g. six- 292 fold and the size of the network grows to 2000 peers, on the order of 293 eleven fingers and successors should be maintained and the 294 stabilization interval should be on the order of 42s. If one 295 continued using the old values, this could result in inaccurate 296 routing tables, network partitioning, and deteriorating performance. 298 3.3. Adaptive Stabilization 300 A self-tuning DHT takes into consideration the continuous evolution 301 of network conditions and adapts to them. In a self-tuning DHT, each 302 peer collects statistical data about the network and dynamically 303 adjusts its stabilization rate, neighborhood set size, and finger 304 table size based on the analysis of the data [ghinita2006]. 305 Reference [mahajan2003] shows that by using self-tuning, it is 306 possible to achieve high reliability and performance even in adverse 307 conditions with low maintenance cost. Adaptive stabilization has 308 been shown to outperform periodic stabilization in terms of both 309 lookup failures and communication overhead [ghinita2006]. 311 4. Introduction to Chord 313 Chord [Chord] is a structured P2P algorithm that uses consistent 314 hashing to build a DHT out of several independent peers. Consistent 315 hashing assigns each peer and resource a fixed-length identifier. 316 Peers use SHA-1 as the base hash fuction to generate the identifiers. 317 As specified in RELOAD base, the length of the identifiers is 318 numBitsInNodeId=128 bits. The identifiers are ordered on an 319 identifier circle of size 2^numBitsInNodeId. On the identifier 320 circle, key k is assigned to the first peer whose identifier equals 321 or follows the identifier of k in the identifier space. The 322 identifier circle is called the Chord ring. 324 Different DHTs differ significantly in performance when bandwidth is 325 limited. It has been shown that when compared to other DHTs, the 326 advantages of Chord include that it uses bandwidth efficiently and 327 can achieve low lookup latencies at little cost [li2004]. 329 A simple lookup mechanism could be implemented on a Chord ring by 330 requiring each peer to only know how to contact its current successor 331 on the identifier circle. Queries for a given identifier could then 332 be passed around the circle via the successor pointers until they 333 encounter the first peer whose identifier is equal to or larger than 334 the desired identifier. Such a lookup scheme uses a number of 335 messages that grows linearly with the number of peers. To reduce the 336 cost of lookups, Chord maintains also additional routing information; 337 each peer n maintains a data structure with up to numBitsInNodeId 338 entries, called the finger table. The first entry in the finger 339 table of peer n contains the peer half-way around the ring from peer 340 n. The second entry contains the peer that is 1/4th of the way 341 around, the third entry the peer that is 1/8th of the way around, 342 etc. In other words, the ith entry in the finger table at peer n 343 contains the identity of the first peer s that succeeds n by at least 344 2^(numBitsInNodeId-i) on the Chord ring. This peer is called the ith 345 finger of peer n. The interval between two consecutive fingers is 346 called a finger interval. The ith finger interval of peer n covers 347 the range [n.id + 2^(numBitsInNodeId-i), n.id + 348 2^(numBitsInNodeId-i+1)) on the Chord ring. In an N-peer network, 349 each peer maintains information about O(log2(N)) other peers in its 350 finger table. As an example, if N=100000, it is sufficient to 351 maintain 17 fingers. 353 Chord needs all peers' successor pointers to be up to date in order 354 to ensure that lookups produce correct results as the set of 355 participating peers changes. To achieve this, peers run a 356 stabilization protocol periodically in the background. The 357 stabilization protocol of the original Chord algorithm uses two 358 operations: successor stabilization and finger stabilization. 360 However, the Chord algorithm of RELOAD base defines two additional 361 stabilization components, as will be discussed below. 363 To increase robustness in the event of peer failures, each Chord peer 364 maintains a successor list of size r, containing the peer's first r 365 successors. The benefit of successor lists is that if each peer 366 fails independently with probability p, the probability that all r 367 successors fail simultaneously is only p^r. 369 The original Chord algorithm maintains only a single predecessor 370 pointer. However, multiple predecessor pointers (i.e., a predecessor 371 list) can be maintained to speed up recovery from predecessor 372 failures. The routing table of a peer consists of the successor 373 list, finger table, and predecessor list. 375 5. Extending Chord-reload to Support Self-tuning 377 This section describes how the mandatory-to-implement chord-reload 378 algorithm defined in RELOAD base [I-D.ietf-p2psip-base] can be 379 extended to support self-tuning. 381 The chord-reload algorithm supports both reactive and periodic 382 recovery strategies. When the self-tuning mechanisms defined in this 383 document are used, the periodic recovery strategy MUST be used. 384 Further, chord-reload specifies that at least three predecessors and 385 three successors need to be maintained. When the self-tuning 386 mechanisms are used, the appropriate sizes of the successor list and 387 predecessor list are determined in an adaptive fashion based on the 388 estimated network size, as will be described in Section 6. 390 As specified in RELOAD base, each peer MUST maintain a stabilization 391 timer. When the stabilization timer fires, the peer MUST restart the 392 timer and carry out the overlay stabilization routine. Overlay 393 stabilization has four components in chord-reload: 395 1. Update the neighbor table. We refer to this as neighbor 396 stabilization. 398 2. Refreshing the finger table. We refer to this as finger 399 stabilization. 401 3. Adjusting finger table size. 403 4. Detecting partitioning. We refer to this as strong 404 stabilization. 406 As specified in RELOAD base [I-D.ietf-p2psip-base], a peer sends 407 periodic messages as part of the neighbor stabilization, finger 408 stabilization, and strong stabilization routines. In neighbor 409 stabilization, a peer periodically sends an Update request to every 410 peer in its Connection Table. The default time is every ten minutes. 411 In finger stabilization, a peer periodically searches for new peers 412 to include in its finger table. This time defaults to one hour. 413 This document specifies how the neighbor stabilization and finger 414 stabilization intervals can be determined in an adaptive fashion 415 based on the operating conditions of the overlay. The subsections 416 below describe how this document extends the four components of 417 stabilization. 419 5.1. Update Requests 421 As described in RELOAD base [I-D.ietf-p2psip-base], the neighbor and 422 finger stabilization procedures are implemented using Update 423 requests. RELOAD base defines three types of Update requests: 424 'peer_ready', 'neighbors', and 'full'. Regardless of the type, all 425 Update requests include an 'uptime' field. Since the self-tuning 426 extensions require information on the uptimes of peers in the routing 427 table, the sender of an Update request MUST include its current 428 uptime in seconds in the 'uptime' field. 430 When self-tuning is used, each peer decides independently the 431 appropriate size for the successor list, predecessor list and finger 432 table. Thus, the 'predecessors', 'successors', and 'fingers' fields 433 included in RELOAD Update requests are of variable length. As 434 specified in RELOAD [I-D.ietf-p2psip-base], variable length fields 435 are on the wire preceded by length bytes. In the case of the 436 successor list, predecessor list, and finger table, there are two 437 length bytes (allowing lengths up to 2^16-1). The number of NodeId 438 structures included in each field can be calculated based on the 439 length bytes since the size of a single NodeId structure is 16 bytes. 440 If a peer receives more entries than fit into its successor list, 441 predecessor list or finger table, the peer MUST ignore the extra 442 entries. If a peer receives less entries than it currently has in 443 its own data structure, the peer MUST NOT drop the extra entries from 444 its data structure. 446 5.2. Neighbor Stabilization 448 In the neighbor stabilization operation of chord-reload, a peer 449 periodically sends an Update request to every peer in its Connection 450 Table. In a small, low-churn overlay, the amount of traffic this 451 process generates is typically acceptable. However, in a large-scale 452 overlay churning at a moderate or high churn rate, the traffic load 453 may no longer be acceptable since the size of the connection table is 454 large and the stabilization interval relatively short. The self- 455 tuning mechanisms described in this document are especially designed 456 for overlays of the latter type. Therefore, when the self-tuning 457 mechanisms are used, each peer MUST send a periodic Update request 458 only to its first predecessor and first successor on the Chord ring. 460 The neighbor stabilization routine MUST be executed when the 461 stabilization timer fires. To begin the neighbor stabilization 462 routine, a peer MUST send an Update request to its first successor 463 and its first predecessor. The type of the Update request MUST be 464 'neighbors'. The Update request MUST include the successor and 465 predecessor lists of the sender. If a peer receiving such an Update 466 request learns from the predecessor and successor lists included in 467 the request that new peers can be included in its neighborhood set, 468 it MUST send Attach requests to the new peers. 470 After a new peer has been added to the predecessor or successor list, 471 an Update request of type 'peer_ready' MUST be sent to the new peer. 472 This allows the new peer to insert the sender into its neighborhood 473 set. 475 5.3. Finger Stabilization 477 Chord-reload specifies two alternative methods for searching for new 478 peers to the finger table. Both of the alternatives can be used with 479 the self-tuning extensions defined in this document. 481 Immediately after a new peer has been added to the finger table, a 482 Probe request MUST be sent to the new peer to fetch its uptime. The 483 requested_info field of the Probe request MUST be set to contain the 484 ProbeInformationType 'uptime' defined in RELOAD base 485 [I-D.ietf-p2psip-base]. 487 5.4. Adjusting Finger Table Size 489 The chord-reload algorithm defines how a peer can make sure that the 490 finger table is appropriately sized to allow for efficient routing. 491 Since the self-tuning mechanisms specified in this document produce a 492 network size estimate, this estimate can be directly used to 493 calculate the optimal size for the finger table. This mechanism MUST 494 be used instead of the one specified by chord-reload. A peer MUST 495 use the network size estimate to determine whether it needs to adjust 496 the size of its finger table each time when the stabilization timer 497 fires. The way this is done is explained in Section 6.2. 499 5.5. Detecting Partitioning 501 This document does not require any changes to the mechanism chord- 502 reload uses to detect network partitioning. 504 5.6. Leaving the Overlay 506 As specified in RELOAD base [I-D.ietf-p2psip-base], a leaving peer 507 SHOULD send a Leave request to all members of its neighbor table 508 prior to leaving the overlay. The overlay_specific_data field MUST 509 contain the ChordLeaveData structure. The Leave requests that are 510 sent to successors MUST contain the predecessor list of the leaving 511 peer. The Leave requests that are sent to the predecessors MUST 512 contain the successor list of the leaving peer. If a given successor 513 can identify better predecessors than are already included in its 514 predecessor lists by investigating the predecessor list it receives 515 from the leaving peer, it MUST send Attach requests to them. 516 Similarly, if a given predecessor identifies better successors by 517 investigating the successor list it receives from the leaving peer, 518 it MUST send Attach requests to them. 520 6. Self-tuning Chord Parameters 522 This section specifies how to determine an appropriate stabilization 523 rate and routing table size in an adaptive fashion. The proposed 524 mechanism is based on [mahajan2003], [liben-nowell2002], and 525 [ghinita2006]. To calculate an appropriate stabilization rate, the 526 values of three parameters MUST be estimated: overlay size N, failure 527 rate U, and join rate L. To calculate an appropriate routing table 528 size, the estimated network size N can be used. Peers in the overlay 529 MUST re-calculate the values of the parameters to self-tune the 530 chord-reload algorithm at the end of each stabilization period before 531 re-starting the stabilization timer. 533 6.1. Estimating Overlay Size 535 Techniques for estimating the size of an overlay network have been 536 proposed for instance in [mahajan2003], [horowitz2003], 537 [kostoulas2005], [binzenhofer2006], and [ghinita2006]. In Chord, the 538 density of peer identifiers in the neighborhood set can be used to 539 produce an estimate of the size of the overlay, N [mahajan2003]. 540 Since peer identifiers are picked randomly with uniform probability 541 from the numBitsInNodeId-bit identifier space, the average distance 542 between peer identifiers in the successor set is (2^numBitsInNodeId)/ 543 N. 545 To estimate the overlay network size, a peer MUST compute the average 546 inter-peer distance d between the successive peers starting from the 547 most distant predecessor and ending to the most distant successor in 548 the successor list. The estimated network size MUST be calculated 549 as: 551 2^numBitsInNodeId 552 N = ------------------- 553 d 555 This estimate has been found to be accurate within 15% of the real 556 network size [ghinita2006]. Of course, the size of the neighborhood 557 set affects the accuracy of the estimate. 559 During the join process, a joining peer fills its routing table by 560 sending a series of Ping and Attach requests, as specified in RELOAD 561 base [I-D.ietf-p2psip-base]. Thus, a joining peer immediately has 562 enough information at its disposal to calculate an estimate of the 563 network size. 565 6.2. Determining Routing Table Size 567 As specified in RELOAD base, the finger table must contain at least 568 16 entries. When the self-tuning mechanisms are used, the size of 569 the finger table MUST be set to max(log2(N), 16) using the estimated 570 network size N. 572 The size of the successor list MUST be set to log2(N). An 573 implementation MAY place a lower limit on the size of the successor 574 list. As an example, the implementation might require the size of 575 the successor list to be always at least three. 577 A peer MAY choose to maintain a fixed-size predecessor list with only 578 three entries as specified in RELOAD base. However, it is 579 RECOMMENDED that a peer maintains log2(N) predecessors. 581 6.3. Estimating Failure Rate 583 A typical approach is to assume that peers join the overlay according 584 to a Poisson process with rate L and leave according to a Poisson 585 process with rate parameter U [mahajan2003]. The value of U can be 586 estimated using peer failures in the finger table and neighborhood 587 set [mahajan2003]. If peers fail with rate U, a peer with M unique 588 peer identifiers in its routing table should observe K failures in 589 time K/(M*U). Every peer in the overlay MUST maintain a history of 590 the last K failures. The current time MUST be inserted into the 591 history when the peer joins the overlay. The estimate of U MUST be 592 calculated as: 594 k 595 U = --------, 596 M * Tk 598 where M is the number of unique peer identifiers in the routing 599 table, Tk is the time between the first and the last failure in the 600 history, and k is the number of failures in the history. If k is 601 smaller than K, the estimate MUST be computed as if there was a 602 failure at the current time. It has been shown that an estimate 603 calculated in a similar manner is accurate within 17% of the real 604 value of U [ghinita2006]. 606 The size of the failure history K affects the accuracy of the 607 estimate of U. One can increase the accuracy by increasing K. 608 However, this has the side effect of decreasing responsiveness to 609 changes in the failure rate. On the other hand, a small history size 610 may cause a peer to overreact each time a new failure occurs. In 611 [ghinita2006], K is set 25% of the routing table size. Use of this 612 approach is RECOMMENDED. 614 6.3.1. Detecting Failures 616 A new failure MUST be inserted to the failure history in the 617 following cases: 619 1. A Leave request is received from a neigbhor. 621 2. A peer fails to reply to a Ping request sent in the situation 622 explained below. If no packets have been received on a 623 connection during the past 2*Tr seconds (where Tr is the 624 inactivity timer defined by ICE [RFC5245]), a RELOAD Ping request 625 MUST be sent to the remote peer. RELOAD mandates the use of STUN 626 [RFC5389] for keepalives. STUN keepalives take the form of STUN 627 Binding Indication transactions. As specified in ICE [RFC5245], 628 a peer sends a STUN Binding Indication if there has been no 629 packet sent on a connection for Tr seconds. Tr is configurable 630 and has a default of 15 seconds. Although STUN Binding 631 Indications do not generate a response, the fact that a peer has 632 failed can be learned from the lack of packets (Binding 633 Indications or application protocol packets) received from the 634 peer. If the remote peer fails to reply to the Ping request, the 635 sender MUST consider the remote peer to have failed. 637 As an alternative to relying on STUN keepalives to detect peer 638 failure, a peer could send additional, frequent RELOAD messages to 639 every peer in its Connection Table. These messages could be Update 640 requests, in which case they would serve two purposes: detecting peer 641 failure and stabilization. However, as the cost of this approach can 642 be very high in terms of bandwidth consumption and traffic load, 643 especially in large-scale overlays experiencing churn, its use is NOT 644 RECOMMENDED. 646 6.4. Estimating Join Rate 648 Reference [ghinita2006] proposes that a peer can estimate the join 649 rate based on the uptime of the peers in its routing table. An 650 increase in peer join rate will be reflected by a decrease in the 651 average age of peers in the routing table. Thus, each peer MUST 652 maintain an array of the ages of the peers in its routing table 653 sorted in increasing order. Using this information, an estimate of 654 the global peer join rate L MUST be calculated as: 656 N 657 L = ---------------, 658 Ages[rsize/2] 660 where Ages is an array containing the ages of the peers in the 661 routing table sorted in increasing order and rsize is the size of the 662 routing table. It has been shown that the estimate obtained by using 663 this method is accurate within 22% of the real join rate 664 [ghinita2006]. Of course, the size of the routing table affects the 665 accuracy. 667 In order for this mechanism to work, peers need to exchange 668 information about the time they have been present in the overlay. 669 Peers receive the uptimes of their successors and predecessors during 670 the stabilization operations since all Update requests carry uptime 671 values. A joining peer learns the uptime of the admitting peer since 672 it receives an Update from the admitting peer during the join 673 procedure. Peers learn the uptimes of new fingers since they can 674 fetch the uptime using a Probe request after having attached to the 675 new finger. 677 6.5. Estimate Sharing 679 To improve the accuracy of network size, join rate, and leave rate 680 estimates, peers MUST share their estimates. When the stabilization 681 timer fires, a peer MUST select number-of-peers-to-probe random peers 682 from its finger table and send each of them a Probe request. The 683 targets of Probe requests are selected from the finger table rather 684 than from the neighbor table since neighbors are likely to make 685 similar errors when calculating their estimates. number-of-peers-to- 686 probe is a new element in the overlay configuration document. It is 687 defined in Section 7 and has a default value of 4. Both the Probe 688 request and the answer returned by the target peer MUST contain a new 689 message extension whose MessageExtensionType is 'self_tuning_data'. 690 This extension type is defined in Section 9.1. The 691 extension_contents field of the MessageExtension structure MUST 692 contain a SelfTuningData structure: 694 struct { 695 uint32 network_size; 696 uint32 join_rate; 697 uint32 leave_rate; 698 } SelfTuningData; 700 The contents of the SelfTuningData structure are as follows: 702 network_size 704 The latest network size estimate calculated by the sender. 706 join_rate 708 The latest join rate estimate calculated by the sender. 710 leave_rate 712 The latest leave rate estimate calculated by the sender. 714 The join and leave rates are expressed as joins or failures per 24 715 hours. As an example, if the global join rate estimate a peer has 716 calculated is 0.123 peers/s, it would include in the join_rate field 717 the ceiling of the value 10627.2 (24*60*60*0.123 = 10627.2), that is, 718 the value 10628. 720 The 'type' field of the MessageExtension structure MUST be set to 721 contain the value 'self_tuning_data'. The 'critical' field of the 722 structure MUST be set to False. 724 A peer MUST store all estimates it receives in Probe requests and 725 answers during a stabilization interval. When the stabilization 726 timer fires, the peer MUST calculate the estimates to be used during 727 the next stabilization interval by taking the 75th percentile of a 728 data set containing its own estimate and the received estimates. 730 6.6. Calculating the Stabilization Interval 732 According to [liben-nowell2002], a Chord network in a ring-like state 733 remains in a ring-like state as long as peers send Omega(log2(N))^2 734 messages before N new peers join or N/2 peers fail. We can use the 735 estimate of peer failure rate, U, to calculate the time Tf in which N 736 /2 peers fail: 738 1 739 Tf = ------ 740 2*U 742 Based on this estimate, a stabilization interval Tstab-1 MUST be 743 calculated as: 745 Tf 746 Tstab-1 = ----------- 747 log2^2(N) 749 On the other hand, the estimated join rate L can be used to calculate 750 the time in which N new peers join the overlay. Based on the 751 estimate of L, a stabilization interval Tstab-2 MUST be calculated 752 as: 754 N 755 Tstab-2 = --------------- 756 L * log2^2(N) 758 Finally, the actual stabilization interval Tstab that MUST be used 759 can be obtained by taking the minimum of Tstab-1 and Tstab-2. 761 The results obtained in [maenpaa2009] indicate that making the 762 stabilization interval too small has the effect of making the overlay 763 less stable (e.g., in terms of detected loops and path failures). 764 Thus, a lower limit should be used for the stabilization period. 765 Based on the results in [maenpaa2009], a lower limit of 15s is 766 RECOMMENDED, since using a stabilization period smaller than this 767 will with a high probability cause too much traffic in the overlay. 769 7. Overlay Configuration Document Extension 771 This document extends the RELOAD overlay configuration document by 772 adding one new element, "number-of-peers-to-probe", inside each 773 "configuration" element. 775 self-tuning:number-of-peers-to-probe: The number of fingers to which 776 Probe requests are sent to obtain their network size, join rate, 777 and leave rate estimates. The default value is 4. 779 This new element is formally defined as follows: 781 namespace self-tuning = "urn:ietf:params:xml:ns:p2p:self-tuning" 782 parameter &= element self-tuning:number-of-peers-to-probe { 783 xsd:unsignedInt } 785 This namespace is added into the element in the 786 overlay configuration file. 788 8. Security Considerations 790 In the same way as malicious or compromised peers implementing the 791 RELOAD base protocol [I-D.ietf-p2psip-base] can advertise false 792 network metrics or distribute false routing table information for 793 instance in RELOAD Update messages, malicious peers implementing this 794 specification may share false join rate, leave rate, and network size 795 estimates. For such attacks, the same security concerns apply as in 796 the RELOAD base specification. In addition, as long as the amount of 797 malicious peers in the overlay remains modest, the statistical 798 mechanisms applied in Section 6.5 (i.e., the use of 75th percentiles) 799 to process the shared estimates a peer obtains help ensuring that 800 estimates that are clearly different from (i.e., larger or smaller 801 than) other received estimates will not significantly influence the 802 process of adapting the stabilization interval and routing table 803 size. 805 9. IANA Considerations 807 9.1. Message Extensions 809 This document introduces one additional extension to the "RELOAD 810 Extensions" Registry: 812 +------------------+-------+---------------+ 813 | Extension Name | Code | Specification | 814 +------------------+-------+---------------+ 815 | self_tuning_data | 3 | RFC-AAAA | 816 +------------------+-------+---------------+ 818 The contents of the extension are defined in Section 6.5. 820 Note to RFC Editor: please replace AAAA with the RFC number for this 821 specification. 823 10. Acknowledgments 825 The authors would like to thank Jani Hautakorpi for his contributions 826 to the document. The authors would also like to thank Carlos 827 Bernardos for his comments on the document. 829 11. References 831 11.1. Normative References 833 [I-D.ietf-p2psip-base] 834 Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and 835 H. Schulzrinne, "REsource LOcation And Discovery (RELOAD) 836 Base Protocol", draft-ietf-p2psip-base-26 (work in 837 progress), February 2013. 839 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 840 Requirement Levels", BCP 14, RFC 2119, March 1997. 842 [RFC5245] Rosenberg, J., "Interactive Connectivity Establishment 843 (ICE): A Protocol for Network Address Translator (NAT) 844 Traversal for Offer/Answer Protocols", RFC 5245, April 845 2010. 847 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, 848 "Session Traversal Utilities for NAT (STUN)", RFC 5389, 849 October 2008. 851 11.2. Informative References 853 [CAN] Ratnasamy, S., Francis, P., Handley, M., Karp, R., and S. 854 Schenker, "A Scalable Content-Addressable Network", In 855 Proceedings of the 2001 Conference on Applications, 856 Technologies, Architectures and Protocols for Computer 857 Communications pp. 161-172, August 2001. 859 [Chord] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., 860 Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A 861 Scalable Peer-to-peer Lookup Service for Internet 862 Applications", IEEE/ACM Transactions on Networking Volume 863 11, Issue 1, pp. 17-32, February 2003. 865 [I-D.ietf-p2psip-concepts] 866 Bryan, D., Matthews, P., Shim, E., Willis, D., and S. 867 Dawkins, "Concepts and Terminology for Peer to Peer SIP", 868 draft-ietf-p2psip-concepts-05 (work in progress), July 869 2013. 871 [Pastry] Rowstron, A. and P. Druschel, "Pastry: Scalable, 872 Decentralized Object Location and Routing for Large-Scale 873 Peer-to-Peer Systems", In Proceedings of the IFIP/ACM 874 International Conference on Distribued Systems Platforms 875 pp. 329-350, November 2001. 877 [binzenhofer2006] 878 Binzenhofer, A., Kunzmann, G., and R. Henjes, "A Scalable 879 Algorithm to Monitor Chord-Based P2P Systems at Runtime", 880 In Proceedings of the 20th IEEE International Parallel and 881 Distributed Processing Symposium (IPDPS) pp. 1-8, April 882 2006. 884 [ghinita2006] 885 Ghinita, G. and Y. Teo, "An Adaptive Stabilization 886 Framework for Distributed Hash Tables", In Proceedings of 887 the 20th IEEE International Parallel and Distributed 888 Processing Symposium (IPDPS) pp. 29-38, April 2006. 890 [horowitz2003] 891 Horowitz, K. and D. Malkhi, "Estimating Network Size from 892 Local Information", Information Processing Letters Volume 893 88, Issue 5, pp. 237-243, December 2003. 895 [kostoulas2005] 896 Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and 897 A. Demers, "Decentralized Schemes for Size Estimation in 898 Large and Dynamic Groups", In Proceedings of the 4th IEEE 899 International Symposium on Network Computing and 900 Applications pp. 41-48, July 2005. 902 [krishnamurthy2008] 903 Krishnamurthy, S., El-Ansary, S., Aurell, E., and S. 904 Haridi, "Comparing Maintenance Strategies for Overlays", 905 In Proceedings of the 16th Euromicro Conference on 906 Parallel, Distributed and Network-Based Processing pp. 907 473-482, February 2008. 909 [li2004] Li, J., Strinbling, J., Gil, T., Morris, R., and M. 910 Kaashoek, "Comparing the Performance of Distributed Hash 911 Tables Under Churn", Peer-to-Peer Systems III, volume 3279 912 of Lecture Notes in Computer Science Springer, pp. 87-99, 913 February 2005. 915 [liben-nowell2002] 916 Liben-Nowell, D., Balakrishnan, H., and D. Karger, 917 "Observations on the Dynamic Evolution of Peer-to-Peer 918 Networks", In Proceedings of the 1st International 919 Workshop on Peer-to-Peer Systems (IPTPS) pp. 22-33, March 920 2002. 922 [maenpaa2009] 923 Maenpaa, J. and G. Camarillo, "A Study on Maintenance 924 Operations in a Chord-Based Peer-to-Peer Session 925 Initiation Protocol Overlay Network", In Proceedings of 926 the 23rd IEEE International Parallel and Distributed 927 Processing Symposium (IPDPS) pp. 1-9, May 2009. 929 [mahajan2003] 930 Mahajan, R., Castro, M., and A. Rowstron, "Controlling the 931 Cost of Reliability in Peer-to-Peer Overlays", In 932 Proceedings of the 2nd International Workshop on Peer-to- 933 Peer Systems (IPTPS) pp. 21-32, February 2003. 935 [rhea2004] 936 Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, 937 "Handling Churn in a DHT", In Proceedings of the USENIX 938 Annual Technical Conference pp. 127-140, June 2004. 940 Authors' Addresses 942 Jouni Maenpaa 943 Ericsson 944 Hirsalantie 11 945 Jorvas 02420 946 Finland 948 Email: Jouni.Maenpaa@ericsson.com 950 Gonzalo Camarillo 951 Ericsson 952 Hirsalantie 11 953 Jorvas 02420 954 Finland 956 Email: Gonzalo.Camarillo@ericsson.com