idnits 2.17.1 draft-ietf-p2psip-self-tuning-14.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 16, 2014) is 3573 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) ** Obsolete normative reference: RFC 5245 (Obsoleted by RFC 8445, RFC 8839) ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489) Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 P2PSIP Working Group J. Maenpaa 3 Internet-Draft G. Camarillo 4 Intended status: Standards Track Ericsson 5 Expires: December 18, 2014 June 16, 2014 7 Self-tuning Distributed Hash Table (DHT) for REsource LOcation And 8 Discovery (RELOAD) 9 draft-ietf-p2psip-self-tuning-14.txt 11 Abstract 13 REsource LOcation And Discovery (RELOAD) is a peer-to-peer (P2P) 14 signaling protocol that provides an overlay network service. Peers 15 in a RELOAD overlay network collectively run an overlay algorithm to 16 organize the overlay, and to store and retrieve data. This document 17 describes how the default topology plugin of RELOAD can be extended 18 to support self-tuning, that is, to adapt to changing operating 19 conditions such as churn and network size. 21 Status of This Memo 23 This Internet-Draft is submitted in full conformance with the 24 provisions of BCP 78 and BCP 79. 26 Internet-Drafts are working documents of the Internet Engineering 27 Task Force (IETF). Note that other groups may also distribute 28 working documents as Internet-Drafts. The list of current Internet- 29 Drafts is at http://datatracker.ietf.org/drafts/current/. 31 Internet-Drafts are draft documents valid for a maximum of six months 32 and may be updated, replaced, or obsoleted by other documents at any 33 time. It is inappropriate to use Internet-Drafts as reference 34 material or to cite them other than as "work in progress." 36 This Internet-Draft will expire on December 18, 2014. 38 Copyright Notice 40 Copyright (c) 2014 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. Code Components extracted from this document must 49 include Simplified BSD License text as described in Section 4.e of 50 the Trust Legal Provisions and are provided without warranty as 51 described in the Simplified BSD License. 53 Table of Contents 55 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 56 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 3. Introduction to Stabilization in DHTs . . . . . . . . . . . . 5 58 3.1. Reactive vs. Periodic Stabilization . . . . . . . . . . . 5 59 3.2. Configuring Periodic Stabilization . . . . . . . . . . . 6 60 3.3. Adaptive Stabilization . . . . . . . . . . . . . . . . . 8 61 4. Introduction to Chord . . . . . . . . . . . . . . . . . . . . 8 62 5. Extending Chord-reload to Support Self-tuning . . . . . . . . 9 63 5.1. Update Requests . . . . . . . . . . . . . . . . . . . . . 10 64 5.2. Neighbor Stabilization . . . . . . . . . . . . . . . . . 11 65 5.3. Finger Stabilization . . . . . . . . . . . . . . . . . . 11 66 5.4. Adjusting Finger Table Size . . . . . . . . . . . . . . . 11 67 5.5. Detecting Partitioning . . . . . . . . . . . . . . . . . 12 68 5.6. Leaving the Overlay . . . . . . . . . . . . . . . . . . . 12 69 6. Self-tuning Chord Parameters . . . . . . . . . . . . . . . . 12 70 6.1. Estimating Overlay Size . . . . . . . . . . . . . . . . . 12 71 6.2. Determining Routing Table Size . . . . . . . . . . . . . 13 72 6.3. Estimating Failure Rate . . . . . . . . . . . . . . . . . 13 73 6.3.1. Detecting Failures . . . . . . . . . . . . . . . . . 14 74 6.4. Estimating Join Rate . . . . . . . . . . . . . . . . . . 15 75 6.5. Estimate Sharing . . . . . . . . . . . . . . . . . . . . 15 76 6.6. Calculating the Stabilization Interval . . . . . . . . . 17 77 7. Overlay Configuration Document Extension . . . . . . . . . . 18 78 8. Security Considerations . . . . . . . . . . . . . . . . . . . 18 79 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 80 9.1. Message Extensions . . . . . . . . . . . . . . . . . . . 19 81 9.2. A New IETF XML Registry . . . . . . . . . . . . . . . . . 19 82 10. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 19 83 11. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 84 11.1. Normative References . . . . . . . . . . . . . . . . . . 19 85 11.2. Informative References . . . . . . . . . . . . . . . . . 20 86 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 88 1. Introduction 90 REsource LOcation And Discovery (RELOAD) [RFC6940] is a peer-to-peer 91 signaling protocol that can be used to maintain an overlay network, 92 and to store data in and retrieve data from the overlay. For 93 interoperability reasons, RELOAD specifies one overlay algorithm, 94 called chord-reload, that is mandatory to implement. This document 95 extends the chord-reload algorithm by introducing self-tuning 96 behavior. 98 Distributed Hash Table (DHT) based overlay networks are self- 99 organizing, scalable and reliable. However, these features come at a 100 cost: peers in the overlay network need to consume network bandwidth 101 to maintain routing state. Most DHTs use a periodic stabilization 102 routine to counter the undesirable effects of churn on routing. To 103 configure the parameters of a DHT, some characteristics such as churn 104 rate and network size need to be known in advance. These 105 characteristics are then used to configure the DHT in a static 106 fashion by using fixed values for parameters such as the size of the 107 successor set, size of the routing table, and rate of maintenance 108 messages. The problem with this approach is that it is not possible 109 to achieve a low failure rate and a low communication overhead by 110 using fixed parameters. Instead, a better approach is to allow the 111 system to take into account the evolution of network conditions and 112 adapt to them. 114 This document extends the mandatory-to-implement chord-reload 115 algorithm by making it self-tuning. The use of the self-tuning 116 feature is optional. However, when used, it needs to be supported by 117 all peers in the RELOAD overlay network. Two main advantages of 118 self-tuning are that users no longer need to tune every DHT parameter 119 correctly for a given operating environment and that the system 120 adapts to changing operating conditions. 122 The remainder of this document is structured as follows: Section 2 123 provides definitions of terms used in this document. Section 3 124 discusses alternative approaches to stabilization operations in DHTs, 125 including reactive stabilization, periodic stabilization, and 126 adaptive stabilization. Section 4 gives an introduction to the Chord 127 DHT algorithm. Section 5 describes how this document extends the 128 stabilization routine of the chord-reload algorithm. Section 6 129 describes how the stabilization rate and routing table size are 130 calculated in an adaptive fashion. 132 2. Terminology 134 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 135 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 136 "OPTIONAL" in this document are to be interpreted as described in RFC 137 2119 [RFC2119]. 139 This document uses terminology and definitions from the RELOAD base 140 specification [RFC6940]. 142 numBitsInNodeId: Specifies the number of bits in a RELOAD Node-ID. 144 DHT: Distributed Hash Tables (DHTs) are a class of decentralized 145 distributed systems that provide a lookup service similar to a 146 regular hash table. Given a key, any peer participating in the 147 system can retrieve the value associated with that key. The 148 responsibility for maintaining the mapping from keys to values is 149 distributed among the peers. 151 Chord Ring: The Chord DHT uses ring topology and orders identifiers 152 on an identifier circle of size 2^numBitsInNodeId. This 153 identifier circle is called the Chord ring. On the Chord ring, 154 the responsibility for a key k is assigned to the node whose 155 identifier equals to or immediately follows k. 157 Finger Table: A data structure with up to (but typically less than) 158 numBitsInNodeId entries maintained by each peer in a Chord-based 159 overlay. The ith entry in the finger table of peer n contains the 160 identity of the first peer that succeeds n by at least 161 2^(numBitsInNodeId-i) on the Chord ring. This peer is called the 162 ith finger of peer n. As an example, the first entry in the 163 finger table of peer n contains a peer half-way around the Chord 164 ring from peer n. The purpose of the finger table is to 165 accelerate lookups. 167 n.id: An abbreviation that is in this document used refer to the 168 Node-ID of peer n. 170 O(g(n)): Informally, saying that some equation f(n) = O(g(n)) means 171 that f(n) is less than some constant multiple of g(n). For the 172 formal definition, please refer to [weiss1998]. 174 Omega(g(n)): Informally, saying that some equation f(n) = 175 Omega(g(n)) means that f(n) is more than some constant multiple of 176 g(n). For the formal definition, please refer to [weiss1998] 178 Percentile: The Pth (0<=P<=100) percentile of N values arranged in 179 ascending order is obtained by first calculating the (ordinal) 180 rank n=(P/100)*N, rounding the result to the nearest integer, and 181 then taking the value corresponding to that rank. 183 Predecessor List: A data structure containing the first r 184 predecessors of a peer on the Chord ring. 186 Successor List: A data structure containing the first r successors 187 of a peer on the Chord ring. 189 Neighborhood Set: A term used to refer to the set of peers included 190 in the successor and predecessor lists of a given peer. 192 Routing Table: Contents of a given peer's routing table include the 193 set of peers that the peer can use to route overlay messages. The 194 routing table is made up of the finger table, successor list and 195 predecessor list. 197 3. Introduction to Stabilization in DHTs 199 DHTs use stabilization routines to counter the undesirable effects of 200 churn on routing. The purpose of stabilization is to keep the 201 routing information of each peer in the overlay consistent with the 202 constantly changing overlay topology. There are two alternative 203 approaches to stabilization: periodic and reactive [rhea2004]. 204 Periodic stabilization can either use a fixed stabilization rate or 205 calculate the stabilization rate in an adaptive fashion. 207 3.1. Reactive vs. Periodic Stabilization 209 In reactive stabilization, a peer reacts to the loss of a peer in its 210 neighborhood set or to the appearance of a new peer that should be 211 added to its neighborhood set by sending a copy of its neighbor table 212 to all peers in the neighborhood set. Periodic recovery, in 213 contrast, takes place independently of changes in the neighborhood 214 set. In periodic recovery, a peer periodically shares its 215 neighborhood set with each or a subset of the members of that set. 217 The chord-reload algorithm [RFC6940] supports both reactive and 218 periodic stabilization. It has been shown in [rhea2004] that 219 reactive stabilization works well for small neighborhood sets (i.e., 220 small overlays) and moderate churn. However, in large-scale (e.g., 221 1000 peers or more [rhea2004]) or high-churn overlays, reactive 222 stabilization runs the risk of creating a positive feedback cycle, 223 which can eventually result in congestion collapse. In [rhea2004], 224 it is shown that a 1000-peer overlay under churn uses significantly 225 less bandwidth and has lower latencies when periodic stabilization is 226 used than when reactive stabilization is used. Although in the 227 experiments carried out in [rhea2004], reactive stabilization 228 performed well when there was no churn, its bandwidth use was 229 observed to jump dramatically under churn. At higher churn rates and 230 larger scale overlays, periodic stabilization uses less bandwidth and 231 the resulting lower contention for the network leads to lower 232 latencies. For this reason, most DHTs such as CAN [CAN], Chord 233 [Chord], Pastry [Pastry], Bamboo [rhea2004], etc. use periodic 234 stabilization [ghinita2006]. As an example, the first version of 235 Bamboo used reactive stabilization, which caused Bamboo to suffer 236 from degradation in performance under churn. To fix this problem, 237 Bamboo was modified to use periodic stabilization. 239 In Chord, periodic stabilization is typically done both for 240 successors and fingers. An alternative strategy is analyzed in 241 [krishnamurthy2008]. In this strategy, called the correction-on- 242 change maintenance strategy, a peer periodically stabilizes its 243 successors but does not do so for its fingers. Instead, finger 244 pointers are stabilized in a reactive fashion. The results obtained 245 in [krishnamurthy2008] imply that although the correction-on-change 246 strategy works well when churn is low, periodic stabilization 247 outperforms the correction-on-change strategy when churn is high. 249 3.2. Configuring Periodic Stabilization 251 When periodic stabilization is used, one faces the problem of 252 selecting an appropriate execution rate for the stabilization 253 procedure. If the execution rate of periodic stabilization is high, 254 changes in the system can be quickly detected, but at the 255 disadvantage of increased communication overhead. Alternatively, if 256 the stabilization rate is low and the churn rate is high, routing 257 tables become inaccurate and DHT performance deteriorates. Thus, the 258 problem is setting the parameters so that the overlay achieves the 259 desired reliability and performance even in challenging conditions, 260 such as under heavy churn. This naturally results in high cost 261 during periods when the churn level is lower than expected, or 262 alternatively, poor performance or even network partitioning in worse 263 than expected conditions. 265 In addition to selecting an appropriate stabilization interval, 266 regardless of whether periodic stabilization is used or not, an 267 appropriate size needs to be selected for the neighborhood set and 268 for the finger table. 270 The current approach is to configure overlays statically. This works 271 in situations where perfect information about the future is 272 available. In situations where the operating conditions of the 273 network are known in advance and remain static throughout the 274 lifetime of the system, it is possible to choose fixed optimal values 275 for parameters such as stabilization rate, neighborhood set size and 276 routing table size. However, if the operating conditions (e.g., the 277 size of the overlay and its churn rate) do not remain static but 278 evolve with time, it is not possible to achieve both a low lookup 279 failure rate and a low communication overhead by using fixed 280 parameters [ghinita2006]. 282 As an example, to configure the Chord DHT algorithm, one needs to 283 select values for the following parameters: size of successor list, 284 stabilization interval, and size of the finger table. To select an 285 appropriate value for the stabilization interval, one needs to know 286 the expected churn rate and overlay size. According to 287 [liben-nowell2002], a Chord network in a ring-like state remains in a 288 ring-like state as long as peers send Omega(square(log(N))) messages 289 before N new peers join or N/2 peers fail. Thus, in a 500-peer 290 overlay churning at a rate such that one peer joins and one peer 291 leaves the network every 30 seconds, an appropriate stabilization 292 interval would be on the order of 93s. According to [Chord], the 293 size of the successor list and finger table should be on the order of 294 log(N). Already a successor list of a modest size (e.g., log2(N) or 295 2*log2(N), which is the successor list size used in [Chord]) makes it 296 very unlikely that a peer will lose all of its successors, which 297 would cause the Chord ring to become disconnected. Thus, in a 298 500-peer network each peer should maintain on the order of nine 299 successors and fingers. However, if the churn rate doubles and the 300 network size remains unchanged, the stabilization rate should double 301 as well. That is, the appropriate maintenance interval would now be 302 on the order of 46s. On the other hand, if the churn rate becomes 303 e.g. six-fold and the size of the network grows to 2000 peers, on the 304 order of eleven fingers and successors should be maintained and the 305 stabilization interval should be on the order of 42s. If one 306 continued using the old values, this could result in inaccurate 307 routing tables, network partitioning, and deteriorating performance. 309 3.3. Adaptive Stabilization 311 A self-tuning DHT takes into consideration the continuous evolution 312 of network conditions and adapts to them. In a self-tuning DHT, each 313 peer collects statistical data about the network and dynamically 314 adjusts its stabilization rate, neighborhood set size, and finger 315 table size based on the analysis of the data [ghinita2006]. 316 Reference [mahajan2003] shows that by using self-tuning, it is 317 possible to achieve high reliability and performance even in adverse 318 conditions with low maintenance cost. Adaptive stabilization has 319 been shown to outperform periodic stabilization in terms of both 320 lookup failures and communication overhead [ghinita2006]. 322 4. Introduction to Chord 324 Chord [Chord] is a structured P2P algorithm that uses consistent 325 hashing to build a DHT out of several independent peers. Consistent 326 hashing assigns each peer and resource a fixed-length identifier. 327 Peers use SHA-1 as the base hash fuction to generate the identifiers. 328 As specified in RELOAD base, the length of the identifiers is 329 numBitsInNodeId=128 bits. The identifiers are ordered on an 330 identifier circle of size 2^numBitsInNodeId. On the identifier 331 circle, key k is assigned to the first peer whose identifier equals 332 or follows the identifier of k in the identifier space. The 333 identifier circle is called the Chord ring. 335 Different DHTs differ significantly in performance when bandwidth is 336 limited. It has been shown that when compared to other DHTs, the 337 advantages of Chord include that it uses bandwidth efficiently and 338 can achieve low lookup latencies at little cost [li2004]. 340 A simple lookup mechanism could be implemented on a Chord ring by 341 requiring each peer to only know how to contact its current successor 342 on the identifier circle. Queries for a given identifier could then 343 be passed around the circle via the successor pointers until they 344 encounter the first peer whose identifier is equal to or larger than 345 the desired identifier. Such a lookup scheme uses a number of 346 messages that grows linearly with the number of peers. To reduce the 347 cost of lookups, Chord maintains also additional routing information; 348 each peer n maintains a data structure with up to numBitsInNodeId 349 entries, called the finger table. The first entry in the finger 350 table of peer n contains the peer half-way around the ring from peer 351 n. The second entry contains the peer that is 1/4th of the way 352 around, the third entry the peer that is 1/8th of the way around, 353 etc. In other words, the ith entry in the finger table at peer n 354 contains the identity of the first peer s that succeeds n by at least 355 2^(numBitsInNodeId-i) on the Chord ring. This peer is called the ith 356 finger of peer n. The interval between two consecutive fingers is 357 called a finger interval. The ith finger interval of peer n covers 358 the range [n.id + 2^(numBitsInNodeId-i), n.id + 2^(numBitsInNodeId- 359 i+1)) on the Chord ring. In an N-peer network, each peer maintains 360 information about O(log(N)) other peers in its finger table. As an 361 example, if N=100000, it is sufficient to maintain 17 fingers. 363 Chord needs all peers' successor pointers to be up to date in order 364 to ensure that lookups produce correct results as the set of 365 participating peers changes. To achieve this, peers run a 366 stabilization protocol periodically in the background. The 367 stabilization protocol of the original Chord algorithm uses two 368 operations: successor stabilization and finger stabilization. 369 However, the Chord algorithm of RELOAD base defines two additional 370 stabilization components, as will be discussed below. 372 To increase robustness in the event of peer failures, each Chord peer 373 maintains a successor list of size r, containing the peer's first r 374 successors. The benefit of successor lists is that if each peer 375 fails independently with probability p, the probability that all r 376 successors fail simultaneously is only p^r. 378 The original Chord algorithm maintains only a single predecessor 379 pointer. However, multiple predecessor pointers (i.e., a predecessor 380 list) can be maintained to speed up recovery from predecessor 381 failures. The routing table of a peer consists of the successor 382 list, finger table, and predecessor list. 384 5. Extending Chord-reload to Support Self-tuning 386 This section describes how the mandatory-to-implement chord-reload 387 algorithm defined in RELOAD base [RFC6940] can be extended to support 388 self-tuning. 390 The chord-reload algorithm supports both reactive and periodic 391 recovery strategies. When the self-tuning mechanisms defined in this 392 document are used, the periodic recovery strategy MUST be used. 393 Further, chord-reload specifies that at least three predecessors and 394 three successors need to be maintained. When the self-tuning 395 mechanisms are used, the appropriate sizes of the successor list and 396 predecessor list are determined in an adaptive fashion based on the 397 estimated network size, as will be described in Section 6. 399 As specified in RELOAD base, each peer MUST maintain a stabilization 400 timer. When the stabilization timer fires, the peer MUST restart the 401 timer and carry out the overlay stabilization routine. Overlay 402 stabilization has four components in chord-reload: 404 1. Update the neighbor table. We refer to this as neighbor 405 stabilization. 407 2. Refreshing the finger table. We refer to this as finger 408 stabilization. 410 3. Adjusting finger table size. 412 4. Detecting partitioning. We refer to this as strong 413 stabilization. 415 As specified in RELOAD base [RFC6940], a peer sends periodic messages 416 as part of the neighbor stabilization, finger stabilization, and 417 strong stabilization routines. In neighbor stabilization, a peer 418 periodically sends an Update request to every peer in its Connection 419 Table. The default time is every ten minutes. In finger 420 stabilization, a peer periodically searches for new peers to include 421 in its finger table. This time defaults to one hour. This document 422 specifies how the neighbor stabilization and finger stabilization 423 intervals can be determined in an adaptive fashion based on the 424 operating conditions of the overlay. The subsections below describe 425 how this document extends the four components of stabilization. 427 5.1. Update Requests 429 As described in RELOAD base [RFC6940], the neighbor and finger 430 stabilization procedures are implemented using Update requests. 431 RELOAD base defines three types of Update requests: 'peer_ready', 432 'neighbors', and 'full'. Regardless of the type, all Update requests 433 include an 'uptime' field. Since the self-tuning extensions require 434 information on the uptimes of peers in the routing table, the sender 435 of an Update request MUST include its current uptime in seconds in 436 the 'uptime' field. 438 When self-tuning is used, each peer decides independently the 439 appropriate size for the successor list, predecessor list and finger 440 table. Thus, the 'predecessors', 'successors', and 'fingers' fields 441 included in RELOAD Update requests are of variable length. As 442 specified in RELOAD [RFC6940], variable length fields are on the wire 443 preceded by length bytes. In the case of the successor list, 444 predecessor list, and finger table, there are two length bytes 445 (allowing lengths up to 2^16-1). The number of NodeId structures 446 included in each field can be calculated based on the length bytes 447 since the size of a single NodeId structure is 16 bytes. If a peer 448 receives more entries than fit into its successor list, predecessor 449 list or finger table, the peer MUST ignore the extra entries. If a 450 peer receives less entries than it currently has in its own data 451 structure, the peer MUST NOT drop the extra entries from its data 452 structure. 454 5.2. Neighbor Stabilization 456 In the neighbor stabilization operation of chord-reload, a peer 457 periodically sends an Update request to every peer in its Connection 458 Table. In a small, low-churn overlay, the amount of traffic this 459 process generates is typically acceptable. However, in a large-scale 460 overlay churning at a moderate or high churn rate, the traffic load 461 may no longer be acceptable since the size of the connection table is 462 large and the stabilization interval relatively short. The self- 463 tuning mechanisms described in this document are especially designed 464 for overlays of the latter type. Therefore, when the self-tuning 465 mechanisms are used, each peer MUST send a periodic Update request 466 only to its first predecessor and first successor on the Chord ring. 468 The neighbor stabilization routine MUST be executed when the 469 stabilization timer fires. To begin the neighbor stabilization 470 routine, a peer MUST send an Update request to its first successor 471 and its first predecessor. The type of the Update request MUST be 472 'neighbors'. The Update request MUST include the successor and 473 predecessor lists of the sender. If a peer receiving such an Update 474 request learns from the predecessor and successor lists included in 475 the request that new peers can be included in its neighborhood set, 476 it MUST send Attach requests to the new peers. 478 After a new peer has been added to the predecessor or successor list, 479 an Update request of type 'peer_ready' MUST be sent to the new peer. 480 This allows the new peer to insert the sender into its neighborhood 481 set. 483 5.3. Finger Stabilization 485 Chord-reload specifies two alternative methods for searching for new 486 peers to the finger table. Both of the alternatives can be used with 487 the self-tuning extensions defined in this document. 489 Immediately after a new peer has been added to the finger table, a 490 Probe request MUST be sent to the new peer to fetch its uptime. The 491 requested_info field of the Probe request MUST be set to contain the 492 ProbeInformationType 'uptime' defined in RELOAD base [RFC6940]. 494 5.4. Adjusting Finger Table Size 496 The chord-reload algorithm defines how a peer can make sure that the 497 finger table is appropriately sized to allow for efficient routing. 498 Since the self-tuning mechanisms specified in this document produce a 499 network size estimate, this estimate can be directly used to 500 calculate the optimal size for the finger table. This mechanism MUST 501 be used instead of the one specified by chord-reload. A peer MUST 502 use the network size estimate to determine whether it needs to adjust 503 the size of its finger table each time when the stabilization timer 504 fires. The way this is done is explained in Section 6.2. 506 5.5. Detecting Partitioning 508 This document does not require any changes to the mechanism chord- 509 reload uses to detect network partitioning. 511 5.6. Leaving the Overlay 513 As specified in RELOAD base [RFC6940], a leaving peer SHOULD send a 514 Leave request to all members of its neighbor table prior to leaving 515 the overlay. The overlay_specific_data field MUST contain the 516 ChordLeaveData structure. The Leave requests that are sent to 517 successors MUST contain the predecessor list of the leaving peer. 518 The Leave requests that are sent to the predecessors MUST contain the 519 successor list of the leaving peer. If a given successor can 520 identify better predecessors than are already included in its 521 predecessor lists by investigating the predecessor list it receives 522 from the leaving peer, it MUST send Attach requests to them. 523 Similarly, if a given predecessor identifies better successors by 524 investigating the successor list it receives from the leaving peer, 525 it MUST send Attach requests to them. 527 6. Self-tuning Chord Parameters 529 This section specifies how to determine an appropriate stabilization 530 rate and routing table size in an adaptive fashion. The proposed 531 mechanism is based on [mahajan2003], [liben-nowell2002], and 532 [ghinita2006]. To calculate an appropriate stabilization rate, the 533 values of three parameters must be estimated: overlay size N, failure 534 rate U, and join rate L. To calculate an appropriate routing table 535 size, the estimated network size N can be used. Peers in the overlay 536 MUST re-calculate the values of the parameters to self-tune the 537 chord-reload algorithm at the end of each stabilization period before 538 re-starting the stabilization timer. 540 6.1. Estimating Overlay Size 542 Techniques for estimating the size of an overlay network have been 543 proposed for instance in [mahajan2003], [horowitz2003], 544 [kostoulas2005], [binzenhofer2006], and [ghinita2006]. In Chord, the 545 density of peer identifiers in the neighborhood set can be used to 546 produce an estimate of the size of the overlay, N [mahajan2003]. 548 Since peer identifiers are picked randomly with uniform probability 549 from the numBitsInNodeId-bit identifier space, the average distance 550 between peer identifiers in the successor set is 551 (2^numBitsInNodeId)/N. 553 To estimate the overlay network size, a peer MUST compute the average 554 inter-peer distance d between the successive peers starting from the 555 most distant predecessor and ending to the most distant successor in 556 the successor list. The estimated network size MUST be calculated 557 as: 559 2^numBitsInNodeId 560 N = ------------------- 561 d 563 This estimate has been found to be accurate within 15% of the real 564 network size [ghinita2006]. Of course, the size of the neighborhood 565 set affects the accuracy of the estimate. 567 During the join process, a joining peer fills its routing table by 568 sending a series of Ping and Attach requests, as specified in RELOAD 569 base [RFC6940]. Thus, a joining peer immediately has enough 570 information at its disposal to calculate an estimate of the network 571 size. 573 6.2. Determining Routing Table Size 575 As specified in RELOAD base, the finger table must contain at least 576 16 entries. When the self-tuning mechanisms are used, the size of 577 the finger table MUST be set to max(ceiling(log2(N)), 16) using the 578 estimated network size N. 580 The size of the successor list MUST be set to ceiling(log2(N)). An 581 implementation MAY place a lower limit on the size of the successor 582 list. As an example, the implementation might require the size of 583 the successor list to be always at least three. 585 A peer MAY choose to maintain a fixed-size predecessor list with only 586 three entries as specified in RELOAD base. However, it is 587 RECOMMENDED that a peer maintains ceiling(log2(N)) predecessors. 589 6.3. Estimating Failure Rate 591 A typical approach is to assume that peers join the overlay according 592 to a Poisson process with rate L and leave according to a Poisson 593 process with rate parameter U [mahajan2003]. The value of U can be 594 estimated using peer failures in the finger table and neighborhood 595 set [mahajan2003]. If peers fail with rate U, a peer with M unique 596 peer identifiers in its routing table should observe K failures in 597 time K/(M*U). Every peer in the overlay MUST maintain a history of 598 the last K failures. The current time MUST be inserted into the 599 history when the peer joins the overlay. The estimate of U MUST be 600 calculated as: 602 k 603 U = --------, 604 M * Tk 606 where M is the number of unique peer identifiers in the routing 607 table, Tk is the time between the first and the last failure in the 608 history, and k is the number of failures in the history. If k is 609 smaller than K, the estimate MUST be computed as if there was a 610 failure at the current time. It has been shown that an estimate 611 calculated in a similar manner is accurate within 17% of the real 612 value of U [ghinita2006]. 614 The size of the failure history K affects the accuracy of the 615 estimate of U. One can increase the accuracy by increasing K. 616 However, this has the side effect of decreasing responsiveness to 617 changes in the failure rate. On the other hand, a small history size 618 may cause a peer to overreact each time a new failure occurs. In 619 [ghinita2006], K is set to 25% of the routing table size. Use of 620 this value is RECOMMENDED. 622 6.3.1. Detecting Failures 624 A new failure MUST be inserted to the failure history in the 625 following cases: 627 1. A Leave request is received from a neigbhor. 629 2. A peer fails to reply to a Ping request sent in the situation 630 explained below. If no packets have been received on a 631 connection during the past 2*Tr seconds (where Tr is the 632 inactivity timer defined by ICE [RFC5245]), a RELOAD Ping request 633 MUST be sent to the remote peer. RELOAD mandates the use of STUN 634 [RFC5389] for keepalives. STUN keepalives take the form of STUN 635 Binding Indication transactions. As specified in ICE [RFC5245], 636 a peer sends a STUN Binding Indication if there has been no 637 packet sent on a connection for Tr seconds. Tr is configurable 638 and has a default of 15 seconds. Although STUN Binding 639 Indications do not generate a response, the fact that a peer has 640 failed can be learned from the lack of packets (Binding 641 Indications or application protocol packets) received from the 642 peer. If the remote peer fails to reply to the Ping request, the 643 sender MUST consider the remote peer to have failed. 645 As an alternative to relying on STUN keepalives to detect peer 646 failure, a peer could send additional, frequent RELOAD messages to 647 every peer in its Connection Table. These messages could be Update 648 requests, in which case they would serve two purposes: detecting peer 649 failure and stabilization. However, as the cost of this approach can 650 be very high in terms of bandwidth consumption and traffic load, 651 especially in large-scale overlays experiencing churn, its use is NOT 652 RECOMMENDED. 654 6.4. Estimating Join Rate 656 Reference [ghinita2006] proposes that a peer can estimate the join 657 rate based on the uptime of the peers in its routing table. An 658 increase in peer join rate will be reflected by a decrease in the 659 average age of peers in the routing table. Thus, each peer MUST 660 maintain an array of the ages of the peers in its routing table 661 sorted in increasing order. Using this information, an estimate of 662 the global peer join rate L MUST be calculated as: 664 N 665 L = ----------------------, 666 Ages[floor(rsize/2)] 668 where Ages is an array containing the ages of the peers in the 669 routing table sorted in increasing order and rsize is the size of the 670 routing table. It has been shown that the estimate obtained by using 671 this method is accurate within 22% of the real join rate 672 [ghinita2006]. Of course, the size of the routing table affects the 673 accuracy. 675 In order for this mechanism to work, peers need to exchange 676 information about the time they have been present in the overlay. 677 Peers receive the uptimes of their successors and predecessors during 678 the stabilization operations since all Update requests carry uptime 679 values. A joining peer learns the uptime of the admitting peer since 680 it receives an Update from the admitting peer during the join 681 procedure. Peers learn the uptimes of new fingers since they can 682 fetch the uptime using a Probe request after having attached to the 683 new finger. 685 6.5. Estimate Sharing 687 To improve the accuracy of network size, join rate, and leave rate 688 estimates, peers MUST share their estimates. When the stabilization 689 timer fires, a peer MUST select number-of-peers-to-probe random peers 690 from its finger table and send each of them a Probe request. The 691 targets of Probe requests are selected from the finger table rather 692 than from the neighbor table since neighbors are likely to make 693 similar errors when calculating their estimates. number-of-peers-to- 694 probe is a new element in the overlay configuration document. It is 695 defined in Section 7. Both the Probe request and the answer returned 696 by the target peer MUST contain a new message extension whose 697 MessageExtensionType is 'self_tuning_data'. This extension type is 698 defined in Section 9.1. The extension_contents field of the 699 MessageExtension structure MUST contain a SelfTuningData structure: 701 struct { 702 uint32 network_size; 703 uint32 join_rate; 704 uint32 leave_rate; 705 } SelfTuningData; 707 The contents of the SelfTuningData structure are as follows: 709 network_size 711 The latest network size estimate calculated by the sender. 713 join_rate 715 The latest join rate estimate calculated by the sender. 717 leave_rate 719 The latest leave rate estimate calculated by the sender. 721 The join and leave rates are expressed as joins or failures per 24 722 hours. As an example, if the global join rate estimate a peer has 723 calculated is 0.123 peers/s, it would include in the join_rate field 724 the ceiling of the value 10627.2 (24*60*60*0.123 = 10627.2), that is, 725 the value 10628. 727 The 'type' field of the MessageExtension structure MUST be set to 728 contain the value 'self_tuning_data'. The 'critical' field of the 729 structure MUST be set to False. 731 A peer MUST store all estimates it receives in Probe requests and 732 answers during a stabilization interval. When the stabilization 733 timer fires, the peer MUST calculate the estimates to be used during 734 the next stabilization interval by taking the 75th percentile (i.e., 735 third quartile) of a data set containing its own estimate and the 736 received estimates. 738 The default value for number-of-peers-to-probe is 4. This default 739 value is recommended to allow a peer to receive a sufficiently large 740 set of estimates from other peers. With a value of 4, a peer 741 receives four estimates in Probe answers. On the average, each peer 742 also receives four Probe requests each carrying an estimate. Thus, 743 on the average, each peer has nine estimates (including its own) that 744 it can use at the end of the stablization interval. A value smaller 745 than 4 is NOT RECOMMENDED to keep the number of received estimates 746 high enough. As an example, if the value were 2, there would be 747 peers in the overlay that would only receive two estimates during a 748 stabilization interval. Such peers would only have three estimates 749 available at the end of the interval, which may not be reliable 750 enough since even a single exceptionally high or low estimate can 751 have a large impact. 753 6.6. Calculating the Stabilization Interval 755 According to [liben-nowell2002], a Chord network in a ring-like state 756 remains in a ring-like state as long as peers send 757 Omega(square(log(N))) messages before N new peers join or N/2 peers 758 fail. We can use the estimate of peer failure rate, U, to calculate 759 the time Tf in which N/2 peers fail: 761 1 762 Tf = ------ 763 2*U 765 Based on this estimate, a stabilization interval Tstab-1 MUST be 766 calculated as: 768 Tf 769 Tstab-1 = ----------------- 770 square(log2(N)) 772 On the other hand, the estimated join rate L can be used to calculate 773 the time in which N new peers join the overlay. Based on the 774 estimate of L, a stabilization interval Tstab-2 MUST be calculated 775 as: 777 N 778 Tstab-2 = --------------------- 779 L * square(log2(N)) 781 Finally, the actual stabilization interval Tstab that MUST be used 782 can be obtained by taking the minimum of Tstab-1 and Tstab-2. 784 The results obtained in [maenpaa2009] indicate that making the 785 stabilization interval too small has the effect of making the overlay 786 less stable (e.g., in terms of detected loops and path failures). 787 Thus, a lower limit should be used for the stabilization period. 788 Based on the results in [maenpaa2009], a lower limit of 15s is 789 RECOMMENDED, since using a stabilization period smaller than this 790 will with a high probability cause too much traffic in the overlay. 792 7. Overlay Configuration Document Extension 794 This document extends the RELOAD overlay configuration document by 795 adding one new element, "number-of-peers-to-probe", inside each 796 "configuration" element. 798 self-tuning:number-of-peers-to-probe: The number of fingers to which 799 Probe requests are sent to obtain their network size, join rate, 800 and leave rate estimates. The default value is 4. 802 The Relax NG Grammar for this element is: 804 namespace self-tuning = "urn:ietf:params:xml:ns:p2p:self-tuning" 806 parameter &= element self-tuning:number-of-peers-to-probe { 807 xsd:unsignedInt }? 809 This namespace is added into the element in the 810 overlay configuration file. 812 8. Security Considerations 814 In the same way as malicious or compromised peers implementing the 815 RELOAD base protocol [RFC6940] can advertise false network metrics or 816 distribute false routing table information for instance in RELOAD 817 Update messages, malicious peers implementing this specification may 818 share false join rate, leave rate, and network size estimates. For 819 such attacks, the same security concerns apply as in the RELOAD base 820 specification. In addition, as long as the amount of malicious peers 821 in the overlay remains modest, the statistical mechanisms applied in 822 Section 6.5 (i.e., the use of 75th percentiles) to process the shared 823 estimates a peer obtains help ensure that estimates that are clearly 824 different from (i.e., larger or smaller than) other received 825 estimates will not significantly influence the process of adapting 826 the stabilization interval and routing table size. However, it 827 should be noted that if an attacker is able to impersonate a high 828 number of other peers in the overlay in strategic locations, it may 829 be able to send a high enough number of false estimates to a victim 830 and therefore influence the victim's choice of a stabilization 831 interval. 833 9. IANA Considerations 835 9.1. Message Extensions 837 This document introduces one additional extension to the "RELOAD 838 Extensions" Registry: 840 +------------------+-------+---------------+ 841 | Extension Name | Code | Specification | 842 +------------------+-------+---------------+ 843 | self_tuning_data | 0x3 | RFC-AAAA | 844 +------------------+-------+---------------+ 846 The contents of the extension are defined in Section 6.5. 848 Note to RFC Editor: please replace AAAA with the RFC number for this 849 specification. 851 9.2. A New IETF XML Registry 853 This document registers one new URI for the self-tuning namespace in 854 the "ns" subregistry of the IETF XML registry defined in [RFC3688]. 856 URI: urn:ietf:params:xml:ns:p2p:self-tuning 858 Registrant Contact: The IESG 860 XML: N/A, the requested URI is an XML namespace 862 10. Acknowledgments 864 The authors would like to thank Jani Hautakorpi for his contributions 865 to the document. The authors would also like to thank Carlos 866 Bernardos, Martin Durst, Alissa Cooper, Tobias Gondrom, and Barry 867 Leiba for their comments on the document. 869 11. References 871 11.1. Normative References 873 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 874 Requirement Levels", BCP 14, RFC 2119, March 1997. 876 [RFC5245] Rosenberg, J., "Interactive Connectivity Establishment 877 (ICE): A Protocol for Network Address Translator (NAT) 878 Traversal for Offer/Answer Protocols", RFC 5245, April 879 2010. 881 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, 882 "Session Traversal Utilities for NAT (STUN)", RFC 5389, 883 October 2008. 885 [RFC6940] Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and 886 H. Schulzrinne, "REsource LOcation And Discovery (RELOAD) 887 Base Protocol", RFC 6940, January 2014. 889 11.2. Informative References 891 [CAN] Ratnasamy, S., Francis, P., Handley, M., Karp, R., and S. 892 Schenker, "A Scalable Content-Addressable Network", In 893 Proceedings of the 2001 Conference on Applications, 894 Technologies, Architectures and Protocols for Computer 895 Communications pp. 161-172, August 2001. 897 [Chord] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., 898 Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A 899 Scalable Peer-to-peer Lookup Service for Internet 900 Applications", IEEE/ACM Transactions on Networking Volume 901 11, Issue 1, pp. 17-32, February 2003. 903 [Pastry] Rowstron, A. and P. Druschel, "Pastry: Scalable, 904 Decentralized Object Location and Routing for Large-Scale 905 Peer-to-Peer Systems", In Proceedings of the IFIP/ACM 906 International Conference on Distribued Systems Platforms 907 pp. 329-350, November 2001. 909 [RFC3688] Mealling, M., "The IETF XML Registry", BCP 81, RFC 3688, 910 January 2004. 912 [binzenhofer2006] 913 Binzenhofer, A., Kunzmann, G., and R. Henjes, "A Scalable 914 Algorithm to Monitor Chord-Based P2P Systems at Runtime", 915 In Proceedings of the 20th IEEE International Parallel and 916 Distributed Processing Symposium (IPDPS) pp. 1-8, April 917 2006. 919 [ghinita2006] 920 Ghinita, G. and Y. Teo, "An Adaptive Stabilization 921 Framework for Distributed Hash Tables", In Proceedings of 922 the 20th IEEE International Parallel and Distributed 923 Processing Symposium (IPDPS) pp. 29-38, April 2006. 925 [horowitz2003] 926 Horowitz, K. and D. Malkhi, "Estimating Network Size from 927 Local Information", Information Processing Letters Volume 928 88, Issue 5, pp. 237-243, December 2003. 930 [kostoulas2005] 931 Kostoulas, D., Psaltoulis, D., Gupta, I., Birman, K., and 932 A. Demers, "Decentralized Schemes for Size Estimation in 933 Large and Dynamic Groups", In Proceedings of the 4th IEEE 934 International Symposium on Network Computing and 935 Applications pp. 41-48, July 2005. 937 [krishnamurthy2008] 938 Krishnamurthy, S., El-Ansary, S., Aurell, E., and S. 939 Haridi, "Comparing Maintenance Strategies for Overlays", 940 In Proceedings of the 16th Euromicro Conference on 941 Parallel, Distributed and Network-Based Processing pp. 942 473-482, February 2008. 944 [li2004] Li, J., Strinbling, J., Gil, T., Morris, R., and M. 945 Kaashoek, "Comparing the Performance of Distributed Hash 946 Tables Under Churn", Peer-to-Peer Systems III, volume 3279 947 of Lecture Notes in Computer Science Springer, pp. 87-99, 948 February 2005. 950 [liben-nowell2002] 951 Liben-Nowell, D., Balakrishnan, H., and D. Karger, 952 "Observations on the Dynamic Evolution of Peer-to-Peer 953 Networks", In Proceedings of the 1st International 954 Workshop on Peer-to-Peer Systems (IPTPS) pp. 22-33, March 955 2002. 957 [maenpaa2009] 958 Maenpaa, J. and G. Camarillo, "A Study on Maintenance 959 Operations in a Chord-Based Peer-to-Peer Session 960 Initiation Protocol Overlay Network", In Proceedings of 961 the 23rd IEEE International Parallel and Distributed 962 Processing Symposium (IPDPS) pp. 1-9, May 2009. 964 [mahajan2003] 965 Mahajan, R., Castro, M., and A. Rowstron, "Controlling the 966 Cost of Reliability in Peer-to-Peer Overlays", In 967 Proceedings of the 2nd International Workshop on Peer-to- 968 Peer Systems (IPTPS) pp. 21-32, February 2003. 970 [rhea2004] 971 Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, 972 "Handling Churn in a DHT", In Proceedings of the USENIX 973 Annual Technical Conference pp. 127-140, June 2004. 975 [weiss1998] 976 Weiss, M., "Data Structures and Algorithm Analysis in 977 C++", Addison-Wesley Longman Publishin Co., Inc. 2nd 978 Edition, ISBN:0201361221, 1998. 980 Authors' Addresses 982 Jouni Maenpaa 983 Ericsson 984 Hirsalantie 11 985 Jorvas 02420 986 Finland 988 Email: Jouni.Maenpaa@ericsson.com 990 Gonzalo Camarillo 991 Ericsson 992 Hirsalantie 11 993 Jorvas 02420 994 Finland 996 Email: Gonzalo.Camarillo@ericsson.com