idnits 2.17.1 draft-mazieres-dinrg-scp-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 13, 2018) is 2143 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '32' on line 344 == Outdated reference: A later version (-02) exists of draft-paillisse-sidrops-blockchain-01 -- Obsolete informational reference (is this intentional?): RFC 6962 (Obsoleted by RFC 9162) Summary: 1 error (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group N. Barry 3 Internet-Draft Stellar Development Foundation 4 Intended status: Experimental G. Losa 5 Expires: December 15, 2018 UCLA 6 D. Mazieres 7 Stanford University 8 J. McCaleb 9 Stellar Development Foundation 10 S. Polu 11 Stripe Inc. 12 June 13, 2018 14 The Stellar Consensus Protocol (SCP) 15 draft-mazieres-dinrg-scp-03 17 Abstract 19 SCP is an open Byzantine agreement protocol resistant to Sybil 20 attacks. It allows Internet infrastructure stakeholders to reach 21 agreement on a series of values without unanimous agreement on what 22 constitutes the set of important stakeholders. A big differentiator 23 from other Byzantine agreement protocols is that, in SCP, nodes 24 determine the composition of quorums in a decentralized way: each 25 node selects sets of nodes it considers large or important enough to 26 speak for the whole network, and a quorum must contain such a set for 27 each of its members. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on December 15, 2018. 46 Copyright Notice 48 Copyright (c) 2018 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 64 2. The Model . . . . . . . . . . . . . . . . . . . . . . . . . . 3 65 2.1. Configuration . . . . . . . . . . . . . . . . . . . . . . 3 66 2.2. Input and output . . . . . . . . . . . . . . . . . . . . 4 67 3. Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 5 68 3.1. Federated voting . . . . . . . . . . . . . . . . . . . . 5 69 3.2. Basic types . . . . . . . . . . . . . . . . . . . . . . . 7 70 3.3. Quorum slices . . . . . . . . . . . . . . . . . . . . . . 8 71 3.4. Nomination . . . . . . . . . . . . . . . . . . . . . . . 10 72 3.5. Ballots . . . . . . . . . . . . . . . . . . . . . . . . . 12 73 3.6. Prepare messages . . . . . . . . . . . . . . . . . . . . 13 74 3.7. Commit messages . . . . . . . . . . . . . . . . . . . . . 16 75 3.8. Externalize messages . . . . . . . . . . . . . . . . . . 18 76 3.9. Summary of phases . . . . . . . . . . . . . . . . . . . . 18 77 3.10. Message envelopes . . . . . . . . . . . . . . . . . . . . 19 78 4. Security considerations . . . . . . . . . . . . . . . . . . . 20 79 5. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 20 80 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 21 81 6.1. Normative References . . . . . . . . . . . . . . . . . . 21 82 6.2. Informative References . . . . . . . . . . . . . . . . . 21 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 85 1. Introduction 87 Various aspects of Internet infrastructure depend on irreversible and 88 transparent updates to data sets such as authenticated mappings [cite 89 Li-Man-Watson draft]. Examples include public key certificates and 90 revocations, transparency logs [RFC6962], preload lists for HSTS 91 [RFC6797] and HPKP [RFC7469], and IP address delegation 92 [I-D.paillisse-sidrops-blockchain]. 94 The Stellar Consensus Protocol (SCP) specified in this draft allows 95 Internet infrastructure stakeholders to collaborate in applying 96 irreversible transactions to public state. SCP is an open Byzantine 97 agreement protocol that resists Sybil attacks by allowing individual 98 parties to specify minimum quorum memberships in terms of specific 99 trusted peers. Each participant chooses combinations of peers on 100 which to depend such that these combinations can be trusted in 101 aggregate. The protocol guarantees safety so long as these 102 dependency sets transitively overlap and contain sufficiently many 103 honest nodes correctly obeying the protocol. 105 Though bad configurations are theoretically possible, several 106 analogies provide an intuition for why transitive dependencies 107 overlap in practice. For example, given multiple entirely disjoint 108 Internet-protocol networks, people would have no trouble agreeing on 109 the fact that the network containing the world's top web sites is 110 _the_ Internet. Such a consensus can hold even without unanimous 111 agreement on what constitute the world's top web sites. Similarly, 112 if network operators listed all the ASes from whom they would 113 consider peering or transit worthwhile, the transitive closures of 114 these sets would contain significant overlap, even without unanimous 115 agreement on the "tier-1 ISP" designation. Finally, while different 116 browsers and operating systems have slightly different lists of valid 117 certificate authorities, there is significant overlap in the sets, so 118 that a hypothetical system requiring validation from "all CAs" would 119 be unlikely to diverge. 121 A more detailed abstract description of SCP and its rationale, 122 including an English-language proof of safety, is available in [SCP]. 123 In particular, that reference shows that a necessary property for 124 safety, termed _quorum intersection despite ill-behaved nodes_, is 125 sufficient to guarantee safety under SCP, making SCP optimally safe 126 against Byzantine node failure for any given configuration. 128 This document specifies the end-system logic and wire format of the 129 messages in SCP. 131 2. The Model 133 This section describes the configuration and input/output values of 134 the consensus protocol. 136 2.1. Configuration 138 Each participant or _node_ in the SCP protocol has a digital 139 signature key and is named by the corresponding public key, which we 140 term a "NodeID". 142 Each node also selects one or more sets of nodes (each of which 143 includes itself) called _quorum slices_. A quorum slice represents a 144 large or important enough set of peers that the node selecting the 145 quorum slice believes the slice collectively speaks for the whole 146 network. 148 A _quorum_ is a non-empty set of nodes containing at least one quorum 149 slice of each of its members. For instance, suppose "v1" has the 150 single quorum slice "{v1, v2, v3}", while each of "v2", "v3", and 151 "v4" has the single quorum slice "{v2, v3, v4}". In this case, "{v2, 152 v3, v4}" is a quorum because it contains a slice for each member. On 153 the other hand "{v1, v2, v3}" is not a quorum, because it does not 154 contain a quorum slice for "v2" or "v3". The smallest quorum 155 including "v1" in this example is the set of all nodes "{v1, v2, v3, 156 v4}". 158 Unlike traditional Byzantine agreement protocols, nodes in SCP only 159 care about quorums to which they belong themselves (and hence that 160 contain at least one of their quorum slices). Intuitively, this is 161 what protects nodes from Sybil attacks. In the example above, if 162 "v3" deviates from the protocol, maliciously inventing 96 Sybils "v5, 163 v6, ..., v100", the honest nodes' quorums will all still include one 164 another, ensuring that "v1", "v2", and "v4" continue to agree on 165 output values. 167 Every message in the SCP protocol specifies the sender's quorum 168 slices. Hence, by collecting messages, a node dynamically learns 169 what constitutes a quorum and can decide when a particular message 170 has been sent by a quorum to which it belongs. (Again, nodes do not 171 care about quorums to which they do not belong themselves.) 173 2.2. Input and output 175 SCP produces a series of output _values_ for consecutively numbered 176 _slots_. At the start of a slot, higher-layer software on each node 177 supplies a candidate input value. SCP's goal is to ensure that non- 178 faulty nodes agree on one or a combination of nodes' input values for 179 the slot's output. 5 seconds after completing one slot, the protocol 180 runs again for the next slot. 182 A value typically encodes a set of actions to apply to a replicated 183 state machine. During the pause between slots, nodes accumulate the 184 next set of actions, thus amortizing the cost of consensus over 185 arbitrarily many individual state machine operations. 187 In practice, only one or a small number of nodes' input values 188 actually affect the output value for any given slot. As discussed in 189 Section 3.4, which nodes' input values to use depends on a 190 cryptographic hash of the slot number, output history, and node 191 public keys. A node's chances of affecting the output value depend 192 on how often it appears in other nodes' quorum slices. 194 From SCP's perspective, values are just opaque byte arrays whose 195 interpretation is left to higher-layer software. However, SCP 196 requires a _validity_ function (to check whether a value is valid) 197 and a _combining function_ that reduces multiple candidate values 198 into a single _composite_ value. When nodes nominate multiple values 199 for a slot, SCP nodes invoke this function to converge on a single 200 composite value. By way of example, in an application where values 201 consist of sets of transactions, the combining function could take 202 the union of transaction sets. Alternatively, if values represent a 203 timestamp and a set of transactions, the combining function might 204 pair the highest nominated timestamp with the transaction set that 205 has the highest hash value. 207 3. Protocol 209 The protocol consists of exchanging digitally-signed messages bound 210 to nodes' quorum slices. The format of all messages is specified 211 using XDR [RFC4506]. In addition to quorum slices, messages 212 compactly convey votes on sets of conceptual statements. The core 213 technique of voting with quorum slices is termed _federated voting_. 214 We describe federated voting next, then detail protocol messages in 215 the subsections that follow. 217 The protocol goes through four phases: NOMINATION, PREPARE, COMMIT, 218 and EXTERNALIZE. The NOMINATION and PREPARE phases run concurrently 219 (though NOMINATIONS's messages are sent earlier and it ends before 220 PREPARE ends). The COMMIT and EXTERNALIZE phrases are exclusive, 221 with COMMIT occurring immediately after PREPARE and EXTERNALIZE 222 immediately after COMMIT. 224 3.1. Federated voting 226 Federated voting is a process through which nodes _confirm_ 227 statements. Not every attempt at federated voting may succeed--an 228 attempt to vote on some statement "a" may get stuck, with the result 229 that nodes can confirm neither "a" nor its negation "!a". However, 230 when a node succeeds in confirming a statement "a", federated voting 231 guarantees two things: 233 1. No two well-behaved nodes will confirm contradictory statements 234 in any configuration and failure scenario in which any protocol 235 can guarantee safety for the two nodes (i.e., quorum intersection 236 for the two nodes holds despite ill-behaved nodes). 238 2. If a node that is guaranteed safety by #1 confirms a statement 239 "a", and that node is a member of one or more quorums consisting 240 entirely of well-behaved nodes, then eventually every member of 241 every such quorum will also confirm "a". 243 Intuitively, these conditions are key to ensuring agreement among 244 nodes as well as a weak form of liveness (the non-blocking property 245 [building-blocks]) that is compatible with the FLP impossibility 246 result [FLP]. 248 As a node "v" collects signed copies of a federated voting message 249 "m" from peers, two thresholds trigger state transitions in "v" 250 depending on the message. We define these thresholds as follows: 252 o _quorum threshold_: When every member of a quorum to which "v" 253 belongs (including "v" itself) has issued message "m" 255 o _blocking threshold_: When at least one member of each of "v"'s 256 quorum slices (a set that does not necessarily include "v" itself) 257 has issued message "m" 259 Each node "v" can send several types of message with respect to a 260 statement "a" during federated voting: 262 o _vote_ "a" states that "a" is a valid statement and constitutes a 263 promise by "v" not to vote for any contradictory statement, such 264 as "!a". 266 o _accept_ "a" says that nodes may or may not come to agree on "a", 267 but if they don't, then the system has experienced a catastrophic 268 set of Byzantine failures to the point that no quorum containing 269 "v" consists entirely of correct nodes. (Nonetheless, accepting 270 "a" is not sufficient to act on it, as doing so could violate 271 agreement, which is worse than merely getting stuck from lack of a 272 correct quorum.) 274 o _vote-or-accept_ "a" is the disjunction of the above two messages. 275 A node implicitly sends such a message if it sends either _vote_ 276 "a" or _accept_ "a". Where it is inconvenient and unnecessary to 277 differentiate between _vote_ and _accept_, a node can explicitly 278 send a _vote-or-accept_ message. 280 o _confirm_ "a" indicates that _accept_ "a" has reached quorum 281 threshold at the sender. This message is interpreted the same as 282 _accept_ "a", but allows recipients to optimize their quorum 283 checks by ignoring the sender's quorum slices, as the sender 284 asserts it has already checked them. 286 Figure 1 illustrates the federated voting process. A node "v" votes 287 for a valid statement "a" that doesn't contradict statements in past 288 _vote_ or _accept_ messages sent by "v". When the _vote_ message 289 reaches quorum threshold, the node accepts "a". In fact, "v" accepts 290 "a" if the _vote-or-accept_ message reaches quorum threshold, as some 291 nodes may accept "a" without first voting for it. Specifically, a 292 node that cannot vote for "a" because it has voted for "a"'s negation 293 "!a" still accepts "a" when the message _accept_ "a" reaches blocking 294 threshold (meaning assertions about "!a" have no hope of reaching 295 quorum threshold barring catastrophic Byzantine failure). 297 If and when the message _accept_ "a" reaches quorum threshold, then 298 "v" has confirmed "a" and the federated vote has succeeded. In 299 effect, the _accept_ messages constitute a second vote on the fact 300 that the initial vote messages succeeded. Once "v" enters the 301 confirmed state, it may issue a _confirm_ "a" message to help other 302 nodes confirm "a" more efficiently by pruning their quorum search at 303 "v". 305 "vote-or-accept a" "accept a" 306 reaches reaches 307 quorum threshold quorum threshold 308 +-----------------+ +-----------------+ 309 | | | | 310 | V | V 311 +-----------+ +-----------+ +-----------+ 312 a is +---->| voted a | |accepted a | |confirmed a| 313 valid | +-----------+ +-----------+ +-----------+ 314 | | ^ 315 +-----------+ | | "accept a" reaches 316 |uncommitted|------+-----------------+ blocking threshold 317 +-----------+ | 318 | | 319 | +-----------+ 320 +---->| voted !a | 321 +-----------+ 323 Figure 1: Federated voting process 325 3.2. Basic types 327 SCP employs 32- and 64-bit integers, as defined below. 329 typedef unsigned int uint32; 330 typedef int int32; 331 typedef unsigned hyper uint64; 332 typedef hyper int64; 334 SCP uses the SHA-256 cryptograhpic hash function [RFC6234], and 335 represents hash values as a simple array of 32 bytes. 337 typedef opaque Hash[32]; 339 SCP employs the Ed25519 digital signature algorithm [RFC8032]. For 340 cryptographic agility, however, public keys are represented as a 341 union type that can later be compatibly extended with other key 342 types. 344 typedef opaque uint256[32]; 346 enum PublicKeyType 347 { 348 PUBLIC_KEY_TYPE_ED25519 = 0 349 }; 351 union PublicKey switch (PublicKeyType type) 352 { 353 case PUBLIC_KEY_TYPE_ED25519: 354 uint256 ed25519; 355 }; 357 // variable size as the size depends on the signature scheme used 358 typedef opaque Signature<64>; 360 Nodes are public keys, while values are simply opaque arrays of 361 bytes. 363 typedef PublicKey NodeID; 365 typedef opaque Value<>; 367 3.3. Quorum slices 369 Theoretically a quorum slice can be an arbitrary set of nodes. 370 However, arbitrary predicates on sets cannot be encoded concisely. 371 Instead we specify quorum slices as any set of k-of-n members, where 372 each of the n members can either be an individual node ID, or, 373 recursively, another k-of-n set. 375 // supports things like: A,B,C,(D,E,F),(G,H,(I,J,K,L)) 376 // only allows 2 levels of nesting 377 struct SCPQuorumSet 378 { 379 uint32 threshold; // the k in k-of-n 380 PublicKey validators<>; 381 SCPQuorumSet1 innerSets<>; 382 }; 383 struct SCPQuorumSet1 384 { 385 uint32 threshold; // the k in k-of-n 386 PublicKey validators<>; 387 SCPQuorumSet2 innerSets<>; 388 }; 389 struct SCPQuorumSet2 390 { 391 uint32 threshold; // the k in k-of-n 392 PublicKey validators<>; 393 }; 395 Let "k" be the value of "threshold" and "n" the sum of the sizes of 396 the "validators" and "innerSets" vectors in a message sent by some 397 node "v". A message "m" sent by "v" reaches quorum threshold at "v" 398 when three things hold: 400 1. "v" itself has issued (digitally signed) the message, 402 2. The number of nodes in "validators" who have signed "m" plus the 403 number "innerSets" that (recursively) meet this condition is at 404 least "k", and 406 3. These three conditions apply (recursively) at some combination of 407 nodes sufficient for condition #2. 409 A message reaches blocking threshold at "v" when the number of 410 "validators" making the statement plus (recursively) the number 411 "innerSets" reaching blocking threshold exceeds "n-k". (Blocking 412 threshold depends only on the local node's quorum slices and hence 413 does not require a recursive check on other nodes like step #3 414 above.) 416 As described in Section 3.10, every protocol message is paired with a 417 cryptographic hash of the sender's "SCPQuorumSet" and digitally 418 signed. Inner protocol messages described in the next few sections 419 should be understood to be received alongside such a quorum slice 420 specification and digital signature. 422 3.4. Nomination 424 For each slot, the SCP protocol begins in a NOMINATION phase whose 425 goal is to devise one or more candidate output values for the 426 consensus protocol. In this phase, nodes send nomination messages 427 comprising a monotonically growing set of values: 429 struct SCPNomination 430 { 431 Value voted<>; // X 432 Value accepted<>; // Y 433 }; 435 The "voted" and "accepted" sets are disjoint; any value that is 436 eligible for both sets is placed only in the "accepted" set. 438 "voted" consists of candidate values that the sender has voted to 439 nominate. Each node progresses through a series of nomination 440 _rounds_ in which it may increase the set of values in its own 441 "voted" field by adding the contents of the "voted" and "accepted" 442 fields of "SCPNomination" messages received from a growing set of 443 peers. In round "n" of slot "i", each node determines an additional 444 peer whose nominated values it should incorporate in its own 445 "SCPNomination" message as follows: 447 o Let "Gi(m) = SHA-256(i || m)", where "||" denotes the 448 concatenation of serialized XDR values. Treat the output of "Gi" 449 as a 256-bit binary number in big-endian format. 451 o For each peer "v", define "weight(v)" as the fraction of quorum 452 slices containing "v". 454 o Define the set of nodes "neighbors(n)" as the set of nodes v for 455 which "Gi(1 || n || v) < 2^{256} * weight(v)", where "1" and "n" 456 are both 32-bit XDR "int" values. 458 o Define "priority(n, v)" as "Gi(2 || n || v)", where "2" and "n" 459 are both 32-bit XDR "int" values. 461 For each round "n" until nomination has finished (see below), a node 462 starts _echoing_ the available peer "v" with the highest value of 463 "priority(n, v)" from among the nodes in "neighbors(n)". To echo 464 "v", the node merges any valid values from "v"'s "voted" and 465 "accepted" sets into its own "voted" set. 467 XXX - expand "voted" with only the 10 values with lowest Gi hash in 468 any given round to avoid blowing out the message size? 469 Note that when echoing nominations, nodes must exclude and neither 470 vote for nor accept values rejected by the higher-layer application's 471 validity function. This validity function must not depend on state 472 that can permanently differ across nodes. By way of example, it is 473 okay to reject values that are syntactically ill-formed, that are 474 semantically incompatible with the previous slot's value, that 475 contain invalid digital signatures, that contain timestamps in the 476 future, or that specify upgrades to unknown versions of the protocol. 477 By contrast, the application cannot reject values that are 478 incompatible with the results of a DNS query or some dynamically 479 retrieved TLS certificate, as different nodes could see different 480 results when doing such queries. 482 Nodes must not send an "SCPNomination" message until at least one of 483 the "voted" or "accepted" fields is non-empty. When these fields are 484 both empty, a node that has the highest priority among its neighbors 485 in the current round (and hence should be echoing its own votes) adds 486 the higher-layer software's input value to its "voted" field. Nodes 487 that do not have the highest priority wait to hear "SCPNomination" 488 messages from the nodes whose nominations they are echoing. 490 If a particular valid value "x" reaches quorum threshold in the 491 messages sent by peers (meaning that every node in a quorum contains 492 "x" either in the "voted" or the "accepted" field), then the node at 493 which this happens moves "x" from its "voted" field to its "accepted" 494 field and broadcasts a new "SCPNomination" message. Similarly, if 495 "x" reaches blocking threshold in a node's peers' "accepted" field 496 (meaning every one of a node's quorum slices contains at least one 497 node with "x" in its "accepted" field), then the node adds "x" to its 498 own "accepted" field (removing it from "voted" if applicable). These 499 two cases correspond to the two conditions for entering the 500 "accepted" state in Figure 1. 502 A node stops adding any new values to its "voted" set as soon as any 503 value "x" reaches quorum threshold in the "accepted" fields of 504 received "SCPNomination" messages. Following the terminology of 505 Section 3.1, this condition corresponds to when the node confirms "x" 506 as nominated. Note, however, that the node continues adding new 507 values to "accepted" as appropriate. Doing so may lead to more 508 values becoming confirmed nominated even after the "voted" set is 509 closed to new values. 511 A node always begins nomination in round "1". Round "n" lasts for 512 "1+n" seconds, after which, if no value has been confirmed nominated, 513 the node proceeds to round "n+1". A node continues to echo votes 514 from the highest priority neighbor in prior rounds as well as the 515 current round. In particular, until any value is confirmed 516 nominated, a node continues expanding its "voted" field with values 517 nominated by highest priority neighbors from prior rounds even when 518 the values appeared after the end of those prior rounds. 520 As defined in the next two sections, the NOMINATION phase ends when a 521 node has confirmed "prepare(b)" for some any ballot "b", as this is 522 the point at which the nomination outcome no longer influences the 523 protocol. Until this point, a node must continue to transmit 524 "SCPNomination" messages as well as to expand its "accepted" set 525 (even if "voted" is closed because some value has been confirmed 526 nominated). 528 3.5. Ballots 530 Once there is a candidate on which to try to reach consensus, a node 531 moves through three phases of balloting: PREPARE, COMMIT, and 532 EXTERNALIZE. Balloting employs federated voting to chose between 533 _commit_ and _abort_ statements for ballots. A ballot is a pair 534 consisting of a counter and candidate value: 536 // Structure representing ballot 537 struct SCPBallot 538 { 539 uint32 counter; // n 540 Value value; // x 541 }; 543 We use the notation "" to represent a ballot with "counter == 544 n" and "value == x". 546 Ballots are totally ordered with "counter" more significant than 547 "value". Hence, we write "b1 < b2" to mean that either "(b1.counter 548 < b2.counter)" or "(b1.counter == b2.counter && b1.value < 549 b2.value)". Values are compared lexicographically as a strings of 550 unsigned octets. 552 The protocol moves through federated voting on successively higher 553 ballots until nodes confirm "commit b" for some ballot "b", at which 554 point consensus terminates and outputs "b.value" for the slot. To 555 ensure that only one value can be chosen for a slot and that the 556 protocol cannot get stuck if individual ballots get stuck, there are 557 two restrictions on voting: 559 1. A node cannot vote for both "commit b" and "abort b" on the same 560 ballot (the two outcomes are contradictory), and 562 2. A node may not vote for or accept "commit b" for any ballot "b" 563 unless it has confirmed "abort" for every lesser ballot with a 564 different value. 566 The second condition requires voting to abort large numbers of 567 ballots before voting to commit a ballot "b". We call this 568 _preparing_ ballot "b", and introduce the following notation for the 569 associated set of abort statements. 571 o "prepare(b)" encodes an "abort" statement for every ballot less 572 than "b" containing a value other than "b.value", i.e., 573 "prepare(b) = { abort b1 | b1 < b AND b1.value != b.value }". 575 o "vote prepare(b)" stands for a set of _vote_ messages for every 576 "abort" statement in "prepare(b)". 578 o Similarly, "accept prepare(b)", "vote-or-accept prepare(b)", and 579 "confirm prepare(b)" encode sets of _accept_, _vote-or-accept_, 580 and _confirm_ messages for every "abort" statement in 581 "prepare(b)". 583 Using this terminology, a node must confirm "prepare(b)" before 584 issuing a _vote_ or _accept_ message for the statement "commit b". 586 3.6. Prepare messages 588 The first phase of balloting is the PREPARE phase. During this 589 phase, as soon as a node has a valid candidate value (see the rules 590 for "ballot.value" below), it begins sending the following message: 592 struct SCPPrepare 593 { 594 SCPBallot ballot; // b 595 SCPBallot *prepared; // p 596 SCPBallot *preparedPrime; // p' 597 uint32 hCounter; // h.counter or 0 if h == NULL 598 uint32 cCounter; // c.counter or 0 if c == NULL 599 }; 601 This message compactly conveys the following (conceptual) federated 602 voting messages: 604 o "vote-or-accept prepare(ballot)" 606 o If "prepared != NULL": "accept prepare(prepared)" 608 o If "preparedPrime != NULL": "accept prepare(preparedPrime)" 610 o If "hCounter != 0": "confirm prepare()" 612 o If "cCounter != 0": "vote commit " for every 613 "cCounter <= n <= hCounter" 615 Note that to be valid, an "SCPPrepare" message must satisfy the 616 following conditions: 618 o If "prepared != NULL", then "prepared <= ballot", 620 o If "preparedPrime != NULL", then "prepared != NULL" and 621 "preparedPrime < prepared", and 623 o "cCounter <= hCounter <= ballot.counter". 625 Based on the federated vote messages received, each node keeps track 626 of what ballots have been accepted and confirmed prepared. It uses 627 these ballots to set the following fields of its own "SCPPrepare" 628 messages as follows. 630 ballot 631 The current ballot that a node is attempting to prepare and 632 commit. The rules for setting each field are detailed below. 633 Note that the "value" is updated when and only when "counter" 634 changes. 636 ballot.counter 637 The counter is set according to the following rules: 639 * Upon entering the PREPARE phase, the "counter" field is 640 initialized to 1. 642 * When a node sees messages from a quorum to which it belongs 643 such that each message's "ballot.counter" is greater than or 644 equal to the local "ballot.counter", the node arms a timer to 645 fire in a number of seconds equal to its "ballot.counter + 1" 646 (so the timeout lengthens linearly as the counter increases). 647 Note that for the purposes of determining whether a quorum has 648 a particular "ballot.counter", a node considers "ballot" fields 649 in "SCPPrepare" and "SCPCommit" messages. It also considers 650 "SCPExternalize" messages to convey an implicit 651 "ballot.counter" of "infinity". 653 * If the timer fires, a node increments the ballot counter by 1. 655 * If nodes forming a blocking threshold all have "ballot.counter" 656 values greater than the local "ballot.counter", then the local 657 node immediately cancels any pending timer, increases 658 "ballot.counter" to the lowest value such that this is no 659 longer the case, and if appropriate according to the rules 660 above arms a new timer. Note that the blocking threshold may 661 include ballots from "SCPCommit" messages as well as 662 "SCPExternalize" messages, which implicitly have an infinite 663 ballot counter. 665 * *Exception*: To avoid exhausting "ballot.counter", its value 666 must always be less then 1,000 plus the number of seconds a 667 node has been running SCP on the current slot. Should any of 668 the above rules require increasing the counter beyond this 669 value, a node either increases "ballot.counter" to the maximum 670 permissible value, or, if it is already at this maximum, waits 671 up to one second before increasing the value. 673 ballot.value 674 Each time the ballot counter is changed, the value is also 675 recomputed as follows: 677 * If any ballot has been confirmed prepared, then "ballot.value" 678 is taken to to be "h.value" for the highest confirmed prepared 679 ballot "h". (Note that once this is the case, the node can 680 stop sending "SCPNomination" messages, as "h.value" supersedes 681 any output of the nomination protocol.) 683 * Otherwise (if no such "h" exists), if one or more values are 684 confirmed nominated, then "ballot.value" is taken as the output 685 of the deterministic combining function applied to all 686 confirmed nominated values. Note that because the NOMINATION 687 and PREPARE phases run concurrently, the set of confirmed 688 nominated values may continue to grow during balloting, 689 changing "ballot.value" even if no ballots are confirmed 690 prepared. 692 * Otherwise, if no ballot is confirmed prepared and no value is 693 confirmed nominated, but the node has accepted a ballot 694 prepared (because "prepare(b)" meets blocking threshold for 695 some ballot "b"), then "ballot.value" is taken as the value of 696 the highest such accepted prepared ballot. 698 * Otherwise, if no value is confirmed nominated and no value is 699 accepted prepared, then a node cannot yet send an "SCPPrepare" 700 message and must continue sending only "SCPNomination" 701 messages. 703 prepared 704 The highest accepted prepared ballot not exceeding the "ballot" 705 field, or NULL if no ballot has been accepted prepared. Recall 706 that ballots with equal counters are totally ordered by the value. 707 Hence, if "ballot = " and the highest prepared ballot is 708 "" where "x < y", then the "prepared" field in sent messages 709 must be set to "" instead of "", as the latter would 710 exceed "ballot". In the event that "n = 1", the prepared field 711 may be set to "<0, y>", meaning 0 is a valid "prepared.counter" 712 even though it is not a valid "ballot.counter". It is possible to 713 confirm "prepare(<0, y>)", in which case the next "ballot.value" 714 is set to "y". However, it is not possible to vote to commit a 715 ballot with counter 0. 717 preparedPrime 718 The highest accepted prepared ballot such that "preparedPrime < 719 prepared" and "preparedPrime.value != prepared.value", or NULL if 720 there is no such ballot. Note that together, "prepared" and 721 "preparedPrime" concisely encode all "abort" statements (below 722 "ballot") that the sender has accepted. 724 hCounter 725 If "h" is the highest confirmed prepared ballot and "h.value == 726 ballot.value", then this field is set to "h.counter". Otherwise, 727 if no ballot is confirmed prepared or if "h.value != 728 ballot.value", then this field is 0. Note that by the rules 729 above, if "h" exists, then "ballot.value" will be set to "h.value" 730 the next time "ballot" is updated. 732 cCounter 733 The value "cCounter" is maintained based on an internally- 734 maintained _commit ballot_ "c", initially "NULL". "cCounter" is 0 735 while "c" is "NULL" and is "c.counter" otherwise. "c" is updated 736 as follows: 738 * If either "(prepared > c && prepared.value != c.value)" or 739 "(preparedPrime > c && preparedPrime.value != c.value)", then 740 reset "c = NULL". 742 * If "c == NULL" and "hCounter == ballot.counter" (meaning 743 "ballot" is confirmed prepared), then set "c" to "ballot". 745 A node leaves the PREPARE phase and proceeds to the COMMIT phase when 746 there is some ballot "b" for which the node confirms "prepare(b)" and 747 accepts "commit b". (If nodes never changed quorum slice mid- 748 protocol, it would suffice to accept "commit b". Also waiting to 749 confirm "prepare(b)" makes it easier to recover from liveness 750 failures by removing Byzantine faulty nodes from quorum slices.) 752 3.7. Commit messages 754 In the COMMIT phase, a node has accepted "commit b" for some ballot 755 "b", and must confirm that statement to act on the value in 756 "b.counter". A node sends the following message in this phase: 758 struct SCPCommit 759 { 760 SCPBallot ballot; // b 761 uint32 preparedCounter; // prepared.counter 762 uint32 hCounter; // h.counter 763 uint32 cCounter; // c.counter 764 }; 766 The message conveys the following federated vote messages, where 767 "infinity" is 2^{32} (a value greater than any ballot counter 768 representable in serialized form): 770 o "accept commit " for every "cCounter <= n <= 771 hCounter" 773 o "vote-or-accept prepare()" 775 o "accept prepare()" 777 o "confirm prepare()" 779 o "vote commit " for every "n >= cCounter" 781 A node computes the fields in the "SCPCommit" messages it sends as 782 follows: 784 ballot 785 This field is maintained identically to how it is maintained in 786 the PREPARE phase, though "ballot.value" can no longer change, 787 only "ballot.counter". Note that the value "ballot.counter" does 788 not figure in any of the federated voting messages. The purpose 789 of continuing to update and send this field is to assist other 790 nodes still in the PREPARE phase in synchronizing their counters. 792 preparedCounter 793 This field is the counter of the highest accepted prepared 794 ballot--maintained identically to the "prepared" field in the 795 PREPARE phase. Since the "value" field will always be the same as 796 "ballot", only the counter is sent in the COMMIT phase. 798 cCounter 799 The counter of the lowest ballot "c" for which the node has 800 accepted "commit c". (No value is included in messages since 801 "c.value == ballot.value".) 803 hCounter 804 The counter of the highest ballot "h" for which the node has 805 accepted "commit h". (No value is included in messages since 806 "h.value == ballot.value".) 808 As soon as a node confirms "commit b" for any ballot "b", it moves to 809 the EXTERNALIZE phase. 811 3.8. Externalize messages 813 A node enters the EXTERNALIZE phase when it confirms "commit b" for 814 any ballot "b". As soon as this happens, SCP outputs "b.value" as 815 the value of the current slot. In order to help other nodes achieve 816 consensus on the slot more quickly, a node reaching this phase also 817 sends the following message: 819 struct SCPExternalize 820 { 821 SCPBallot commit; // c 822 uint32 hCounter; // h.counter 823 }; 825 An "SCPExternalize" message conveys the following federated voting 826 messages: 828 o "accept commit " for every "n >= commit.counter" 830 o "confirm commit " for every "commit.counter <= n 831 <= hCounter" 833 o "confirm prepare()" 835 The fields are set as follows: 837 commit 838 The lowest confirmed committed ballot. 840 hCounter 841 The counter of the highest confirmed committed ballot. 843 3.9. Summary of phases 845 Table 1 summarizes the phases of SCP for each slot. The NOMINATION 846 and PREPARE phases begin concurrently. However, a node initially 847 does not send "SCPPrepare" messages but only listens for ballot 848 messages in case "accept prepare(b)" reaches blocking threshold for 849 some ballot "b". The COMMIT and EXTERNALIZE phases then run in turn 850 after PREPARE ends. A node may externalize (act upon) a value as 851 soon as it enters the EXTERNALIZE phase. 853 The point of "SCPExternalize" messages is to help straggling nodes 854 catch up more quickly. As such, the EXTERNALIZE phase never ends. 855 Rather, a node should archive an "SCPExternalize" message for as long 856 as it retains slot state. 858 +-------------+--------------------------------+--------------------+ 859 | Phase | Begin | End | 860 +-------------+--------------------------------+--------------------+ 861 | NOMINATION | previous slot externalized and | some value is | 862 | | 5 seconds have elapsed since | confirmed | 863 | | NOMINATION ended for that slot | nominated or some | 864 | | | ballot is | 865 | | | confirmed prepared | 866 | | | | 867 | PREPARE | begin with NOMINATION, but | accept "commit b" | 868 | | send "SCPPrepare" only once | for some ballot | 869 | | some value confirmed nominated | "b" | 870 | | or accept "prepare(b)" for | | 871 | | some ballot b | | 872 | | | | 873 | COMMIT | accept "commit b" for some | confirm "commit b" | 874 | | ballot "b" | for some ballot | 875 | | | "b" | 876 | | | | 877 | EXTERNALIZE | confirm "commit b" for some | never | 878 | | ballot "b" | | 879 +-------------+--------------------------------+--------------------+ 881 Table 1 883 3.10. Message envelopes 885 In order to provide full context for each signed message, all signed 886 messages are part of an "SCPStatement" union type that includes the 887 "slotIndex" naming the slot to which the message applies, as well as 888 the "type" of the message. A signed message and its signature are 889 packed together in an "SCPEnvelope" structure. 891 enum SCPStatementType 892 { 893 SCP_ST_PREPARE = 0, 894 SCP_ST_COMMIT = 1, 895 SCP_ST_EXTERNALIZE = 2, 896 SCP_ST_NOMINATE = 3 897 }; 899 struct SCPStatement 900 { 901 NodeID nodeID; // v (node signing message) 902 uint64 slotIndex; // i 903 Hash quorumSetHash; // hash of serialized SCPQuorumSet 905 union switch (SCPStatementType type) 906 { 907 case SCP_ST_PREPARE: 908 SCPPrepare prepare; 909 case SCP_ST_COMMIT: 910 SCPCommit commit; 911 case SCP_ST_EXTERNALIZE: 912 SCPExternalize externalize; 913 case SCP_ST_NOMINATE: 914 SCPNomination nominate; 915 } 916 pledges; 917 }; 919 struct SCPEnvelope 920 { 921 SCPStatement statement; 922 Signature signature; 923 }; 925 4. Security considerations 927 If nodes do not pick quorum slices well, the protocol will not be 928 safe. 930 5. Acknowledgments 932 The Stellar development foundation supported development of the 933 protocol and produced the first production deployment of SCP. The 934 IRTF DIN group including Dirk Kutscher, Sydney Li, Colin Man, Melinda 935 Shore, and Jean-Luc Watson helped with the framing and motivation for 936 this specification. We also thank Bob Glickstein for finding bugs in 937 drafts of this document and offering many useful suggestions. 939 6. References 941 6.1. Normative References 943 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 944 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 945 2006, . 947 [RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms 948 (SHA and SHA-based HMAC and HKDF)", RFC 6234, 949 DOI 10.17487/RFC6234, May 2011, 950 . 952 [RFC8032] Josefsson, S. and I. Liusvaara, "Edwards-Curve Digital 953 Signature Algorithm (EdDSA)", RFC 8032, 954 DOI 10.17487/RFC8032, January 2017, 955 . 957 6.2. Informative References 959 [building-blocks] 960 Song, Y., van Renesse, R., Schneider, F., and D. Dolev, 961 "The Building Blocks of Consensus", 9th International 962 Conference on Distributed Computing and Networking pp. 963 54-72, 2008. 965 [FLP] Fischer, M., Lynch, N., and M. Lynch, "Impossibility of 966 Distributed Consensus with One Faulty Process", Journal of 967 the ACM 32(2):374-382, 1985. 969 [I-D.paillisse-sidrops-blockchain] 970 Paillisse, J., Rodriguez-Natal, A., Ermagan, V., Maino, 971 F., and A. Cabellos-Aparicio, "An analysis of the 972 applicability of blockchain to secure IP addresses 973 allocation, delegation and bindings.", draft-paillisse- 974 sidrops-blockchain-01 (work in progress), October 2017. 976 [RFC6797] Hodges, J., Jackson, C., and A. Barth, "HTTP Strict 977 Transport Security (HSTS)", RFC 6797, 978 DOI 10.17487/RFC6797, November 2012, 979 . 981 [RFC6962] Laurie, B., Langley, A., and E. Kasper, "Certificate 982 Transparency", RFC 6962, DOI 10.17487/RFC6962, June 2013, 983 . 985 [RFC7469] Evans, C., Palmer, C., and R. Sleevi, "Public Key Pinning 986 Extension for HTTP", RFC 7469, DOI 10.17487/RFC7469, April 987 2015, . 989 [SCP] Mazieres, D., "The Stellar Consensus Protocol: A Federated 990 Model for Internet-level Consensus", Stellar Development 991 Foundation whitepaper , 2015, 992 . 995 Authors' Addresses 997 Nicolas Barry 998 Stellar Development Foundation 999 170 Capp St., Suite A 1000 San Francisco, CA 94110 1001 US 1003 Email: nicolas@stellar.org 1005 Giuliano Losa 1006 UCLA 1007 3753 Keystone Avenue #10 1008 Los Angeles, CA 90034 1009 US 1011 Email: giuliano@cs.ucla.edu 1013 David Mazieres 1014 Stanford University 1015 353 Serra Mall, Room 290 1016 Stanford, CA 94305 1017 US 1019 Email: dm@uun.org 1021 Jed McCaleb 1022 Stellar Development Foundation 1023 170 Capp St., Suite A 1024 San Francisco, CA 94110 1025 US 1027 Email: jed@stellar.org 1028 Stanislas Polu 1029 Stripe Inc. 1030 185 Berry Street, Suite 550 1031 San Francisco, CA 94107 1032 US 1034 Email: stan@stripe.com