idnits 2.17.1 draft-mazieres-dinrg-scp-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 29, 2018) is 2100 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '32' on line 345 -- Obsolete informational reference (is this intentional?): RFC 6962 (Obsoleted by RFC 9162) Summary: 1 error (**), 0 flaws (~~), 1 warning (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group N. Barry 3 Internet-Draft Stellar Development Foundation 4 Intended status: Experimental G. Losa 5 Expires: December 31, 2018 UCLA 6 D. Mazieres 7 Stanford University 8 J. McCaleb 9 Stellar Development Foundation 10 S. Polu 11 Stripe Inc. 12 June 29, 2018 14 The Stellar Consensus Protocol (SCP) 15 draft-mazieres-dinrg-scp-04 17 Abstract 19 SCP is an open Byzantine agreement protocol resistant to Sybil 20 attacks. It allows Internet infrastructure stakeholders to reach 21 agreement on a series of values without unanimous agreement on what 22 constitutes the set of important stakeholders. A big differentiator 23 from other Byzantine agreement protocols is that, in SCP, nodes 24 determine the composition of quorums in a decentralized way: each 25 node selects sets of nodes it considers large or important enough to 26 speak for the whole network, and a quorum must contain such a set for 27 each of its members. 29 Status of This Memo 31 This Internet-Draft is submitted in full conformance with the 32 provisions of BCP 78 and BCP 79. 34 Internet-Drafts are working documents of the Internet Engineering 35 Task Force (IETF). Note that other groups may also distribute 36 working documents as Internet-Drafts. The list of current Internet- 37 Drafts is at https://datatracker.ietf.org/drafts/current/. 39 Internet-Drafts are draft documents valid for a maximum of six months 40 and may be updated, replaced, or obsoleted by other documents at any 41 time. It is inappropriate to use Internet-Drafts as reference 42 material or to cite them other than as "work in progress." 44 This Internet-Draft will expire on December 31, 2018. 46 Copyright Notice 48 Copyright (c) 2018 IETF Trust and the persons identified as the 49 document authors. All rights reserved. 51 This document is subject to BCP 78 and the IETF Trust's Legal 52 Provisions Relating to IETF Documents 53 (https://trustee.ietf.org/license-info) in effect on the date of 54 publication of this document. Please review these documents 55 carefully, as they describe your rights and restrictions with respect 56 to this document. Code Components extracted from this document must 57 include Simplified BSD License text as described in Section 4.e of 58 the Trust Legal Provisions and are provided without warranty as 59 described in the Simplified BSD License. 61 Table of Contents 63 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 64 2. The Model . . . . . . . . . . . . . . . . . . . . . . . . . . 3 65 2.1. Configuration . . . . . . . . . . . . . . . . . . . . . . 3 66 2.2. Input and output . . . . . . . . . . . . . . . . . . . . 4 67 3. Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 5 68 3.1. Federated voting . . . . . . . . . . . . . . . . . . . . 5 69 3.2. Basic types . . . . . . . . . . . . . . . . . . . . . . . 7 70 3.3. Quorum slices . . . . . . . . . . . . . . . . . . . . . . 8 71 3.4. Nominate message . . . . . . . . . . . . . . . . . . . . 10 72 3.5. Ballots . . . . . . . . . . . . . . . . . . . . . . . . . 12 73 3.6. Prepare message . . . . . . . . . . . . . . . . . . . . . 13 74 3.7. Commit message . . . . . . . . . . . . . . . . . . . . . 16 75 3.8. Externalize message . . . . . . . . . . . . . . . . . . . 18 76 3.9. Summary of phases . . . . . . . . . . . . . . . . . . . . 18 77 3.10. Message envelopes . . . . . . . . . . . . . . . . . . . . 19 78 4. Security considerations . . . . . . . . . . . . . . . . . . . 20 79 5. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 20 80 6. References . . . . . . . . . . . . . . . . . . . . . . . . . 21 81 6.1. Normative References . . . . . . . . . . . . . . . . . . 21 82 6.2. Informative References . . . . . . . . . . . . . . . . . 21 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 22 85 1. Introduction 87 Various aspects of Internet infrastructure depend on irreversible and 88 transparent updates to data sets such as authenticated mappings [cite 89 Li-Man-Watson draft]. Examples include public key certificates and 90 revocations, transparency logs [RFC6962], preload lists for HSTS 91 [RFC6797] and HPKP [RFC7469], and IP address delegation 92 [I-D.paillisse-sidrops-blockchain]. 94 The Stellar Consensus Protocol (SCP) specified in this draft allows 95 Internet infrastructure stakeholders to collaborate in applying 96 irreversible transactions to public state. SCP is an open Byzantine 97 agreement protocol that resists Sybil attacks by allowing individual 98 parties to specify minimum quorum memberships in terms of specific 99 trusted peers. Each participant chooses combinations of peers on 100 which to depend such that these combinations can be trusted in 101 aggregate. The protocol guarantees safety so long as these 102 dependency sets transitively overlap and contain sufficiently many 103 honest nodes correctly obeying the protocol. 105 Though bad configurations are theoretically possible, several 106 analogies provide an intuition for why transitive dependencies 107 overlap in practice. For example, given multiple entirely disjoint 108 Internet-protocol networks, people would have no trouble agreeing on 109 the fact that the network containing the world's top web sites is 110 _the_ Internet. Such a consensus can hold even without unanimous 111 agreement on what constitute the world's top web sites. Similarly, 112 if network operators listed all the ASes from whom they would 113 consider peering or transit worthwhile, the transitive closures of 114 these sets would contain significant overlap, even without unanimous 115 agreement on the "tier-1 ISP" designation. Finally, while different 116 browsers and operating systems have slightly different lists of valid 117 certificate authorities, there is significant overlap in the sets, so 118 that a hypothetical system requiring validation from "all CAs" would 119 be unlikely to diverge. 121 A more detailed abstract description of SCP and its rationale, 122 including an English-language proof of safety, is available in [SCP]. 123 In particular, that reference shows that a necessary property for 124 safety, termed _quorum intersection despite ill-behaved nodes_, is 125 sufficient to guarantee safety under SCP, making SCP optimally safe 126 against Byzantine node failure for any given configuration. 128 This document specifies the end-system logic and wire format of the 129 messages in SCP. 131 2. The Model 133 This section describes the configuration and input/output values of 134 the consensus protocol. 136 2.1. Configuration 138 Each participant or _node_ in the SCP protocol has a digital 139 signature key and is named by the corresponding public key, which we 140 term a "NodeID". 142 Each node also selects one or more sets of nodes (each of which 143 includes itself) called _quorum slices_. A quorum slice represents a 144 large or important enough set of peers that the node selecting the 145 quorum slice believes the slice collectively speaks for the whole 146 network. 148 A _quorum_ is a non-empty set of nodes containing at least one quorum 149 slice of each of its members. For instance, suppose "v1" has the 150 single quorum slice "{v1, v2, v3}", while each of "v2", "v3", and 151 "v4" has the single quorum slice "{v2, v3, v4}". In this case, "{v2, 152 v3, v4}" is a quorum because it contains a slice for each member. On 153 the other hand "{v1, v2, v3}" is not a quorum, because it does not 154 contain a quorum slice for "v2" or "v3". The smallest quorum 155 including "v1" in this example is the set of all nodes "{v1, v2, v3, 156 v4}". 158 Unlike traditional Byzantine agreement protocols, nodes in SCP only 159 care about quorums to which they belong themselves (and hence that 160 contain at least one of their quorum slices). Intuitively, this is 161 what protects nodes from Sybil attacks. In the example above, if 162 "v3" deviates from the protocol, maliciously inventing 96 Sybils "v5, 163 v6, ..., v100", the honest nodes' quorums will all still include one 164 another, ensuring that "v1", "v2", and "v4" continue to agree on 165 output values. 167 Every message in the SCP protocol specifies the sender's quorum 168 slices. Hence, by collecting messages, a node dynamically learns 169 what constitutes a quorum and can decide when a particular message 170 has been sent by a quorum to which it belongs. (Again, nodes do not 171 care about quorums to which they do not belong themselves.) 173 2.2. Input and output 175 SCP produces a series of output _values_ for consecutively numbered 176 _slots_. At the start of a slot, higher-layer software on each node 177 supplies a candidate input value. Nodes then exchange protocol 178 messages to agree on one or a combination of nodes' input values as 179 the slot's output value. After a pause to assemble new input values, 180 the process repeats for the next slot, with a 5-second interval 181 between slots. 183 A value typically encodes a set of actions to apply to a replicated 184 state machine. During the pause between slots, nodes accumulate the 185 next set of actions, amortizing the cost of consensus on one slot 186 over arbitrarily many individual state machine operations. 188 In practice, only one or a small number of nodes' input values 189 actually affect the output value for any given slot. As discussed in 190 Section 3.4, which nodes' input values to use depends on a 191 cryptographic hash of the slot number and node public keys. A node's 192 chances of affecting the output value depend on how often it appears 193 in other nodes' quorum slices. 195 From SCP's perspective, values are just opaque byte arrays whose 196 interpretation is left to higher-layer software. However, SCP 197 requires a _validity_ function (to check whether a value is valid) 198 and a _combining function_ that reduces multiple candidate values 199 into a single _composite_ value. When nodes nominate multiple values 200 for a slot, SCP nodes invoke this function to converge on a single 201 composite value. By way of example, in an application where values 202 consist of sets of transactions, the combining function could take 203 the union of transaction sets. Alternatively, if values represent a 204 timestamp and a set of transactions, the combining function might 205 pair the highest nominated timestamp with the transaction set that 206 has the highest hash value. 208 3. Protocol 210 The protocol consists of exchanging digitally-signed messages bound 211 to nodes' quorum slices. The format of all messages is specified 212 using XDR [RFC4506]. In addition to quorum slices, messages 213 compactly convey votes on sets of conceptual statements. The core 214 technique of voting with quorum slices is termed _federated voting_. 215 We describe federated voting next, then detail protocol messages in 216 the subsections that follow. 218 The protocol goes through four phases: NOMINATE, PREPARE, COMMIT, and 219 EXTERNALIZE. The NOMINATE and PREPARE phases run concurrently 220 (though NOMINATE's messages are sent earlier and it ends before 221 PREPARE ends). The COMMIT and EXTERNALIZE phrases are exclusive, 222 with COMMIT occurring immediately after PREPARE and EXTERNALIZE 223 immediately after COMMIT. 225 3.1. Federated voting 227 Federated voting is a process through which nodes _confirm_ 228 statements. Not every attempt at federated voting may succeed--an 229 attempt to vote on some statement "a" may get stuck, with the result 230 that nodes can confirm neither "a" nor its negation "!a". However, 231 when a node succeeds in confirming a statement "a", federated voting 232 guarantees two things: 234 1. No two well-behaved nodes will confirm contradictory statements 235 in any configuration and failure scenario in which any protocol 236 can guarantee safety for the two nodes (i.e., quorum intersection 237 for the two nodes holds despite ill-behaved nodes). 239 2. If a node that is guaranteed safety by #1 confirms a statement 240 "a", and that node is a member of one or more quorums consisting 241 entirely of well-behaved nodes, then eventually every member of 242 every such quorum will also confirm "a". 244 Intuitively, these conditions are key to ensuring agreement among 245 nodes as well as a weak form of liveness (the non-blocking property 246 [building-blocks]) that is compatible with the FLP impossibility 247 result [FLP]. 249 As a node "v" collects signed copies of a federated voting message 250 "m" from peers, two thresholds trigger state transitions in "v" 251 depending on the message. We define these thresholds as follows: 253 o _quorum threshold_: When every member of a quorum to which "v" 254 belongs (including "v" itself) has issued message "m" 256 o _blocking threshold_: When at least one member of each of "v"'s 257 quorum slices (a set that does not necessarily include "v" itself) 258 has issued message "m" 260 Each node "v" can send several types of message with respect to a 261 statement "a" during federated voting: 263 o _vote_ "a" states that "a" is a valid statement and constitutes a 264 promise by "v" not to vote for any contradictory statement, such 265 as "!a". 267 o _accept_ "a" says that nodes may or may not come to agree on "a", 268 but if they don't, then the system has experienced a catastrophic 269 set of Byzantine failures to the point that no quorum containing 270 "v" consists entirely of correct nodes. (Nonetheless, accepting 271 "a" is not sufficient to act on it, as doing so could violate 272 agreement, which is worse than merely getting stuck from lack of a 273 correct quorum.) 275 o _vote-or-accept_ "a" is the disjunction of the above two messages. 276 A node implicitly sends such a message if it sends either _vote_ 277 "a" or _accept_ "a". Where it is inconvenient and unnecessary to 278 differentiate between _vote_ and _accept_, a node can explicitly 279 send a _vote-or-accept_ message. 281 o _confirm_ "a" indicates that _accept_ "a" has reached quorum 282 threshold at the sender. This message is interpreted the same as 283 _accept_ "a", but allows recipients to optimize their quorum 284 checks by ignoring the sender's quorum slices, as the sender 285 asserts it has already checked them. 287 Figure 1 illustrates the federated voting process. A node "v" votes 288 for a valid statement "a" that doesn't contradict statements in past 289 _vote_ or _accept_ messages sent by "v". When the _vote_ message 290 reaches quorum threshold, the node accepts "a". In fact, "v" accepts 291 "a" if the _vote-or-accept_ message reaches quorum threshold, as some 292 nodes may accept "a" without first voting for it. Specifically, a 293 node that cannot vote for "a" because it has voted for "a"'s negation 294 "!a" still accepts "a" when the message _accept_ "a" reaches blocking 295 threshold (meaning assertions about "!a" have no hope of reaching 296 quorum threshold barring catastrophic Byzantine failure). 298 If and when the message _accept_ "a" reaches quorum threshold, then 299 "v" has confirmed "a" and the federated vote has succeeded. In 300 effect, the _accept_ messages constitute a second vote on the fact 301 that the initial vote messages succeeded. Once "v" enters the 302 confirmed state, it may issue a _confirm_ "a" message to help other 303 nodes confirm "a" more efficiently by pruning their quorum search at 304 "v". 306 "vote-or-accept a" "accept a" 307 reaches reaches 308 quorum threshold quorum threshold 309 +-----------------+ +-----------------+ 310 | | | | 311 | V | V 312 +-----------+ +-----------+ +-----------+ 313 a is +---->| voted a | |accepted a | |confirmed a| 314 valid | +-----------+ +-----------+ +-----------+ 315 | | ^ 316 +-----------+ | | "accept a" reaches 317 |uncommitted|------+-----------------+ blocking threshold 318 +-----------+ | 319 | | 320 | +-----------+ 321 +---->| voted !a | 322 +-----------+ 324 Figure 1: Federated voting process 326 3.2. Basic types 328 SCP employs 32- and 64-bit integers, as defined below. 330 typedef unsigned int uint32; 331 typedef int int32; 332 typedef unsigned hyper uint64; 333 typedef hyper int64; 335 SCP uses the SHA-256 cryptograhpic hash function [RFC6234], and 336 represents hash values as a simple array of 32 bytes. 338 typedef opaque Hash[32]; 340 SCP employs the Ed25519 digital signature algorithm [RFC8032]. For 341 cryptographic agility, however, public keys are represented as a 342 union type that can later be compatibly extended with other key 343 types. 345 typedef opaque uint256[32]; 347 enum PublicKeyType 348 { 349 PUBLIC_KEY_TYPE_ED25519 = 0 350 }; 352 union PublicKey switch (PublicKeyType type) 353 { 354 case PUBLIC_KEY_TYPE_ED25519: 355 uint256 ed25519; 356 }; 358 // variable size as the size depends on the signature scheme used 359 typedef opaque Signature<64>; 361 Nodes are public keys, while values are simply opaque arrays of 362 bytes. 364 typedef PublicKey NodeID; 366 typedef opaque Value<>; 368 3.3. Quorum slices 370 Theoretically a quorum slice can be an arbitrary set of nodes. 371 However, arbitrary predicates on sets cannot be encoded concisely. 372 Instead we specify quorum slices as any set of k-of-n members, where 373 each of the n members can either be an individual node ID, or, 374 recursively, another k-of-n set. 376 // supports things like: A,B,C,(D,E,F),(G,H,(I,J,K,L)) 377 // only allows 2 levels of nesting 378 struct SCPSlices 379 { 380 uint32 threshold; // the k in k-of-n 381 PublicKey validators<>; 382 SCPSlices1 innerSets<>; 383 }; 384 struct SCPSlices1 385 { 386 uint32 threshold; // the k in k-of-n 387 PublicKey validators<>; 388 SCPSlices2 innerSets<>; 389 }; 390 struct SCPSlices2 391 { 392 uint32 threshold; // the k in k-of-n 393 PublicKey validators<>; 394 }; 396 Let "k" be the value of "threshold" and "n" the sum of the sizes of 397 the "validators" and "innerSets" vectors in a message sent by some 398 node "v". A message "m" sent by "v" reaches quorum threshold at "v" 399 when three things hold: 401 1. "v" itself has issued (digitally signed) the message, 403 2. The number of nodes in "validators" who have signed "m" plus the 404 number "innerSets" that (recursively) meet this condition is at 405 least "k", and 407 3. These three conditions apply (recursively) at some combination of 408 nodes sufficient for condition #2. 410 A message reaches blocking threshold at "v" when the number of 411 "validators" making the statement plus (recursively) the number 412 "innerSets" reaching blocking threshold exceeds "n-k". (Blocking 413 threshold depends only on the local node's quorum slices and hence 414 does not require a recursive check on other nodes like step #3 415 above.) 417 As described in Section 3.10, every protocol message is paired with a 418 cryptographic hash of the sender's "SCPSlices" and digitally signed. 419 Inner protocol messages described in the next few sections should be 420 understood to be received alongside such a quorum slice specification 421 and digital signature. 423 3.4. Nominate message 425 For each slot, the SCP protocol begins in a NOMINATE phase, whose 426 goal is to devise one or more candidate output values for the 427 consensus protocol. In this phase, nodes send nomination messages 428 comprising a monotonically growing set of values: 430 struct SCPNominate 431 { 432 Value voted<>; // X 433 Value accepted<>; // Y 434 }; 436 The "voted" and "accepted" sets are disjoint; any value that is 437 eligible for both sets is placed only in the "accepted" set. 439 "voted" consists of candidate values that the sender has voted to 440 nominate. Each node progresses through a series of nomination 441 _rounds_ in which it may increase the set of values in its own 442 "voted" field by adding the contents of the "voted" and "accepted" 443 fields of "SCPNominate" messages received from a growing set of 444 peers. In round "n" of slot "i", each node determines an additional 445 peer whose nominated values it should incorporate in its own 446 "SCPNominate" message as follows: 448 o Let "Gi(m) = SHA-256(i || m)", where "||" denotes the 449 concatenation of serialized XDR values. Treat the output of "Gi" 450 as a 256-bit binary number in big-endian format. 452 o For each peer "v", define "weight(v)" as the fraction of quorum 453 slices containing "v". 455 o Define the set of nodes "neighbors(n)" as the set of nodes v for 456 which "Gi(1 || n || v) < 2^{256} * weight(v)", where "1" and "n" 457 are both 32-bit XDR "int" values. 459 o Define "priority(n, v)" as "Gi(2 || n || v)", where "2" and "n" 460 are both 32-bit XDR "int" values. 462 For each round "n" until nomination has finished (see below), a node 463 starts _echoing_ the available peer "v" with the highest value of 464 "priority(n, v)" from among the nodes in "neighbors(n)". To echo 465 "v", the node merges any valid values from "v"'s "voted" and 466 "accepted" sets into its own "voted" set. 468 XXX - expand "voted" with only the 10 values with lowest Gi hash in 469 any given round to avoid blowing out the message size? 470 Note that when echoing nominations, nodes must exclude and neither 471 vote for nor accept values rejected by the higher-layer application's 472 validity function. This validity function must not depend on state 473 that can permanently differ across nodes. By way of example, it is 474 okay to reject values that are syntactically ill-formed, that are 475 semantically incompatible with the previous slot's value, that 476 contain invalid digital signatures, that contain timestamps in the 477 future, or that specify upgrades to unknown versions of the protocol. 478 By contrast, the application cannot reject values that are 479 incompatible with the results of a DNS query or some dynamically 480 retrieved TLS certificate, as different nodes could see different 481 results when doing such queries. 483 Nodes must not send an "SCPNominate" message until at least one of 484 the "voted" or "accepted" fields is non-empty. When these fields are 485 both empty, a node that has the highest priority among its neighbors 486 in the current round (and hence should be echoing its own votes) adds 487 the higher-layer software's input value to its "voted" field. Nodes 488 that do not have the highest priority wait to hear "SCPNominate" 489 messages from the nodes whose nominations they are echoing. 491 If a particular valid value "x" reaches quorum threshold in the 492 messages sent by peers (meaning that every node in a quorum contains 493 "x" either in the "voted" or the "accepted" field), then the node at 494 which this happens moves "x" from its "voted" field to its "accepted" 495 field and broadcasts a new "SCPNominate" message. Similarly, if "x" 496 reaches blocking threshold in a node's peers' "accepted" field 497 (meaning every one of a node's quorum slices contains at least one 498 node with "x" in its "accepted" field), then the node adds "x" to its 499 own "accepted" field (removing it from "voted" if applicable). These 500 two cases correspond to the two conditions for entering the 501 "accepted" state in Figure 1. 503 A node stops adding any new values to its "voted" set as soon as any 504 value "x" reaches quorum threshold in the "accepted" fields of 505 received "SCPNominate" messages. Following the terminology of 506 Section 3.1, this condition corresponds to when the node confirms "x" 507 as nominated. Note, however, that the node continues adding new 508 values to "accepted" as appropriate. Doing so may lead to more 509 values becoming confirmed nominated even after the "voted" set is 510 closed to new values. 512 A node always begins nomination in round "1". Round "n" lasts for 513 "1+n" seconds, after which, if no value has been confirmed nominated, 514 the node proceeds to round "n+1". A node continues to echo votes 515 from the highest priority neighbor in prior rounds as well as the 516 current round. In particular, until any value is confirmed 517 nominated, a node continues expanding its "voted" field with values 518 nominated by highest priority neighbors from prior rounds even when 519 the values appeared after the end of those prior rounds. 521 As defined in the next two sections, the NOMINATE phase ends when a 522 node has confirmed "prepare(b)" for some any ballot "b", as this is 523 the point at which the nomination outcome no longer influences the 524 protocol. Until this point, a node must continue to transmit 525 "SCPNominate" messages as well as to expand its "accepted" set (even 526 if "voted" is closed because some value has been confirmed 527 nominated). 529 3.5. Ballots 531 Once there is a candidate on which to try to reach consensus, a node 532 moves through three phases of balloting: PREPARE, COMMIT, and 533 EXTERNALIZE. Balloting employs federated voting to chose between 534 _commit_ and _abort_ statements for ballots. A ballot is a pair 535 consisting of a counter and candidate value: 537 // Structure representing ballot 538 struct SCPBallot 539 { 540 uint32 counter; // n 541 Value value; // x 542 }; 544 We use the notation "" to represent a ballot with "counter == 545 n" and "value == x". 547 Ballots are totally ordered with "counter" more significant than 548 "value". Hence, we write "b1 < b2" to mean that either "(b1.counter 549 < b2.counter)" or "(b1.counter == b2.counter && b1.value < 550 b2.value)". Values are compared lexicographically as a strings of 551 unsigned octets. 553 The protocol moves through federated voting on successively higher 554 ballots until nodes confirm "commit(b)" for some ballot "b", at which 555 point consensus terminates and outputs "b.value" for the slot. To 556 ensure that only one value can be chosen for a slot and that the 557 protocol cannot get stuck if individual ballots get stuck, there are 558 two restrictions on voting: 560 1. A node cannot vote for both "commit(b)" and "abort(b)" on the 561 same ballot (the two outcomes are contradictory), and 563 2. A node may not vote for or accept "commit(b)" for any ballot "b" 564 unless it has confirmed "abort" for every lesser ballot with a 565 different value. 567 The second condition requires voting to abort large numbers of 568 ballots before voting to commit a ballot "b". We call this 569 _preparing_ ballot "b", and introduce the following notation for the 570 associated set of abort statements. 572 o "prepare(b)" encodes an "abort" statement for every ballot less 573 than "b" containing a value other than "b.value", i.e., 574 "prepare(b) = { abort(b1) | b1 < b AND b1.value != b.value }". 576 o "vote prepare(b)" stands for a set of _vote_ messages for every 577 "abort" statement in "prepare(b)". 579 o Similarly, "accept prepare(b)", "vote-or-accept prepare(b)", and 580 "confirm prepare(b)" encode sets of _accept_, _vote-or-accept_, 581 and _confirm_ messages for every "abort" statement in 582 "prepare(b)". 584 Using this terminology, a node must confirm "prepare(b)" before 585 issuing a _vote_ or _accept_ message for the statement "commit(b)". 587 3.6. Prepare message 589 The first phase of balloting is the PREPARE phase. During this 590 phase, as soon as a node has a valid candidate value (see the rules 591 for "ballot.value" below), it begins sending the following message: 593 struct SCPPrepare 594 { 595 SCPBallot ballot; // b 596 SCPBallot *prepared; // p 597 SCPBallot *preparedPrime; // p' 598 uint32 hCounter; // h.counter or 0 if h == NULL 599 uint32 cCounter; // c.counter or 0 if !c || !hCounter 600 }; 602 This message compactly conveys the following (conceptual) federated 603 voting messages: 605 o "vote-or-accept prepare(ballot)" 607 o If "prepared != NULL": "accept prepare(prepared)" 609 o If "preparedPrime != NULL": "accept prepare(preparedPrime)" 611 o If "hCounter != 0": "confirm prepare()" 613 o If "cCounter != 0": "vote commit()" for every 614 "cCounter <= n <= hCounter" 616 Note that to be valid, an "SCPPrepare" message must satisfy the 617 following conditions: 619 o If "prepared != NULL", then "prepared <= ballot", 621 o If "preparedPrime != NULL", then "prepared != NULL" and 622 "preparedPrime < prepared", and 624 o "cCounter <= hCounter <= ballot.counter". 626 Based on the federated vote messages received, each node keeps track 627 of what ballots have been accepted and confirmed prepared. It uses 628 these ballots to set the following fields of its own "SCPPrepare" 629 messages as follows. 631 ballot 632 The current ballot that a node is attempting to prepare and 633 commit. The rules for setting each field are detailed below. 634 Note that the "value" is updated when and only when "counter" 635 changes. 637 ballot.counter 638 The counter is set according to the following rules: 640 * Upon entering the PREPARE phase, the "counter" field is 641 initialized to 1. 643 * When a node sees messages from a quorum to which it belongs 644 such that each message's "ballot.counter" is greater than or 645 equal to the local "ballot.counter", the node arms a timer to 646 fire in a number of seconds equal to its "ballot.counter + 1" 647 (so the timeout lengthens linearly as the counter increases). 648 Note that for the purposes of determining whether a quorum has 649 a particular "ballot.counter", a node considers "ballot" fields 650 in "SCPPrepare" and "SCPCommit" messages. It also considers 651 "SCPExternalize" messages to convey an implicit 652 "ballot.counter" of "infinity". 654 * If the timer fires, a node increments the ballot counter by 1. 656 * If nodes forming a blocking threshold all have "ballot.counter" 657 values greater than the local "ballot.counter", then the local 658 node immediately cancels any pending timer, increases 659 "ballot.counter" to the lowest value such that this is no 660 longer the case, and if appropriate according to the rules 661 above arms a new timer. Note that the blocking threshold may 662 include ballots from "SCPCommit" messages as well as 663 "SCPExternalize" messages, which implicitly have an infinite 664 ballot counter. 666 * *Exception*: To avoid exhausting "ballot.counter", its value 667 must always be less then 1,000 plus the number of seconds a 668 node has been running SCP on the current slot. Should any of 669 the above rules require increasing the counter beyond this 670 value, a node either increases "ballot.counter" to the maximum 671 permissible value, or, if it is already at this maximum, waits 672 up to one second before increasing the value. 674 ballot.value 675 Each time the ballot counter is changed, the value is also 676 recomputed as follows: 678 * If any ballot has been confirmed prepared, then "ballot.value" 679 is taken to to be "h.value" for the highest confirmed prepared 680 ballot "h". (Note that once this is the case, the node can 681 stop sending "SCPNominate" messages, as "h.value" supersedes 682 any output of the nomination protocol.) 684 * Otherwise (if no such "h" exists), if one or more values are 685 confirmed nominated, then "ballot.value" is taken as the output 686 of the deterministic combining function applied to all 687 confirmed nominated values. Note that because the NOMINATE and 688 PREPARE phases run concurrently, the set of confirmed nominated 689 values may continue to grow during balloting, changing 690 "ballot.value" even if no ballots are confirmed prepared. 692 * Otherwise, if no ballot is confirmed prepared and no value is 693 confirmed nominated, but the node has accepted a ballot 694 prepared (because "prepare(b)" meets blocking threshold for 695 some ballot "b"), then "ballot.value" is taken as the value of 696 the highest such accepted prepared ballot. 698 * Otherwise, if no value is confirmed nominated and no value is 699 accepted prepared, then a node cannot yet send an "SCPPrepare" 700 message and must continue sending only "SCPNominate" messages. 702 prepared 703 The highest accepted prepared ballot not exceeding the "ballot" 704 field, or NULL if no ballot has been accepted prepared. Recall 705 that ballots with equal counters are totally ordered by the value. 706 Hence, if "ballot = " and the highest prepared ballot is 707 "" where "x < y", then the "prepared" field in sent messages 708 must be set to "" instead of "", as the latter would 709 exceed "ballot". In the event that "n = 1", the prepared field 710 may be set to "<0, y>", meaning 0 is a valid "prepared.counter" 711 even though it is not a valid "ballot.counter". It is possible to 712 confirm "prepare(<0, y>)", in which case the next "ballot.value" 713 is set to "y". However, it is not possible to vote to commit a 714 ballot with counter 0. 716 preparedPrime 717 The highest accepted prepared ballot such that "preparedPrime < 718 prepared" and "preparedPrime.value != prepared.value", or NULL if 719 there is no such ballot. Note that together, "prepared" and 720 "preparedPrime" concisely encode all "abort" statements (below 721 "ballot") that the sender has accepted. 723 hCounter 724 If "h" is the highest confirmed prepared ballot and "h.value == 725 ballot.value", then this field is set to "h.counter". Otherwise, 726 if no ballot is confirmed prepared or if "h.value != 727 ballot.value", then this field is 0. Note that by the rules 728 above, if "h" exists, then "ballot.value" will be set to "h.value" 729 the next time "ballot" is updated. 731 cCounter 732 The value "cCounter" is maintained based on an internally- 733 maintained _commit ballot_ "c", initially "NULL". "cCounter" is 0 734 while "c == NULL" or "hCounter == 0", and is "c.counter" 735 otherwise. "c" is updated as follows: 737 * If either "(prepared > c && prepared.value != c.value)" or 738 "(preparedPrime > c && preparedPrime.value != c.value)", then 739 reset "c = NULL". 741 * If "c == NULL" and "hCounter == ballot.counter" (meaning 742 "ballot" is confirmed prepared), then set "c" to "ballot". 744 A node leaves the PREPARE phase and proceeds to the COMMIT phase when 745 there is some ballot "b" for which the node confirms "prepare(b)" and 746 accepts "commit(b)". (If nodes never changed quorum slice mid- 747 protocol, it would suffice to accept "commit(b)". Also waiting to 748 confirm "prepare(b)" makes it easier to recover from liveness 749 failures by removing Byzantine faulty nodes from quorum slices.) 751 3.7. Commit message 753 In the COMMIT phase, a node has accepted "commit(b)" for some ballot 754 "b", and must confirm that statement to act on the value in 755 "b.counter". A node sends the following message in this phase: 757 struct SCPCommit 758 { 759 SCPBallot ballot; // b 760 uint32 preparedCounter; // prepared.counter 761 uint32 hCounter; // h.counter 762 uint32 cCounter; // c.counter 763 }; 765 The message conveys the following federated vote messages, where 766 "infinity" is 2^{32} (a value greater than any ballot counter 767 representable in serialized form): 769 o "accept commit()" for every "cCounter <= n <= 770 hCounter" 772 o "vote-or-accept prepare()" 774 o "accept prepare()" 776 o "confirm prepare()" 778 o "vote commit()" for every "n >= cCounter" 780 A node computes the fields in the "SCPCommit" messages it sends as 781 follows: 783 ballot 784 This field is maintained identically to how it is maintained in 785 the PREPARE phase, though "ballot.value" can no longer change, 786 only "ballot.counter". Note that the value "ballot.counter" does 787 not figure in any of the federated voting messages. The purpose 788 of continuing to update and send this field is to assist other 789 nodes still in the PREPARE phase in synchronizing their counters. 791 preparedCounter 792 This field is the counter of the highest accepted prepared 793 ballot--maintained identically to the "prepared" field in the 794 PREPARE phase. Since the "value" field will always be the same as 795 "ballot", only the counter is sent in the COMMIT phase. 797 cCounter 798 The counter of the lowest ballot "c" for which the node has 799 accepted "commit(c)". (No value is included in messages since 800 "c.value == ballot.value".) 802 hCounter 803 The counter of the highest ballot "h" for which the node has 804 accepted "commit(h)". (No value is included in messages since 805 "h.value == ballot.value".) 807 As soon as a node confirms "commit(b)" for any ballot "b", it moves 808 to the EXTERNALIZE phase. 810 3.8. Externalize message 812 A node enters the EXTERNALIZE phase when it confirms "commit(b)" for 813 any ballot "b". As soon as this happens, SCP outputs "b.value" as 814 the value of the current slot. In order to help other nodes achieve 815 consensus on the slot more quickly, a node reaching this phase also 816 sends the following message: 818 struct SCPExternalize 819 { 820 SCPBallot commit; // c 821 uint32 hCounter; // h.counter 822 }; 824 An "SCPExternalize" message conveys the following federated voting 825 messages: 827 o "accept commit()" for every "n >= commit.counter" 829 o "confirm commit()" for every "commit.counter <= n 830 <= hCounter" 832 o "confirm prepare()" 834 The fields are set as follows: 836 commit 837 The lowest confirmed committed ballot. 839 hCounter 840 The counter of the highest confirmed committed ballot. 842 3.9. Summary of phases 844 Table 1 summarizes the phases of SCP for each slot. The NOMINATE and 845 PREPARE phases begin concurrently. However, a node initially does 846 not send "SCPPrepare" messages but only listens for ballot messages 847 in case "accept prepare(b)" reaches blocking threshold for some 848 ballot "b". The COMMIT and EXTERNALIZE phases then run in turn after 849 PREPARE ends. A node may externalize (act upon) a value as soon as 850 it enters the EXTERNALIZE phase. 852 The point of "SCPExternalize" messages is to help straggling nodes 853 catch up more quickly. As such, the EXTERNALIZE phase never ends. 854 Rather, a node should archive an "SCPExternalize" message for as long 855 as it retains slot state. 857 +-------------+---------------------------------+-------------------+ 858 | Phase | Begin | End | 859 +-------------+---------------------------------+-------------------+ 860 | NOMINATE | previous slot externalized and | some ballot is | 861 | | 5 seconds have elapsed since | confirmed | 862 | | NOMINATE ended for that slot | prepared | 863 | | | | 864 | PREPARE | begin with NOMINATE, but send | accept | 865 | | "SCPPrepare" only once some | "commit(b)" for | 866 | | value confirmed nominated or | some ballot "b" | 867 | | accept "prepare(b)" for some | | 868 | | ballot b | | 869 | | | | 870 | COMMIT | accept "commit(b)" for some | confirm | 871 | | ballot "b" | "commit(b)" for | 872 | | | some ballot "b" | 873 | | | | 874 | EXTERNALIZE | confirm "commit(b)" for some | slot state | 875 | | ballot "b" | garbage-collected | 876 +-------------+---------------------------------+-------------------+ 878 Table 1: Phases of SCP for a slot 880 3.10. Message envelopes 882 In order to provide full context for each signed message, all signed 883 messages are part of an "SCPStatement" union type that includes the 884 "slotIndex" naming the slot to which the message applies, as well as 885 the "type" of the message. A signed message and its signature are 886 packed together in an "SCPEnvelope" structure. 888 enum SCPStatementType 889 { 890 SCP_ST_PREPARE = 0, 891 SCP_ST_COMMIT = 1, 892 SCP_ST_EXTERNALIZE = 2, 893 SCP_ST_NOMINATE = 3 894 }; 896 struct SCPStatement 897 { 898 NodeID nodeID; // v (node signing message) 899 uint64 slotIndex; // i 900 Hash quorumSetHash; // hash of serialized SCPSlices 902 union switch (SCPStatementType type) 903 { 904 case SCP_ST_PREPARE: 905 SCPPrepare prepare; 906 case SCP_ST_COMMIT: 907 SCPCommit commit; 908 case SCP_ST_EXTERNALIZE: 909 SCPExternalize externalize; 910 case SCP_ST_NOMINATE: 911 SCPNominate nominate; 912 } 913 pledges; 914 }; 916 struct SCPEnvelope 917 { 918 SCPStatement statement; 919 Signature signature; 920 }; 922 4. Security considerations 924 If nodes do not pick quorum slices well, the protocol will not be 925 safe. 927 5. Acknowledgments 929 The Stellar development foundation supported development of the 930 protocol and produced the first production deployment of SCP. The 931 IRTF DIN group including Dirk Kutscher, Sydney Li, Colin Man, Piers 932 Powlesland, Melinda Shore, and Jean-Luc Watson helped with the 933 framing and motivation for this specification. We also thank Bob 934 Glickstein for finding bugs in drafts of this document and offering 935 many useful suggestions. 937 6. References 939 6.1. Normative References 941 [RFC4506] Eisler, M., Ed., "XDR: External Data Representation 942 Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, May 943 2006, . 945 [RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms 946 (SHA and SHA-based HMAC and HKDF)", RFC 6234, 947 DOI 10.17487/RFC6234, May 2011, 948 . 950 [RFC8032] Josefsson, S. and I. Liusvaara, "Edwards-Curve Digital 951 Signature Algorithm (EdDSA)", RFC 8032, 952 DOI 10.17487/RFC8032, January 2017, 953 . 955 6.2. Informative References 957 [building-blocks] 958 Song, Y., van Renesse, R., Schneider, F., and D. Dolev, 959 "The Building Blocks of Consensus", 9th International 960 Conference on Distributed Computing and Networking pp. 961 54-72, 2008. 963 [FLP] Fischer, M., Lynch, N., and M. Lynch, "Impossibility of 964 Distributed Consensus with One Faulty Process", Journal of 965 the ACM 32(2):374-382, 1985. 967 [I-D.paillisse-sidrops-blockchain] 968 Paillisse, J., Rodriguez-Natal, A., Ermagan, V., Maino, 969 F., Vegoda, L., and A. Cabellos-Aparicio, "An analysis of 970 the applicability of blockchain to secure IP addresses 971 allocation, delegation and bindings.", draft-paillisse- 972 sidrops-blockchain-02 (work in progress), June 2018. 974 [RFC6797] Hodges, J., Jackson, C., and A. Barth, "HTTP Strict 975 Transport Security (HSTS)", RFC 6797, 976 DOI 10.17487/RFC6797, November 2012, 977 . 979 [RFC6962] Laurie, B., Langley, A., and E. Kasper, "Certificate 980 Transparency", RFC 6962, DOI 10.17487/RFC6962, June 2013, 981 . 983 [RFC7469] Evans, C., Palmer, C., and R. Sleevi, "Public Key Pinning 984 Extension for HTTP", RFC 7469, DOI 10.17487/RFC7469, April 985 2015, . 987 [SCP] Mazieres, D., "The Stellar Consensus Protocol: A Federated 988 Model for Internet-level Consensus", Stellar Development 989 Foundation whitepaper , 2015, 990 . 993 Authors' Addresses 995 Nicolas Barry 996 Stellar Development Foundation 997 170 Capp St., Suite A 998 San Francisco, CA 94110 999 US 1001 Email: nicolas@stellar.org 1003 Giuliano Losa 1004 UCLA 1005 3753 Keystone Avenue #10 1006 Los Angeles, CA 90034 1007 US 1009 Email: giuliano@cs.ucla.edu 1011 David Mazieres 1012 Stanford University 1013 353 Serra Mall, Room 290 1014 Stanford, CA 94305 1015 US 1017 Email: dm@uun.org 1019 Jed McCaleb 1020 Stellar Development Foundation 1021 170 Capp St., Suite A 1022 San Francisco, CA 94110 1023 US 1025 Email: jed@stellar.org 1026 Stanislas Polu 1027 Stripe Inc. 1028 185 Berry Street, Suite 550 1029 San Francisco, CA 94107 1030 US 1032 Email: stan@stripe.com