idnits 2.17.1 draft-ietf-ppsp-peer-protocol-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 5 instances of lines with control characters in the document. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 6, 2012) is 4157 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'RFC4960' is defined on line 2372, but no explicit reference was found in the text == Unused Reference: 'RFC5226' is defined on line 2375, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2459 (Obsoleted by RFC 3280) -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) -- Obsolete informational reference (is this intentional?): RFC 5226 (Obsoleted by RFC 8126) -- Obsolete informational reference (is this intentional?): RFC 5389 (Obsoleted by RFC 8489) -- Obsolete informational reference (is this intentional?): RFC 6347 (Obsoleted by RFC 9147) Summary: 2 errors (**), 0 flaws (~~), 3 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 PPSP A. Bakker 3 Internet-Draft R. Petrocco 4 Intended status: Informational V. Grishchenko 5 Expires: June 9, 2013 Technische Universiteit Delft 6 December 6, 2012 8 Peer-to-Peer Streaming Peer Protocol (PPSPP) 9 draft-ietf-ppsp-peer-protocol-04 11 Abstract 13 The Peer-to-Peer Streaming Peer Protocol (PPSPP) is a transport 14 protocol for disseminating the same content to a group of interested 15 parties in a streaming fashion. PPSPP supports streaming of both 16 pre-recorded (on-demand) and live audio/video content. It is based 17 on the peer-to-peer paradigm, where clients consuming the content are 18 put on equal footing with the servers initially providing the 19 content, to create a system where everyone can potentially provide 20 upload bandwidth. It has been designed to provide short time-till- 21 playback for the end user, and to prevent disruption of the streams 22 by malicious peers. PPSPP has also been designed to be flexible and 23 extensible. It can use different mechanisms to optimize peer 24 uploading, prevent freeriding, and work with different peer discovery 25 schemes (centralized trackers or Distributed Hash Tables). It 26 supports multiple methods for content integrity protection and chunk 27 addressing. Designed as a generic protocol that can run on top of 28 various transport protocols, it currently runs on top of UDP using 29 LEDBAT for congestion control. 31 Status of this Memo 33 This Internet-Draft is submitted in full conformance with the 34 provisions of BCP 78 and BCP 79. 36 Internet-Drafts are working documents of the Internet Engineering 37 Task Force (IETF). Note that other groups may also distribute 38 working documents as Internet-Drafts. The list of current Internet- 39 Drafts is at http://datatracker.ietf.org/drafts/current/. 41 Internet-Drafts are draft documents valid for a maximum of six months 42 and may be updated, replaced, or obsoleted by other documents at any 43 time. It is inappropriate to use Internet-Drafts as reference 44 material or to cite them other than as "work in progress." 46 This Internet-Draft will expire on June 9, 2013. 48 Copyright Notice 49 Copyright (c) 2012 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (http://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 65 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 5 66 1.2. Requirements Language . . . . . . . . . . . . . . . . . . 6 67 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 6 68 2. Overall Operation . . . . . . . . . . . . . . . . . . . . . . 8 69 2.1. Joining a Swarm . . . . . . . . . . . . . . . . . . . . . 8 70 2.2. Exchanging Chunks . . . . . . . . . . . . . . . . . . . . 9 71 2.3. Leaving a Swarm . . . . . . . . . . . . . . . . . . . . . 9 72 3. Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 73 3.1. HANDSHAKE . . . . . . . . . . . . . . . . . . . . . . . . 10 74 3.2. HAVE . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 75 3.3. DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 76 3.4. ACK . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 77 3.5. INTEGRITY . . . . . . . . . . . . . . . . . . . . . . . . 11 78 3.6. SIGNED_INTEGRITY . . . . . . . . . . . . . . . . . . . . . 11 79 3.7. REQUEST . . . . . . . . . . . . . . . . . . . . . . . . . 11 80 3.8. CANCEL . . . . . . . . . . . . . . . . . . . . . . . . . . 12 81 3.9. CHOKE and UNCHOKE . . . . . . . . . . . . . . . . . . . . 12 82 3.10. Peer Address Exchange and NAT Hole Punching . . . . . . . 13 83 3.10.1. PEX_REQ and PEX_RES Messages . . . . . . . . . . . . 13 84 3.10.2. Hole Punching via PPSPP Messages . . . . . . . . . . 13 85 3.11. Keep Alive Signalling . . . . . . . . . . . . . . . . . . 13 86 3.12. Storage Independence . . . . . . . . . . . . . . . . . . . 14 87 4. Chunk Addressing Schemes . . . . . . . . . . . . . . . . . . . 14 88 4.1. Start-End Ranges . . . . . . . . . . . . . . . . . . . . . 14 89 4.1.1. Chunk Ranges . . . . . . . . . . . . . . . . . . . . 14 90 4.1.2. Byte Ranges . . . . . . . . . . . . . . . . . . . . . 14 91 4.2. Bin Numbers . . . . . . . . . . . . . . . . . . . . . . . 14 92 4.3. In Messages . . . . . . . . . . . . . . . . . . . . . . . 16 93 4.3.1. In HAVE Messages . . . . . . . . . . . . . . . . . . 16 94 4.3.2. In ACK Messages . . . . . . . . . . . . . . . . . . . 16 95 4.4. Compatibility . . . . . . . . . . . . . . . . . . . . . . 16 97 5. Content Integrity Protection . . . . . . . . . . . . . . . . . 17 98 5.1. Merkle Hash Tree Scheme . . . . . . . . . . . . . . . . . 18 99 5.2. Content Integrity Verification . . . . . . . . . . . . . . 19 100 5.3. The Atomic Datagram Principle . . . . . . . . . . . . . . 20 101 5.4. INTEGRITY Messages . . . . . . . . . . . . . . . . . . . . 21 102 5.5. Discussion and Overhead . . . . . . . . . . . . . . . . . 21 103 5.6. Automatic Detection of Content Size . . . . . . . . . . . 22 104 5.6.1. Peak Hashes . . . . . . . . . . . . . . . . . . . . . 22 105 5.6.2. Procedure . . . . . . . . . . . . . . . . . . . . . . 24 106 6. Live Streaming . . . . . . . . . . . . . . . . . . . . . . . . 25 107 6.1. Content Authentication . . . . . . . . . . . . . . . . . . 25 108 6.1.1. Sign All . . . . . . . . . . . . . . . . . . . . . . 25 109 6.1.2. Unified Merkle Tree . . . . . . . . . . . . . . . . . 26 110 6.2. Forgetting Chunks . . . . . . . . . . . . . . . . . . . . 29 111 7. Protocol Options . . . . . . . . . . . . . . . . . . . . . . . 29 112 7.1. End Option . . . . . . . . . . . . . . . . . . . . . . . . 29 113 7.2. Version . . . . . . . . . . . . . . . . . . . . . . . . . 29 114 7.3. Minimum Version . . . . . . . . . . . . . . . . . . . . . 30 115 7.4. Swarm Identifier . . . . . . . . . . . . . . . . . . . . . 30 116 7.5. Content Integrity Protection Method . . . . . . . . . . . 30 117 7.6. Merkle Tree Hash Function . . . . . . . . . . . . . . . . 31 118 7.7. Live Signature Algorithm . . . . . . . . . . . . . . . . . 31 119 7.8. Chunk Addressing Method . . . . . . . . . . . . . . . . . 32 120 7.9. Live Discard Window . . . . . . . . . . . . . . . . . . . 32 121 7.10. Supported Messages . . . . . . . . . . . . . . . . . . . . 33 122 8. UDP Encapsulation . . . . . . . . . . . . . . . . . . . . . . 33 123 8.1. Chunk Size . . . . . . . . . . . . . . . . . . . . . . . . 33 124 8.2. Datagrams and Messages . . . . . . . . . . . . . . . . . . 34 125 8.3. Channels . . . . . . . . . . . . . . . . . . . . . . . . . 35 126 8.4. HANDSHAKE . . . . . . . . . . . . . . . . . . . . . . . . 35 127 8.5. HAVE . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 128 8.6. DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 129 8.7. ACK . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 130 8.8. INTEGRITY . . . . . . . . . . . . . . . . . . . . . . . . 37 131 8.9. SIGNED_INTEGRITY . . . . . . . . . . . . . . . . . . . . . 37 132 8.10. REQUEST . . . . . . . . . . . . . . . . . . . . . . . . . 37 133 8.11. CANCEL . . . . . . . . . . . . . . . . . . . . . . . . . . 37 134 8.12. CHOKE and UNCHOKE . . . . . . . . . . . . . . . . . . . . 38 135 8.13. PEX_REQ, PEX_RESv4, PEX_RESv6 and PEX_REScert . . . . . . 38 136 8.14. KEEPALIVE . . . . . . . . . . . . . . . . . . . . . . . . 38 137 8.15. Flow and Congestion Control . . . . . . . . . . . . . . . 39 138 9. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . 39 139 9.1. Chunk Picking Algorithms . . . . . . . . . . . . . . . . . 39 140 9.2. Reciprocity Algorithms . . . . . . . . . . . . . . . . . . 39 141 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39 142 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 40 143 12. Security Considerations . . . . . . . . . . . . . . . . . . . 40 144 12.1. Security of the Handshake Procedure . . . . . . . . . . . 40 145 12.1.1. Protection against attack 1 . . . . . . . . . . . . . 41 146 12.1.2. Protection against attack 2 . . . . . . . . . . . . . 42 147 12.1.3. Protection against attack 3 . . . . . . . . . . . . . 42 148 12.2. Secure Peer Address Exchange . . . . . . . . . . . . . . . 42 149 12.2.1. Protection against the Amplification Attack . . . . . 43 150 12.2.2. Example: Tracker as Certification Authority . . . . . 43 151 12.2.3. Protection Against Eclipse Attacks . . . . . . . . . 44 152 12.3. Support for Closed Swarms (PPSP.SEC.REQ-1) . . . . . . . . 45 153 12.4. Confidentiality of Streamed Content (PPSP.SEC.REQ-2+3) . . 45 154 12.5. Limit Potential Damage and Resource Exhaustion by Bad 155 or Broken Peers (PPSP.SEC.REQ-4+6) . . . . . . . . . . . . 45 156 12.5.1. HANDSHAKE . . . . . . . . . . . . . . . . . . . . . . 45 157 12.5.2. HAVE . . . . . . . . . . . . . . . . . . . . . . . . 46 158 12.5.3. DATA . . . . . . . . . . . . . . . . . . . . . . . . 46 159 12.5.4. ACK . . . . . . . . . . . . . . . . . . . . . . . . . 46 160 12.5.5. INTEGRITY and SIGNED_INTEGRITY . . . . . . . . . . . 47 161 12.5.6. REQUEST . . . . . . . . . . . . . . . . . . . . . . . 47 162 12.5.7. CANCEL . . . . . . . . . . . . . . . . . . . . . . . 47 163 12.5.8. CHOKE . . . . . . . . . . . . . . . . . . . . . . . . 47 164 12.5.9. UNCHOKE . . . . . . . . . . . . . . . . . . . . . . . 48 165 12.5.10. PEX_RES . . . . . . . . . . . . . . . . . . . . . . . 48 166 12.5.11. Unsolicited Messages in General . . . . . . . . . . . 48 167 12.6. Exclude Bad or Broken Peers (PPSP.SEC.REQ-5) . . . . . . . 48 168 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 48 169 13.1. Normative References . . . . . . . . . . . . . . . . . . . 48 170 13.2. Informative References . . . . . . . . . . . . . . . . . . 50 171 Appendix A. Revision History . . . . . . . . . . . . . . . . . . 53 172 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 57 174 1. Introduction 176 1.1. Purpose 178 This document describes the Peer-to-Peer Streaming Peer Protocol 179 (PPSPP), designed for disseminating the same content to a group of 180 interested parties in a streaming fashion. PPSPP supports streaming 181 of both pre-recorded (on-demand) and live audio/video content. It is 182 based on the peer-to-peer paradigm where clients consuming the 183 content are put on equal footing with the servers initially providing 184 the content, to create a system where everyone can potentially 185 provide upload bandwidth. 187 PPSPP has been designed to provide short time-till-playback for the 188 end user, and to prevent disruption of the streams by malicious 189 peers. Central in this design is a simple method of identifying 190 content based on self-certification. In particular, content in PPSPP 191 is identified by a single cryptographic hash that is the root hash in 192 a Merkle hash tree calculated recursively from the content 193 [MERKLE][ABMRKL]. This self-certifying hash tree allows every peer 194 to directly detect when a malicious peer tries to distribute fake 195 content. The tree can be used for both static and live content. 196 Moreover, it ensures only a small amount of information is needed to 197 start a download and to verify incoming chunks of content, thus 198 ensuring short start-up times. 200 PPSPP has also been designed to be extensible for different 201 transports and use cases. Hence, PPSPP is a generic protocol which 202 can run directly on top of UDP, TCP, or other protocols. As such, 203 PPSPP defines a common set of messages that make up the protocol, 204 which can have different representations on the wire depending on the 205 lower-level protocol used. When the lower-level transport allows, 206 PPSPP can also use different congestion control algorithms. 208 At present, PPSPP is set to run on top of UDP using LEDBAT for 209 congestion control [I-D.ietf-ledbat-congestion]. Using LEDBAT 210 enables PPSPP to serve the content after playback (seeding) without 211 disrupting the user who may have moved to different tasks that use 212 its network connection. LEDBAT may be replaced with a different 213 algorithm when the work of the IETF working group on RTP Media 214 Congestion Avoidance Techniques (RMCAT) [RMCATCHART] matures. 216 PPSPP is also flexible and extensible in the mechanisms it uses to 217 promote client contribution and prevent freeriding, that is, how to 218 deal with peers that only download content but never upload to 219 others. It also allows different schemes for chunk addressing and 220 content integrity protection, if the defaults are not fit for a 221 particular use case. In addition, it can work with different peer 222 discovery schemes, such as centralized trackers or fast Distributed 223 Hash Tables [JIM11]. Finally, in this default setup, PPSPP maintains 224 only a small amount of state per peer. A reference implementation of 225 PPSPP over UDP is available [SWIFTIMPL]. 227 1.2. Requirements Language 229 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 230 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 231 document are to be interpreted as described in RFC 2119 [RFC2119]. 233 1.3. Terminology 235 message 236 The basic unit of PPSPP communication. A message will have 237 different representations on the wire depending on the transport 238 protocol used. Messages are typically multiplexed into a 239 datagram for transmission. 241 datagram 242 A sequence of messages that is offered as a unit to the 243 underlying transport protocol (UDP, etc.). The datagram is 244 PPSPP's Protocol Data Unit (PDU). 246 content 247 Either a live transmission, a pre-recorded multimedia asset, or a 248 file. 250 chunk 251 The basic unit in which the content is divided. E.g. a block of 252 N kilobyte. 254 chunk ID 255 Unique identifier for a chunk of content (e.g. an integer). Its 256 type depends on the chunk addressing scheme used. 258 chunk specification 259 An expression that denotes one or more chunk IDs. 261 chunk addressing scheme 262 Scheme for identifying chunks and expressing the chunk 263 availability map of a peer in a compact fashion. 265 chunk availability map 266 The set of chunks a peer has successfully downloaded and checked 267 the integrity of. 269 bin 270 A number denoting a specific binary interval of the content 271 (i.e., one or more consecutive chunks) in the bin numbers chunk 272 addressing scheme (see Section 4). 274 content integrity protection scheme 275 Scheme for protecting the integrity of the content while it is 276 being distributed via the peer-to-peer network. I.e. methods for 277 receiving peers to detect whether a requested chunk has been 278 maliciously modified by the sending peer. 280 hash 281 The result of applying a cryptographic hash function, more 282 specifically a modification detection code (MDC) [HAC01], such as 283 SHA-1 [FIPS180-3], to a piece of data. 285 Merkle hash tree 286 A tree of hashes whose base is formed by the hashes of the chunks 287 of content, and its higher nodes are calculated by recursively 288 computing the hash of the concatenation of the two child hashes 289 (see Section 5.1). 291 root hash 292 The root in a Merkle hash tree calculated recursively from the 293 content (see Section 5.1). 295 swarm 296 A group of peers participating in the distribution of the same 297 content. 299 swarm ID 300 Unique identifier for a swarm of peers, in PPSPP a sequence of 301 bytes. When Merkle hash trees are used for content integrity 302 protection, the identifier is the so-called root hash of the 303 content (video-on-demand). For live streaming, the swarm ID is a 304 public key. 306 tracker 307 An entity that records the addresses of peers participating in a 308 swarm, usually for a set of swarms, and makes this membership 309 information available to other peers on request. 311 choking 312 When a peer A is choking peer B it means that A is currently not 313 willing to accept requests for content from B. 315 seeding/leeching 316 When a peer A is seeding it means that A has downloaded a static 317 content asset completely and is now offering it for others to 318 download. If peer A does not yet have all content or is not 319 offering it for download, A is said to be leeching. 321 2. Overall Operation 323 The basic unit of communication in PPSPP is the message. Multiple 324 messages are multiplexed into a single datagram for transmission. A 325 datagram (and hence the messages it contains) will have different 326 representations on the wire depending on the transport protocol used 327 (see Section 8). 329 The overall operation of PPSPP is illustrated in the following 330 examples. The examples assume that UDP is used for transport, the 331 recommended method for content integrity protection (Merkle hash 332 trees) is used, and that a specific policy is used for selecting 333 which chunks to download. 335 2.1. Joining a Swarm 337 Consider a peer A that wants to download a certain content asset. To 338 commence a PPSPP download, peer A must have the swarm ID of the 339 content and a list of one or more tracker contact points (e.g. host+ 340 port). The list of trackers is optional in the presence of a 341 decentralized tracking mechanism. 343 Peer A now registers with the tracker following e.g. the PPSP tracker 344 protocol [I-D.ietf-ppsp-reqs] and receives the IP address and port of 345 peers already in the swarm, say B, C, and D. Peer A now sends a 346 datagram containing a HANDSHAKE message to B, C, and D. This message 347 conveys protocol options and may serve as an end-to-end check that 348 the peers are actually in the correct swarm (in which case it 349 contains the ID of the swarm). 351 Peer B and C respond with datagrams containing a HANDSHAKE message 352 and one or more HAVE messages. A HAVE message conveys (part of) the 353 chunk availability of a peer and thus contains a chunk specification 354 that denotes what chunks of the content peer B, resp. C have. Peer D 355 sends a datagram with a HANDSHAKE and HAVE messages, but also with a 356 CHOKE message. The latter indicates that D is not willing to upload 357 chunks to A at present. 359 2.2. Exchanging Chunks 361 In response to B and C, A sends new datagrams to B and C containing 362 REQUEST messages. A REQUEST message indicates the chunks that a peer 363 wants to download, and thus contains a chunk specification. The 364 REQUEST messages to B and C refer to disjunct sets of chunks. B and 365 C respond with datagrams containing HAVE, DATA and, in this example, 366 INTEGRITY messages. In the Merkle hash tree content protection 367 scheme (see Section 5.1), the INTEGRITY messages contain all 368 cryptographic hashes that peer A needs to verify the integrity of the 369 content chunk sent in the DATA message. Using these hashes peer A 370 verifies that the chunks received from B and C are correct. It also 371 updates the chunk availability of B and C using the information in 372 the received HAVE messages. 374 After processing, A sends a datagram containing HAVE messages for the 375 chunks it just received to all its peers. In the datagram to B and C 376 it includes an ACK message acknowledging the receipt of the chunks, 377 and adds REQUEST messages for new chunks. ACK messages are not used 378 when a reliable transport protocol is used. When e.g. C finds that 379 A obtained a chunk (from B) that C did not yet have, C's next 380 datagram includes a REQUEST for that chunk. 382 Peer D also sends HAVE messages to A when it downloads chunks from 383 other peers. When D is willing to accept REQUESTs from A, D sends a 384 datagram with an UNCHOKE message to inform A. If B or C decide to 385 choke A they sending a CHOKE message and A should then re-request 386 from other peers. B and C may continue to send HAVE, REQUEST, or 387 periodic KEEPALIVE messages such that A keeps sending them HAVE 388 messages. 390 Once peer A has received all content (video-on-demand use case) it 391 stops sending messages to all other peers that have all content 392 (a.k.a. seeders). Peer A MAY also contact the tracker or another 393 source again to obtain more peer addresses. 395 2.3. Leaving a Swarm 397 To leave a swarm in a graceful way, peer A sends a "close-channel" 398 datagram to all its peers and deregisters from the tracker following 399 the (PPSP) tracker protocol. Peers receiving the datagram should 400 remove A from their current peer list. If A crashes ungracefully, 401 peers should remove A from their peer list when they detect it no 402 longer sends messages. 404 3. Messages 406 In general, no error codes or responses are used in the protocol; 407 absence of any response indicates an error. Invalid messages are 408 discarded, and further communication with the peer SHOULD be stopped. 409 The rationale is that it is sufficient to classify peers as either 410 good (i.e., responding with chunks) or bad and only use the good 411 ones. This behavior allows a peer to deal with slow, crashed and 412 (silent) malicious peers. 414 For the sake of simplicity, one swarm of peers deals with one content 415 asset (e.g. file) only. Retrieval of a collections of files can be 416 done either by using multiple swarms or by using an external storage 417 mapping from the linear byte space of a single swarm to different 418 files, transparent to the protocol, as described in Section 3.12. 420 3.1. HANDSHAKE 422 The initiating peer and the addressed peer MUST send a HANDSHAKE 423 message as the first message in the first datagrams they exchange. 424 The payload of the HANDSHAKE message is a sequence of protocol 425 options. Example options are the content integrity protection scheme 426 used and an option to specify the swarm identifier. The latter 427 option MAY be used as an end-to-end check that the peers are actually 428 in the correct swarm. Protocol options are specified in Section 7. 430 After the handshakes are exchanged, the initiator knows that the peer 431 really responds. Hence, the second datagram the initiator sends MAY 432 already contain some heavy payload, e.g. DATA messages. To minimize 433 the number of initialization round-trips, the first two datagrams 434 exchanged MAY also contain some minor payload, e.g. HAVE messages to 435 indicate the current progress of a peer or a REQUEST (see 436 Section 3.7). 438 3.2. HAVE 440 The HAVE message is used to convey which chunks a peer has available 441 for download. The set of chunks it has available may be expressed 442 using different chunk addressing and availability map compression 443 schemes, described in Section 4. HAVE messages can be used both for 444 sending a complete overview of a peer's chunk availability as well as 445 for updates to that set. 447 In particular, whenever a receiving peer has successfully checked the 448 integrity of a chunk or interval of chunks it MUST send a HAVE 449 message to all peers it wants to interact with in the near future. 450 The latter confinement allows the HAVE message to be used as a method 451 of choking. The HAVE message MUST contain the chunk specification of 452 the received chunks. A receiving peer MUST NOT send a HAVE message 453 to peers for which the handshake procedure is still incomplete, see 454 Section 12.1. 456 3.3. DATA 458 The DATA message is used to transfer chunks of content. The DATA 459 message MUST contain the chunk ID of the chunk and chunk itself. A 460 peer MAY send the DATA messages for multiple chunks in the same 461 datagram. The DATA message MAY contain additional information if 462 needed by the specific congestion control mechanism used. At present 463 PPSPP uses LEDBAT [I-D.ietf-ledbat-congestion] for congestion 464 control, which requires the current system time to be sent along with 465 the DATA message, so the current system time MUST be included. 467 3.4. ACK 469 When PPSPP is run over an unreliable transport protocol, an 470 implementation MAY choose to use ACK messages to acknowledge received 471 data. When used, a receiving peer that has successfully checked the 472 integrity of a chunk or interval of chunks C it MUST send an ACK 473 message containing a chunk specification for C. As LEDBAT is used, an 474 ACK message MUST contain the one-way delay, computed from the peer's 475 current system time received in the DATA message. A peer MAY delay 476 sending ACK messages as defined in the LEDBAT specification. 478 3.5. INTEGRITY 480 The INTEGRITY message carries information required by the receiver to 481 verify the integrity of a chunk. Its payload depends on the content 482 integrity protection scheme used. When the recommended method of 483 Merkle hash trees is used, the datagram carrying the DATA message 484 MUST include the cryptographic hashes that are necessary for a 485 receiver to check the integrity of the chunk in the form of INTEGRITY 486 messages. What are the necessary hashes is explained in Section 5.3. 488 3.6. SIGNED_INTEGRITY 490 The SIGNED_INTEGRITY message carries digitally signed information 491 required by the receiver to verify the integrity of a chunk in live 492 streaming. It logically contains a chunk specification and a digital 493 signature. Its exact payload depends on the live content integrity 494 protection scheme used, see Section 6.1. 496 3.7. REQUEST 498 While bulk download protocols normally do explicit requests for 499 certain ranges of data (i.e., use a pull model, for example, 500 BitTorrent [BITTORRENT]), live streaming protocols quite often use a 501 request-less push model to save round trips. PPSPP supports both 502 models of operation. 504 A peer MAY send a REQUEST message that MUST contain the specification 505 of the chunks it wants to download. A peer receiving a REQUEST 506 message MAY send out the requested chunks. When peer Q receives 507 multiple REQUESTs from the same peer P peer Q SHOULD process the 508 REQUESTs sequentially. Multiple REQUEST messages MAY be sent in one 509 datagram, for example, when a peer wants to request several rare 510 chunks at once. 512 When live streaming via a push model, a peer receiving REQUESTs also 513 MAY send some other chunks in case it runs out of requests or for 514 some other reason. In that case the only purpose of REQUEST messages 515 is to provide hints and coordinate peers to avoid unnecessary data 516 retransmission. 518 3.8. CANCEL 520 When downloading on demand or live streaming content, a peer MAY 521 request urgent data from multiple peers to increase the probability 522 of it being delivered on time. In particular, when the specific 523 chunk picking algorithm (see Section 9.1), detects that a request for 524 urgent data might not be served on time, a request for the same data 525 MAY be sent to a different peer. When a peer P decides to request 526 urgent data from a peer Q, peer P SHOULD send a CANCEL message to all 527 the peers to which the data has been previously requested. The 528 CANCEL message contains the specification of the chunks P no longer 529 wants to request. In addition, when peer Q receives a HAVE message 530 for the urgent data from peer P, peer Q MUST also cancel the previous 531 REQUEST(s) from P. In other words, the HAVE message acts as an 532 implicit CANCEL. 534 3.9. CHOKE and UNCHOKE 536 Peer A MAY send a CHOKE message to peer B to signal it will no longer 537 be responding to REQUEST messages from B, for example, because A's 538 upload capacity is exhausted. Peer A MAY send a subsequent UNCHOKE 539 message to signal that it will respond to new REQUESTs from B again 540 (A SHOULD discard old requests). When peer B receives a CHOKE 541 message from A it MUST NOT send new REQUEST messages and SHOULD NOT 542 expect answers to any outstanding ones. The CHOKE and UNCHOKE 543 messages are informational as a peer is not required to respond to 544 REQUESTs, see Section 3.7. 546 3.10. Peer Address Exchange and NAT Hole Punching 548 3.10.1. PEX_REQ and PEX_RES Messages 550 Peer address exchange messages (or PEX messages for short) are common 551 in many peer-to-peer protocols. By exchanging peer addresses in 552 gossip fashion, peers relieve central coordinating entities (the 553 trackers) from unnecessary work. PPSPP optionally features two types 554 of PEX messages: PEX_REQ and PEX_RES. A peer that wants to retrieve 555 some peer addresses MUST send a PEX_REQ message. The receiving peer 556 MAY respond with a PEX_RES message containing the (potentially 557 signed) addresses of several peers. The addresses MUST be of peers 558 it has exchanged messages with in the last 60 seconds to guarantee 559 liveliness. Alternatively, the receiving peer MAY ignore PEX 560 messages if uninterested in obtaining new peers or because of 561 security considerations (rate limiting) or any other reason. The PEX 562 messages can be used to construct a dedicated tracker peer. 564 As peer-address exchange enables a number of attacks it should not be 565 used outside a benign environment unless extra security measures are 566 in place. These security measures, which involve exchanging 567 addresses in cryptographically signed swarm-membership certificates, 568 are described in Section 12.2. 570 3.10.2. Hole Punching via PPSPP Messages 572 PPSPP can be used in combination with STUN [RFC5389]. In addition, 573 the native PEX messages can be used to do simple NAT hole punching 574 [SNP], as follows. When peer A introduces peer B to peer C by 575 sending a PEX_RES message to C, it SHOULD also send a PEX_RES message 576 to B introducing C. These messages SHOULD be within 2 seconds from 577 each other, but MAY not be simultaneous, instead leaving a gap of 578 twice the "typical" RTT, i.e. 300-600 ms. As a result, the peers are 579 supposed to initiate handshakes to each other thus forming a simple 580 NAT hole punching pattern where the introducing peer effectively acts 581 as a STUN server. Note that the PEX_RES message is sent without a 582 prior PEX_REQ in this case. Also note the PEX_RES from A to B is 583 likely to arrive because recent communication between A and B is a 584 prerequisite for A introducing B to C, see previous section. 586 3.11. Keep Alive Signalling 588 A peer MUST send a "keep alive" message periodically to each peer it 589 wants to interact with in the future, but has no other messages to 590 send them at present. PPSPP does not define an explicit message type 591 for "keep alive" messages. In the PPSP-over-UDP encapsulation they 592 are implemented as simple datagrams consisting of a 4-byte channel 593 number only, see Section 8.3 and Section 8.4. 595 3.12. Storage Independence 597 Note PPSPP does not prescribe how chunks are stored. This also 598 allows users of PPSPP to map different files into a single swarm as 599 in BitTorrent's multi-file torrents [BITTORRENT], and more innovative 600 storage solutions when variable-sized chunks are used. 602 4. Chunk Addressing Schemes 604 PPSPP can use different methods of chunk addressing, that is, support 605 different ways of identifying chunks and different ways of expressing 606 the chunk availability map of a peer in a compact fashion. 608 4.1. Start-End Ranges 610 A chunk specification consists of a single (start specification,end 611 specification) pair that identifies a range of chunks (end 612 inclusive). The start and end specifications can use one of multiple 613 addressing schemes. Two schemes are currently defined, chunk ranges 614 and byte ranges. 616 4.1.1. Chunk Ranges 618 The start and end specification are both chunk identifiers. A PPSPP 619 peer MUST support this scheme. 621 4.1.2. Byte Ranges 623 The start and end specification are byte offsets in the content. A 624 PPSPP peer MAY support this scheme. 626 4.2. Bin Numbers 628 PPSPP introduces a novel method of addressing chunks of content 629 called "bin numbers" (or "bins" for short). Bin numbers allow the 630 addressing of a binary interval of data using a single integer. This 631 reduces the amount of state that needs to be recorded per peer and 632 the space needed to denote intervals on the wire, making the protocol 633 light-weight. In general, this numbering system allows PPSPP to work 634 with simpler data structures, e.g. to use arrays instead of binary 635 trees, thus reducing complexity. A PPSPP peer MAY support this 636 scheme. 638 In bin addressing, the smallest binary interval is a single chunk 639 (e.g. a block of bytes which may be of variable size), the largest 640 interval is a complete range of 2**63 chunks. In a novel addition to 641 the classical scheme, these intervals are numbered in a way which 642 lays them out into a vector nicely, which is called bin numbering, as 643 follows. Consider an chunk interval of width W. To derive the bin 644 numbers of the complete interval and the subintervals, a minimal 645 balanced binary tree is built that is at least W chunks wide at the 646 base. The leaves from left-to-right correspond to the chunks 0..W-1 647 in the interval, and have bin number I*2 where I is the index of the 648 chunk (counting beyond W-1 to balance the tree). The bin number of 649 higher level nodes P in the tree is calculated as follows: 651 binP = (binL + binR) / 2 653 where binL is the bin of node P's left-hand child and binR is the bin 654 of node P's right-hand child. Given that each node in the tree 655 represents a subinterval of the original interval, each such 656 subinterval now is addressable by a bin number, a single integer. 657 The bin number tree of an interval of width W=8 looks like this: 659 7 660 / \ 661 / \ 662 / \ 663 / \ 664 3 11 665 / \ / \ 666 / \ / \ 667 / \ / \ 668 1 5 9 13 669 / \ / \ / \ / \ 670 0 2 4 6 8 10 12 14 672 C0 C1 C2 C3 C4 C5 C6 C7 674 The bin number tree of an interval of width W=8 676 Figure 1 678 So bin 7 represents the complete interval, bin 3 represents the 679 interval of chunk 0..3, bin 1 represents the interval of chunks 0 and 680 1, and bin 2 represents chunk C1. The special numbers 0xFFFFFFFF 681 (32-bit) or 0xFFFFFFFFFFFFFFFF (64-bit) stands for an empty interval, 682 and 0x7FFF...FFF stands for "everything". 684 When bin numbering is used, the ID of a chunk is its corresponding 685 (leaf) bin number in the tree and the chunk specification in HAVE and 686 ACK messages is equal to a single bin number, as follows. 688 4.3. In Messages 690 4.3.1. In HAVE Messages 692 When a receiving peer has successfully checked the integrity of a 693 chunk or interval of chunks it MUST send a HAVE message to all peers 694 it wants to interact with. The latter allows the HAVE message to be 695 used as a method of choking. The HAVE message MUST contain the chunk 696 specification of the biggest complete interval of all chunks the 697 receiver has received and checked so far that fully includes the 698 interval of chunks just received. So the chunk specification MUST 699 denote at least the interval received, but the receiver is supposed 700 to aggregate and acknowledge bigger intervals, when possible. 702 As a result, every single chunk is acknowledged a logarithmic number 703 of times. That provides some necessary redundancy of acknowledgments 704 and sufficiently compensates for unreliable transport protocols. 706 Implementation note: 708 To record which chunks a peer has in the state that an 709 implementation keeps for each peer, an implementation MAY use the 710 efficient "binmap" data structure, which is a hybrid of a bitmap 711 and a binary tree, discussed in detail in [BINMAP]. 713 4.3.2. In ACK Messages 715 When PPSPP is run over an unreliable transport protocol, an 716 implementation MAY choose to use ACK messages to acknowledge received 717 data. When a receiving peer has successfully checked the integrity 718 of a chunk or interval of chunks C it MUST send a ACK message 719 containing the chunk specification of its biggest, complete, interval 720 covering C to the sending peer (see HAVE). 722 4.4. Compatibility 724 In principle, peers using range addressing and peers using bin 725 numbering can interact, with some limitations. Alternatively, a peer 726 A MAY refuse to interact with a peer B using a different addressing 727 scheme. In that case, A MUST respond to B'S HANDSHAKE message by 728 sending an explicit close (see Section 8.4). PPSPP presently 729 supports only interaction between willing peers when fixed sized 730 chunks are used, as follows: 732 When a bin peer sends a message containing a chunk specification to a 733 byte-range peer it MUST translate its internal bin numbers to byte 734 ranges. When a byte range peer sends a message with a chunk 735 specification message to a bin peer, it MUST round its internal byte 736 ranges to 1 or more bins. For the latter translation, the byte-range 737 peer MUST know the fixed chunk size used (which it should receive 738 along with the swarm identifier). When a range translates to 739 multiple bins, the byte-range peer should send multiple e.g. HAVE 740 messages. Note that the bin peer may not be able to request all 741 content the byte-range peer has if it does not have an integral 742 number of chunks. 744 Aside: Translation from bytes to bins is possible for variable sized 745 chunks only when the byte-range peer has extra information. In 746 particular, it will need to know the individual sizes of the chunks 747 from the start of the content till the byte range it wants to convey 748 to the bin peer. 750 A similar translation MUST be done for translating between bins and 751 chunk ranges. Chunk ranges are directly translatable to bins. 752 Assuming ranges are intervals of a list of chunks numbered 0...N, for 753 a given bin number "bin" and bitwise operations AND and OR: 755 startrange = (bin AND (bin + 1))/2 757 endrange = ((bin OR (bin + 1)) - 1)/2 759 The reverse translation may require a chunk range to be rounded to 760 the largest binary interval it covers, or for a range be translated 761 to a series of bin numbers that should be sent using multiple (e.g. 762 HAVE) messages. 764 Finally, byte-range peers can interact with chunk-range peers, by 765 using the direct translation from chunks into bytes and by rounding 766 byte ranges into chunk ranges. The latter requires the byte-range 767 peer to know the fixed chunk size. 769 5. Content Integrity Protection 771 PPSPP can use different methods for protecting the integrity of the 772 content while it is being distributed via the peer-to-peer network. 773 More specifically, PPSPP can use different methods for receiving 774 peers to detect whether a requested chunk has been maliciously 775 modified by the sending peer. 777 This section describes the recommended method for bad content 778 detection, the Merkle Hash Tree scheme, which SHOULD be implemented 779 for protecting static content. It can also be efficiently used in 780 protecting live streams, as explained below and in Section 6.1. 782 The Merkle hash tree scheme can use different chunk addressing 783 schemes. All it requires is the ability to address a range of 784 chunks. In the following description abstract node IDs are used to 785 identify nodes in the tree. On the wire these are translated to the 786 corresponding range of chunks in the chosen chunk addressing scheme. 787 When bin numbering is used, node IDs correspond directly to bin 788 numbers in the INTEGRITY message, see below. 790 5.1. Merkle Hash Tree Scheme 792 PPSPP uses a method of naming content based on self-certification. 793 In particular, content in PPSPP is identified by a single 794 cryptographic hash that is the root hash in a Merkle hash tree 795 calculated recursively from the content [ABMRKL]. This self- 796 certifying hash tree allows every peer to directly detect when a 797 malicious peer tries to distribute fake content. It also ensures 798 only a small the amount of information is needed to start a download 799 (the root hash and some peer addresses). For live streaming a 800 dynamic tree and a public key are used, see below. 802 The Merkle hash tree of a content asset that is divided into N chunks 803 is constructed as follows. Note the construction does not assume 804 chunks of content to be fixed size. Given a cryptographic hash 805 function, more specifically a modification detection code (MDC) 806 [HAC01] , such as SHA1, the hashes of all the chunks of the content 807 are calculated. Next, a binary tree of sufficient height is created. 808 Sufficient height means that the lowest level in the tree has enough 809 nodes to hold all chunk hashes in the set, as with bin numbering. 810 The figure below shows the tree for a content asset consisting of 7 811 chunks. As before with the content addressing scheme, the leaves of 812 the tree correspond to a chunk and in this case are assigned the hash 813 of that chunk, starting at the left-most leaf. As the base of the 814 tree may be wider than the number of chunks, any remaining leaves in 815 the tree are assigned an empty hash value of all zeros. Finally, the 816 hash values of the higher levels in the tree are calculated, by 817 concatenating the hash values of the two children (again left to 818 right) and computing the hash of that aggregate. If the two children 819 are empty hashes, the parent is an empty all zeros hash as well (to 820 save computation). This process ends in a hash value for the root 821 node, which is called the "root hash". Note the root hash only 822 depends on the content and any modification of the content will 823 result in a different root hash. 825 7 = root hash 826 / \ 827 / \ 828 / \ 829 / \ 830 3* 11 831 / \ / \ 832 / \ / \ 833 / \ / \ 834 1 5 9 13* = uncle hash 835 / \ / \ / \ / \ 836 0 2 4 6 8 10* 12 14 838 C0 C1 C2 C3 C4 C5 C6 E 839 =chunk index ^^ = empty hash 841 The Merkle hash tree of an interval of width W=8 843 Figure 2 845 5.2. Content Integrity Verification 847 Assuming a peer receives the root hash of the content it wants to 848 download from a trusted source, it can check the integrity of any 849 chunk of that content it receives as follows. It first calculates 850 the hash of the chunk it received, for example chunk C4 in the 851 previous figure. Along with this chunk it MUST receive the hashes 852 required to check the integrity of that chunk. In principle, these 853 are the hash of the chunk's sibling (C5) and that of its "uncles". A 854 chunk's uncles are the sibling Y of its parent X, and the uncle of 855 that Y, recursively until the root is reached. For chunk C4 its 856 uncles are nodes 13 and 3, marked with * in the figure. Using this 857 information the peer recalculates the root hash of the tree, and 858 compares it to the root hash it received from the trusted source. If 859 they match the chunk of content has been positively verified to be 860 the requested part of the content. Otherwise, the sending peer 861 either sent the wrong content or the wrong sibling or uncle hashes. 862 For simplicity, the set of sibling and uncles hashes is collectively 863 referred to as the "uncle hashes". 865 In the case of live streaming the tree of chunks grows dynamically 866 and the root hash is undefined or, more precisely, transient, as long 867 as new data is generated by the live source. Section 6.1.2 defines a 868 method for content integrity verification for live streams that works 869 with such a dynamic tree. Although the tree is dynamic, content 870 verification works the same for both live and predefined content, 871 resulting in a unified method for both types of streaming. 873 5.3. The Atomic Datagram Principle 875 As explained above, a datagram consists of a sequence of messages. 876 Ideally, every datagram sent must be independent of other datagrams, 877 so each datagram SHOULD be processed separately and a loss of one 878 datagram MUST NOT disrupt the flow. Thus, as a datagram carries zero 879 or more messages, neither messages nor message interdependencies 880 SHOULD span over multiple datagrams. 882 This principle implies that as any chunk is verified using its uncle 883 hashes the necessary hashes SHOULD be put into the same datagram as 884 the chunk's data. If this is not possible because of a limitation on 885 datagram size, the necessary hashes MUST be sent first in one or more 886 datagrams. As a general rule, if some additional data is still 887 missing to process a message within a datagram, the message SHOULD be 888 dropped. 890 The hashes necessary to verify a chunk are in principle its sibling's 891 hash and all its uncle hashes, but the set of hashes to send can be 892 optimized. Before sending a packet of data to the receiver, the 893 sender inspects the receiver's previous acknowledgments (HAVE or ACK) 894 to derive which hashes the receiver already has for sure. Suppose, 895 the receiver had acknowledged chunks C0 and C1 (first two chunks of 896 the file), then it must already have uncle hashes 5, 11 and so on. 897 That is because those hashes are necessary to check C0 and C1 against 898 the root hash. Then, hashes 3, 7 and so on must be also known as 899 they are calculated in the process of checking the uncle hash chain. 900 Hence, to send chunk C7, the sender needs to include just the hashes 901 for nodes 14 and 9, which let the data be checked against hash 11 902 which is already known to the receiver. 904 The sender MAY optimistically skip hashes which were sent out in 905 previous, still unacknowledged datagrams. It is an optimization 906 trade-off between redundant hash transmission and possibility of 907 collateral data loss in the case some necessary hashes were lost in 908 the network so some delivered data cannot be verified and thus has to 909 be dropped. In either case, the receiver builds the Merkle tree on- 910 demand, incrementally, starting from the root hash, and uses it for 911 data validation. 913 In short, the sender MUST put into the datagram the missing hashes 914 necessary for the receiver to verify the chunk. The receiver MUST 915 remember all the hashes it needs to verify missing chunks that it 916 still wants to download. Note that the latter implies that a 917 hardware-limited receiver MAY forget some hashes if it does not plan 918 to announce possession of these chunks to others (i.e., does not plan 919 to send HAVE messages.) 921 5.4. INTEGRITY Messages 923 Concretely, a peer that wants to send a chunk of content creates a 924 datagram that MUST consist of a list of INTEGRITY messages followed 925 by a DATA message. If the INTEGRITY messages and DATA message cannot 926 be put into a single datagram because of a limitation on datagram 927 size, the INTEGRITY messages MUST be sent first in one or more 928 datagrams. The list of INTEGRITY messages sent MUST contain a 929 INTEGRITY message for each hash the receiver misses for integrity 930 checking. A INTEGRITY message for a hash MUST contain the chunk 931 specification corresponding to the node ID of the hash and the hash 932 data itself. The chunk specification corresponding to a node ID is 933 defined as the range of chunks formed by the leaves of the subtree 934 rooted at the node. For example, node 3 in Figure 2 denotes chunks 935 0,2,4,6, so the chunk specification should denote that interval. The 936 list of INTEGRITY messages MUST be sorted in order of tree height of 937 the nodes, descending. The DATA message MUST contain the chunk 938 specification of the chunk and chunk itself. A peer MAY send the 939 required messages for multiple chunks in the same datagram, depending 940 on the encapsulation. 942 5.5. Discussion and Overhead 944 The current method for protecting content integrity in BitTorrent 945 [BITTORRENT] is not suited for streaming. It involves providing 946 clients with the hashes of the content's chunks before the download 947 commences by means of metadata files (called .torrent files in 948 BitTorrent.) However, when chunks are small as in the current UDP 949 encapsulation of PPSPP this implies having to download a large number 950 of hashes before content download can begin. This, in turn, 951 increases time-till-playback for end users, making this method 952 unsuited for streaming. 954 The overhead of using Merkle hash trees is limited. The size of the 955 hash tree expressed as the total number of nodes depends on the 956 number of chunks the content is divided (and hence the size of 957 chunks) following this formula: 959 nnodes = math.pow(2,math.log(nchunks,2)+1) 961 In principle, the hash values of all these nodes will have to be sent 962 to a peer once for it to verify all chunks. Hence the maximum on- 963 the-wire overhead is hashsize * nnodes. However, the actual number 964 of hashes transmitted can be optimized as described in Section 5.3. 965 To see a peer can verify all chunks whilst receiving not all hashes, 966 consider the example tree in Section 5.1. 968 In case of a simple progressive download, of chunks 0,2,4,6, etc. the 969 sending peer will send the following hashes: 971 +-------+---------------------------------------------+ 972 | Chunk | Node IDs of hashes sent | 973 +-------+---------------------------------------------+ 974 | 0 | 2,5,11 | 975 | 2 | - (receiver already knows all) | 976 | 4 | 6 | 977 | 6 | - | 978 | 8 | 10,13 (hash 3 can be calculated from 0,2,5) | 979 | 10 | - | 980 | 12 | 14 | 981 | 14 | - | 982 | Total | # hashes 7 | 983 +-------+---------------------------------------------+ 985 Table 1: Overhead for the example tree 987 So the number of hashes sent in total (7) is less than the total 988 number of hashes in the tree (16), as a peer does not need to send 989 hashes that are calculated and verified as part of earlier chunks. 991 5.6. Automatic Detection of Content Size 993 In PPSPP, the root hash of a static content asset, such as a video 994 file, along with some peer addresses is sufficient to start a 995 download. In addition, PPSPP can reliably and automatically derive 996 the size of such content from information received from the network 997 when fixed sized chunks are used. As a result, it is not necessary 998 to include the size of the content asset as the metadata of the 999 content, in addition to the root hash. Implementations of PPSPP MAY 1000 use this automatic detection feature. Note this feature is the only 1001 feature of PPSPP that requires that a fixed-sized chunk is used. 1003 5.6.1. Peak Hashes 1005 The ability for a newcomer peer to detect the size of the content 1006 depends heavily on the concept of peak hashes. Peak hashes, in 1007 general, enable two cornerstone features of PPSPP: reliable file size 1008 detection and download/live streaming unification (see Section 6). 1009 The concept of peak hashes depends on the concepts of filled and 1010 incomplete nodes. Recall that when constructing the binary trees for 1011 content verification and addressing the base of the tree may have 1012 more leaves than the number of chunks in the content. In the Merkle 1013 hash tree these leaves were assigned empty all-zero hashes to be able 1014 to calculate the higher level hashes. A filled node is now defined 1015 as a node that corresponds to an interval of leaves that consists 1016 only of hashes of content chunks, not empty hashes. Reversely, an 1017 incomplete (not filled) node corresponds to an interval that contains 1018 also empty hashes, typically an interval that extends past the end of 1019 the file. In the following figure nodes 7, 11, 13 and 14 are 1020 incomplete the rest is filled. 1022 Formally, a peak hash is the hash of a filled node in the Merkle 1023 tree, whose sibling is an incomplete node. Practically, suppose a 1024 file is 7162 bytes long and a chunk is 1 kilobyte. That file fits 1025 into 7 chunks, the tail chunk being 1018 bytes long. The Merkle tree 1026 for that file is shown in Figure 3. Following the definition the 1027 peak hashes of this file are in nodes 3, 9 and 12, denoted with a *. 1028 E denotes an empty hash. 1030 7 1031 / \ 1032 / \ 1033 / \ 1034 / \ 1035 3* 11 1036 / \ / \ 1037 / \ / \ 1038 / \ / \ 1039 1 5 9* 13 1040 / \ / \ / \ / \ 1041 0 2 4 6 8 10 12* 14 1043 C0 C1 C2 C3 C4 C5 C6 E 1044 = 1018 bytes 1046 Peak hashes in a Merkle hash tree. 1048 Figure 3 1050 Peak hashes can be explained by the binary representation of the 1051 number of chunks the file occupies. The binary representation for 7 1052 is 111. Every "1" in binary representation of the file's packet 1053 length corresponds to a peak hash. For this particular file there 1054 are indeed three peaks, nodes 3, 9, 12. The number of peak hashes 1055 for a file is therefore also at most logarithmic with its size. 1057 A peer knowing which nodes contain the peak hashes for the file can 1058 therefore calculate the number of chunks it consists of, and thus get 1059 an estimate of the file size (given all chunks but the last are fixed 1060 size). Which nodes are the peaks can be securely communicated from 1061 one (untrusted) peer A to another B by letting A send the peak hashes 1062 and their node IDs to B. It can be shown that the root hash that B 1063 obtained from a trusted source is sufficient to verify that these are 1064 indeed the right peak hashes, as follows. 1066 Lemma: Peak hashes can be checked against the root hash. 1068 Proof: (a) Any peak hash is always the left sibling. Otherwise, be 1069 it the right sibling, its left neighbor/sibling must also be a filled 1070 node, because of the way chunks are laid out in the leaves, 1071 contradiction. (b) For the rightmost peak hash, its right sibling is 1072 zero. (c) For any peak hash, its right sibling might be calculated 1073 using peak hashes to the left and zeros for empty nodes. (d) Once the 1074 right sibling of the leftmost peak hash is calculated, its parent 1075 might be calculated. (e) Once that parent is calculated, we might 1076 trivially get to the root hash by concatenating the hash with zeros 1077 and hashing it repeatedly. 1079 Informally, the Lemma might be expressed as follows: peak hashes 1080 cover all data, so the remaining hashes are either trivial (zeros) or 1081 might be calculated from peak hashes and zero hashes. 1083 Finally, once peer B has obtained the number of chunks in the content 1084 it can determine the exact file size as follows. Given that all 1085 chunks except the last are fixed size B just needs to know the size 1086 of the last chunk. Knowing the number of chunks B can calculate the 1087 node ID of the last chunk and download it. As always B verifies the 1088 integrity of this chunk against the trusted root hash. As there is 1089 only one chunk of data that leads to a successful verification the 1090 size of this chunk must be correct. B can then determine the exact 1091 file size as 1093 (number of chunks -1) * fixed chunk size + size of last chunk 1095 5.6.2. Procedure 1097 A PPSPP implementation that wants to use automatic size detection 1098 MUST operate as follows. When a peer B sends a DATA message for the 1099 first time to a peer A, B MUST first send all the peak hashes for the 1100 content, unless A has already signalled earlier in the exchange that 1101 it knows the peak hashes by having acknowledged any chunk. If they 1102 are needed, the peak hashes MUST be sent as an extra list of uncle 1103 hashes for the chunk, before the list of actual uncle hashes of the 1104 chunk as described in Section 5.3. The receiver A MUST check the 1105 peak hashes against the root hash to determine the approximate 1106 content size. To obtain the definite content size peer A MUST 1107 download the last chunk of the content from any peer that offers it. 1109 6. Live Streaming 1111 The set of messages defined above can be used for live streaming as 1112 well. In a pull-based model, a live streaming injector can announce 1113 the chunks it generates via HAVE messages, and peers can retrieve 1114 them via REQUEST messages. Areas that need special attention are 1115 content authentication and chunk addressing (to achieve an infinite 1116 stream of chunks). 1118 6.1. Content Authentication 1120 For live streaming, PPSPP supports two methods for a peer to 1121 authenticate the content it receives from another peer, called "Sign 1122 All" and "Unified Merkle Tree". 1124 In the "Sign All" method, the live injector signs each chunk of 1125 content using a private key and peers, upon receiving the chunk, 1126 check the signature using the corresponding public key obtained from 1127 a trusted source. In particular, in PPSP, the swarm ID of the live 1128 stream is that public key. The signature is sent along with the DATA 1129 message containing the relevant chunk using the SIGNED_INTEGRITY 1130 message. 1132 In the "Unified Merkle Tree" method, PPSPP combines the Merkle hash 1133 tree scheme for static content with signatures to unify the video-on- 1134 demand and live streaming case. The use of Merkle hash trees reduces 1135 the number of signing and verification operations per second, that 1136 is, provide signature amortization similar to the approach described 1137 in [SIGMCAST]. 1139 In both methods the swarm ID consists of a public key encoded as in a 1140 DNSSEC DNSKEY resource record without BASE-64 encoding [RFC4034]. In 1141 particular, the swarm ID consists of a 1 byte Algorithm field that 1142 identifies the public key's cryptographic algorithm and determines 1143 the format of the Public Key field that follows. 1145 6.1.1. Sign All 1147 Even with "Sign All", the number of cryptographic operations per 1148 second may be limited. For example, consider a 25 frame/second video 1149 transmitted over UDP. When each frame is transmitted in its own 1150 chunk, only 25 signature verification operations per second are 1151 required, for the receiving peer, for bitrates up to ~12.8 megabit/ 1152 second over UDP. For higher bitrates multiple UDP packets per frame 1153 are needed. 1155 6.1.2. Unified Merkle Tree 1157 In this method, the chunks of content are used as the basis for a 1158 Merkle hash tree as for static content. However, because chunks are 1159 continuously generated, this tree is not static, but dynamic. As a 1160 result, the tree does not have a root hash, or more precisely has a 1161 transient root hash. A public key therefore serves as swarm ID of 1162 the content. It is used to digitally sign updates to the tree, 1163 allowing peers to expand it based on trusted information using the 1164 following procedure. 1166 The live injector creates a number of chunks N, a fixed power of 2 1167 (N>=2), which are added as new leaves to the existing hash tree, 1168 expanding the tree as required. As a result of this expansion, the 1169 tree will have gotten a set of new peak hashes (see Section 5.6.1). 1170 The injector now signs only the peak hashes in this set that are not 1171 in the old set of peak hashes. For N being a power of 2 there will 1172 just be one new peak hash (see below). This complementary signed 1173 peak is distributed to the peers. Receiving peers will verify the 1174 signature on the signed peak against the swarm ID, update their tree 1175 and request the new chunks. 1177 To illustrate this procedure, consider the injector has generated the 1178 tree shown in Figure 4 and it is connected to several peers that 1179 currently have the same tree and all chunks. In this tree the root 1180 node 3 is also the peak node for this tree. Now the injector 1181 generates N=2 new chunks. As a result the tree expands as shown in 1182 Figure 5. The two new pieces 8 and 10 extend the tree on the right 1183 side, and to accommodate them a new root is created, node 7. As this 1184 tree is wider at the base than the actual number of chunks, there are 1185 currently two empty leaves. The peak nodes in this tree are 3 and 9. 1187 3* 1188 / \ 1189 / \ 1190 / \ 1191 1 5 1192 / \ / \ 1193 0 2 4 6 1195 Current live tree 1197 Figure 4 1198 7 1199 / \ 1200 / \ 1201 / \ 1202 / \ 1203 3* 11 1204 / \ / \ 1205 / \ / \ 1206 / \ / \ 1207 1 5 9* 13 1208 / \ / \ / \ / \ 1209 0 2 4 6 8 10 E E 1211 Next current live tree 1213 Figure 5 1215 The injector now needs to inform its peers of the changed tree, in 1216 particular the addition of the new complementary peak hash 9. To 1217 this extent, it sends an INTEGRITY message with the hash of node 9, a 1218 SIGNED_INTEGRITY message with the signature of the hash of node 9 and 1219 a HAVE message for node 9. The receiving peers now expand their view 1220 of the tree. Next, the peers will request e.g. chunk 8 from the 1221 injector by sending a REQUEST message. The injector responds by 1222 sending the requester an INTEGRITY message with the hash of node 10, 1223 and a DATA message with chunk 8. This allows the peer to verify the 1224 chunk against peak hash 9 which is signed by the trusted injector. 1226 The injector MAY send HAVE messages for the chunks it creates 1227 immediately, and allow peers to retrieve them. This optimizes the 1228 use of the injector's bandwidth. Peers MUST NOT forward these chunks 1229 to others until they have received and checked the peak hash 1230 signature and the necessary hashes. 1232 This procedure generates just 1 new peak hash for every N blocks, so 1233 it requires just one signature on each iteration, making it N times 1234 cheaper than "Sign All". To see why just 1 new peak hash is 1235 generated each iteration let's return to the definition of a peak 1236 hash in a tree, from Section 5.6.1. A peak hash is the hash of a 1237 filled node in the Merkle tree, whose sibling is an incomplete node. 1238 Now consider the above procedure where N chunks (with N a power of 2) 1239 are added to a tree at each iteration. In the first iteration, the 1240 tree consists of just N leaves, therefore the only peak is the root 1241 of the tree. In the second iteration, the tree consists of 2N chunks 1242 and the only peak is the root of that bigger tree (depicted in 1243 Figure 4 for N=2). In the third iteration, we have 3N chunks as 1244 leaves and a tree that is 4N wide (to span the 3N chunks) and hence 1245 has N empty leaves (depicted in Figure 5 for N=2). This implies that 1246 the tree has 2 peaks, notably the peak from the previous iteration 1247 (node 3 in the figure) and the top of the subtree of the N chunks 1248 that were added last (node 9 in the figure). Although this iteration 1249 has two peaks, there is only one new peak as the expanded tree 1250 overlaps with the tree from the previous iteration. In the fourth 1251 iteration, we have a complete balanced tree again, and just a single 1252 new peak. It is now easy to see that this process in which previous 1253 peaks are either consumed into a single new peak, or peak sets 1254 overlap with just 1 new addition yields a single new peak per N 1255 chunks. 1257 From this we can conclude that the injector has to sign less hashes 1258 than in the "Sign All" method. A receiving peer therefore also has 1259 to verify less signatures. It does additionally need to check one or 1260 more hashes per chunk via the Merkle Tree scheme, but this requires 1261 less CPU than a signature verification for every chunk. This 1262 approach is similar to signature amortization via Merkle Tree 1263 Chaining [SIGMCAST]. The downside of this amortization of signature 1264 costs over several chunks is that latency will increase. A receiving 1265 peer now has to wait for the signature before delivering the chunks 1266 to the higher layers responsible for playback [POLLIVE], unless some 1267 (optimistic) optimisations are made. It MUST check the signature 1268 before forwarding the chunks to other peers. 1270 The number of chunks per signature N MUST be a fixed power of 2 1271 (N>=2). The procedure does not preclude using variable-sized chunks. 1272 Using a variable number N, however, is not allowed as this breaks the 1273 property of generating just 1 new peak per iteration. 1275 Unification of static content checking and live content checking is 1276 achieved by sending the signed peak hashes on-demand, ahead of the 1277 actual data. As before, the sender SHOULD use acknowledgments to 1278 derive which content range the receiver has peak hashes for, and to 1279 prepend the data hashes with the necessary (signed) peak hashes. 1280 Except for the fact that the set of peak hashes changes with time, 1281 other parts of the protocol work as described in Section 5.1. 1283 This method of integrity verification has an added benefit if the 1284 system includes some peers that saved the complete broadcast. The 1285 benefit is that as soon as the broadcast ends, the content is 1286 available as a video-on-demand download using the now stabilized tree 1287 and the final root hash as swarm identifier. Peers that have saved 1288 all chunks can now announce this root hash to the tracking 1289 infrastructure and instantly seed it. 1291 6.2. Forgetting Chunks 1293 As a live broadcast progresses a peer may want to discard the chunks 1294 that it already played out. Ideally, other peers should be aware of 1295 this fact such that they will not try to request these chunks from 1296 this peer. This could happen in scenarios where live streams may be 1297 paused by viewers, or viewers are allowed to start late in a live 1298 broadcast (e.g., start watching a broadcast at 20:35 whereas it began 1299 at 20:30). 1301 PPSPP provides a simple solution for peers to stay up-to-date with 1302 the chunk availability of a discarding peer. A discarding peer in a 1303 live stream MUST enable the Live Discard Window protocol option, 1304 specifying how many chunks/bytes it caches before the last chunk/byte 1305 it advertised as being available (see Section 7.9). Its peers SHOULD 1306 apply this number as a sliding window filter over the peer's chunk 1307 availability as conveyed via its HAVE messages. 1309 7. Protocol Options 1311 The HANDSHAKE message in PPSPP can contain the following protocol 1312 options (cf. [RFC2132] (DHCP options)). Each element in a protocol 1313 option is 8 bits wide, unless stated otherwise. 1315 7.1. End Option 1317 A peer MUST conclude the list of protocol options with the end 1318 option. Subsequent octets should be considered protocol messages. 1319 The code for the end option is 255, and its length is 1 octet. 1321 +------+ 1322 | Code | 1323 +------+ 1324 | 255 | 1325 +------+ 1327 7.2. Version 1329 A peer MUST include the maximum version of the PPSPP protocol it 1330 supports as the first protocol option in the list. 1332 +------+-------------+ 1333 | Code | Max Version | 1334 +------+-------------+ 1335 | 0 | v | 1336 +------+-------------+ 1338 Currently one value is defined, 1 = protocol as described in this 1339 document. 1341 7.3. Minimum Version 1343 When a peer initiates the handshake it MUST include the minimum 1344 version of the PPSPP protocol it supports in the list of protocol 1345 options, following the Min/max versioning scheme defined in 1346 [RFC6709], Section 4.1. 1348 +------+-------------+ 1349 | Code | Min Version | 1350 +------+-------------+ 1351 | 1 | v | 1352 +------+-------------+ 1354 Currently one value is defined, 1 = protocol as described in this 1355 document. 1357 7.4. Swarm Identifier 1359 To enable end-to-end checking of any peer discovery process a peer 1360 MAY include a swarm identifier option. 1362 +------+-------------+------------------+ 1363 | Code | Length | Swarm Identifier | 1364 +------+-------------+------------------+ 1365 | 2 | n (16 bits) | i1,i2,... | 1366 +------+-------------+------------------+ 1368 Each PPSPP peer knows the IDs of the swarms it joins so this 1369 information can be immediately verified upon receipt. The length 1370 field is 2 octets to allow for large public keys as identifiers in 1371 live streaming. 1373 7.5. Content Integrity Protection Method 1375 A peer MUST include the content integrity method used by a swarm, 1376 unless it uses the default, in which case it MAY include the method. 1378 +------+--------+ 1379 | Code | Method | 1380 +------+--------+ 1381 | 3 | m | 1382 +------+--------+ 1384 Currently three values are defined for the method, 0 = No integrity 1385 protection, 1 = Merkle Hash Trees (for static content, see 1386 Section 5.1), 2 = Sign All, and 3 = Unified Merkle Tree (for live 1387 content, see Section 6.1). 1389 The veracity of this information will come out when the receiver 1390 successfully verifies the first chunk from any peer. 1392 7.6. Merkle Tree Hash Function 1394 When the content integrity protection method is Merkle Hash Trees 1395 this option defining which hash function is used for the tree MUST 1396 also be defined. 1398 +------+-----------+ 1399 | Code | Hash Func | 1400 +------+-----------+ 1401 | 4 | h | 1402 +------+-----------+ 1404 Currently the following values are defined for the hash function, 0 = 1405 SHA1, 1 = SHA-224, 2 = SHA-256, 3 = SHA-384, and 4 = SHA-512 1406 [FIPS180-3]. 1408 The veracity of this information will come out when the receiver 1409 successfully verifies the first chunk from any peer. 1411 7.7. Live Signature Algorithm 1413 When the content integrity protection method is "Sign All" or 1414 "Unified Merkle Tree" this option MUST also be defined. 1416 +------+---------+ 1417 | Code | Sig Alg | 1418 +------+---------+ 1419 | 5 | s | 1420 +------+---------+ 1422 The value of this option is one of the Domain Name System Security 1423 (DNSSEC) Algorithm Numbers [IANADNSSECALGNUM]. If necessary, the key 1424 size that impacts signature length can be derived from the swarm 1425 identifier which is the signing public key in live streaming. 1427 The veracity of this information will come out when the receiver 1428 successfully verifies the first chunk from any peer. 1430 7.8. Chunk Addressing Method 1432 A peer MUST include the chunk addressing method it uses, unless it 1433 uses the default, in which case it MAY include the method. 1435 +------+--------+ 1436 | Code | Scheme | 1437 +------+--------+ 1438 | 6 | a | 1439 +------+--------+ 1441 Currently six values are defined for the chunk addressing scheme, 1442 0=32-bit bins, 1=64-bit byte ranges, and 2=32-bit chunk ranges, 3=64- 1443 bit bins, 4=64-bit chunk ranges. 1445 The veracity of this information will come out when the receiver 1446 parses the first message containing a chunk specification from any 1447 peer. 1449 7.9. Live Discard Window 1451 A peer in a live swarm MUST include the discard window it uses. The 1452 unit of the discard window depends on the chunk addressing method 1453 used. For bins and chunk ranges it is a number of chunks, for byte 1454 ranges it is a number of bytes. Its data type is the same as for a 1455 bin, or one value in a range specification. In other words, a 32-bit 1456 or 64-bit integer in big endian format. This option MUST therefore 1457 appear after the Chunk Addressing option, if present in the list of 1458 protocol options. 1460 +------+-------------------+ 1461 | Code | Scheme | 1462 +------+-------------------+ 1463 | 7 | w (32 or 64-bits) | 1464 +------+-------------------+ 1466 A peer that does not, under normal circumstances, discard chunks MUST 1467 set this option to the special value 0xFFFFFFFF (32-bit) or 1468 0xFFFFFFFFFFFFFFFF (64-bit). For example, peers that record a 1469 complete broadcast to offer it directly as a static asset after the 1470 broadcast ends (see Section 6.1.2). 1472 The veracity of this information does not impact a receiving peer 1473 more than when a sender peer just does not respond to REQUEST 1474 messages. 1476 7.10. Supported Messages 1478 Peers may support just a subset of the PPSPP messages. For example, 1479 peers running over TCP may not accept ACK messages, or peers used 1480 with a centralized tracking infrastructure may not accept PEX 1481 messages. For these reasons, peers who support only a proper subset 1482 of the PPSPP messages MUST signal which subset they support by means 1483 of this protocol option. The value of this option is a 256-bit 1484 bitmap where each bit represents a message type. The bitmap may be 1485 truncated to the last non-zero byte. 1487 +------+--------+----------------+ 1488 | Code | Length | Message Bitmap | 1489 +------+--------+----------------+ 1490 | 8 | n | m1,m2,... | 1491 +------+--------+----------------+ 1493 8. UDP Encapsulation 1495 Currently, PPSPP-over-UDP is the preferred deployment option. 1496 Effectively, UDP allows the use of IP with minimal overhead and it 1497 also allows userspace implementations. LEDBAT is used for congestion 1498 control [I-D.ietf-ledbat-congestion]. Using LEDBAT enables PPSPP to 1499 serve the content after playback (seeding) without disrupting the 1500 user who may have moved to different tasks that use its network 1501 connection. LEDBAT may be replaced with a different algorithm when 1502 the work of the IETF working group on RTP Media Congestion Avoidance 1503 Techniques (RMCAT) [RMCATCHART] matures. 1505 8.1. Chunk Size 1507 In general, an UDP datagram containing PPSPP messages SHOULD fit 1508 inside a single IP packet, so its maximum size depends on the MTU of 1509 the network. The default is to use fixed-sized chunks of 1 kilobyte 1510 such that a UDP datagram with a DATA message can be transmitted as a 1511 single IP packet over an Ethernet network with 1500-byte frames. 1512 PPSPP implementations can use larger chunk sizes. For example, on 1513 CPU-limited hardware 8 kilobyte chunks MAY be used, transported as a 1514 single UDP datagram fragmented over multiple IP packets (with the 1515 increased chance of that UDP datagram getting lost). The chunk 1516 addressing schemes can all work with different chunk sizes, see 1517 Section 4. 1519 The chunk size used for a particular swarm MUST be part of the 1520 swarm's metadata (which then consists of the swarm ID and the chunk 1521 size), unless it is the 1 KB default. 1523 8.2. Datagrams and Messages 1525 When using UDP, the abstract datagram described above corresponds 1526 directly to a UDP datagram. Most messages within a datagram have a 1527 fixed length, which generally depends on the type of the message. 1528 The first byte of a message denotes its type. The currently defined 1529 types are: 1531 o HANDSHAKE = 0x00 1533 o DATA = 0x01 1535 o ACK = 0x02 1537 o HAVE = 0x03 1539 o INTEGRITY = 0x04 1541 o PEX_RESv4 = 0x05 1543 o PEX_REQ = 0x06 1545 o SIGNED_INTEGRITY = 0x07 1547 o REQUEST = 0x08 1549 o CANCEL = 0x09 1551 o CHOKE = 0x0a 1553 o UNCHOKE = 0x0b 1555 o PEX_RESv6 = 0x0c 1557 o PEX_REScert = 0x0d 1559 Furthermore, integers are serialized in the network (big-endian) byte 1560 order. So consider the example of a HAVE message (Section 3.2) using 1561 bin chunk addressing. It has message type of 0x02 and a payload of a 1562 bin number, a four-byte integer (say, 1); hence, its on the wire 1563 representation for UDP can be written in hex as: "0200000001". 1565 All messages are idempotent or recognizable as duplicates. In 1566 particular, a peer MAY resend DATA, ACK, HAVE, INTEGRITY, PEX_*, 1567 SIGNED_INTEGRITY, REQUEST, CANCEL, CHOKE and UNCHOKE messages without 1568 problems when loss is suspected. When a peer resends a HANDSHAKE 1569 message it can be recognized as duplicate by the receiver and be 1570 dealt with. 1572 8.3. Channels 1574 As it is increasingly complex for peers to enable UDP communication 1575 between each other due to NATs and firewalls, PPSPP-over-UDP uses a 1576 multiplexing scheme, called "channels", to allow multiple swarms to 1577 use the same UDP port. Channels loosely correspond to TCP 1578 connections and each channel belongs to a single swarm. To support 1579 channels, each datagram starts with four bytes corresponding to the 1580 receiving channel number. 1582 8.4. HANDSHAKE 1584 A channel is established with a handshake. To start a handshake, the 1585 initiating peer needs to know: 1587 1. the IP address of a peer 1589 2. peer's UDP port and 1591 3. the swarm ID of the content (see Section 5.1 and Section 6). 1593 4. the chunk size used, unless the 1 KB default 1595 To do the handshake the initiating peer sends a datagram that MUST 1596 start with an all 0-zeros channel number, followed by a HANDSHAKE 1597 message, whose payload is a locally unused channel number and a list 1598 of protocol options. 1600 On the wire the datagram will look something like this: 1602 (CHANNEL) 00000000 HANDSHAKE 00000011 v=01 si=123...1234 ca=0 end 1604 (to unknown channel, handshake from channel 0x11 speaking protocol 1605 version 0x01, initiating a transfer of a file with a root hash 1606 123...1234 using bins for chunk addressing) 1608 The receiving peer MAY respond in which case the returned datagram 1609 MUST consist of the channel number from the sender's HANDSHAKE 1610 message, a HANDSHAKE message, whose payload is a locally unused 1611 channel number and a list of protocol options, followed by any other 1612 messages it wants to send. 1614 Peer's response datagram on the wire: 1616 (CHANNEL) 00000011 HANDSHAKE 00000022 v=01 protocol options end 1618 (peer to the initiator: use channel number 0x22 for this transfer and 1619 proto version 0x01.) 1620 At this point, the initiator knows that the peer really responds; for 1621 that purpose channel IDs MUST be random enough to prevent easy 1622 guessing. So, the third datagram of a handshake MAY already contain 1623 some heavy payload. To minimize the number of initialization 1624 roundtrips, the first two datagrams MAY also contain some minor 1625 payload, e.g. a couple of HAVE messages roughly indicating the 1626 current progress of a peer or a REQUEST (see Section 3.7). When 1627 receiving the third datagram, both peers have the proof they really 1628 talk to each other; the three-way handshake is complete. 1630 A peer MAY explicit close a channel by sending a HANDSHAKE message 1631 that MUST contain an all 0-zeros channel number and a list of 1632 protocol options. The list MUST be either empty or contain the 1633 maximum version number the sender supports, following the Min/max 1634 versioning scheme defined in [RFC6709], Section 4.1. 1636 On the wire: 1638 (CHANNEL) 00000022 HANDSHAKE 00000000 end 1640 8.5. HAVE 1642 A HAVE message (type 0x03) consists of a chunk specification that 1643 states that the sending peer has those chunks and successfully 1644 checked their integrity. A bin consists of a single integer, and a 1645 chunk or byte range of two integers, of the width specified by the 1646 Chunk Addressing protocol options, encoded big endian. 1648 A HAVE message for bin 3 on the wire: 1650 HAVE 00000003 1652 (received and checked first four kilobytes of a file/stream) 1654 8.6. DATA 1656 A DATA message (type 0x01) consists of a chunk specification, a 1657 timestamp and the actual chunk. In case a datagram contains one DATA 1658 message, a sender MUST always put the DATA message in the tail of the 1659 datagram. A datagram MAY contain multiple DATA messages unless one 1660 of the chunks is the last chunk and smaller than the chunk size. As 1661 the LEDBAT congestion control is used, a sender MUST include a 1662 timestamp, in particular, a 64-bit integer representing the current 1663 system time with microsecond accuracy. The timestamp MUST be 1664 included between chunk specification and the actual chunk. 1666 A DATA message for bin 0, with timestamp 12345678, and some data on 1667 the wire: 1669 DATA 00000000 12345678 48656c6c6f20776f726c6421 1671 (This message accommodates an entire file: "Hello world!") 1673 8.7. ACK 1675 An ACK message (type 0x02) acknowledges data that was received from 1676 its addressee; to comply with the LEDBAT delay-based congestion 1677 control an ACK message consists of a chunk specification and a 1678 timestamp representing an one-way delay sample. The one-way delay 1679 sample is a 64-bit integer with microsecond accuracy, and is computed 1680 from the timestamp received from the previous DATA message containing 1681 the chunk being acknowledged following the LEDBAT specification. 1683 An ACK message for bin 2 with one-way delay 12345678 on the wire: 1685 ACK 00000002 12345678 1687 8.8. INTEGRITY 1689 An INTEGRITY message (type 0x04) consists of a chunk specification 1690 and the cryptographic hash for the specified chunk or node. The type 1691 and format of the hash depends on the protocol options. 1693 An INTEGRITY message for bin 0 with a SHA1 hash on the wire: 1695 INTEGRITY 00000000 1234123412341234123412341234123412341234 1697 8.9. SIGNED_INTEGRITY 1699 A SIGNED_INTEGRITY message (type 0x07) consists of a chunk 1700 specification and a digital signature encoded as in DNSSEC without 1701 the BASE-64 encoding [RFC4034]. The signature algorithm is defined 1702 by the Live Signature Algorithm protocol option, see Section 7.7. 1704 8.10. REQUEST 1706 A REQUEST message (type 0x08) consists of a chunk specification for 1707 the chunks the requester wants to download. 1709 8.11. CANCEL 1711 A CANCEL message (type 0x09) consists of a chunk specification for 1712 the chunks the requester no longer is interested in. 1714 8.12. CHOKE and UNCHOKE 1716 Both CHOKE and UNCHOKE messages (types 0x0a and 0x0b, respectively) 1717 carry no payload. 1719 8.13. PEX_REQ, PEX_RESv4, PEX_RESv6 and PEX_REScert 1721 A PEX_REQ (0x06) message has no payload. A PEX_RES (0x05) message 1722 consists of an IPv4 address in big endian format followed by a UDP 1723 port number in big endian format. A PEX_RESv6 (0x0c) message 1724 contains a 128-bit IPv6 address instead of an IPv4 one. If sender of 1725 the PEX_REQ message does not have a private or link-local address, 1726 then the PEX_RES* messages MUST NOT contain such addresses 1727 [RFC1918][RFC4291]. 1729 A PEX_REScert (0x0d) message consists of a 16-bit integer in big 1730 endian specifying the size of the membership certificate that 1731 follows, see Section 12.2.1. This membership certificate states that 1732 peer P at time T is a member of swarm S and is a X.509v3 certificate 1733 [RFC2459] that is encoded using the ASN.1 distinguished encoding 1734 rules (DER) [CCITT.X208.1988]. The certificate MUST contain a 1735 "Subject Alternative Name" extension, marked as critical, of type 1736 uniformResourceIdentifier. 1738 The URL the name extension contains MUST follow the generic syntax 1739 for URLs [RFC3986], where its scheme component is "ppsp", the host in 1740 the authority component is the DNS name or IP address of peer P, the 1741 port in the authority component is the port of peer P, and the path 1742 contains the swarm identifier for swarm S, in hexadecimal form. In 1743 particular, the preferred form of the swarm identifier is xxyyzz..., 1744 where the 'x's, 'y's and 'z's are 2 hexadecimal digits of the 8-bit 1745 pieces of the identifier. The validity time of the certificate is 1746 set with notBefore UTCTime set to T and notAfter UTCTime set to T 1747 plus some expiry time defined by the issuer. An example URL: 1749 ppsp://192.168.0.1:6778/e5a12c7ad2d8fab33c699d1e198d66f79fa610c3 1751 8.14. KEEPALIVE 1753 Keepalives do not have a message type on UDP. They are just simple 1754 datagrams consisting of a 4-byte channel number only. 1756 On the wire: 1758 (CHANNEL) 00000022 1760 A guideline for declaring a peer dead consist of a 3 minute delay 1761 since that last packet has been received from that peer, and at least 1762 3 datagrams were sent to that peer during the same period. 1764 8.15. Flow and Congestion Control 1766 Explicit flow control is not necessary in PPSPP-over-UDP. In the 1767 case of video-on-demand the receiver will request data explicitly 1768 from peers and is therefore in control of how much data is coming 1769 towards it. In the case of live streaming, where a push-model may be 1770 used, the amount of data incoming is limited to the bitrate, which 1771 the receiver must be able to process otherwise it cannot play the 1772 stream. Should, for any reason, the receiver get saturated with data 1773 that situation is perfectly detected by the congestion control. 1774 PPSPP-over-UDP can support different congestion control algorithms. 1775 At present, it uses the LEDBAT congestion control algorithm 1776 [I-D.ietf-ledbat-congestion]. 1778 9. Extensibility 1780 9.1. Chunk Picking Algorithms 1782 Chunk (or piece) picking entirely depends on the receiving peer. The 1783 sender peer is made aware of preferred chunks by the means of REQUEST 1784 messages. In some (live) scenarios it may be beneficial to allow the 1785 sender to ignore those hints and send unrequested data. 1787 The chunk picking algorithm is external to the PPSPP protocol and 1788 will generally be a pluggable policy that uses the mechanisms 1789 provided by PPSPP. The algorithm will handle the choices made by the 1790 user consuming the content, such as seeking, switching audio tracks 1791 or subtitles. Example policies for P2P streaming can be found in 1792 [BITOS], and [EPLIVEPERF]. 1794 9.2. Reciprocity Algorithms 1796 The role of reciprocity algorithms in peer-to-peer systems is to 1797 promote client contribution and prevent freeriding. A peer is said 1798 to be freeriding if it only downloads content but never uploads to 1799 others. Examples of reciprocity algorithms are tit-for-tat as used 1800 in BitTorrent [TIT4TAT] and Give-to-Get [GIVE2GET]. In PPSPP, 1801 reciprocity enforcement is the sole responsibility of the sender 1802 peer. 1804 10. Acknowledgements 1806 Arno Bakker, Riccardo Petrocco and Victor Grishchenko are partially 1807 supported by the P2P-Next project (http://www.p2p-next.org/), a 1808 research project supported by the European Community under its 7th 1809 Framework Programme (grant agreement no. 216217). The views and 1810 conclusions contained herein are those of the authors and should not 1811 be interpreted as necessarily representing the official policies or 1812 endorsements, either expressed or implied, of the P2P-Next project or 1813 the European Commission. 1815 The PPSPP protocol was designed by Victor Grishchenko at Technische 1816 Universiteit Delft. The authors would like to thank the following 1817 people for their contributions to this draft: the chairs and members 1818 of the IETF PPSP working group, and Mihai Capota, Raul Jimenez, 1819 Flutra Osmani, Johan Pouwelse, and Raynor Vliegendhart. 1821 11. IANA Considerations 1823 To be determined. 1825 12. Security Considerations 1827 As any other network protocol, the PPSPP faces a common set of 1828 security challenges. An implementation must consider the possibility 1829 of buffer overruns, DoS attacks and manipulation (i.e. reflection 1830 attacks). Any guarantee of privacy seems unlikely, as the user is 1831 exposing its IP address to the peers. A probable exception is the 1832 case of the user being hidden behind a public NAT or proxy. This 1833 section discusses the protocol's security considerations in detail. 1835 12.1. Security of the Handshake Procedure 1837 Borrowing from the analysis in [RFC5971], the PPSP peer protocol may 1838 be attacked with 3 types of denial-of-service attacks: 1840 1. DOS amplification attack: attackers try to use a PPSPP peer to 1841 generate more traffic to a victim. 1843 2. DOS flood attack: attackers try to deny service to other peers by 1844 allocating lots of state at a PPSPP peer. 1846 3. Disrupt service to an individual peer: attackers send bogus e.g. 1847 REQUEST and HAVE messages appearing to come from victim peer A to 1848 the peers B1..Bn serving that peer. This causes A to receive 1849 chunks it did not request or to not receive the chunks it 1850 requested. 1852 The basic scheme to protect against these attacks is the use of a 1853 secure handshake procedure. In the UDP encapsulation the handshake 1854 procedure is secured by the use of randomly chosen channel IDs as 1855 follows. The channel IDs must be generated following the 1856 requirements in [RFC4960](Sec. 5.1.3). 1858 When UDP is used, all datagrams carrying PPSPP messages are prefixed 1859 with a 4-byte channel ID. These channel IDs are random numbers, 1860 established during the handshake phase as follows. Peer A initiates 1861 an exchange with peer B by sending a datagram containing a HANDSHAKE 1862 message prefixed with the channel ID consisting of all 0s. Peer A's 1863 HANDSHAKE contains a randomly chosen channel ID, chanA: 1865 A->B: chan0 + HANDSHAKE(chanA) + ... 1867 When peer B receives this datagram, it creates some state for peer A, 1868 that at least contains the channel ID chanA. Next, peer B sends a 1869 response to A, consisting of a datagram containing a HANDSHAKE 1870 message prefixed with the chanA channel ID. Peer B's HANDSHAKE 1871 contains a randomly chosen channel ID, chanB. 1873 B->A: chanA + HANDSHAKE(chanB) + ... 1875 Peer A now knows that peer B really responds, as it echoed chanA. So 1876 the next datagram that A sends may already contain heavy payload, 1877 i.e., a chunk. This next datagram to B will be prefixed with the 1878 chanB channel ID. When B receives this datagram, both peers have the 1879 proof they are really talking to each other, the three-way handshake 1880 is complete. In other words, the randomly chosen channel IDs act as 1881 tags (cf. [RFC4960](Sec. 5.1)). 1883 A->B: chanB + HAVE + DATA + ... 1885 12.1.1. Protection against attack 1 1887 In short, PPSPP does a so-called return routability check before 1888 heavy payload is sent. This means that attack 1 is fended off: PPSPP 1889 does not send back much more data than it received, unless it knows 1890 it is talking to a live peer. Attackers sending a spoofed HANDSHAKE 1891 to B pretending to be A now need to intercept the message from B to A 1892 to get B to send heavy payload, and ensure that that heavy payload 1893 goes to the victim, something assumed too hard to be a practical 1894 attack. 1896 Note the rule is that no heavy payload may be sent until the third 1897 datagram. This has implications for PPSPP implementations that use 1898 chunk addressing schemes that are verbose. If a PPSPP implementation 1899 uses large bitmaps to convey chunk availability these may not be sent 1900 by peer B in the second datagram. 1902 12.1.2. Protection against attack 2 1904 On receiving the first datagram peer B will record some state about 1905 peer A. At present this state consists of the chanA channel ID, and 1906 the results of processing the other messages in the first datagram. 1907 In particular, if A included some HAVE messages, B may add a chunk 1908 availability map to A's state. In addition, B may request some 1909 chunks from A in the second datagram, and B will maintain state about 1910 these outgoing requests. 1912 So presently, PPSPP is somewhat vulnerable to attack 2. An attacker 1913 could send many datagrams with HANDSHAKEs and HAVEs and thus allocate 1914 state at the PPSPP peer. Therefore peer A MUST respond immediately 1915 to the second datagram, if it is still interested in peer B. 1917 The reason for using this slightly vulnerable three-way handshake 1918 instead of the safer handshake procedure of SCTP [RFC4960](Sec. 5.1) 1919 is quicker response time for the user. In the SCTP procedure, peer A 1920 and B cannot request chunks until datagrams 3 and 4 respectively, as 1921 opposed to 2 and 1 in the proposed procedure. This means that the 1922 user has to wait shorter in PPSPP between starting the video stream 1923 and seeing the first images. 1925 12.1.3. Protection against attack 3 1927 In general, channel IDs serve to authenticate a peer. Hence, to 1928 attack, a malicious peer T would need to be able to eavesdrop on 1929 conversations between victim A and a benign peer B to obtain the 1930 channel ID B assigned to A, chanB. Furthermore, attacker T would 1931 need to be able to spoof e.g. REQUEST and HAVE messages from A to 1932 cause B to send heavy DATA messages to A, or prevent B from sending 1933 them, respectively. 1935 The capability to eavesdrop is not common, so the protection afforded 1936 by channel IDs will be sufficient in most cases. If not, point-to- 1937 point encryption of traffic should be used, see below. 1939 12.2. Secure Peer Address Exchange 1941 As described in Section 3.10, a peer A can send a Peer-Exchange 1942 message PEX_RES to a peer B, which contains the IP address and port 1943 of other peers that are supposedly also in the current swarm. The 1944 strength of this mechanism is that it allows decentralized tracking: 1945 after an initial bootstrap no central tracker is needed anymore. The 1946 vulnerability of this mechanism (and DHTs) is that malicious peers 1947 can use it for an Amplification attack. 1949 In particular, a malicious peer T could send a PEX_RES to well- 1950 behaved peer A containing a list of address B1,B2,...,BN and on 1951 receipt, peer A could send a HANDSHAKE to all these peers. So in the 1952 worst case, a single datagram results in N datagrams. The actual 1953 damage depends on A's behaviour. E.g. when A already has sufficient 1954 connections it may not connect to the offered ones at all, but if it 1955 is a fresh peer it may connect to all directly. 1957 In addition, PEX can be used in Eclipse attacks [ECLIPSE] where 1958 malicious peers try to isolate a particular peer such that it only 1959 interacts with malicious peers. Let us distinguish two specific 1960 attacks: 1962 E1. Malicious peers try to eclipse the single injector in live 1963 streaming. 1965 E2. Malicious peers try to eclipse a specific consumer peer. 1967 Attack E1 has the most impact on the system as it would disrupt all 1968 peers. 1970 12.2.1. Protection against the Amplification Attack 1972 If peer addresses are relatively stable, strong protection against 1973 the attack can be provided by using public key cryptography and 1974 certification. In particular, a PEX_RES message will carry swarm- 1975 membership certificates rather than IP address and port. A 1976 membership certificate for peer B states that peer B at address 1977 (ipB,portB) is part of swarm S at time T and is cryptographically 1978 signed. The receiver A can check the cert for a valid signature, the 1979 right swarm and liveliness and only then consider contacting B. These 1980 swarm-membership certificates correspond to signed node descriptors 1981 in secure decentralized peer sampling services [SPS]. 1983 Several designs are possible for the security environment for these 1984 membership certificates. That is, there are different designs 1985 possible for who signs the membership certificates and how public 1986 keys are distributed. As an example, we describe a design where the 1987 PPSP tracker acts as certification authority. 1989 12.2.2. Example: Tracker as Certification Authority 1991 A peer A wanting to join swarm S sends a certificate request message 1992 to a tracker X for that swarm. Upon receipt, the tracker creates a 1993 membership certificate from the request with swarm ID S, a timestamp 1994 T and the external IP and port it received the message from, signed 1995 with the tracker's private key. This certificate is returned to A. 1997 Peer A then includes this certificate when it sends a PEX_RES to peer 1998 B. Receiver B verifies it against the tracker public key. This 1999 tracker public key should be part of the swarm's metadata, which B 2000 received from a trusted source. Subsequently, peer B can send the 2001 member certificate of A to other peers in PEX_RES messages. 2003 Peer A can send the certification request when it first contacts the 2004 tracker, or at a later time. Furthermore, the responses the tracker 2005 sends could contain membership certificates instead of plain 2006 addresses, such that they can be gossiped securely as well. 2008 We assume the tracker is protected against attacks and does a return 2009 routability check. The latter ensures that malicious peers cannot 2010 obtain a certificate for a random host, just for hosts where they can 2011 eavesdrop on incoming traffic. 2013 The load generated on the tracker depends on churn and the lifetime 2014 of a certificate. Certificates can be fairly long lived, given that 2015 the main goal of the membership certificates is to prevent that 2016 malicious peer T can cause good peer A to contact *random* hosts. 2017 The freshness of the timestamp just adds extra protection in addition 2018 to achieving that goal. It protects against malicious hosts causing 2019 a good peer A to contact hosts that previously participated in the 2020 swarm. 2022 The membership certificate mechanism itself can be used for a kind of 2023 amplification attack against good peers. Malicious peer T can cause 2024 peer A to spend some CPU to verify the signatures on the membership 2025 certificates that T sends. To counter this, A SHOULD check a few of 2026 the certificates sent and discard the rest if they are defective. 2028 The same membership certificates described above can be registered in 2029 a Distributed Hash Table that has been secured against the well-known 2030 DHT specific attacks [SECDHTS]. 2032 Note that this scheme does not work for peers behind a symmetric 2033 Network Address Translator, but neither does normal tracker 2034 registration. 2036 12.2.3. Protection Against Eclipse Attacks 2038 Before we can discuss Eclipse attacks we first need to establish the 2039 security properties of the central tracker. A tracker is vulnerable 2040 to Amplification attacks too. A malicious peer T could register a 2041 victim B with the tracker, and many peers joining the swarm will 2042 contact B. Trackers can also be used in Eclipse attacks. If many 2043 malicious peers register themselves at the tracker, the percentage of 2044 bad peers in the returned address list may become high. Leaving the 2045 protection of the tracker to the PPSP tracker protocol specification, 2046 we assume for the following discussion that it returns a true random 2047 sample of the actual swarm membership (achieved via Sybil attack 2048 protection). This means that if 50% of the peers is bad, you'll 2049 still get 50% good addresses from the tracker. 2051 Attack E1 on PEX can be fended off by letting live injectors disable 2052 PEX. Or at least, let live injectors ensure that part of their 2053 connections are to peers whose addresses came from the trusted 2054 tracker. 2056 The same measures defend against attack E2 on PEX. They can also be 2057 employed dynamically. When the current set of peers B that peer A is 2058 connected to doesn't provide good quality of service, A can contact 2059 the tracker to find new candidates. 2061 12.3. Support for Closed Swarms (PPSP.SEC.REQ-1) 2063 The Closed Swarms [CLOSED] and Enhanced Closed Swarms [ECS] 2064 mechanisms provide swarm-level access control. The basic idea is 2065 that a peer cannot download from another peer unless it shows a 2066 Proof-of-Access. Enhanced Closed Swarms improve on the original 2067 Closed Swarms by adding on-the-wire encryption against man-in-the- 2068 middle attacks and more flexible access control rules. 2070 The exact mapping of ECS to PPSPP is defined in 2071 [I-D.gabrijelcic-ppsp-ecs]. 2073 12.4. Confidentiality of Streamed Content (PPSP.SEC.REQ-2+3) 2075 No extra mechanism is needed to support confidentiality in PPSPP. A 2076 content publisher wishing confidentiality should just distribute 2077 content in cyphertext / DRM-ed format. In that case it is assumed a 2078 higher layer handles key management out-of-band. Alternatively, pure 2079 point-to-point encryption of content and traffic can be provided by 2080 the proposed Closed Swarms access control mechanism, or by DTLS 2081 [RFC6347] or IPsec [RFC4301]. 2083 12.5. Limit Potential Damage and Resource Exhaustion by Bad or Broken 2084 Peers (PPSP.SEC.REQ-4+6) 2086 In this section an analysis is given of the potential damage a 2087 malicious peer can do with each message in the protocol, and how it 2088 is prevented by the protocol (implementation). 2090 12.5.1. HANDSHAKE 2091 o Secured against DoS amplification attacks as described in 2092 Section 12.1. 2094 o Threat HS.1: An Eclipse attack where peers T1..TN fill all 2095 connection slots of A by initiating the connection to A. 2097 Solution: Peer A must not let other peers fill all its available 2098 connection slots, i.e., A must initiate connections itself too, to 2099 prevent isolation. 2101 12.5.2. HAVE 2103 o Threat HAVE.1: Malicious peer T can claim to have content which it 2104 hasn't. Subsequently T won't respond to requests. 2106 Solution: peer A will consider T to be a slow peer and not ask it 2107 again. 2109 o Threat HAVE.2: Malicious peer T can claim not to have content. 2110 Hence it won't contribute. 2112 Solution: Peer and chunk selection algorithms external to the 2113 protocol will implement fairness and provide sharing incentives. 2115 12.5.3. DATA 2117 o Threat DATA.1: peer T sending bogus chunks. 2119 Solution: The content integrity protection schemes defend against 2120 this. 2122 o Threat DATA.2: peer T sends peer A unrequested chunks. 2124 To protect against this threat we need network-level DoS 2125 prevention. 2127 12.5.4. ACK 2129 o Threat ACK.1: peer T acknowledges wrong chunks. 2131 Solution: peer A will detect inconsistencies with the data it sent 2132 to T. 2134 o Threat ACK.2: peer T modifies timestamp in ACK to peer A used for 2135 time-based congestion control. 2137 Solution: In theory, by decreasing the timestamp peer T could fake 2138 there is no congestion when in fact there is, causing A to send 2139 more data than it should. [I-D.ietf-ledbat-congestion] does not 2140 list this as a security consideration. Possibly this attack can 2141 be detected by the large resulting asymmetry between round-trip 2142 time and measured one-way delay. 2144 12.5.5. INTEGRITY and SIGNED_INTEGRITY 2146 o Threat INTEGRITY.1: An amplification attack where peer T sends 2147 bogus INTEGRITY or SIGNED_INTEGRITY messages, causing peer A to 2148 checks hashes or signatures, thus spending CPU unnecessarily. 2150 Solution: If the hashes/signatures don't check out A will stop 2151 asking T because of the atomic datagram principle and the content 2152 integrity protection. Subsequent unsolicited traffic from T will 2153 be ignored. 2155 12.5.6. REQUEST 2157 o Threat REQUEST.1: peer T could request lots from A, leaving A 2158 without resources for others. 2160 Solution: A limit is imposed on the upload capacity a single peer 2161 can consume, for example, by using an upload bandwidth scheduler 2162 that takes into account the need of multiple peers. A natural 2163 upper limit of this upload quotum is the bitrate of the content, 2164 taking into account that this may be variable. 2166 12.5.7. CANCEL 2168 o Threat CANCEL.1: peer T sends CANCEL messages for content it never 2169 requested to peer A. 2171 Solution: peer A will detect the inconsistency of the messages and 2172 ignore them. Note that CANCEL messages may be received 2173 unexpectedly when a transport is used where REQUEST messages may 2174 be lost or reordered with respect to the subsequent CANCELs. 2176 12.5.8. CHOKE 2178 o Threat CHOKE.1: peer T sends REQUEST messages after peer A sent B 2179 a CHOKE message. 2181 Solution: peer A will just discard the unwanted REQUESTs and 2182 resend the CHOKE, assuming it got lost. 2184 12.5.9. UNCHOKE 2186 o Threat UNCHOKE.1: peer T sends an UNCHOKE message to peer A 2187 without having sent a CHOKE message before. 2189 Solution: peer A can easily detect this violation of protocol 2190 state, and ignore it. Note this can also happen due to loss of a 2191 CHOKE message sent by a benign peer. 2193 o Threat UNCHOKE.2: peer T sends an UNCHOKE message to peer A, but 2194 subsequently does not respond to its REQUESTs. 2196 Solution: peer A will consider T to be a slow peer and not ask it 2197 again. 2199 12.5.10. PEX_RES 2201 o Secured against amplification and Eclipse attacks as described in 2202 Section 12.2. 2204 12.5.11. Unsolicited Messages in General 2206 o Threat: peer T could send a spoofed PEX_REQ or REQUEST from peer B 2207 to peer A, causing A to send a PEX_RES/DATA to B. 2209 Solution: the message from peer T won't be accepted unless T does 2210 a handshake first, in which case the reply goes to T, not victim 2211 B. 2213 12.6. Exclude Bad or Broken Peers (PPSP.SEC.REQ-5) 2215 A receiving peer can detect malicious or faulty senders as just 2216 described, which it can then subsequently ignore. However, excluding 2217 such a bad peer from the system completely is complex. Random 2218 monitoring by trusted peers that would blacklist bad peers as 2219 described in [DETMAL] is one option. This mechanism does require 2220 extra capacity to run such trusted peers, which must be 2221 indistinguishable from regular peers, and requires a solution for the 2222 timely distribution of this blacklist to peers in a scalable manner. 2224 13. References 2226 13.1. Normative References 2228 [CCITT.X208.1988] 2229 International International Telephone and Telegraph 2230 Consultative Committee, "Specification of Abstract Syntax 2231 Notation One (ASN.1)", CCITT Recommendation X.208, 2232 November 1988. 2234 [FIPS180-3] 2235 Information Technology Laboratory, National Institute of 2236 Standards and Technology, "Federal Information Processing 2237 Standards: Secure Hash Standard (SHS)", Publication 180-3, 2238 Oct 2008. 2240 [I-D.ietf-ledbat-congestion] 2241 Shalunov, S., Hazel, G., Iyengar, J., and M. Kuehlewind, 2242 "Low Extra Delay Background Transport (LEDBAT)", 2243 draft-ietf-ledbat-congestion-10 (work in progress), 2244 September 2012. 2246 [IANADNSSECALGNUM] 2247 IANA, "Domain Name System Security (DNSSEC) Algorithm 2248 Numbers", 2249 . 2251 [RFC1918] Rekhter, Y., Moskowitz, R., Karrenberg, D., Groot, G., and 2252 E. Lear, "Address Allocation for Private Internets", 2253 BCP 5, RFC 1918, February 1996. 2255 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 2256 Requirement Levels", BCP 14, RFC 2119, March 1997. 2258 [RFC2459] Housley, R., Ford, W., Polk, T., and D. Solo, "Internet 2259 X.509 Public Key Infrastructure Certificate and CRL 2260 Profile", RFC 2459, January 1999. 2262 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 2263 Resource Identifier (URI): Generic Syntax", STD 66, 2264 RFC 3986, January 2005. 2266 [RFC4034] Arends, R., Austein, R., Larson, M., Massey, D., and S. 2267 Rose, "Resource Records for the DNS Security Extensions", 2268 RFC 4034, March 2005. 2270 [RFC4291] Hinden, R. and S. Deering, "IP Version 6 Addressing 2271 Architecture", RFC 4291, February 2006. 2273 [RFC6709] Carpenter, B., Aboba, B., and S. Cheshire, "Design 2274 Considerations for Protocol Extensions", RFC 6709, 2275 September 2012. 2277 13.2. Informative References 2279 [ABMRKL] Bakker, A., "Merkle hash torrent extension", BitTorrent 2280 Enhancement Proposal 30, Mar 2009, 2281 . 2283 [BINMAP] Grishchenko, V. and J. Pouwelse, "Binmaps: hybridizing 2284 bitmaps and binary trees", Technical Report PDS-2011-005, 2285 Parallel and Distributed Systems Group, Fac. of 2286 Electrical Engineering, Mathematics, and Computer 2287 Science, Delft University of Technology, The Netherlands, 2288 Apr 2009. 2290 [BITOS] Vlavianos, A., Iliofotou, M., Mathieu, F., and M. 2291 Faloutsos, "BiToS: Enhancing BitTorrent for Supporting 2292 Streaming Applications", IEEE INFOCOM Global Internet 2293 Symposium Barcelona, Spain, Apr 2006. 2295 [BITTORRENT] 2296 Cohen, B., "The BitTorrent Protocol Specification", 2297 BitTorrent Enhancement Proposal 3, Feb 2008, 2298 . 2300 [CLOSED] Borch, N., Mitchell, K., Arntzen, I., and D. Gabrijelcic, 2301 "Access Control to BitTorrent Swarms Using Closed Swarms", 2302 ACM workshop on Advanced Video Streaming Techniques for 2303 Peer-to-Peer Networks and Social Networking (AVSTP2P '10), 2304 Florence, Italy, Oct 2010, 2305 . 2307 [DETMAL] Shetty, S., Galdames, P., Tavanapong, W., and Ying. Cai, 2308 "Detecting Malicious Peers in Overlay Multicast 2309 Streaming", IEEE Conference on Local Computer 2310 Networks (LCN'06). Tampa, FL, USA, Nov 2006. 2312 [ECLIPSE] Sit, E. and R. Morris, "Security Considerations for Peer- 2313 to-Peer Distributed Hash Tables", IPTPS '01: Revised 2314 Papers from the First International Workshop on Peer-to- 2315 Peer Systems pp. 261-269, Springer-Verlag, 2002. 2317 [ECS] Jovanovikj, V., Gabrijelcic, D., and T. Klobucar, "Access 2318 Control in BitTorrent P2P Networks Using the Enhanced 2319 Closed Swarms Protocol", International Conference on 2320 Emerging Security Information, Systems and 2321 Technologies (SECURWARE 2011), pp. 97-102, Nice, France, 2322 Aug 2011. 2324 [EPLIVEPERF] 2325 Bonald, T., Massoulie, L., Mathieu, F., Perino, D., and A. 2326 Twigg, "Epidemic Live Streaming: Optimal Performance 2327 Trade-offs", Proceedings of the 2008 ACM SIGMETRICS 2328 International Conference on Measurement and Modeling of 2329 Computer Systems Annapolis, MD, USA, Jun 2008. 2331 [GIVE2GET] 2332 Mol, J., Pouwelse, J., Meulpolder, M., Epema, D., and H. 2333 Sips, "Give-to-Get: Free-riding Resilient Video-on-demand 2334 in P2P Systems", Proceedings Multimedia Computing and 2335 Networking conference (Proceedings of SPIE Vol. 6818) San 2336 Jose, California, USA, Jan 2008. 2338 [HAC01] Menezes, A., van Oorschot, P., and S. Vanstone, "Handbook 2339 of Applied Cryptography", CRC Press, (Fifth Printing, 2340 August 2001), Oct 1996. 2342 [I-D.gabrijelcic-ppsp-ecs] 2343 Gabrijelcic, D., "Enhanced Closed Swarm protocol", 2344 draft-ppsp-gabrijelcic-ecs (work in progress), 2345 November 2012. 2347 [I-D.ietf-ppsp-reqs] 2348 Williams, C., Xiao, L., Zong, N., Pascual, V., and Y. 2349 Zhang, "P2P Streaming Protocol (PPSP) Requirements", 2350 draft-ietf-ppsp-reqs-05 (work in progress), October 2011. 2352 [JIM11] Jimenez, R., Osmani, F., and B. Knutsson, "Sub-Second 2353 Lookups on a Large-Scale Kademlia-Based Overlay", IEEE 2354 International Conference on Peer-to-Peer 2355 Computing (P2P'11), Kyoto, Japan, Aug 2011. 2357 [MERKLE] Merkle, R., "Secrecy, Authentication, and Public Key 2358 Systems", Ph.D. thesis Dept. of Electrical Engineering, 2359 Stanford University, CA, USA, pp 40-45, 1979. 2361 [POLLIVE] Dhungel, P., Hei, Xiaojun., Ross, K., and N. Saxena, 2362 "Pollution in P2P Live Video Streaming", International 2363 Journal of Computer Networks & Communications 2364 (IJCNC) Vol.1, No.2, Jul 2009. 2366 [RFC2132] Alexander, S. and R. Droms, "DHCP Options and BOOTP Vendor 2367 Extensions", RFC 2132, March 1997. 2369 [RFC4301] Kent, S. and K. Seo, "Security Architecture for the 2370 Internet Protocol", RFC 4301, December 2005. 2372 [RFC4960] Stewart, R., "Stream Control Transmission Protocol", 2373 RFC 4960, September 2007. 2375 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 2376 IANA Considerations Section in RFCs", BCP 26, RFC 5226, 2377 May 2008. 2379 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, 2380 "Session Traversal Utilities for NAT (STUN)", RFC 5389, 2381 October 2008. 2383 [RFC5971] Schulzrinne, H. and R. Hancock, "GIST: General Internet 2384 Signalling Transport", RFC 5971, October 2010. 2386 [RFC6347] Rescorla, E. and N. Modadugu, "Datagram Transport Layer 2387 Security Version 1.2", RFC 6347, January 2012. 2389 [RMCATCHART] 2390 Eggert, L. and others, "RTP Media Congestion Avoidance 2391 Techniques (rmcat) Description of Working Group", 2012, 2392 . 2394 [SECDHTS] Urdaneta, G., Pierre, G., and M. van Steen, "A Survey of 2395 DHT Security Techniques", ACM Computing Surveys vol. 2396 43(2), Jun 2011. 2398 [SIGMCAST] 2399 Wong, C. and S. Lam, "Digital Signatures for Flows and 2400 Multicasts", IEEE/ACM Transactions on Networking 7(4), pp. 2401 502-513, 1999. 2403 [SNP] Ford, B., Srisuresh, P., and D. Kegel, "Peer-to-Peer 2404 Communication Across Network Address Translators", 2405 Feb 2005, . 2407 [SPS] Jesi, G., Montresor, A., and M. van Steen, "Secure Peer 2408 Sampling", Computer Networks vol. 54(12), pp. 2086-2098, 2409 Elsevier, Aug 2010. 2411 [SWIFTIMPL] 2412 Grishchenko, V., Paananen, J., Pronchenkov, A., Bakker, 2413 A., and R. Petrocco, "Swift reference implementation", 2414 2012, . 2416 [TIT4TAT] Cohen, B., "Incentives Build Robustness in BitTorrent", 2417 1st Workshop on Economics of Peer-to-Peer 2418 Systems, Berkeley, CA, USA, Jun 2003. 2420 Appendix A. Revision History 2422 -00 2011-12-19 Initial version. 2424 -01 2012-01-30 Minor text revision: 2426 * Changed heading to "A. Bakker" 2428 * Changed title to *Peer* Protocol, and abbreviation PPSPP. 2430 * Replaced swift with PPSPP. 2432 * Removed Sec. 6.4. "HTTP (as PPSP)". 2434 * Renamed Sec. 8.4. to "Chunk Picking Algorithms". 2436 * Resolved Ticket #3: Removed sentence about random set of 2437 peers. 2439 * Resolved Ticket #6: Added clarification to "Chunk Picking 2440 Algorithms" section. 2442 * Resolved Ticket #11: Added Sec. 3.12 on Storage Independence 2444 * Resolved Ticket #14: Added clarification to "Automatic Size 2445 Detection" section. 2447 * Resolved Ticket #15: Operation section now states it shows 2448 example behaviour for a specific set of policies and schemes. 2450 * Resolved Ticket #30: Explained why multiple REQUESTs in one 2451 datagram. 2453 * Resolved Ticket #31: Renamed PEX_ADD message to PEX_RES. 2455 * Resolved Ticket #32: Renamed Sec 3.8. to "Keep Alive 2456 Signaling", and updated explanation. 2458 * Resolved Ticket #33: Explained NAT hole punching via only 2459 PPSPP messages. 2461 * Resolved Ticket #34: Added section about limited overhead of 2462 the Merkle hash tree scheme. 2464 -02 2012-04-17 Major revision 2466 * Allow different chunk addressing and content integrity 2467 protection schemes (ticket #13): 2469 * Added chunk ID, chunk specification, chunk addressing scheme, 2470 etc. to terminology. 2472 * Created new Sections 4 and 5 discussing chunk addressing and 2473 content integrity protection schemes, respectively and moved 2474 relevant sections on bin numbering and Merkle hash trees 2475 there. 2477 * Renamed Section 4 to "Merkle Hash Trees and The Automatic 2478 Detection of Content Size". 2480 * Reformulated automatic size detection in terms of nodes, not 2481 bins. 2483 * Extended HANDSHAKE message to carry protocol options and 2484 created Section 8 on Protocol options. VERSION and 2485 MSGTYPE_RCVD messages replaced with protocol options. 2487 * Renamed HASH message to INTEGRITY. 2489 * Renamed HINT to REQUEST. 2491 * Added description of chunk addressing via (start,end) ranges. 2493 * Resolved Ticket #26: Extended "Security Considerations" with 2494 section on the handshake procedure. 2496 * Resolved Ticket #17: Defined recently as "in last 60 seconds" 2497 in PEX. 2499 * Resolved Ticket #20: Extended "Security Considerations" with 2500 design to make Peer Address Exchange more secure. 2502 * Resolved Ticket #38+39 / PPSP.SEC.REQ-2+3: Extended "Security 2503 Considerations" with a section on confidentiality of content. 2505 * Resolved Ticket #40+42 / PPSP.SEC.REQ-4+6: Extended "Security 2506 Considerations" with a per-message analysis of threats and 2507 how PPSPP is protected from them. 2509 * Progressed Ticket #41 / PPSP.SEC.REQ-5: Extended "Security 2510 Considerations" with a section on possible ways of excluding 2511 bad or broken peers from the system. 2513 * Moved Rationale to Appendix. 2515 * Resolved Ticket #43: Updated Live Streaming section to 2516 include "Sign All" content authentication, and reference to 2517 [SIGMCAST] following discussion with Fabio Picconi. 2519 * Resolved Ticket #12: Added a CANCEL message to cancel 2520 REQUESTs for the same data that were sent to multiple peers 2521 at the same time in time-critical situations. 2523 -03 2012-10-22 Major revision 2525 * Updated Abstract and Introduction, removing download case. 2527 * Resolved Ticket #4: Added explicit CHOKE/UNCHOKE messages. 2529 * Removed directory lists unused in streaming. 2531 * Resolved Ticket #22, #23, #28: Failure behaviour, error codes 2532 and dealing with peer crashes. 2534 * Resolved Ticket #13: Chunk ranges are the default chunk 2535 addressing scheme that all peers MUST support. 2537 * Added a section on compatibility between chunk addressing 2538 schemes. 2540 * Expanded the explanation of Unified Merkle Trees as a method 2541 for content integrity protection for live streams. 2543 * Added a section on forgetting chunks in live streaming. 2545 * Added "End" option to protocol options and corrected bugs in 2546 UDP encapsulation, following Karl Knutsson's comments. 2548 * Added SHA-2 support for Merkle Hash functions. 2550 * Added content integrity protection methods for live streaming 2551 to the relevant protocol option. 2553 * Added a Live Signature Algorithm protocol option. 2555 * Resolved Ticket #24+27: The choice for UDP + LEDBAT as 2556 transport has now been reflected in the draft. TCP and RTP 2557 encapsulations have been removed. 2559 * Superfluous parts of Section 10 on extensibility have been 2560 removed. 2562 * Removed appendix with Rationale. 2564 * Resolved Ticket #21+25: PPSPP currently uses LEDBAT and the 2565 DATA and ACK messages now contain the time fields it 2566 requires. Should other congestion control algorithms be 2567 supported in the future, a protocol option will be added. 2569 -04 2012-11-07 Minor revision 2571 * Corrected typos. 2573 * Added empty protocol option list when HANDSHAKE is used for 2574 explicitly closing a channel in the UDP encapsulation. 2576 * Corrected definition of a range chunk specification to be a 2577 single (start,end) pair. To send multiple disjunct ranges 2578 multiple messages should be used. 2580 * Clarified that in a range chunk specification the end is 2581 inclusive. I.e., [start,end] not [start,end) 2583 * Added PEX_REScert message to carry a membership certificate. 2584 Renamed PEX_RES to PEX_RESv4. 2586 * Added a guideline about private and link-local addresses in 2587 PEX_RES messages. 2589 * Defined the format of the public key that is used as swarm ID 2590 in live streaming. 2592 * Clarified that a HANDSHAKE message must be the first message 2593 in a datagram. 2595 * Clarified sending INTEGRITY messages ahead in a separate 2596 datagram if not all necessary hashes that still need to be 2597 sent and the chunk fit into a single datagram. Defined an 2598 order for the INTEGRITY messages. 2600 * Clarified rare case of sending multiple DATA messages in one 2601 datagram. 2603 * Clarified UDP datagrams carrying PPSPP should adhere to the 2604 network's MTU to avoid IP fragmentation. 2606 * Defined value for version protocol option. 2608 * Added small clarifications and corrected typos. 2610 * Extended versioning scheme to Min/max versioning scheme 2611 defined in [RFC6709], Section 4.1, following Riccardo 2612 Bernardini's suggestion. 2614 * Processed comments on unclear phrasing from Riccardo 2615 Bernardini. 2617 * Added a guideline on when to declare a peer dead. 2619 * Made sure all essential references are listed as Normative 2620 references following RFC3967. 2622 Authors' Addresses 2624 Arno Bakker 2625 Technische Universiteit Delft 2626 Mekelweg 4 2627 Delft, 2628CD 2628 The Netherlands 2630 Phone: 2631 Email: arno@cs.vu.nl 2633 Riccardo Petrocco 2634 Technische Universiteit Delft 2635 Mekelweg 4 2636 Delft, 2628CD 2637 The Netherlands 2639 Phone: 2640 Email: r.petrocco@gmail.com 2642 Victor Grishchenko 2643 Technische Universiteit Delft 2644 Mekelweg 4 2645 Delft, 2628CD 2646 The Netherlands 2648 Phone: 2649 Email: victor.grishchenko@gmail.com