idnits 2.17.1 draft-ietf-p2psip-base-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 1821 has weird spacing: '...Options optio...' == Line 2078 has weird spacing: '...hReqAns con...' == Line 2115 has weird spacing: '...ionType type;...' == Line 2265 has weird spacing: '...te_type typ...' == Line 2325 has weird spacing: '...tyValue ide...' == (5 more instances...) -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (November 10, 2010) is 4910 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'NodeIdLength' is mentioned on line 1716, but not defined -- Looks like a reference, but probably isn't: '0' on line 4549 == Missing Reference: 'RFC-AAAA' is mentioned on line 6182, but not defined ** Obsolete normative reference: RFC 5245 (Obsoleted by RFC 8445, RFC 8839) ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489) ** Obsolete normative reference: RFC 5766 (Obsoleted by RFC 8656) ** Obsolete normative reference: RFC 5246 (Obsoleted by RFC 8446) ** Obsolete normative reference: RFC 4347 (Obsoleted by RFC 6347) ** Obsolete normative reference: RFC 2988 (Obsoleted by RFC 6298) ** Obsolete normative reference: RFC 4395 (Obsoleted by RFC 7595) ** Obsolete normative reference: RFC 2818 (Obsoleted by RFC 9110) ** Obsolete normative reference: RFC 3447 (Obsoleted by RFC 8017) == Outdated reference: A later version (-16) exists of draft-ietf-mmusic-ice-tcp-11 -- Obsolete informational reference (is this intentional?): RFC 5201 (Obsoleted by RFC 7401) == Outdated reference: A later version (-09) exists of draft-ietf-p2psip-concepts-03 -- Obsolete informational reference (is this intentional?): RFC 5785 (Obsoleted by RFC 8615) == Outdated reference: A later version (-10) exists of draft-ietf-hip-reload-instance-02 == Outdated reference: A later version (-21) exists of draft-ietf-p2psip-sip-05 Summary: 9 errors (**), 0 flaws (~~), 13 warnings (==), 7 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 P2PSIP C. Jennings 3 Internet-Draft Cisco 4 Intended status: Standards Track B. B. Lowekamp, Ed. 5 Expires: May 14, 2011 E.K. Rescorla 6 Skype 7 S.A. Baset 8 H.G. Schulzrinne 9 Columbia University 10 November 10, 2010 12 REsource LOcation And Discovery (RELOAD) Base Protocol 13 draft-ietf-p2psip-base-12 15 Abstract 17 This specification defines REsource LOcation And Discovery (RELOAD), 18 a peer-to-peer (P2P) signaling protocol for use on the Internet. A 19 P2P signaling protocol provides its clients with an abstract storage 20 and messaging service between a set of cooperating peers that form 21 the overlay network. RELOAD is designed to support a P2P Session 22 Initiation Protocol (P2PSIP) network, but can be utilized by other 23 applications with similar requirements by defining new usages that 24 specify the kinds of data that must be stored for a particular 25 application. RELOAD defines a security model based on a certificate 26 enrollment service that provides unique identities. NAT traversal is 27 a fundamental service of the protocol. RELOAD also allows access 28 from "client" nodes that do not need to route traffic or store data 29 for others. 31 Legal 33 THIS DOCUMENT AND THE INFORMATION CONTAINED THEREIN ARE PROVIDED ON 34 AN "AS IS" BASIS AND THE CONTRIBUTOR, THE ORGANIZATION HE/SHE 35 REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE 36 IETF TRUST, AND THE INTERNET ENGINEERING TASK FORCE, DISCLAIM ALL 37 WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY 38 WARRANTY THAT THE USE OF THE INFORMATION THEREIN WILL NOT INFRINGE 39 ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS 40 FOR A PARTICULAR PURPOSE. 42 Status of this Memo 44 This Internet-Draft is submitted in full conformance with the 45 provisions of BCP 78 and BCP 79. 47 Internet-Drafts are working documents of the Internet Engineering 48 Task Force (IETF). Note that other groups may also distribute 49 working documents as Internet-Drafts. The list of current Internet- 50 Drafts is at http://datatracker.ietf.org/drafts/current/. 52 Internet-Drafts are draft documents valid for a maximum of six months 53 and may be updated, replaced, or obsoleted by other documents at any 54 time. It is inappropriate to use Internet-Drafts as reference 55 material or to cite them other than as "work in progress." 57 This Internet-Draft will expire on May 14, 2011. 59 Copyright Notice 61 Copyright (c) 2010 IETF Trust and the persons identified as the 62 document authors. All rights reserved. 64 This document is subject to BCP 78 and the IETF Trust's Legal 65 Provisions Relating to IETF Documents 66 (http://trustee.ietf.org/license-info) in effect on the date of 67 publication of this document. Please review these documents 68 carefully, as they describe your rights and restrictions with respect 69 to this document. Code Components extracted from this document must 70 include Simplified BSD License text as described in Section 4.e of 71 the Trust Legal Provisions and are provided without warranty as 72 described in the Simplified BSD License. 74 This document may contain material from IETF Documents or IETF 75 Contributions published or made publicly available before November 76 10, 2008. The person(s) controlling the copyright in some of this 77 material may not have granted the IETF Trust the right to allow 78 modifications of such material outside the IETF Standards Process. 79 Without obtaining an adequate license from the person(s) controlling 80 the copyright in such materials, this document may not be modified 81 outside the IETF Standards Process, and derivative works of it may 82 not be created outside the IETF Standards Process, except to format 83 it for publication as an RFC or to translate it into languages other 84 than English. 86 Table of Contents 88 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 8 89 1.1. Basic Setting . . . . . . . . . . . . . . . . . . . . . 9 90 1.2. Architecture . . . . . . . . . . . . . . . . . . . . . . 10 91 1.2.1. Usage Layer . . . . . . . . . . . . . . . . . . . . 13 92 1.2.2. Message Transport . . . . . . . . . . . . . . . . . 14 93 1.2.3. Storage . . . . . . . . . . . . . . . . . . . . . . 14 94 1.2.4. Topology Plugin . . . . . . . . . . . . . . . . . . 15 95 1.2.5. Forwarding and Link Management Layer . . . . . . . . 15 96 1.3. Security . . . . . . . . . . . . . . . . . . . . . . . . 16 97 1.4. Structure of This Document . . . . . . . . . . . . . . . 17 98 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 17 99 3. Overlay Management Overview . . . . . . . . . . . . . . . . . 20 100 3.1. Security and Identification . . . . . . . . . . . . . . 20 101 3.1.1. Shared-Key Security . . . . . . . . . . . . . . . . 21 102 3.2. Clients . . . . . . . . . . . . . . . . . . . . . . . . 21 103 3.2.1. Client Routing . . . . . . . . . . . . . . . . . . . 22 104 3.2.2. Minimum Functionality Requirements for Clients . . . 22 105 3.3. Routing . . . . . . . . . . . . . . . . . . . . . . . . 23 106 3.4. Connectivity Management . . . . . . . . . . . . . . . . 25 107 3.5. Overlay Algorithm Support . . . . . . . . . . . . . . . 26 108 3.5.1. Support for Pluggable Overlay Algorithms . . . . . . 26 109 3.5.2. Joining, Leaving, and Maintenance Overview . . . . . 26 110 3.6. First-Time Setup . . . . . . . . . . . . . . . . . . . . 28 111 3.6.1. Initial Configuration . . . . . . . . . . . . . . . 28 112 3.6.2. Enrollment . . . . . . . . . . . . . . . . . . . . . 28 113 4. Application Support Overview . . . . . . . . . . . . . . . . 29 114 4.1. Data Storage . . . . . . . . . . . . . . . . . . . . . . 29 115 4.1.1. Storage Permissions . . . . . . . . . . . . . . . . 30 116 4.1.2. Replication . . . . . . . . . . . . . . . . . . . . 31 117 4.2. Usages . . . . . . . . . . . . . . . . . . . . . . . . . 31 118 4.3. Service Discovery . . . . . . . . . . . . . . . . . . . 32 119 4.4. Application Connectivity . . . . . . . . . . . . . . . . 32 120 5. Overlay Management Protocol . . . . . . . . . . . . . . . . . 33 121 5.1. Message Receipt and Forwarding . . . . . . . . . . . . . 33 122 5.1.1. Responsible ID . . . . . . . . . . . . . . . . . . . 33 123 5.1.2. Other ID . . . . . . . . . . . . . . . . . . . . . . 34 124 5.1.3. Private ID . . . . . . . . . . . . . . . . . . . . . 35 125 5.2. Symmetric Recursive Routing . . . . . . . . . . . . . . 35 126 5.2.1. Request Origination . . . . . . . . . . . . . . . . 36 127 5.2.2. Response Origination . . . . . . . . . . . . . . . . 36 128 5.3. Message Structure . . . . . . . . . . . . . . . . . . . 37 129 5.3.1. Presentation Language . . . . . . . . . . . . . . . 38 130 5.3.1.1. Common Definitions . . . . . . . . . . . . . . . 38 131 5.3.2. Forwarding Header . . . . . . . . . . . . . . . . . 41 132 5.3.2.1. Processing Configuration Sequence Numbers . . . . 43 133 5.3.2.2. Destination and Via Lists . . . . . . . . . . . . 44 134 5.3.2.3. Forwarding Options . . . . . . . . . . . . . . . 46 135 5.3.2.4. Direct Return Response Forwarding Options . . . . 47 136 5.3.3. Message Contents Format . . . . . . . . . . . . . . 47 137 5.3.3.1. Response Codes and Response Errors . . . . . . . 49 138 5.3.4. Security Block . . . . . . . . . . . . . . . . . . . 51 139 5.4. Overlay Topology . . . . . . . . . . . . . . . . . . . . 54 140 5.4.1. Topology Plugin Requirements . . . . . . . . . . . . 54 141 5.4.2. Methods and types for use by topology plugins . . . 54 142 5.4.2.1. Join . . . . . . . . . . . . . . . . . . . . . . 54 143 5.4.2.2. Leave . . . . . . . . . . . . . . . . . . . . . . 55 144 5.4.2.3. Update . . . . . . . . . . . . . . . . . . . . . 56 145 5.4.2.4. RouteQuery . . . . . . . . . . . . . . . . . . . 56 146 5.4.2.5. Probe . . . . . . . . . . . . . . . . . . . . . . 57 147 5.5. Forwarding and Link Management Layer . . . . . . . . . . 59 148 5.5.1. Attach . . . . . . . . . . . . . . . . . . . . . . . 60 149 5.5.1.1. Request Definition . . . . . . . . . . . . . . . 60 150 5.5.1.2. Response Definition . . . . . . . . . . . . . . . 63 151 5.5.1.3. Using ICE With RELOAD . . . . . . . . . . . . . . 63 152 5.5.1.4. Collecting STUN Servers . . . . . . . . . . . . . 63 153 5.5.1.5. Gathering Candidates . . . . . . . . . . . . . . 64 154 5.5.1.6. Prioritizing Candidates . . . . . . . . . . . . . 65 155 5.5.1.7. Encoding the Attach Message . . . . . . . . . . . 65 156 5.5.1.8. Verifying ICE Support . . . . . . . . . . . . . . 66 157 5.5.1.9. Role Determination . . . . . . . . . . . . . . . 66 158 5.5.1.10. Full ICE . . . . . . . . . . . . . . . . . . . . 66 159 5.5.1.11. No-ICE . . . . . . . . . . . . . . . . . . . . . 67 160 5.5.1.12. Subsequent Offers and Answers . . . . . . . . . . 67 161 5.5.1.13. Sending Media . . . . . . . . . . . . . . . . . . 67 162 5.5.1.14. Receiving Media . . . . . . . . . . . . . . . . . 67 163 5.5.2. AppAttach . . . . . . . . . . . . . . . . . . . . . 67 164 5.5.2.1. Request Definition . . . . . . . . . . . . . . . 68 165 5.5.2.2. Response Definition . . . . . . . . . . . . . . . 68 166 5.5.3. Ping . . . . . . . . . . . . . . . . . . . . . . . . 69 167 5.5.3.1. Request Definition . . . . . . . . . . . . . . . 69 168 5.5.3.2. Response Definition . . . . . . . . . . . . . . . 69 169 5.5.4. ConfigUpdate . . . . . . . . . . . . . . . . . . . . 70 170 5.5.4.1. Request Definition . . . . . . . . . . . . . . . 70 171 5.5.4.2. Response Definition . . . . . . . . . . . . . . . 71 172 5.6. Overlay Link Layer . . . . . . . . . . . . . . . . . . . 72 173 5.6.1. Future Overlay Link Protocols . . . . . . . . . . . 73 174 5.6.1.1. HIP . . . . . . . . . . . . . . . . . . . . . . . 73 175 5.6.1.2. ICE-TCP . . . . . . . . . . . . . . . . . . . . . 74 176 5.6.1.3. Message-oriented Transports . . . . . . . . . . . 74 177 5.6.1.4. Tunneled Transports . . . . . . . . . . . . . . . 74 178 5.6.2. Framing Header . . . . . . . . . . . . . . . . . . . 74 179 5.6.3. Simple Reliability . . . . . . . . . . . . . . . . . 76 180 5.6.3.1. Retransmission and Flow Control . . . . . . . . . 76 181 5.6.4. DTLS/UDP with SR . . . . . . . . . . . . . . . . . . 78 182 5.6.5. TLS/TCP with FH, No-ICE . . . . . . . . . . . . . . 78 183 5.6.6. DTLS/UDP with SR, No-ICE . . . . . . . . . . 79 184 5.7. Fragmentation and Reassembly . . . . . . . . . . . . . . 79 185 6. Data Storage Protocol . . . . . . . . . . . . . . . . . . . . 80 186 6.1. Data Signature Computation . . . . . . . . . . . . . . . 81 187 6.2. Data Models . . . . . . . . . . . . . . . . . . . . . . 82 188 6.2.1. Single Value . . . . . . . . . . . . . . . . . . . . 83 189 6.2.2. Array . . . . . . . . . . . . . . . . . . . . . . . 84 190 6.2.3. Dictionary . . . . . . . . . . . . . . . . . . . . . 84 191 6.3. Access Control Policies . . . . . . . . . . . . . . . . 85 192 6.3.1. USER-MATCH . . . . . . . . . . . . . . . . . . . . . 85 193 6.3.2. NODE-MATCH . . . . . . . . . . . . . . . . . . . . . 85 194 6.3.3. USER-NODE-MATCH . . . . . . . . . . . . . . . . . . 85 195 6.3.4. NODE-MULTIPLE . . . . . . . . . . . . . . . . . . . 85 196 6.4. Data Storage Methods . . . . . . . . . . . . . . . . . . 86 197 6.4.1. Store . . . . . . . . . . . . . . . . . . . . . . . 86 198 6.4.1.1. Request Definition . . . . . . . . . . . . . . . 86 199 6.4.1.2. Response Definition . . . . . . . . . . . . . . . 90 200 6.4.1.3. Removing Values . . . . . . . . . . . . . . . . . 91 201 6.4.2. Fetch . . . . . . . . . . . . . . . . . . . . . . . 92 202 6.4.2.1. Request Definition . . . . . . . . . . . . . . . 93 203 6.4.2.2. Response Definition . . . . . . . . . . . . . . . 95 204 6.4.3. Stat . . . . . . . . . . . . . . . . . . . . . . . . 95 205 6.4.3.1. Request Definition . . . . . . . . . . . . . . . 96 206 6.4.3.2. Response Definition . . . . . . . . . . . . . . . 96 207 6.4.4. Find . . . . . . . . . . . . . . . . . . . . . . . . 98 208 6.4.4.1. Request Definition . . . . . . . . . . . . . . . 98 209 6.4.4.2. Response Definition . . . . . . . . . . . . . . . 98 210 6.4.5. Defining New Kinds . . . . . . . . . . . . . . . . . 99 211 7. Certificate Store Usage . . . . . . . . . . . . . . . . . . . 100 212 8. TURN Server Usage . . . . . . . . . . . . . . . . . . . . . . 101 213 9. Chord Algorithm . . . . . . . . . . . . . . . . . . . . . . . 102 214 9.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 103 215 9.2. Routing . . . . . . . . . . . . . . . . . . . . . . . . 104 216 9.3. Redundancy . . . . . . . . . . . . . . . . . . . . . . . 104 217 9.4. Joining . . . . . . . . . . . . . . . . . . . . . . . . 105 218 9.5. Routing Attaches . . . . . . . . . . . . . . . . . . . . 106 219 9.6. Updates . . . . . . . . . . . . . . . . . . . . . . . . 106 220 9.6.1. Handling Neighbor Failures . . . . . . . . . . . . . 107 221 9.6.2. Handling Finger Table Entry Failure . . . . . . . . 108 222 9.6.3. Receiving Updates . . . . . . . . . . . . . . . . . 108 223 9.6.4. Stabilization . . . . . . . . . . . . . . . . . . . 109 224 9.6.4.1. Updating neighbor table . . . . . . . . . . . . . 110 225 9.6.4.2. Refreshing finger table . . . . . . . . . . . . . 110 226 9.6.4.3. Adjusting finger table size . . . . . . . . . . . 111 227 9.6.4.4. Detecting partitioning . . . . . . . . . . . . . 111 228 9.7. Route query . . . . . . . . . . . . . . . . . . . . . . 112 229 9.8. Leaving . . . . . . . . . . . . . . . . . . . . . . . . 112 231 10. Enrollment and Bootstrap . . . . . . . . . . . . . . . . . . 113 232 10.1. Overlay Configuration . . . . . . . . . . . . . . . . . 113 233 10.1.1. Relax NG Grammar . . . . . . . . . . . . . . . . . . 118 234 10.2. Discovery Through Enrollment Server . . . . . . . . . . 120 235 10.3. Credentials . . . . . . . . . . . . . . . . . . . . . . 121 236 10.3.1. Self-Generated Credentials . . . . . . . . . . . . . 122 237 10.4. Searching for a Bootstrap Node . . . . . . . . . . . . . 123 238 10.5. Contacting a Bootstrap Node . . . . . . . . . . . . . . 123 239 11. Message Flow Example . . . . . . . . . . . . . . . . . . . . 123 240 12. Security Considerations . . . . . . . . . . . . . . . . . . . 130 241 12.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 130 242 12.2. Attacks on P2P Overlays . . . . . . . . . . . . . . . . 131 243 12.3. Certificate-based Security . . . . . . . . . . . . . . . 131 244 12.4. Shared-Secret Security . . . . . . . . . . . . . . . . . 132 245 12.5. Storage Security . . . . . . . . . . . . . . . . . . . . 133 246 12.5.1. Authorization . . . . . . . . . . . . . . . . . . . 133 247 12.5.2. Distributed Quota . . . . . . . . . . . . . . . . . 134 248 12.5.3. Correctness . . . . . . . . . . . . . . . . . . . . 134 249 12.5.4. Residual Attacks . . . . . . . . . . . . . . . . . . 134 250 12.6. Routing Security . . . . . . . . . . . . . . . . . . . . 135 251 12.6.1. Background . . . . . . . . . . . . . . . . . . . . . 135 252 12.6.2. Admissions Control . . . . . . . . . . . . . . . . . 136 253 12.6.3. Peer Identification and Authentication . . . . . . . 136 254 12.6.4. Protecting the Signaling . . . . . . . . . . . . . . 137 255 12.6.5. Residual Attacks . . . . . . . . . . . . . . . . . . 137 256 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 138 257 13.1. Well-Known URI Registration . . . . . . . . . . . . . . 138 258 13.2. Port Registrations . . . . . . . . . . . . . . . . . . . 138 259 13.3. Overlay Algorithm Types . . . . . . . . . . . . . . . . 139 260 13.4. Access Control Policies . . . . . . . . . . . . . . . . 139 261 13.5. Application-ID . . . . . . . . . . . . . . . . . . . . . 139 262 13.6. Data Kind-ID . . . . . . . . . . . . . . . . . . . . . . 140 263 13.7. Data Model . . . . . . . . . . . . . . . . . . . . . . . 140 264 13.8. Message Codes . . . . . . . . . . . . . . . . . . . . . 140 265 13.9. Error Codes . . . . . . . . . . . . . . . . . . . . . . 141 266 13.10. Overlay Link Types . . . . . . . . . . . . . . . . . . . 142 267 13.11. Overlay Link Protocols . . . . . . . . . . . . . . . . . 142 268 13.12. Forwarding Options . . . . . . . . . . . . . . . . . . . 143 269 13.13. Probe Information Types . . . . . . . . . . . . . . . . 143 270 13.14. Message Extensions . . . . . . . . . . . . . . . . . . . 143 271 13.15. reload URI Scheme . . . . . . . . . . . . . . . . . . . 144 272 13.15.1. URI Registration . . . . . . . . . . . . . . . . . . 144 273 14. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 145 274 15. References . . . . . . . . . . . . . . . . . . . . . . . . . 145 275 15.1. Normative References . . . . . . . . . . . . . . . . . . 145 276 15.2. Informative References . . . . . . . . . . . . . . . . . 147 277 Appendix A. Routing Alternatives . . . . . . . . . . . . . . . . 150 278 A.1. Iterative vs Recursive . . . . . . . . . . . . . . . . . 150 279 A.2. Symmetric vs Forward response . . . . . . . . . . . . . 150 280 A.3. Direct Response . . . . . . . . . . . . . . . . . . . . 151 281 A.4. Relay Peers . . . . . . . . . . . . . . . . . . . . . . 152 282 A.5. Symmetric Route Stability . . . . . . . . . . . . . . . 152 283 Appendix B. Why Clients? . . . . . . . . . . . . . . . . . . . . 153 284 B.1. Why Not Only Peers? . . . . . . . . . . . . . . . . . . 153 285 B.2. Clients as Application-Level Agents . . . . . . . . . . 154 286 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 154 288 1. Introduction 290 This document defines REsource LOcation And Discovery (RELOAD), a 291 peer-to-peer (P2P) signaling protocol for use on the Internet. It 292 provides a generic, self-organizing overlay network service, allowing 293 nodes to efficiently route messages to other nodes and to efficiently 294 store and retrieve data in the overlay. RELOAD provides several 295 features that are critical for a successful P2P protocol for the 296 Internet: 298 Security Framework: A P2P network will often be established among a 299 set of peers that do not trust each other. RELOAD leverages a 300 central enrollment server to provide credentials for each peer 301 which can then be used to authenticate each operation. This 302 greatly reduces the possible attack surface. 304 Usage Model: RELOAD is designed to support a variety of 305 applications, including P2P multimedia communications with the 306 Session Initiation Protocol [I-D.ietf-p2psip-sip]. RELOAD allows 307 the definition of new application usages, each of which can define 308 its own data types, along with the rules for their use. This 309 allows RELOAD to be used with new applications through a simple 310 documentation process that supplies the details for each 311 application. 313 NAT Traversal: RELOAD is designed to function in environments where 314 many if not most of the nodes are behind NATs or firewalls. 315 Operations for NAT traversal are part of the base design, 316 including using ICE to establish new RELOAD or application 317 protocol connections. 319 High Performance Routing: The very nature of overlay algorithms 320 introduces a requirement that peers participating in the P2P 321 network route requests on behalf of other peers in the network. 322 This introduces a load on those other peers, in the form of 323 bandwidth and processing power. RELOAD has been defined with a 324 simple, lightweight forwarding header, thus minimizing the amount 325 of effort required by intermediate peers. 327 Pluggable Overlay Algorithms: RELOAD has been designed with an 328 abstract interface to the overlay layer to simplify implementing a 329 variety of structured (e.g., distributed hash tables) and 330 unstructured overlay algorithms. This specification also defines 331 how RELOAD is used with the Chord DHT algorithm, which is 332 mandatory to implement. Specifying a default "must implement" 333 overlay algorithm promotes interoperability, while extensibility 334 allows selection of overlay algorithms optimized for a particular 335 application. 337 These properties were designed specifically to meet the requirements 338 for a P2P protocol to support SIP. This document defines the base 339 protocol for the distributed storage and location service, as well as 340 critical usages for NAT traversal and security. The SIP Usage itself 341 is described separately in [I-D.ietf-p2psip-sip]. RELOAD is not 342 limited to usage by SIP and could serve as a tool for supporting 343 other P2P applications with similar needs. RELOAD is also based on 344 the concepts introduced in [I-D.ietf-p2psip-concepts]. 346 1.1. Basic Setting 348 In this section, we provide a brief overview of the operational 349 setting for RELOAD. See the concepts 350 document[I-D.ietf-p2psip-concepts] for more details. A RELOAD 351 Overlay Instance consists of a set of nodes arranged in a connected 352 graph. Each node in the overlay is assigned a numeric Node-ID which, 353 together with the specific overlay algorithm in use, determines its 354 position in the graph and the set of nodes it connects to. The 355 figure below shows a trivial example which isn't drawn from any 356 particular overlay algorithm, but was chosen for convenience of 357 representation. 359 +--------+ +--------+ +--------+ 360 | Node 10|--------------| Node 20|--------------| Node 30| 361 +--------+ +--------+ +--------+ 362 | | | 363 | | | 364 +--------+ +--------+ +--------+ 365 | Node 40|--------------| Node 50|--------------| Node 60| 366 +--------+ +--------+ +--------+ 367 | | | 368 | | | 369 +--------+ +--------+ +--------+ 370 | Node 70|--------------| Node 80|--------------| Node 90| 371 +--------+ +--------+ +--------+ 372 | 373 | 374 +--------+ 375 | Node 85| 376 |(Client)| 377 +--------+ 379 Because the graph is not fully connected, when a node wants to send a 380 message to another node, it may need to route it through the network. 381 For instance, Node 10 can talk directly to nodes 20 and 40, but not 382 to Node 70. In order to send a message to Node 70, it would first 383 send it to Node 40 with instructions to pass it along to Node 70. 384 Different overlay algorithms will have different connectivity graphs, 385 but the general idea behind all of them is to allow any node in the 386 graph to efficiently reach every other node within a small number of 387 hops. 389 The RELOAD network is not only a messaging network. It is also a 390 storage network. Records are stored under numeric addresses which 391 occupy the same space as node identifiers. Peers are responsible for 392 storing the data associated with some set of addresses as determined 393 by their Node-ID. For instance, we might say that every peer is 394 responsible for storing any data value which has an address less than 395 or equal to its own Node-ID, but greater than the next lowest 396 Node-ID. Thus, Node-20 would be responsible for storing values 397 11-20. 399 RELOAD also supports clients. These are nodes which have Node-IDs 400 but do not participate in routing or storage. For instance, in the 401 figure above Node 85 is a client. It can route to the rest of the 402 RELOAD network via Node 80, but no other node will route through it 403 and Node 90 is still responsible for all addresses between 81-90. We 404 refer to non-client nodes as peers. 406 Other applications (for instance, SIP) can be defined on top of 407 RELOAD and use these two basic RELOAD services to provide their own 408 services. 410 1.2. Architecture 412 RELOAD is fundamentally an overlay network. The following figure 413 shows the layered RELOAD architecture. 415 Application 417 +-------+ +-------+ 418 | SIP | | XMPP | ... 419 | Usage | | Usage | 420 +-------+ +-------+ 421 ------------------------------------ Messaging Service Boundary 422 +------------------+ +---------+ 423 | Message |<--->| Storage | 424 | Transport | +---------+ 425 +------------------+ ^ 426 ^ ^ | 427 | v v 428 | +-------------------+ 429 | | Topology | 430 | | Plugin | 431 | +-------------------+ 432 | ^ 433 v v 434 +------------------+ 435 | Forwarding & | 436 | Link Management | 437 +------------------+ 438 ------------------------------------ Overlay Link Service Boundary 439 +-------+ +------+ 440 |TLS | |DTLS | ... 441 +-------+ +------+ 443 The major components of RELOAD are: 445 Usage Layer: Each application defines a RELOAD usage; a set of data 446 kinds and behaviors which describe how to use the services 447 provided by RELOAD. These usages all talk to RELOAD through a 448 common Message Transport Service. 450 Message Transport: Handles end-to-end reliability, manages request 451 state for the usages, and forwards Store and Fetch operations to 452 the Storage component. Delivers message responses to the 453 component initiating the request. 455 Storage: The Storage component is responsible for processing 456 messages relating to the storage and retrieval of data. It talks 457 directly to the Topology Plugin to manage data replication and 458 migration, and it talks to the Message Transport component to send 459 and receive messages. 461 Topology Plugin: The Topology Plugin is responsible for implementing 462 the specific overlay algorithm being used. It uses the Message 463 Transport component to send and receive overlay management 464 messages, to the Storage component to manage data replication, and 465 directly to the Forwarding Layer to control hop-by-hop message 466 forwarding. This component closely parallels conventional routing 467 algorithms, but is more tightly coupled to the Forwarding Layer 468 because there is no single "routing table" equivalent used by all 469 overlay algorithms. 471 Forwarding and Link Management Layer: Stores and implements the 472 routing table by providing packet forwarding services between 473 nodes. It also handles establishing new links between nodes, 474 including setting up connections across NATs using ICE. 476 Overlay Link Layer: Responsible for actually transporting traffic 477 directly between nodes. Each such protocol includes the 478 appropriate provisions for per-hop framing or hop-by-hop ACKs 479 required by unreliable transports. TLS [RFC5246] and DTLS 480 [RFC4347] are the currently defined "link layer" protocols used by 481 RELOAD for hop-by-hop communication. New protocols MAY be 482 defined, as described in Section 5.6.1 and Section 10.1. As this 483 document defines only TLS and DTLS, we use those terms throughout 484 the remainder of the document with the understanding that some 485 future specification may add new overlay link layers. 487 To further clarify the roles of the various layers, this figure 488 parallels the architecture with each layer's role from an overlay 489 perspective and implementation layer in the internet: 491 | Internet Model | 492 Real | Equivalent | Reload 493 Internet | in Overlay | Architecture 494 -------------+-----------------+------------------------------------ 495 | | +-------+ +-------+ 496 | Application | | SIP | | XMPP | ... 497 | | | Usage | | Usage | 498 | | +-------+ +-------+ 499 | | ---------------------------------- 500 | |+------------------+ +---------+ 501 | Transport || Message |<--->| Storage | 502 | || Transport | +---------+ 503 | |+------------------+ ^ 504 | | ^ ^ | 505 | | | v v 506 Application | | | +-------------------+ 507 | (Routing) | | | Topology | 508 | | | | Plugin | 509 | | | +-------------------+ 510 | | | ^ 511 | | v v 512 | Network | +------------------+ 513 | | | Forwarding & | 514 | | | Link Management | 515 | | +------------------+ 516 | | ---------------------------------- 517 Transport | Link | +-------+ +------+ 518 | | |TLS | |DTLS | ... 519 | | +-------+ +------+ 520 -------------+-----------------+------------------------------------ 521 Network | 522 | 523 Link | 525 1.2.1. Usage Layer 527 The top layer, called the Usage Layer, has application usages, such 528 as the SIP Location Usage [I-D.ietf-p2psip-sip], that use the 529 abstract Message Transport Service provided by RELOAD. The goal of 530 this layer is to implement application-specific usages of the generic 531 overlay services provided by RELOAD. The usage defines how a 532 specific application maps its data into something that can be stored 533 in the overlay, where to store the data, how to secure the data, and 534 finally how applications can retrieve and use the data. 536 The architecture diagram shows both a SIP usage and an XMPP usage. A 537 single application may require multiple usages; for example a 538 softphone application may also require a voicemail usage. A usage 539 may define multiple kinds of data that are stored in the overlay and 540 may also rely on kinds originally defined by other usages. 542 Because the security and storage policies for each kind are dictated 543 by the usage defining the kind, the usages may be coupled with the 544 Storage component to provide security policy enforcement and to 545 implement appropriate storage strategies according to the needs of 546 the usage. The exact implementation of such an interface is outside 547 the scope of this specification. 549 1.2.2. Message Transport 551 The Message Transport component provides a generic message routing 552 service for the overlay. The Message Transport layer is responsible 553 for end-to-end message transactions, including retransmissions. Each 554 peer is identified by its location in the overlay as determined by 555 its Node-ID. A component that is a client of the Message Transport 556 can perform two basic functions: 558 o Send a message to a given peer specified by Node-ID or to the peer 559 responsible for a particular Resource-ID. 560 o Receive messages that other peers sent to a Node-ID or Resource-ID 561 for which the receiving peer is responsible. 563 All usages rely on the Message Transport component to send and 564 receive messages from peers. For instance, when a usage wants to 565 store data, it does so by sending Store requests. Note that the 566 Storage component and the Topology Plugin are themselves clients of 567 the Message Transport, because they need to send and receive messages 568 from other peers. 570 The Message Transport Service is similar to those described as 571 providing "Key based routing" (KBR), although as RELOAD supports 572 different overlay algorithms (including non-DHT overlay algorithms) 573 that calculate keys in different ways, the actual interface must 574 accept Resource Names rather than actual keys. 576 1.2.3. Storage 578 One of the major functions of RELOAD is to allow nodes to store data 579 in the overlay and to retrieve data stored by other nodes or by 580 themselves. The Storage component is responsible for processing data 581 storage and retrieval messages. For instance, the Storage component 582 might receive a Store request for a given resource from the Message 583 Transport. It would then query the appropriate usage before storing 584 the data value(s) in its local data store and sending a response to 585 the Message Transport for delivery to the requesting node. 586 Typically, these messages will come from other nodes, but depending 587 on the overlay topology, a node might be responsible for storing data 588 for itself as well, especially if the overlay is small. 590 A peer's Node-ID determines the set of resources that it will be 591 responsible for storing. However, the exact mapping between these is 592 determined by the overlay algorithm in use. The Storage component 593 will only receive a Store request from the Message Transport if this 594 peer is responsible for that Resource-ID. The Storage component is 595 notified by the Topology Plugin when the Resource-IDs for which it is 596 responsible change, and the Storage component is then responsible for 597 migrating resources to other peers, as required. 599 1.2.4. Topology Plugin 601 RELOAD is explicitly designed to work with a variety of overlay 602 algorithms. In order to facilitate this, the overlay algorithm 603 implementation is provided by a Topology Plugin so that each overlay 604 can select an appropriate overlay algorithm that relies on the common 605 RELOAD core protocols and code. 607 The Topology Plugin is responsible for maintaining the overlay 608 algorithm Routing Table, which is consulted by the Forwarding and 609 Link Management Layer before routing a message. When connections are 610 made or broken, the Forwarding and Link Management Layer notifies the 611 Topology Plugin, which adjusts the routing table as appropriate. The 612 Topology Plugin will also instruct the Forwarding and Link Management 613 Layer to form new connections as dictated by the requirements of the 614 overlay algorithm Topology. The Topology Plugin issues periodic 615 update requests through Message Transport to maintain and update its 616 Routing Table. 618 As peers enter and leave, resources may be stored on different peers, 619 so the Topology Plugin also keeps track of which peers are 620 responsible for which resources. As peers join and leave, the 621 Topology Plugin instructs the Storage component to issue resource 622 migration requests as appropriate, in order to ensure that other 623 peers have whatever resources they are now responsible for. The 624 Topology Plugin is also responsible for providing for redundant data 625 storage to protect against loss of information in the event of a peer 626 failure and to protect against compromised or subversive peers. 628 1.2.5. Forwarding and Link Management Layer 630 The Forwarding and Link Management Layer is responsible for getting a 631 message to the next peer, as determined by the Topology Plugin. This 632 Layer establishes and maintains the network connections as required 633 by the Topology Plugin. This layer is also responsible for setting 634 up connections to other peers through NATs and firewalls using ICE, 635 and it can elect to forward traffic using relays for NAT and firewall 636 traversal. 638 This layer provides a generic interface that allows the topology 639 plugin to control the overlay and resource operations and messages. 640 Since each overlay algorithm is defined and functions differently, we 641 generically refer to the table of other peers that the overlay 642 algorithm maintains and uses to route requests (neighbors) as a 643 Routing Table. The Topology Plugin actually owns the Routing Table, 644 and forwarding decisions are made by querying the Topology Plugin for 645 the next hop for a particular Node-ID or Resource-ID. If this node 646 is the destination of the message, the message is delivered to the 647 Message Transport. 649 This layer also utilizes a framing header to encapsulate messages as 650 they are forwarding along each hop. This header aids reliability 651 congestion control, flow control, etc. It has meaning only in the 652 context of that individual link. 654 The Forwarding and Link Management Layer sits on top of the Overlay 655 Link Layer protocols that carry the actual traffic. This 656 specification defines how to use DTLS and TLS protocols to carry 657 RELOAD messages. 659 1.3. Security 661 RELOAD's security model is based on each node having one or more 662 public key certificates. In general, these certificates will be 663 assigned by a central server which also assigns Node-IDs, although 664 self-signed certificates can be used in closed networks. These 665 credentials can be leveraged to provide communications security for 666 RELOAD messages. RELOAD provides communications security at three 667 levels: 669 Connection Level: Connections between peers are secured with TLS, 670 DTLS, or potentially some to be defined future protocol. 671 Message Level: Each RELOAD message must be signed. 672 Object Level: Stored objects must be signed by the storing peer. 674 These three levels of security work together to allow peers to verify 675 the origin and correctness of data they receive from other peers, 676 even in the face of malicious activity by other peers in the overlay. 677 RELOAD also provides access control built on top of these 678 communications security features. Because the peer responsible for 679 storing a piece of data can validate the signature on the data being 680 stored, the responsible peer can determine whether a given operation 681 is permitted or not. 683 RELOAD also provides an optional shared secret based admission 684 control feature using shared secrets and TLS-PSK. In order to form a 685 TLS connection to any node in the overlay, a new node needs to know 686 the shared overlay key, thus restricting access to authorized users 687 only. This feature is used together with certificate-based access 688 control, not as a replacement for it. It is typically used when 689 self-signed certificates are being used but would generally not be 690 used when the certificates were all signed by an enrollment server. 692 1.4. Structure of This Document 694 The remainder of this document is structured as follows. 696 o Section 2 provides definitions of terms used in this document. 697 o Section 3 provides an overview of the mechanisms used to establish 698 and maintain the overlay. 699 o Section 4 provides an overview of the mechanism RELOAD provides to 700 support other applications. 701 o Section 5 defines the protocol messages that RELOAD uses to 702 establish and maintain the overlay. 703 o Section 6 defines the protocol messages that are used to store and 704 retrieve data using RELOAD. 705 o Section 7 defines the Certificate Store Usage that is fundamental 706 to RELOAD security. 707 o Section 8 defines the TURN Server Usage needed to locate TURN 708 servers for NAT traversal. 709 o Section 9 defines a specific Topology Plugin using Chord. 710 o Section 10 defines the mechanisms that new RELOAD nodes use to 711 join the overlay for the first time. 712 o Section 11 provides an extended example. 714 2. Terminology 716 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 717 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 718 document are to be interpreted as described in RFC 2119 [RFC2119]. 720 We use the terminology and definitions from the Concepts and 721 Terminology for Peer to Peer SIP [I-D.ietf-p2psip-concepts] draft 722 extensively in this document. Other terms used in this document are 723 defined inline when used and are also defined below for reference. 725 DHT: A distributed hash table. A DHT is an abstract hash table 726 service realized by storing the contents of the hash table across 727 a set of peers. 729 Overlay Algorithm: An overlay algorithm defines the rules for 730 determining which peers in an overlay store a particular piece of 731 data and for determining a topology of interconnections amongst 732 peers in order to find a piece of data. 734 Overlay Instance: A specific overlay algorithm and the collection of 735 peers that are collaborating to provide read and write access to 736 it. There can be any number of overlay instances running in an IP 737 network at a time, and each operates in isolation of the others. 739 Peer: A host that is participating in the overlay. Peers are 740 responsible for holding some portion of the data that has been 741 stored in the overlay and also route messages on behalf of other 742 hosts as required by the Overlay Algorithm. 744 Client: A host that is able to store data in and retrieve data from 745 the overlay but which is not participating in routing or data 746 storage for the overlay. 748 Kind: A kind defines a particular type of data that can be stored in 749 the overlay. Applications define new Kinds to story the data they 750 use. Each Kind is identified with a unique IANA assigned integer 751 called a Kind-ID. 753 Node: We use the term "Node" to refer to a host that may be either a 754 Peer or a Client. Because RELOAD uses the same protocol for both 755 clients and peers, much of the text applies equally to both. 756 Therefore we use "Node" when the text applies to both Clients and 757 Peers and the more specific term (i.e. client or peer) when the 758 text applies only to Clients or only to Peers. 760 Node-ID: A fixed-length value that uniquely identifies a node. 761 Node-IDs of all 0s and all 1s are reserved and are invalid Node- 762 IDs. A value of zero is not used in the wire protocol but can be 763 used to indicate an invalid node in implementations and APIs. The 764 Node-ID of all 1s is used on the wire protocol as a wildcard. 766 Resource: An object or group of objects associated with a string 767 identifier. See "Resource Name" below. 769 Resource Name: The potentially human readable name by which a 770 resource is identified. In unstructured P2P networks, the 771 resource name is sometimes used directly as a Resource-ID. In 772 structured P2P networks the resource name is typically mapped into 773 a Resource-ID by using the string as the input to hash function. 774 A SIP resource, for example, is often identified by its AOR which 775 is an example of a Resource Name. 777 Resource-ID: A value that identifies some resources and which is 778 used as a key for storing and retrieving the resource. Often this 779 is not human friendly/readable. One way to generate a Resource-ID 780 is by applying a mapping function to some other unique name (e.g., 781 user name or service name) for the resource. The Resource-ID is 782 used by the distributed database algorithm to determine the peer 783 or peers that are responsible for storing the data for the 784 overlay. In structured P2P networks, Resource-IDs are generally 785 fixed length and are formed by hashing the resource name. In 786 unstructured networks, resource names may be used directly as 787 Resource-IDs and may be variable lengths. 789 Connection Table: The set of nodes to which a node is directly 790 connected. This includes nodes with which Attach handshakes have 791 been done but which have not sent any Updates. 793 Routing Table: The set of peers which a node can use to route 794 overlay messages. In general, these peers will all be on the 795 connection table but not vice versa, because some peers will have 796 Attached but not sent updates. Peers may send messages directly 797 to peers that are in the connection table but may only route 798 messages to other peers through peers that are in the routing 799 table. 801 Destination List: A list of IDs through which a message is to be 802 routed. A single Node-ID is a trivial form of destination list. 804 Usage: A usage is an application that wishes to use the overlay for 805 some purpose. Each application wishing to use the overlay defines 806 a set of data kinds that it wishes to use. The SIP usage defines 807 the location data kind. 809 The term "maximum request lifetime" is the maximum time a request 810 will wait for a response; it defaults to 15 seconds. The term 811 "successor replacement hold-down time" is the amount of time to wait 812 before starting replication when a new successor is found; it 813 defaults to 30 seconds. 815 3. Overlay Management Overview 817 The most basic function of RELOAD is as a generic overlay network. 818 Nodes need to be able to join the overlay, form connections to other 819 nodes, and route messages through the overlay to nodes to which they 820 are not directly connected. This section provides an overview of the 821 mechanisms that perform these functions. 823 3.1. Security and Identification 825 Every node in the RELOAD overlay is identified by a Node-ID. The 826 Node-ID is used for three major purposes: 828 o To address the node itself. 829 o To determine its position in the overlay topology when the overlay 830 is structured. 831 o To determine the set of resources for which the node is 832 responsible. 834 Each node has a certificate [RFC5280] containing a Node-ID, which is 835 unique within an overlay instance. 837 The certificate serves multiple purposes: 839 o It entitles the user to store data at specific locations in the 840 Overlay Instance. Each data kind defines the specific rules for 841 determining which certificates can access each Resource-ID/Kind-ID 842 pair. For instance, some kinds might allow anyone to write at a 843 given location, whereas others might restrict writes to the owner 844 of a single certificate. 845 o It entitles the user to operate a node that has a Node-ID found in 846 the certificate. When the node forms a connection to another 847 peer, it uses this certificate so that a node connecting to it 848 knows it is connected to the correct node (technically: a (D)TLS 849 association with client authentication is formed.) In addition, 850 the node can sign messages, thus providing integrity and 851 authentication for messages which are sent from the node. 852 o It entitles the user to use the user name found in the 853 certificate. 855 If a user has more than one device, typically they would get one 856 certificate for each device. This allows each device to act as a 857 separate peer. 859 RELOAD supports multiple certificate issuance models. The first is 860 based on a central enrollment process which allocates a unique name 861 and Node-ID and puts them in a certificate for the user. All peers 862 in a particular Overlay Instance have the enrollment server as a 863 trust anchor and so can verify any other peer's certificate. 865 In some settings, a group of users want to set up an overlay network 866 but are not concerned about attack by other users in the network. 867 For instance, users on a LAN might want to set up a short term ad hoc 868 network without going to the trouble of setting up an enrollment 869 server. RELOAD supports the use of self-generated, self-signed 870 certificates. When self-signed certificates are used, the node also 871 generates its own Node-ID and username. The Node-ID is computed as a 872 digest of the public key, to prevent Node-ID theft; however this 873 model is still subject to a number of known attacks (most notably 874 Sybil attacks [Sybil]) and can only be safely used in closed networks 875 where users are mutually trusting. 877 The general principle here is that the security mechanisms (TLS and 878 message signatures) are always used, even if the certificates are 879 self-signed. This allows for a single set of code paths in the 880 systems with the only difference being whether certificate 881 verification is required to chain to a single root of trust. 883 3.1.1. Shared-Key Security 885 RELOAD also provides an admission control system based on shared 886 keys. In this model, the peers all share a single key which is used 887 to authenticate the peer-to-peer connections via TLS-PSK/TLS-SRP. 889 3.2. Clients 891 RELOAD defines a single protocol that is used both as the peer 892 protocol and as the client protocol for the overlay. This simplifies 893 implementation, particularly for devices that may act in either role, 894 and allows clients to inject messages directly into the overlay. 896 We use the term "peer" to identify a node in the overlay that routes 897 messages for nodes other than those to which it is directly 898 connected. Peers typically also have storage responsibilities. We 899 use the term "client" to refer to nodes that do not have routing or 900 storage responsibilities. When text applies to both peers and 901 clients, we will simply refer such devices as "nodes." 903 RELOAD's client support allows nodes that are not participating in 904 the overlay as peers to utilize the same implementation and to 905 benefit from the same security mechanisms as the peers. Clients 906 possess and use certificates that authorize the user to store data at 907 certain locations in the overlay. The Node-ID in the certificate is 908 used to identify the particular client as a member of the overlay and 909 to authenticate its messages. 911 In RELOAD, unlike some other designs, clients are not a first-class 912 concept. From the perspective of a peer, a client is simply a node 913 which has not yet sent any Updates or Joins. It might never do so 914 (if it's a client) or it might eventually do so (if it's just a node 915 that's taking a long time to join). The routing and storage rules 916 for RELOAD provide for correct behavior by peers regardless of 917 whether other nodes attached to them are clients or peers. Of 918 course, a client implementation must know that it intends to be a 919 client, but this localizes complexity only to that node. 921 For more discussion of the motivation for RELOAD's client support, 922 see Appendix B. 924 3.2.1. Client Routing 926 Clients may insert themselves in the overlay in two ways: 928 o Establish a connection to the peer responsible for the client's 929 Node-ID in the overlay. Then requests may be sent from/to the 930 client using its Node-ID in the same manner as if it were a peer, 931 because the responsible peer in the overlay will handle the final 932 step of routing to the client. This may require a TURN relay in 933 cases where NATs or firewalls prevent a client from forming a 934 direct connections with its responsible peer. Note that clients 935 that choose this option MUST process Update messages from the 936 peer. Those updates can indicate that the peer no longer is 937 responsible for the Client's Node-ID. The client then MUST form a 938 connection to the appropriate peer. Failure to do so will result 939 in the client no longer receiving messages. 940 o Establish a connection with an arbitrary peer in the overlay 941 (perhaps based on network proximity or an inability to establish a 942 direct connection with the responsible peer). In this case, the 943 client will rely on RELOAD's Destination List feature to ensure 944 reachability. The client can initiate requests, and any node in 945 the overlay that knows the Destination List to its current 946 location can reach it, but the client is not directly reachable 947 using only its Node-ID. If the client is to receive incoming 948 requests from other members of the overlay, the Destination List 949 required to reach it must be learnable via other mechanisms, such 950 as being stored in the overlay by a usage. 952 3.2.2. Minimum Functionality Requirements for Clients 954 A node may act as a client simply because it does not have the 955 resources or even an implementation of the topology plugin required 956 to act as a peer in the overlay. In order to exchange RELOAD 957 messages with a peer, a client must meet a minimum level of 958 functionality. Such a client must: 960 o Implement RELOAD's connection-management operations that are used 961 to establish the connection with the peer. 962 o Implement RELOAD's data retrieval methods (with client 963 functionality). 964 o Be able to calculate Resource-IDs used by the overlay. 965 o Possess security credentials required by the overlay it is 966 implementing. 968 A client speaks the same protocol as the peers, knows how to 969 calculate Resource-IDs, and signs its requests in the same manner as 970 peers. While a client does not necessarily require a full 971 implementation of the overlay algorithm, calculating the Resource-ID 972 requires an implementation of the appropriate algorithm for the 973 overlay. 975 3.3. Routing 977 This section will discuss the requirements RELOAD's routing 978 capabilities must meet, then describe the routing features in the 979 protocol, and then provide a brief overview of how they are used. 980 Appendix A discusses some alternative designs and the tradeoffs that 981 would be necessary to support them. 983 RELOAD's routing capabilities must meet the following requirements: 985 NAT Traversal: RELOAD must support establishing and using 986 connections between nodes separated by one or more NATs, including 987 locating peers behind NATs for those overlays allowing/requiring 988 it. 989 Clients: RELOAD must support requests from and to clients that do 990 not participate in overlay routing. 991 Client promotion: RELOAD must support clients that become peers at a 992 later point as determined by the overlay algorithm and deployment. 993 Low state: RELOAD's routing algorithms must not require 994 significant state to be stored on intermediate peers. 995 Return routability in unstable topologies: At some points in 996 times, different nodes may have inconsistent information about the 997 connectivity of the routing graph. In all cases, the response to 998 a request needs to delivered to the node that sent the request and 999 not to some other node. 1001 RELOAD's routing provides three mechanisms designed to assist in 1002 meeting these needs: 1004 Destination Lists: While in principle it is possible to just 1005 inject a message into the overlay with a bare Node-ID as the 1006 destination, RELOAD provides a source routing capability in the 1007 form of "Destination Lists". A "Destination List provides a list 1008 of the nodes through which a message must flow. 1009 Via Lists: In order to allow responses to follow the same path as 1010 requests, each message also contains a "Via List", which is added 1011 to by each node a message traverses. This via list can then be 1012 inverted and used as a destination list for the response. 1013 RouteQuery: The RouteQuery method allows a node to query a peer 1014 for the next hop it will use to route a message. This method is 1015 useful for diagnostics and for iterative routing. 1017 The basic routing mechanism used by RELOAD is Symmetric Recursive. 1018 We will first describe symmetric recursive routing and then discuss 1019 its advantages in terms of the requirements discussed above. 1021 Symmetric recursive routing requires that a message follow a path 1022 through the overlay to the destination without returning to the 1023 originating node: each peer forwards the message closer to its 1024 destination. The return path of the response is then the same path 1025 followed in reverse. For example, a message following a route from A 1026 to Z through B and X: 1028 A B X Z 1029 ------------------------------- 1031 ----------> 1032 Dest=Z 1033 ----------> 1034 Via=A 1035 Dest=Z 1036 ----------> 1037 Via=A, B 1038 Dest=Z 1040 <---------- 1041 Dest=X, B, A 1042 <---------- 1043 Dest=B, A 1044 <---------- 1045 Dest=A 1047 Note that the preceding Figure does not indicate whether A is a 1048 client or peer: A forwards its request to B and the response is 1049 returned to A in the same manner regardless of A's role in the 1050 overlay. 1052 This figure shows use of full via-lists by intermediate peers B and 1053 X. However, if B and/or X are willing to store state, then they may 1054 elect to truncate the lists, save that information internally (keyed 1055 by the transaction id), and return the response message along the 1056 path from which it was received when the response is received. This 1057 option requires greater state to be stored on intermediate peers but 1058 saves a small amount of bandwidth and reduces the need for modifying 1059 the message en route. Selection of this mode of operation is a 1060 choice for the individual peer; the techniques are interoperable even 1061 on a single message. The figure below shows B using full via lists 1062 but X truncating them to X1 and saving the state internally. 1064 A B X Z 1065 ------------------------------- 1067 ----------> 1068 Dest=Z 1069 ----------> 1070 Via=A 1071 Dest=Z 1072 ----------> 1073 Dest=Z, X1 1075 <---------- 1076 Dest=X,X1 1077 <---------- 1078 Dest=B, A 1079 <---------- 1080 Dest=A 1082 RELOAD also supports a basic Iterative routing mode (where the 1083 intermediate peers merely return a response indicating the next hop, 1084 but do not actually forward the message to that next hop themselves). 1085 Iterative routing is implemented using the RouteQuery method, which 1086 requests this behavior. Note that iterative routing is selected only 1087 by the initiating node. 1089 3.4. Connectivity Management 1091 In order to provide efficient routing, a peer needs to maintain a set 1092 of direct connections to other peers in the Overlay Instance. Due to 1093 the presence of NATs, these connections often cannot be formed 1094 directly. Instead, we use the Attach request to establish a 1095 connection. Attach uses ICE [RFC5245] to establish the connection. 1096 It is assumed that the reader is familiar with ICE. 1098 Say that peer A wishes to form a direct connection to peer B. It 1099 gathers ICE candidates and packages them up in an Attach request 1100 which it sends to B through usual overlay routing procedures. B does 1101 its own candidate gathering and sends back a response with its 1102 candidates. A and B then do ICE connectivity checks on the candidate 1103 pairs. The result is a connection between A and B. At this point, A 1104 and B can add each other to their routing tables and send messages 1105 directly between themselves without going through other overlay 1106 peers. 1108 There is one special case in which Attach cannot be used: when a 1109 peer is joining the overlay and is not connected to any peers. In 1110 order to support this case, some small number of "bootstrap nodes" 1111 typically need to be publicly accessible so that new peers can 1112 directly connect to them. Section 10 contains more detail on this. 1114 In general, a peer needs to maintain connections to all of the peers 1115 near it in the Overlay Instance and to enough other peers to have 1116 efficient routing (the details depend on the specific overlay). If a 1117 peer cannot form a connection to some other peer, this isn't 1118 necessarily a disaster; overlays can route correctly even without 1119 fully connected links. However, a peer should try to maintain the 1120 specified link set and if it detects that it has fewer direct 1121 connections, should form more as required. This also implies that 1122 peers need to periodically verify that the connected peers are still 1123 alive and if not try to reform the connection or form an alternate 1124 one. 1126 3.5. Overlay Algorithm Support 1128 The Topology Plugin allows RELOAD to support a variety of overlay 1129 algorithms. This specification defines a DHT based on Chord [Chord], 1130 which is mandatory to implement, but the base RELOAD protocol is 1131 designed to support a variety of overlay algorithms. 1133 3.5.1. Support for Pluggable Overlay Algorithms 1135 RELOAD defines three methods for overlay maintenance: Join, Update, 1136 and Leave. However, the contents of those messages, when they are 1137 sent, and their precise semantics are specified by the actual overlay 1138 algorithm; RELOAD merely provides a framework of commonly-needed 1139 methods that provides uniformity of notation (and ease of debugging) 1140 for a variety of overlay algorithms. 1142 3.5.2. Joining, Leaving, and Maintenance Overview 1144 When a new peer wishes to join the Overlay Instance, it must have a 1145 Node-ID that it is allowed to use and a set of credentials which 1146 match that Node-ID. When an enrollment server is used that Node-ID 1147 will be in the certificate the node received from the enrollment 1148 server. The details of the joining procedure are defined by the 1149 overlay algorithm, but the general steps for joining an Overlay 1150 Instance are: 1152 o Forming connections to some other peers. 1153 o Acquiring the data values this peer is responsible for storing. 1154 o Informing the other peers which were previously responsible for 1155 that data that this peer has taken over responsibility. 1157 The first thing the peer needs to do is to form a connection to some 1158 "bootstrap node". Because this is the first connection the peer 1159 makes, these nodes must have public IP addresses so that they can be 1160 connected to directly. Once a peer has connected to one or more 1161 bootstrap nodes, it can form connections in the usual way by routing 1162 Attach messages through the overlay to other nodes. Once a peer has 1163 connected to the overlay for the first time, it can cache the set of 1164 nodes it has connected to with public IP addresses for use as future 1165 bootstrap nodes. 1167 Once a peer has connected to a bootstrap node, it then needs to take 1168 up its appropriate place in the overlay. This requires two major 1169 operations: 1171 o Forming connections to other peers in the overlay to populate its 1172 Routing Table. 1173 o Getting a copy of the data it is now responsible for storing and 1174 assuming responsibility for that data. 1176 The second operation is performed by contacting the Admitting Peer 1177 (AP), the node which is currently responsible for that section of the 1178 overlay. 1180 The details of this operation depend mostly on the overlay algorithm 1181 involved, but a typical case would be: 1183 1. JP (Joining Peer) sends a Join request to AP (Admitting Peer) 1184 announcing its intention to join. 1185 2. AP sends a Join response. 1186 3. AP does a sequence of Stores to JP to give it the data it will 1187 need. 1188 4. AP does Updates to JP and to other peers to tell it about its own 1189 routing table. At this point, both JP and AP consider JP 1190 responsible for some section of the Overlay Instance. 1191 5. JP makes its own connections to the appropriate peers in the 1192 Overlay Instance. 1194 After this process is completed, JP is a full member of the Overlay 1195 Instance and can process Store/Fetch requests. 1197 Note that the first node is a special case. When ordinary nodes 1198 cannot form connections to the bootstrap nodes, then they are not 1199 part of the overlay. However, the first node in the overlay can 1200 obviously not connect to other nodes. In order to support this case, 1201 potential first nodes (which must also serve as bootstrap nodes 1202 initially) must somehow be instructed (perhaps by configuration 1203 settings) that they are the entire overlay, rather than not part of 1204 it. 1206 Note that clients do not perform either of these operations. 1208 3.6. First-Time Setup 1210 Previous sections addressed how RELOAD works once a node has 1211 connected. This section provides an overview of how users get 1212 connected to the overlay for the first time. RELOAD is designed so 1213 that users can start with the name of the overlay they wish to join 1214 and perhaps a username and password, and leverage that into having a 1215 working peer with minimal user intervention. This helps avoid the 1216 problems that have been experienced with conventional SIP clients 1217 where users are required to manually configure a large number of 1218 settings. 1220 3.6.1. Initial Configuration 1222 In the first phase of the process, the user starts out with the name 1223 of the overlay and uses this to download an initial set of overlay 1224 configuration parameters. The node does a DNS SRV lookup on the 1225 overlay name to get the address of a configuration server. It can 1226 then connect to this server with HTTPS to download a configuration 1227 document which contains the basic overlay configuration parameters as 1228 well as a set of bootstrap nodes which can be used to join the 1229 overlay. 1231 If a node already has the valid configuration document that it 1232 received by some out of band method, this step can be skipped. 1234 3.6.2. Enrollment 1236 If the overlay is using centralized enrollment, then a user needs to 1237 acquire a certificate before joining the overlay. The certificate 1238 attests both to the user's name within the overlay and to the Node- 1239 IDs which they are permitted to operate. In that case, the 1240 configuration document will contain the address of an enrollment 1241 server which can be used to obtain such a certificate. The 1242 enrollment server may (and probably will) require some sort of 1243 username and password before issuing the certificate. The enrollment 1244 server's ability to restrict attackers' access to certificates in the 1245 overlay is one of the cornerstones of RELOAD's security. 1247 4. Application Support Overview 1249 RELOAD is not intended to be used alone, but rather as a substrate 1250 for other applications. These applications can use RELOAD for a 1251 variety of purposes: 1253 o To store data in the overlay and retrieve data stored by other 1254 nodes. 1255 o As a discovery mechanism for services such as TURN. 1256 o To form direct connections which can be used to transmit 1257 application-level messages without using the overlay. 1259 This section provides an overview of these services. 1261 4.1. Data Storage 1263 RELOAD provides operations to Store and Fetch data. Each location in 1264 the Overlay Instance is referenced by a Resource-ID. However, each 1265 location may contain data elements corresponding to multiple kinds 1266 (e.g., certificate, SIP registration). Similarly, there may be 1267 multiple elements of a given kind, as shown below: 1269 +--------------------------------+ 1270 | Resource-ID | 1271 | | 1272 | +------------+ +------------+ | 1273 | | Kind 1 | | Kind 2 | | 1274 | | | | | | 1275 | | +--------+ | | +--------+ | | 1276 | | | Value | | | | Value | | | 1277 | | +--------+ | | +--------+ | | 1278 | | | | | | 1279 | | +--------+ | | +--------+ | | 1280 | | | Value | | | | Value | | | 1281 | | +--------+ | | +--------+ | | 1282 | | | +------------+ | 1283 | | +--------+ | | 1284 | | | Value | | | 1285 | | +--------+ | | 1286 | +------------+ | 1287 +--------------------------------+ 1289 Each kind is identified by a Kind-ID, which is a code point assigned 1290 by IANA. As part of the kind definition, protocol designers may 1291 define constraints, such as limits on size, on the values which may 1292 be stored. For many kinds, the set may be restricted to a single 1293 value; some sets may be allowed to contain multiple identical items 1294 while others may only have unique items. Note that a kind may be 1295 employed by multiple usages and new usages are encouraged to use 1296 previously defined kinds where possible. We define the following 1297 data models in this document, though other usages can define their 1298 own structures: 1300 single value: There can be at most one item in the set and any value 1301 overwrites the previous item. 1303 array: Many values can be stored and addressed by a numeric index. 1305 dictionary: The values stored are indexed by a key. Often this key 1306 is one of the values from the certificate of the peer sending the 1307 Store request. 1309 In order to protect stored data from tampering, by other nodes, each 1310 stored value is digitally signed by the node which created it. When 1311 a value is retrieved, the digital signature can be verified to detect 1312 tampering. 1314 4.1.1. Storage Permissions 1316 A major issue in peer-to-peer storage networks is minimizing the 1317 burden of becoming a peer, and in particular minimizing the amount of 1318 data which any peer is required to store for other nodes. RELOAD 1319 addresses this issue by only allowing any given node to store data at 1320 a small number of locations in the overlay, with those locations 1321 being determined by the node's certificate. When a peer uses a Store 1322 request to place data at a location authorized by its certificate, it 1323 signs that data with the private key that corresponds to its 1324 certificate. Then the peer responsible for storing the data is able 1325 to verify that the peer issuing the request is authorized to make 1326 that request. Each data kind defines the exact rules for determining 1327 what certificate is appropriate. 1329 The most natural rule is that a certificate authorizes a user to 1330 store data keyed with their user name X. This rule is used for all 1331 the kinds defined in this specification. Thus, only a user with a 1332 certificate for "alice@example.org" could write to that location in 1333 the overlay. However, other usages can define any rules they choose, 1334 including publicly writable values. 1336 The digital signature over the data serves two purposes. First, it 1337 allows the peer responsible for storing the data to verify that this 1338 Store is authorized. Second, it provides integrity for the data. 1340 The signature is saved along with the data value (or values) so that 1341 any reader can verify the integrity of the data. Of course, the 1342 responsible peer can "lose" the value but it cannot undetectably 1343 modify it. 1345 The size requirements of the data being stored in the overlay are 1346 variable. For instance, a SIP AOR and voicemail differ widely in the 1347 storage size. RELOAD leaves it to the Usage and overlay 1348 configuration to limit size imbalance of various kinds. 1350 4.1.2. Replication 1352 Replication in P2P overlays can be used to provide: 1354 persistence: if the responsible peer crashes and/or if the storing 1355 peer leaves the overlay 1356 security: to guard against DoS attacks by the responsible peer or 1357 routing attacks to that responsible peer 1358 load balancing: to balance the load of queries for popular 1359 resources. 1361 A variety of schemes are used in P2P overlays to achieve some of 1362 these goals. Common techniques include replicating on neighbors of 1363 the responsible peer, randomly locating replicas around the overlay, 1364 or replicating along the path to the responsible peer. 1366 The core RELOAD specification does not specify a particular 1367 replication strategy. Instead, the first level of replication 1368 strategies are determined by the overlay algorithm, which can base 1369 the replication strategy on its particular topology. For example, 1370 Chord places replicas on successor peers, which will take over 1371 responsibility should the responsible peer fail [Chord]. 1373 If additional replication is needed, for example if data persistence 1374 is particularly important for a particular usage, then that usage may 1375 specify additional replication, such as implementing random 1376 replications by inserting a different well known constant into the 1377 Resource Name used to store each replicated copy of the resource. 1378 Such replication strategies can be added independent of the 1379 underlying algorithm, and their usage can be determined based on the 1380 needs of the particular usage. 1382 4.2. Usages 1384 By itself, the distributed storage layer just provides infrastructure 1385 on which applications are built. In order to do anything useful, a 1386 usage must be defined. Each Usage needs to specify several things: 1388 o Registers Kind-ID code points for any kinds that the Usage 1389 defines. 1390 o Defines the data structure for each of the kinds. 1391 o Defines access control rules for each of the kinds. 1392 o Defines how the Resource Name is formed that is hashed to form the 1393 Resource-ID where each kind is stored. 1394 o Describes how values will be merged after a network partition. 1395 Unless otherwise specified, the default merging rule is to act as 1396 if all the values that need to be merged were stored and as if the 1397 order they were stored in corresponds to the stored time values 1398 associated with (and carried in) their values. Because the stored 1399 time values are those associated with the peer which did the 1400 writing, clock skew is generally not an issue. If two nodes are 1401 on different partitions, write to the same location, and have 1402 clock skew, this can create merge conflicts. However because 1403 RELOAD deliberately segregates storage so that data from different 1404 users and peers is stored in different locations, and a single 1405 peer will typically only be in a single network partition, this 1406 case will generally not arise. 1408 The kinds defined by a usage may also be applied to other usages. 1409 However, a need for different parameters, such as different size 1410 limits, would imply the need to create a new kind. 1412 4.3. Service Discovery 1414 RELOAD does not currently define a generic service discovery 1415 algorithm as part of the base protocol, although a simplistic TURN- 1416 specific discovery mechanism is provided. A variety of service 1417 discovery algorithms can be implemented as extensions to the base 1418 protocol, such as the service discovery algorithm ReDIR 1419 [opendht-sigcomm05]. 1421 4.4. Application Connectivity 1423 There is no requirement that a RELOAD usage must use RELOAD's 1424 primitives for establishing its own communication if it already 1425 possesses its own means of establishing connections. For example, 1426 one could design a RELOAD-based resource discovery protocol which 1427 used HTTP to retrieve the actual data. 1429 For more common situations, however, it is the overlay itself - 1430 rather than an external authority such as DNS - which is used to 1431 establish a connection. RELOAD provides connectivity to applications 1432 using the AppAttach method. For example, if a P2PSIP node wishes to 1433 establish a SIP dialog with another P2PSIP node, it will use 1434 AppAttach to establish a direct connection with the other node. This 1435 new connection is separate from the peer protocol connection. It is 1436 a dedicated UDP or TCP flow used only for the SIP dialog. 1438 5. Overlay Management Protocol 1440 This section defines the basic protocols used to create, maintain, 1441 and use the RELOAD overlay network. We start by defining the basic 1442 concept of how message destinations are interpreted when routing 1443 messages. We then describe the symmetric recursive routing model, 1444 which is RELOAD's default routing algorithm. We then define the 1445 message structure and then finally define the messages used to join 1446 and maintain the overlay. 1448 5.1. Message Receipt and Forwarding 1450 When a peer receives a message, it first examines the overlay, 1451 version, and other header fields to determine whether the message is 1452 one it can process. If any of these are incorrect (e.g., the message 1453 is for an overlay in which the peer does not participate) it is an 1454 error. The peer SHOULD generate an appropriate error but local 1455 policy can override this and cause the messages to be silently 1456 dropped. 1458 Once the peer has determined that the message is correctly formatted, 1459 it examines the first entry on the destination list. There are three 1460 possible cases here: 1462 o The first entry on the destination list is an ID for which the 1463 peer is responsible. 1464 o The first entry on the destination list is an ID for which another 1465 peer is responsible. 1466 o The first entry on the destination list is a private ID that is 1467 being used for destination list compression. This is described 1468 later (note that private IDs can be distinguished from Node-IDs 1469 and Resource-IDs on the wire; see Section 5.3.2.2). 1471 These cases are handled as discussed below. 1473 5.1.1. Responsible ID 1475 If the first entry on the destination list is an ID for which the 1476 node is responsible, there are several sub-cases to consider. 1478 o 1479 o If the entry is a Resource-ID, then it MUST be the only entry on 1480 the destination list. If there are other entries, the message 1481 MUST be silently dropped. Otherwise, the message is destined for 1482 this node and it passes it up to the upper layers. 1484 o If the entry is a Node-ID which equals this node's Node-ID, then 1485 the message is destined for this node. If this is the only entry 1486 on the destination list, the message is destined for this node and 1487 is passed up to the upper layers. Otherwise the entry is removed 1488 from the destination list and the message is passed to the Message 1489 Transport. If the message is a response and there is state for 1490 the transaction ID, the state is reinserted into the destination 1491 list before the message is further processed. 1492 o If the entry is a Node-ID which is not equal to this node, then 1493 the node MUST drop the message silently unless the Node-ID 1494 corresponds to a node which is directly connected to this node 1495 (i.e., a client). In that case, it MUST forward the message to 1496 the destination node as described in the next section. 1498 Note that this implies that in order to address a message to "the 1499 peer that controls region X", a sender sends to Resource-ID X, not 1500 Node-ID X. 1502 5.1.2. Other ID 1504 If neither of the other three cases applies, then the peer MUST 1505 forward the message towards the first entry on the destination list. 1506 This means that it MUST select one of the peers to which it is 1507 connected and which is likely to be responsible for the first entry 1508 on the destination list. If the first entry on the destination list 1509 is in the peer's connection table, then it SHOULD forward the message 1510 to that peer directly. Otherwise, the peer consults the routing 1511 table to forward the message. 1513 Any intermediate peer which forwards a RELOAD message MUST arrange 1514 that if it receives a response to that message the response can be 1515 routed back through the set of nodes through which the request 1516 passed. This may be arranged in one of two ways: 1518 o The peer MAY add an entry to the via list in the forwarding header 1519 that will enable it to determine the correct node. 1520 o The peer MAY keep per-transaction state which will allow it to 1521 determine the correct node. 1523 As an example of the first strategy, if node D receives a message 1524 from node C with via list (A, B), then D would forward to the next 1525 node (E) with via list (A, B, C). Now, if E wants to respond to the 1526 message, it reverses the via list to produce the destination list, 1527 resulting in (D, C, B, A). When D forwards the response to C, the 1528 destination list will contain (C, B, A). 1530 As an example of the second strategy, if node D receives a message 1531 from node C with transaction ID X and via list (A, B), it could store 1532 (X, C) in its state database and forward the message with the via 1533 list unchanged. When D receives the response, it consults its state 1534 database for transaction id X, determines that the request came from 1535 C, and forwards the response to C. 1537 Intermediate peers which modify the via list are not required to 1538 simply add entries. The only requirement is that the peer be able to 1539 reconstruct the correct destination list on the return route. RELOAD 1540 provides explicit support for this functionality in the form of 1541 private IDs, which can replace any number of via list entries. For 1542 instance, in the above example, Node D might send E a via list 1543 containing only the private ID (I). E would then use the destination 1544 list (D, I) to send its return message. When D processes this 1545 destination list, it would detect that I is a private ID, recover the 1546 via list (A, B, C), and reverse that to produce the correct 1547 destination list (C, B, A) before sending it to C. This feature is 1548 called List Compression. It MAY either be a compressed version of 1549 the original via list or an index into a state database containing 1550 the original via list. 1552 No matter what mechanism for storing via list state is used, if an 1553 intermediate peer exits the overlay, then on the return trip the 1554 message cannot be forwarded and will be dropped. The ordinary 1555 timeout and retransmission mechanisms provide stability over this 1556 type of failure. 1558 Note that if an intermediate peer retains per-transaction state 1559 instead of modifying the via list, it needs some mechanism for timing 1560 out that state, otherwise its state database will grow without bound. 1561 Whatever algorithm is used, state MUST be maintained for at least the 1562 value of the overlay reliability timer (3 seconds) and MAY be kept 1563 longer. 1565 5.1.3. Private ID 1567 If the first entry in the destination list is a private id (e.g., a 1568 compressed via list), the peer MUST replace that entry with the 1569 original via list that it replaced and then re-examine the 1570 destination list to determine which of the above cases now applies. 1572 5.2. Symmetric Recursive Routing 1574 This Section defines RELOAD's symmetric recursive routing algorithm, 1575 which is the default algorithm used by nodes to route messages 1576 through the overlay. All implementations MUST implement this routing 1577 algorithm. An overlay may be configured to use alternative routing 1578 algorithms, and alternative routing algorithms may be selected on a 1579 per-message basis. 1581 5.2.1. Request Origination 1583 In order to originate a message to a given Node-ID or Resource-ID, a 1584 node constructs an appropriate destination list. The simplest such 1585 destination list is a single entry containing the Node-ID or 1586 Resource-ID. The resulting message will use the normal overlay 1587 routing mechanisms to forward the message to that destination. The 1588 node can also construct a more complicated destination list for 1589 source routing. 1591 Once the message is constructed, the node sends the message to some 1592 adjacent peer. If the first entry on the destination list is 1593 directly connected, then the message MUST be routed down that 1594 connection. Otherwise, the topology plugin MUST be consulted to 1595 determine the appropriate next hop. 1597 Parallel searches for the resource are a common solution to improve 1598 reliability in the face of churn or of subversive peers. Parallel 1599 searches for usage-specified replicas are managed by the usage layer. 1600 However, a single request can also be routed through multiple 1601 adjacent peers, even when known to be sub-optimal, to improve 1602 reliability [vulnerabilities-acsac04]. Such parallel searches MAY BE 1603 specified by the topology plugin. 1605 Because messages may be lost in transit through the overlay, RELOAD 1606 incorporates an end-to-end reliability mechanism. When an 1607 originating node transmits a request it MUST set a 3 second timer. 1608 If a response has not been received when the timer fires, the request 1609 is retransmitted with the same transaction identifier. The request 1610 MAY be retransmitted up to 4 times (for a total of 5 messages). 1611 After the timer for the fifth transmission fires, the message SHALL 1612 be considered to have failed. Note that this retransmission 1613 procedure is not followed by intermediate nodes. They follow the 1614 hop-by-hop reliability procedure described in Section 5.6.3. 1616 The above algorithm can result in multiple requests being delivered 1617 to a node. Receiving nodes MUST generate semantically equivalent 1618 responses to retransmissions of the same request (this can be 1619 determined by transaction id) if the request is received within the 1620 maximum request lifetime (15 seconds). For some requests (e.g., 1621 Fetch) this can be accomplished merely by processing the request 1622 again. For other requests, (e.g., Store) it may be necessary to 1623 maintain state for the duration of the request lifetime. 1625 5.2.2. Response Origination 1627 When a peer sends a response to a request using this routing 1628 algorithm, it MUST construct the destination list by reversing the 1629 order of the entries on the via list. This has the result that the 1630 response traverses the same peers as the request traversed, except in 1631 reverse order (symmetric routing). 1633 5.3. Message Structure 1635 RELOAD is a message-oriented request/response protocol. The messages 1636 are encoded using binary fields. All integers are represented in 1637 network byte order. The general philosophy behind the design was to 1638 use Type, Length, Value fields to allow for extensibility. However, 1639 for the parts of a structure that were required in all messages, we 1640 just define these in a fixed position, as adding a type and length 1641 for them is unnecessary and would simply increase bandwidth and 1642 introduces new potential for interoperability issues. 1644 Each message has three parts, concatenated as shown below: 1646 +-------------------------+ 1647 | Forwarding Header | 1648 +-------------------------+ 1649 | Message Contents | 1650 +-------------------------+ 1651 | Security Block | 1652 +-------------------------+ 1654 The contents of these parts are as follows: 1656 Forwarding Header: Each message has a generic header which is used 1657 to forward the message between peers and to its final destination. 1658 This header is the only information that an intermediate peer 1659 (i.e., one that is not the target of a message) needs to examine. 1661 Message Contents: The message being delivered between the peers. 1662 From the perspective of the forwarding layer, the contents are 1663 opaque, however, they are interpreted by the higher layers. 1665 Security Block: A security block containing certificates and a 1666 digital signature over the ""Message Contents". Note that this 1667 signature can be computed without parsing the message contents. 1668 All messages MUST be signed by their originator. 1670 The following sections describe the format of each part of the 1671 message. 1673 5.3.1. Presentation Language 1675 The structures defined in this document are defined using a C-like 1676 syntax based on the presentation language used to define TLS. 1677 [RFC5246] Advantages of this style include: 1679 o It familiar enough looking that most readers can grasp it quickly. 1680 o The ability to define nested structures allows a separation 1681 between high-level and low-level message structures. 1682 o It has a straightforward wire encoding that allows quick 1683 implementation, but the structures can be comprehended without 1684 knowing the encoding. 1685 o The ability to mechanically compile encoders and decoders. 1687 Several idiosyncrasies of this language are worth noting. 1689 o All lengths are denoted in bytes, not objects. 1690 o Variable length values are denoted like arrays with angle 1691 brackets. 1692 o "select" is used to indicate variant structures. 1694 For instance, "uint16 array<0..2^8-2>;" represents up to 254 bytes 1695 but only up to 127 values of two bytes (16 bits) each. 1697 5.3.1.1. Common Definitions 1699 The following definitions are used throughout RELOAD and so are 1700 defined here. They also provide a convenient introduction to how to 1701 read the presentation language. 1703 An enum represents an enumerated type. The values associated with 1704 each possibility are represented in parentheses and the maximum value 1705 is represented as a nameless value, for purposes of describing the 1706 width of the containing integral type. For instance, Boolean 1707 represents a true or false: 1709 enum { false (0), true(1), (255)} Boolean; 1711 A boolean value is either a 1 or a 0. The max value of 255 indicates 1712 this is represented as a single byte on the wire. 1714 The NodeId, shown below, represents a single Node-ID. 1716 typedef opaque NodeId[NodeIdLength]; 1718 A NodeId is a fixed-length structure represented as a series of 1719 bytes, with the most significant byte first. The length is set on a 1720 per-overlay basis within the range of 16-20 bytes (128 to 160 bits). 1721 (See Section 10.1 for how NodeIdLength is set.) Note: the use of 1722 "typedef" here is an extension to the TLS language, but its meaning 1723 should be relatively obvious. Note the [ size ] syntax defines a 1724 fixed length element that does not include the length of the element 1725 in the on the wire encoding. 1727 A ResourceId, shown below, represents a single Resource-ID. 1729 typedef opaque ResourceId<0..2^8-1>; 1731 Like a NodeId, a ResourceId is an opaque string of bytes, but unlike 1732 NodeIds, ResourceIds are variable length, up to 255 bytes (2048 bits) 1733 in length. On the wire, each ResourceId is preceded by a single 1734 length byte (allowing lengths up to 255). Thus, the 3-byte value 1735 "FOO" would be encoded as: 03 46 4f 4f. Note the < range > syntax 1736 defines a variable length element that does include the length of the 1737 element in the on the wire encoding. The number of bytes to encode 1738 the length on the wire is derived by range; i.e., it is the minimum 1739 number of bytes which can encode the largest range value. 1741 A more complicated example is IpAddressPort, which represents a 1742 network address and can be used to carry either an IPv6 or IPv4 1743 address: 1745 enum {reservedAddr(0), ipv4_address (1), ipv6_address (2), 1746 (255)} AddressType; 1748 struct { 1749 uint32 addr; 1750 uint16 port; 1751 } IPv4AddrPort; 1753 struct { 1754 uint128 addr; 1755 uint16 port; 1756 } IPv6AddrPort; 1758 struct { 1759 AddressType type; 1760 uint8 length; 1762 select (type) { 1763 case ipv4_address: 1764 IPv4AddrPort v4addr_port; 1766 case ipv6_address: 1767 IPv6AddrPort v6addr_port; 1769 /* This structure can be extended */ 1771 } IpAddressPort; 1773 The first two fields in the structure are the same no matter what 1774 kind of address is being represented: 1776 type: the type of address (v4 or v6). 1777 length: the length of the rest of the structure. 1779 By having the type and the length appear at the beginning of the 1780 structure regardless of the kind of address being represented, an 1781 implementation which does not understand new address type X can still 1782 parse the IpAddressPort field and then discard it if it is not 1783 needed. 1785 The rest of the IpAddressPort structure is either an IPv4AddrPort or 1786 an IPv6AddrPort. Both of these simply consist of an address 1787 represented as an integer and a 16-bit port. As an example, here is 1788 the wire representation of the IPv4 address "192.0.2.1" with port 1789 "6100". 1791 01 ; type = IPv4 1792 06 ; length = 6 1793 c0 00 02 01 ; address = 192.0.2.1 1794 17 d4 ; port = 6100 1796 Unless a given structure that uses a select explicitly allows for 1797 unknown types in the select, any unknown type SHOULD be treated as an 1798 parsing error and the whole message discarded with no response. 1800 5.3.2. Forwarding Header 1802 The forwarding header is defined as a ForwardingHeader structure, as 1803 shown below. 1805 struct { 1806 uint32 relo_token; 1807 uint32 overlay; 1808 uint16 configuration_sequence; 1809 uint8 version; 1810 uint8 ttl; 1811 uint32 fragment; 1812 uint32 length; 1813 uint64 transaction_id; 1814 uint32 max_response_length; 1815 uint16 via_list_length; 1816 uint16 destination_list_length; 1817 uint16 options_length; 1818 Destination via_list[via_list_length]; 1819 Destination destination_list 1820 [destination_list_length]; 1821 ForwardingOptions options[options_length]; 1822 } ForwardingHeader; 1824 The contents of the structure are: 1826 relo_token: The first four bytes identify this message as a RELOAD 1827 message. This field MUST contain the value 0xd2454c4f (the string 1828 'RELO' with the high bit of the first byte set.). 1830 overlay: The 32 bit checksum/hash of the overlay being used. The 1831 variable length string representing the overlay name is hashed 1832 with SHA-1 and the low order 32 bits are used. The purpose of 1833 this field is to allow nodes to participate in multiple overlays 1834 and to detect accidental misconfiguration. This is not a security 1835 critical function. 1837 configuration_sequence: The sequence number of the configuration 1838 file. 1840 version: The version of the RELOAD protocol being used. This is a 1841 fixed point integer between 0.1 and 25.4. This document describes 1842 version 0.1, with a value of 0x01. [[ Note to RFC Editor: Please 1843 update this to version 1.0 with value of 0x0a and remove this 1844 note. ]] 1846 ttl: An 8 bit field indicating the number of iterations, or hops, a 1847 message can experience before it is discarded. The TTL value MUST 1848 be decremented by one at every hop along the route the message 1849 traverses. If the TTL is 0, the message MUST NOT be propagated 1850 further and MUST be discarded, and a "Error_TTL_Exceeded" error 1851 should be generated. The initial value of the TTL SHOULD be 100 1852 unless defined otherwise by the overlay configuration. 1854 fragment: This field is used to handle fragmentation. The high 1855 order two bits are used to indicate the fragmentation status: If 1856 the high bit (0x80000000) is set, it indicates that the message is 1857 a fragment. If the next bit (0x40000000) is set, it indicates 1858 that this is the last fragment. The next six bits (0x20000000 to 1859 0x01000000) are reserved and SHOULD be set to zero. The remainder 1860 of the field is used to indicate the fragment offset; see 1861 Section 5.7 1863 length: The count in bytes of the size of the message, including the 1864 header. 1866 transaction_id: A unique 64 bit number that identifies this 1867 transaction and also allows receivers to disambiguate transactions 1868 which are otherwise identical. In order to provide a high 1869 probability that transaction IDs are unique, they MUST be randomly 1870 generated. Responses use the same Transaction ID as the request 1871 they correspond to. Transaction IDs are also used for fragment 1872 reassembly. 1874 max_response_length: The maximum size in bytes of a response. Used 1875 by requesting nodes to avoid receiving (unexpected) very large 1876 responses. If this value is non-zero, responding peers MUST check 1877 that any response would not exceed it and if so generate an 1878 Error_Response_Too_Large value. This value SHOULD be set to zero 1879 for responses. 1881 via_list_length: The length of the via list in bytes. Note that in 1882 this field and the following two length fields we depart from the 1883 usual variable-length convention of having the length immediately 1884 precede the value in order to make it easier for hardware decoding 1885 engines to quickly determine the length of the header. 1887 destination_list_length: The length of the destination list in 1888 bytes. 1890 options_length: The length of the header options in bytes. 1892 via_list: The via_list contains the sequence of destinations through 1893 which the message has passed. The via_list starts out empty and 1894 grows as the message traverses each peer. 1896 destination_list: The destination_list contains a sequence of 1897 destinations which the message should pass through. The 1898 destination list is constructed by the message originator. The 1899 first element in the destination list is where the message goes 1900 next. The list shrinks as the message traverses each listed peer. 1902 options: Contains a series of ForwardingOptions entries. See 1903 Section 5.3.2.3. 1905 5.3.2.1. Processing Configuration Sequence Numbers 1907 In order to be part of the overlay, a node MUST have a copy of the 1908 overlay configuration document. In order to allow for configuration 1909 document changes, each version of the configuration document has a 1910 sequence number which is monotonically increasing mod 65536. Because 1911 the sequence number may in principle wrap, greater than or less than 1912 are interpreted by modulo arithmetic as in TCP. 1914 When a destination node receives a request, it MUST check that the 1915 configuration_sequence field is equal to its own configuration 1916 sequence number. If they do not match, it MUST generate an error, 1917 either Error_Config_Too_Old or Error_Config_Too_New. In addition, if 1918 the configuration file in the request is too old, it MUST generate a 1919 ConfigUpdate message to update the requesting node. This allows new 1920 configuration documents to propagate quickly throughout the system. 1921 The one exception to this rule is that if the configuration_sequence 1922 field is equal to 0xffff, and the message type is ConfigUpdate, then 1923 the message MUST be accepted regardless of the receiving node's 1924 configuration sequence number. Since 65535 is a special value, peers 1925 sending a new configuration when the configuration sequence is 1926 currently 65534 MUST set the configuration sequence number to 0 when 1927 they send out a new configuration. 1929 5.3.2.2. Destination and Via Lists 1931 The destination list and via lists are sequences of Destination 1932 values: 1934 enum {reserved(0), node(1), resource(2), compressed(3), 1935 /* 128-255 not allowed */ (255) } 1936 DestinationType; 1938 select (destination_type) { 1939 case node: 1940 NodeId node_id; 1942 case resource: 1943 ResourceId resource_id; 1945 case compressed: 1946 opaque compressed_id<0..2^8-1>; 1948 /* This structure may be extended with new types */ 1949 } DestinationData; 1951 struct { 1952 DestinationType type; 1953 uint8 length; 1954 DestinationData destination_data; 1955 } Destination; 1957 struct { 1958 uint16 compressed_id; /* top bit MUST be 1 */ 1959 } Destination; 1961 If a destination structure has its first bit set to 1, then it is a 1962 16 bit integer. If the first bit is not set, then it is a structure 1963 starting with DestinationType. If it is a 16 bit integer, it is 1964 treated as if it were a full structure with a DestinationType of 1965 compressed and a compressed_id that was 2 bytes long with the value 1966 of the 16 bit integer. When the destination structure is not a 16 1967 bit integer, it is the TLV structure with the following contents: 1969 type 1970 The type of the DestinationData Payload Data Unit (PDU). This may 1971 be one of "node", "resource", or "compressed". 1973 length 1974 The length of the destination_data. 1976 destination_value 1977 The destination value itself, which is an encoded DestinationData 1978 structure, depending on the value of "type". 1980 Note: This structure encodes a type, length, value. The length 1981 field specifies the length of the DestinationData values, which 1982 allows the addition of new DestinationTypes. This allows an 1983 implementation which does not understand a given DestinationType 1984 to skip over it. 1986 A DestinationData can be one of three types: 1988 node 1989 A Node-ID. 1991 compressed 1992 A compressed list of Node-IDs and/or resources. Because this 1993 value was compressed by one of the peers, it is only meaningful to 1994 that peer and cannot be decoded by other peers. Thus, it is 1995 represented as an opaque string. 1997 resource 1998 The Resource-ID of the resource which is desired. This type MUST 1999 only appear in the final location of a destination list and MUST 2000 NOT appear in a via list. It is meaningless to try to route 2001 through a resource. 2003 One possible encoding of the 16 bit integer version as an opaque 2004 identifier is to encode an index into a connection table. To avoid 2005 misrouting responses in the event a response is delayed and the 2006 connection table entry has changed, the identifier SHOULD be split 2007 between an index and a generation counter for that index. At 2008 startup, the generation counters should be initialized to random 2009 values. An implementation could use 12 bits for the connection table 2010 index and 3 bits for the generation counter. (Note that this does 2011 not suggest a 4096 entry connection table for every node, only the 2012 ability to encode for a larger connection table.) When a connection 2013 table slot is used for a new connection, the generation counter is 2014 incremented (with wrapping). Connection table slots are used on a 2015 rotating basis to maximize the time interval between uses of the same 2016 slot for different connections. When routing a message to an entry 2017 in the destination list encoding a connection table entry, the node 2018 confirms that the generation counter matches the current generation 2019 counter of that index before forwarding the message. If it does not 2020 match, the message is silently dropped. 2022 5.3.2.3. Forwarding Options 2024 The Forwarding header can be extended with forwarding header options, 2025 which are a series of ForwardingOptions structures: 2027 enum { reservedForwarding(0), 2028 directResponseForwarding(1), (255) } ForwardingOptionsType; 2030 struct { 2031 ForwardingOptionsType type; 2032 uint8 flags; 2033 uint16 length; 2034 select (type) { 2035 case directResponseForwarding: 2036 DirectResponseForwardingOption directResponseForwardingOption; 2037 /* This type may be extended */ 2038 } option; 2039 } ForwardingOption; 2041 Each ForwardingOption consists of the following values: 2043 type 2044 The type of the option. This structure allows for unknown options 2045 types. 2047 length 2048 The length of the rest of the structure. 2050 flags 2051 Three flags are defined FORWARD_CRITICAL(0x01), 2052 DESTINATION_CRITICAL(0x02), and RESPONSE_COPY(0x04). These flags 2053 MUST NOT be set in a response. If the FORWARD_CRITICAL flag is 2054 set, any node that would forward the message but does not 2055 understand this options MUST reject the request with an 2056 Error_Unsupported_Forwarding_Option error response. If the 2057 DESTINATION_CRITICAL flag is set, any node that generates a 2058 response to the message but does not understand the forwarding 2059 option MUST reject the request with an 2060 Error_Unsupported_Forwarding_Option error response. If the 2061 RESPONSE_COPY flag is set, any node generating a response MUST 2062 copy the option from the request to the response except that the 2063 RESPONSE_COPY, FORWARD_CRITICAL and DESTINATION_CRITICAL flags 2064 must be cleared. 2066 option 2067 The option value. 2069 5.3.2.4. Direct Return Response Forwarding Options 2071 This section defines an OPTIONAL forwarding option that allows the 2072 originator of a request to signal that the node responding to the 2073 request should try to route the response directly to the node that 2074 made the request instead of having the responses traverse the 2075 overlay. : 2077 struct { 2078 AttachReqAns connection_information; 2079 NodeID requesting_node; 2080 } DirectResponseForwardingOption; 2082 Each ForwardingOption consists of the following values: 2084 connection_information 2085 All of the information needed to initiate a new connection to the 2086 requesting node. This type is defined in Section 5.5.1.1. 2088 requesting_node 2089 The NodeID of the node that originated the request. This is used 2090 to match the TLS certificate. 2092 This option can only be used if the direct-return-response-permitted 2093 flag in the configuration for the overlay is set to TRUE. The 2094 RESPONSE_COPY flag SHOULD be set to false while the FORWARD_CRITICAL 2095 and DESTINATION_CRITICAL MUST be set to true. When a node that 2096 supports this forwarding options receives a request with it, it acts 2097 as if it had send an Attach request to the requesting_node and it had 2098 received the connection_information in the answer. This causes it to 2099 form a new connection directly to that node. Once that is complete 2100 the response to this request is sent over that connection. If a 2101 connection already exists directly to that node, it is used instead 2102 of a new connection being formed. The connection MAY be closed at 2103 any point but is typically kept open until it has not been used for a 2104 significant period of time or one of the nodes needs to reclaim 2105 resources. 2107 5.3.3. Message Contents Format 2109 The second major part of a RELOAD message is the contents part, which 2110 is defined by MessageContents: 2112 enum { reservedMessagesExtension(0), (2^16-1) } MessageExtensionType; 2114 struct { 2115 MessageExtensionType type; 2116 Boolean critical; 2117 opaque extension_contents<0..2^32-1>; 2118 } MessageExtension; 2120 struct { 2121 uint16 message_code; 2122 opaque message_body<0..2^32-1>; 2123 MessageExtensions extensions<0..2^32-1>; 2124 } MessageContents; 2126 The contents of this structure are as follows: 2128 message_code 2129 This indicates the message that is being sent. The code space is 2130 broken up as follows. 2132 0 Reserved 2134 1 .. 0x7fff Requests and responses. These code points are always 2135 paired, with requests being odd and the corresponding response 2136 being the request code plus 1. Thus, "probe_request" (the 2137 Probe request) has value 1 and "probe_answer" (the Probe 2138 response) has value 2 2140 0xffff Error 2141 The message codes are defined in Section 13.8 2142 message_body 2143 The message body itself, represented as a variable-length string 2144 of bytes. The bytes themselves are dependent on the code value. 2145 See the sections describing the various RELOAD methods (Join, 2146 Update, Attach, Store, Fetch, etc.) for the definitions of the 2147 payload contents. 2148 extensions 2149 Extensions to the message. Currently no extensions are defined, 2150 but new extensions can be defined by the process described in 2151 Section 13.14. 2153 All extensions have the following form: 2155 type 2156 The extension type. 2158 critical 2159 Whether this extension must be understood in order to process the 2160 message. If critical = True and the recipient does not understand 2161 the message, it MUST generate an Error_Unknown_Extension error. 2162 If critical = False, the recipient MAY choose to process the 2163 message even if it does not understand the extension. 2165 extension_contents 2166 The contents of the extension (extension-dependent). 2168 5.3.3.1. Response Codes and Response Errors 2170 A peer processing a request returns its status in the message_code 2171 field. If the request was a success, then the message code is the 2172 response code that matches the request (i.e., the next code up). The 2173 response payload is then as defined in the request/response 2174 descriptions. 2176 If the request has failed, then the message code is set to 0xffff 2177 (error) and the payload MUST be an error_response PDU, as shown 2178 below. 2180 When the message code is 0xffff, the payload MUST be an 2181 ErrorResponse. 2183 public struct { 2184 uint16 error_code; 2185 opaque error_info<0..2^16-1>; 2186 } ErrorResponse; 2188 The contents of this structure are as follows: 2190 error_code 2191 A numeric error code indicating the error that occurred. 2193 error_info 2194 An optional arbitrary byte string. Unless otherwise specified, 2195 this will be a UTF-8 text string providing further information 2196 about what went wrong. 2198 The following error code values are defined. The numeric values for 2199 these are defined in Section 13.9. 2201 Error_Forbidden: The requesting node does not have permission to 2202 make this request. 2204 Error_Not_Found: The resource or peer cannot be found or does not 2205 exist. 2207 Error_Request_Timeout: A response to the request has not been 2208 received in a suitable amount of time. The requesting node MAY 2209 resend the request at a later time. 2211 Error_Data_Too_Old: A store cannot be completed because the 2212 storage_time precedes the existing value. 2214 Error_Data_Too_Old: A store cannot be completed because the 2215 storage_time precedes the existing value. 2217 Error_Data_Too_Large: A store cannot be completed because the 2218 requested object exceeds the size limits for that kind. 2220 Error_Generation_Counter_Too_Low: A store cannot be completed 2221 because the generation counter precedes the existing value. 2223 Error_Incompatible_with_Overlay: A peer receiving the request is 2224 using a different overlay, overlay algorithm, or hash algorithm. 2226 Error_Unsupported_Forwarding_Option: A peer receiving the request 2227 with a forwarding options flagged as critical but the peer does 2228 not support this option. See section Section 5.3.2.3. 2230 Error_TTL_Exceeded: A peer receiving the request where the TTL got 2231 decremented to zero. See section Section 5.3.2. 2233 Error_Message_Too_Large: A peer receiving the request that was too 2234 large. See section Section 5.6. 2236 Error_Response_Too_Large: A peer would have generated a response 2237 that is too large per the max_response_length field. 2239 Error_Config_Too_Old: A destination peer received a request with a 2240 configuration sequence that's too old. A node which generates 2241 this response MUST then generate a ConfigUpdate message containing 2242 the correct configuration. 2244 Error_Config_Too_New: A destination node received a request with a 2245 configuration sequence that's too new. A node which receives this 2246 error MUST generate a ConfigUpdate message to send a new copy of 2247 the configuration document to the node which generated the error. 2249 Error_Unknown_Kind: A destination node received a request with an 2250 unknown kind-id. A node which receives this error MUST generate a 2251 ConfigUpdate message which contains the appropriate kind 2252 definition (assuming that in fact a kind was used which was 2253 defined in the configuration document). 2254 Error_Unknown_Extension: A destination node received a request with 2255 an unknown extension. 2257 5.3.4. Security Block 2259 The third part of a RELOAD message is the security block. The 2260 security block is represented by a SecurityBlock structure: 2262 enum { x509(0), (255) } certificate_type; 2264 struct { 2265 certificate_type type; 2266 opaque certificate<0..2^16-1>; 2267 } GenericCertificate; 2269 struct { 2270 GenericCertificate certificates<0..2^16-1>; 2271 Signature signature; 2272 } SecurityBlock; 2274 The contents of this structure are: 2276 certificates 2277 A bucket of certificates. 2279 signature 2280 A signature over the message contents. 2282 The certificates bucket SHOULD contain all the certificates necessary 2283 to verify every signature in both the message and the internal 2284 message objects. This is the only location in the message which 2285 contains certificates, thus allowing for only a single copy of each 2286 certificate to be sent. In systems which have some alternate 2287 certificate distribution mechanism, some certificates MAY be omitted. 2288 However, implementors should note that this creates the possibility 2289 that messages may not be immediately verifiable because certificates 2290 must first be retrieved. 2292 Each certificate is represented by a GenericCertificate structure, 2293 which has the following contents: 2295 type 2296 The type of the certificate. Only one type is defined: x509 2297 representing an X.509 certificate. 2299 certificate 2300 The encoded version of the certificate. For X.509 certificates, 2301 it is the DER form. 2303 The signature is computed over the payload and parts of the 2304 forwarding header. The payload, in case of a Store, may contain an 2305 additional signature computed over a StoreReq structure. All 2306 signatures are formatted using the Signature element. This element 2307 is also used in other contexts where signatures are needed. The 2308 input structure to the signature computation varies depending on the 2309 data element being signed. 2311 enum { reservedSignerIdentity(0), 2312 cert_hash(1), (255)} SignerIdentityType; 2314 select (identity_type) { 2315 case cert_hash; 2316 HashAlgorithm hash_alg; // From TLS 2317 opaque certificate_hash<0..2^8-1>; 2319 /* This structure may be extended with new types if necessary*/ 2320 } SignerIdentityValue; 2322 struct { 2323 SignerIdentityType identity_type; 2324 uint16 length; 2325 SignerIdentityValue identity[SignerIdentity.length]; 2326 } SignerIdentity; 2328 struct { 2329 SignatureAndHashAlgorithm algorithm; // From TLS 2330 SignerIdentity identity; 2331 opaque signature_value<0..2^16-1>; 2332 } Signature; 2334 The signature construct contains the following values: 2336 algorithm 2337 The signature algorithm in use. The algorithm definitions are 2338 found in the IANA TLS SignatureAlgorithm Registry. All 2339 implementations MUST support RSASSA-PKCS1-v1_5 [RFC3447] 2340 signatures with SHA-256 hashes. 2342 identity 2343 The identity used to form the signature. 2345 signature_value 2346 The value of the signature. 2348 The only currently permitted identity format is a hash of the 2349 signer's certificate. The hash_alg field is used to indicate the 2350 algorithm used to produce the hash. The certificate_hash contains 2351 the hash of the certificate object. The SignerIdentity structure is 2352 typed purely to allow for future (unanticipated) extensibility. 2354 For signatures over messages the input to the signature is computed 2355 over: 2357 overlay + transaction_id + MessageContents + SignerIdentity 2359 where overlay and transaction_id come from the forwarding header and 2360 + indicates concatenation. 2362 The input to signatures over data values is different, and is 2363 described in Section 6.1. 2365 All RELOAD messages MUST be signed. Upon receipt, the receiving node 2366 MUST verify the signature and the authorizing certificate. This 2367 check provides a minimal level of assurance that the sending node is 2368 a valid part of the overlay as well as cryptographic authentication 2369 of the sending node. In addition, responses MUST be checked as 2370 follows: 2372 1. The response to a message sent to a specific Node-ID MUST have 2373 been sent by that Node-ID. 2374 2. The response to a message sent to a Resource-Id MUST have been 2375 sent by a Node-ID which is as close to or closer to the target 2376 Resource-Id than any node in the requesting node's neighbor 2377 table. 2379 The second condition serves as a primitive check for responses from 2380 wildly wrong nodes but is not a complete check. Note that in periods 2381 of churn, it is possible for the requesting node to obtain a closer 2382 neighbor while the request is outstanding. This will cause the 2383 response to be rejected and the request to be retransmitted. 2385 In addition, some methods (especially Store) have additional 2386 authentication requirements, which are described in the sections 2387 covering those methods. 2389 5.4. Overlay Topology 2391 As discussed in previous sections, RELOAD does not itself implement 2392 any overlay topology. Rather, it relies on Topology Plugins, which 2393 allow a variety of overlay algorithms to be used while maintaining 2394 the same RELOAD core. This section describes the requirements for 2395 new topology plugins and the methods that RELOAD provides for overlay 2396 topology maintenance. 2398 5.4.1. Topology Plugin Requirements 2400 When specifying a new overlay algorithm, at least the following need 2401 to be described: 2403 o Joining procedures, including the contents of the Join message. 2404 o Stabilization procedures, including the contents of the Update 2405 message, the frequency of topology probes and keepalives, and the 2406 mechanism used to detect when peers have disconnected. 2407 o Exit procedures, including the contents of the Leave message. 2408 o The length of the Resource-IDs. For DHTs, the hash algorithm to 2409 compute the hash of an identifier. 2410 o The procedures that peers use to route messages. 2411 o The replication strategy used to ensure data redundancy. 2413 All overlay algorithms MUST specify maintenance procedures that send 2414 Updates to clients and peers that have established connections to the 2415 peer responsible for a particular ID when the responsibility for that 2416 ID changes. Because tracking this information is difficult, overlay 2417 algorithms MAY simply specify that an Update is sent to all members 2418 of the Connection Table whenever the range of IDs for which the peer 2419 is responsible changes. 2421 5.4.2. Methods and types for use by topology plugins 2423 This section describes the methods that topology plugins use to join, 2424 leave, and maintain the overlay. 2426 5.4.2.1. Join 2428 A new peer (but one that already has credentials) uses the JoinReq 2429 message to join the overlay. The JoinReq is sent to the responsible 2430 peer depending on the routing mechanism described in the topology 2431 plugin. This notifies the responsible peer that the new peer is 2432 taking over some of the overlay and it needs to synchronize its 2433 state. 2435 struct { 2436 NodeId joining_peer_id; 2437 opaque overlay_specific_data<0..2^16-1>; 2438 } JoinReq; 2440 The minimal JoinReq contains only the Node-ID which the sending peer 2441 wishes to assume. Overlay algorithms MAY specify other data to 2442 appear in this request. Receivers of the JoinReq MUST verify that 2443 the joining_peer_id field matches the Node-ID used to sign the 2444 message and if not MUST reject the message with an Error_Forbidden 2445 error. 2447 If the request succeeds, the responding peer responds with a JoinAns 2448 message, as defined below: 2450 struct { 2451 opaque overlay_specific_data<0..2^16-1>; 2452 } JoinAns; 2454 If the request succeeds, the responding peer MUST follow up by 2455 executing the right sequence of Stores and Updates to transfer the 2456 appropriate section of the overlay space to the joining peer. In 2457 addition, overlay algorithms MAY define data to appear in the 2458 response payload that provides additional info. 2460 In general, nodes which cannot form connections SHOULD report an 2461 error. However, implementations MUST provide some mechanism whereby 2462 nodes can determine that they are potentially the first node and take 2463 responsibility for the overlay. This specification does not mandate 2464 any particular mechanism, but a configuration flag or setting seems 2465 appropriate. 2467 5.4.2.2. Leave 2469 The LeaveReq message is used to indicate that a node is exiting the 2470 overlay. A node SHOULD send this message to each peer with which it 2471 is directly connected prior to exiting the overlay. 2473 public struct { 2474 NodeId leaving_peer_id; 2475 opaque overlay_specific_data<0..2^16-1>; 2476 } LeaveReq; 2478 LeaveReq contains only the Node-ID of the leaving peer. Overlay 2479 algorithms MAY specify other data to appear in this request. 2480 Receivers of the JoinReq MUST verify that the joining_peer_id field 2481 matches the Node-ID used to sign the message and if not MUST reject 2482 the message with an Error_Forbidden error. 2484 Upon receiving a Leave request, a peer MUST update its own routing 2485 table, and send the appropriate Store/Update sequences to re- 2486 stabilize the overlay. 2488 5.4.2.3. Update 2490 Update is the primary overlay-specific maintenance message. It is 2491 used by the sender to notify the recipient of the sender's view of 2492 the current state of the overlay (its routing state), and it is up to 2493 the recipient to take whatever actions are appropriate to deal with 2494 the state change. In general, peers send Update messages to all 2495 their adjacencies whenever they detect a topology shift. 2497 When a peer detects through an Update that it is no longer 2498 responsible for any data value it is storing, it MUST attempt to 2499 Store a copy to the correct node unless it knows the newly 2500 responsible node already has a copy of the data. This prevents data 2501 loss during large-scale topology shifts such as the merging of 2502 partitioned overlays. 2504 The contents of the UpdateReq message are completely overlay- 2505 specific. The UpdateAns response is expected to be either success or 2506 an error. 2508 5.4.2.4. RouteQuery 2510 The RouteQuery request allows the sender to ask a peer where they 2511 would route a message directed to a given destination. In other 2512 words, a RouteQuery for a destination X requests the Node-ID for the 2513 node that the receiving peer would next route to in order to get to 2514 X. A RouteQuery can also request that the receiving peer initiate an 2515 Update request to transfer the receiving peer's routing table. 2517 One important use of the RouteQuery request is to support iterative 2518 routing. The sender selects one of the peers in its routing table 2519 and sends it a RouteQuery message with the destination_object set to 2520 the Node-ID or Resource-ID it wishes to route to. The receiving peer 2521 responds with information about the peers to which the request would 2522 be routed. The sending peer MAY then use the Attach method to attach 2523 to that peer(s), and repeat the RouteQuery. Eventually, the sender 2524 gets a response from a peer that is closest to the identifier in the 2525 destination_object as determined by the topology plugin. At that 2526 point, the sender can send messages directly to that peer. 2528 5.4.2.4.1. Request Definition 2530 A RouteQueryReq message indicates the peer or resource that the 2531 requesting node is interested in. It also contains a "send_update" 2532 option allowing the requesting node to request a full copy of the 2533 other peer's routing table. 2535 struct { 2536 Boolean send_update; 2537 Destination destination; 2538 opaque overlay_specific_data<0..2^16-1>; 2539 } RouteQueryReq; 2541 The contents of the RouteQueryReq message are as follows: 2543 send_update 2544 A single byte. This may be set to "true" to indicate that the 2545 requester wishes the responder to initiate an Update request 2546 immediately. Otherwise, this value MUST be set to "false". 2548 destination 2549 The destination which the requester is interested in. This may be 2550 any valid destination object, including a Node-ID, compressed ids, 2551 or Resource-ID. 2553 overlay_specific_data 2554 Other data as appropriate for the overlay. 2556 5.4.2.4.2. Response Definition 2558 A response to a successful RouteQueryReq request is a RouteQueryAns 2559 message. This is completely overlay specific. 2561 5.4.2.5. Probe 2563 Probe provides primitive "exploration" services: it allows node to 2564 determine which resources another node is responsible for; and it 2565 allows some discovery services using multicast, anycast, or 2566 broadcast. A probe can be addressed to a specific Node-ID, or the 2567 peer controlling a given location (by using a resource ID). In 2568 either case, the target Node-IDs respond with a simple response 2569 containing some status information. 2571 5.4.2.5.1. Request Definition 2573 The ProbeReq message contains a list (potentially empty) of the 2574 pieces of status information that the requester would like the 2575 responder to provide. 2577 enum { reservedProbeInformation(0), responsible_set(1), 2578 num_resources(2), uptime(3), (255)} 2579 ProbeInformationType; 2581 struct { 2582 ProbeInformationType requested_info<0..2^8-1>; 2583 } ProbeReq 2585 The currently defined values for ProbeInformation are: 2587 responsible_set 2588 indicates that the peer should Respond with the fraction of the 2589 overlay for which the responding peer is responsible. 2591 num_resources 2592 indicates that the peer should Respond with the number of 2593 resources currently being stored by the peer. 2595 uptime 2596 indicates that the peer should Respond with how long the peer has 2597 been up in seconds. 2599 5.4.2.5.2. Response Definition 2601 A successful ProbeAns response contains the information elements 2602 requested by the peer. 2604 struct { 2605 select (type) { 2606 case responsible_set: 2607 uint32 responsible_ppb; 2609 case num_resources: 2610 uint32 num_resources; 2612 case uptime: 2613 uint32 uptime; 2614 /* This type may be extended */ 2616 }; 2617 } ProbeInformationData; 2619 struct { 2620 ProbeInformationType type; 2621 uint8 length; 2622 ProbeInformationData value; 2623 } ProbeInformation; 2625 struct { 2626 ProbeInformation probe_info<0..2^16-1>; 2627 } ProbeAns; 2629 A ProbeAns message contains a sequence of ProbeInformation 2630 structures. Each has a "length" indicating the length of the 2631 following value field. This structure allows for unknown option 2632 types. 2634 Each of the current possible Probe information types is a 32-bit 2635 unsigned integer. For type "responsible_ppb", it is the fraction of 2636 the overlay for which the peer is responsible in parts per billion. 2637 For type "num_resources", it is the number of resources the peer is 2638 storing. For the type "uptime" it is the number of seconds the peer 2639 has been up. 2641 The responding peer SHOULD include any values that the requesting 2642 node requested and that it recognizes. They SHOULD be returned in 2643 the requested order. Any other values MUST NOT be returned. 2645 5.5. Forwarding and Link Management Layer 2647 Each node maintains connections to a set of other nodes defined by 2648 the topology plugin. This section defines the methods RELOAD uses to 2649 form and maintain connections between nodes in the overlay. Three 2650 methods are defined: 2652 Attach: used to form RELOAD connections between nodes. When node 2653 A wants to connect to node B, it sends an Attach message to node B 2654 through the overlay. The Attach contains A's ICE parameters. B 2655 responds with its ICE parameters and the two nodes perform ICE to 2656 form connection. Attach also allows two nodes to connect via No- 2657 ICE instead of full ICE. 2658 AppAttach: used to form application layer connections between 2659 nodes. 2660 Ping: is a simple request/response which is used to verify 2661 connectivity of the target peer. 2663 5.5.1. Attach 2665 A node sends an Attach request when it wishes to establish a direct 2666 TCP or UDP connection to another node for the purpose of sending 2667 RELOAD messages. 2669 As described in Section 5.1, an Attach may be routed to either a 2670 Node-ID or to a Resource-ID. An Attach routed to a specific Node-ID 2671 will fail if that node is not reached. An Attach routed to a 2672 Resource-ID will establish a connection with the peer currently 2673 responsible for that Resource-ID, which may be useful in establishing 2674 a direct connection to the responsible peer for use with frequent or 2675 large resource updates. 2677 An Attach in and of itself does not result in updating the routing 2678 table of either node. That function is performed by Updates. If 2679 node A has Attached to node B, but not received any Updates from B, 2680 it MAY route messages which are directly addressed to B through that 2681 channel but MUST NOT route messages through B to other peers via that 2682 channel. The process of Attaching is separate from the process of 2683 becoming a peer (using Join and Update), to prevent half-open states 2684 where a node has started to form connections but is not really ready 2685 to act as a peer. Thus, clients (unlike peers) can simply Attach 2686 without sending Join or Update. 2688 5.5.1.1. Request Definition 2690 An Attach request message contains the requesting node ICE connection 2691 parameters formatted into a binary structure. 2693 enum { reservedOverlayLink(0), DTLS-UDP-SR(1), 2694 DTLS-UDP-SR-NO-ICE(3), TLS-TCP-FH-NO-ICE(4), 2695 (255) } OverlayLinkType; 2697 enum { reservedCand(0), host(1), srflx(2), prflx(3), relay(4), 2698 (255) } CandType; 2700 struct { 2701 opaque name<0..2^16-1>; 2702 opaque value<0..2^16-1>; 2703 } IceExtension; 2705 struct { 2706 IpAddressPort addr_port; 2707 OverlayLinkType overlay_link; 2708 opaque foundation<0..255>; 2709 uint32 priority; 2710 CandType type; 2711 select (type){ 2712 case host: 2713 ; /* Nothing */ 2714 case srflx: 2715 case prflx: 2716 case relay: 2717 IpAddressPort rel_addr_port; 2718 } 2719 IceExtension extensions<0..2^16-1>; 2720 } IceCandidate; 2722 struct { 2723 opaque ufrag<0..2^8-1>; 2724 opaque password<0..2^8-1>; 2725 opaque role<0..2^8-1>; 2726 IceCandidate candidates<0..2^16-1>; 2727 Boolean send_update; 2728 } AttachReqAns; 2730 The values contained in AttachReqAns are: 2732 ufrag 2733 The username fragment (from ICE). 2735 password 2736 The ICE password. 2738 role 2739 An active/passive/actpass attribute from RFC 4145 [RFC4145]. This 2740 value MUST be 'passive' for the offerer (the peer sending the 2741 Attach request) and 'active' for the answerer (the peer sending 2742 the Attach response). 2744 candidates 2745 One or more ICE candidate values, as described below. 2746 send_update 2747 Has the same meaning as the send_update field in RouteQueryReq. 2749 Each ICE candidate is represented as an IceCandidate structure, which 2750 is a direct translation of the information from the ICE string 2751 structures, with the exception of the component ID. Since there is 2752 only one component, it is always 1, and thus left out of the PDU. 2753 The remaining values are specified as follows: 2755 addr_port 2756 corresponds to the connection-address and port productions. 2758 overlay_link 2759 corresponds to the OverlayLinkType production, Overlay Link 2760 protocols used with No-ICE MUST specify "No-ICE" in their 2761 description. Future overlay link values can be added be defining 2762 new OverlayLinkType values in the IANA registry in Section 13.10. 2763 Future extensions to the encapsulation or framing that provide for 2764 backward compatibility with that specified by a previously defined 2765 OverlayLinkType values MUST use that previous value. 2766 OverlayLinkType protocols are defined in Section 5.6 2767 A single AttachReqAns MUST NOT include both candidates whose 2768 OverlayLinkType protocols use ICE (the default) and candidates 2769 that specify "No-ICE". 2771 foundation 2772 corresponds to the foundation production. 2774 priority 2775 corresponds to the priority production. 2777 type 2778 corresponds to the cand-type production. 2780 rel_addr_port 2781 corresponds to the rel-addr and rel-port productions. Only 2782 present for type "relay". 2784 extensions 2785 ICE extensions. The name and value fields correspond to binary 2786 translations of the equivalent fields in the ICE extensions. 2788 These values should be generated using the procedures described in 2789 Section 5.5.1.3. 2791 5.5.1.2. Response Definition 2793 If a peer receives an Attach request, it MUST process the request and 2794 SHOULD generate its own response with a AttachReqAns. A peer which 2795 is overloaded or detects some other kind of error may of course 2796 generate an error instead of an AttachReqAns. It should then begin 2797 ICE checks. When a peer receives an Attach response, it SHOULD parse 2798 the response and begin its own ICE checks. 2800 5.5.1.3. Using ICE With RELOAD 2802 This section describes the profile of ICE that is used with RELOAD. 2803 RELOAD implementations MUST implement full ICE. 2805 In ICE as defined by [RFC5245], SDP is used to carry the ICE 2806 parameters. In RELOAD, this function is performed by a binary 2807 encoding in the Attach method. This encoding is more restricted than 2808 the SDP encoding because the RELOAD environment is simpler: 2810 o Only a single media stream is supported. 2811 o In this case, the "stream" refers not to RTP or other types of 2812 media, but rather to a connection for RELOAD itself or for SIP 2813 signaling. 2814 o RELOAD only allows for a single offer/answer exchange. Unlike the 2815 usage of ICE within SIP, there is never a need to send a 2816 subsequent offer to update the default candidates to match the 2817 ones selected by ICE. 2819 An agent follows the ICE specification as described in [RFC5245] with 2820 the changes and additional procedures described in the subsections 2821 below. 2823 5.5.1.4. Collecting STUN Servers 2825 ICE relies on the node having one or more STUN servers to use. In 2826 conventional ICE, it is assumed that nodes are configured with one or 2827 more STUN servers through some out of band mechanism. This is still 2828 possible in RELOAD but RELOAD also learns STUN servers as it connects 2829 to other peers. Because all RELOAD peers implement ICE and use STUN 2830 keepalives, every peer is a capable of responding to STUN Binding 2831 requests [RFC5389]. Accordingly, any peer that a node knows about 2832 can be used like a STUN server -- though of course it may be behind a 2833 NAT. 2835 A peer on a well-provisioned wide-area overlay will be configured 2836 with one or more bootstrap nodes. These nodes make an initial list 2837 of STUN servers. However, as the peer forms connections with 2838 additional peers, it builds more peers it can use like STUN servers. 2840 Because complicated NAT topologies are possible, a peer may need more 2841 than one STUN server. Specifically, a peer that is behind a single 2842 NAT will typically observe only two IP addresses in its STUN checks: 2843 its local address and its server reflexive address from a STUN server 2844 outside its NAT. However, if there are more NATs involved, it may 2845 learn additional server reflexive addresses (which vary based on 2846 where in the topology the STUN server is). To maximize the chance of 2847 achieving a direct connection, a peer SHOULD group other peers by the 2848 peer-reflexive addresses it discovers through them. It SHOULD then 2849 select one peer from each group to use as a STUN server for future 2850 connections. 2852 Only peers to which the peer currently has connections may be used. 2853 If the connection to that host is lost, it MUST be removed from the 2854 list of stun servers and a new server from the same group MUST be 2855 selected unless there are no others servers in the group in which 2856 case some other peer MAY be used. 2858 5.5.1.5. Gathering Candidates 2860 When a node wishes to establish a connection for the purposes of 2861 RELOAD signaling or application signaling, it follows the process of 2862 gathering candidates as described in Section 4 of ICE [RFC5245]. 2863 RELOAD utilizes a single component. Consequently, gathering for 2864 these "streams" requires a single component. In the case where a 2865 node has not yet found a TURN server, the agent would not include a 2866 relayed candidate. 2868 The ICE specification assumes that an ICE agent is configured with, 2869 or somehow knows of, TURN and STUN servers. RELOAD provides a way 2870 for an agent to learn these by querying the overlay, as described in 2871 Section 5.5.1.4 and Section 8. 2873 The default candidate selection described in Section 4.1.4 of ICE is 2874 ignored; defaults are not signaled or utilized by RELOAD. 2876 An alternative to using the full ICE supported by the Attach request 2877 is to use No-ICE mechanism by providing candidates with "No-ICE" 2878 Overlay Link protocols. Configuration for the overlay indicates 2879 whether or not these Overlay Link protocols can be used. An overlay 2880 MUST be either all ICE or all No-ICE. 2882 No-ICE will not work in all of the scenarios where ICE would work, 2883 but in some cases, particularly those with no NATs or firewalls, it 2884 will work. Therefore it is RECOMMENDED that full ICE be used even 2885 for a node that has a public, unfiltered IP address, to take 2886 advantage of STUN connectivity checks, etc. 2888 5.5.1.6. Prioritizing Candidates 2890 At the time of writing, UDP is the only transport protocol specified 2891 to work with ICE. However, standardization of additional protocols 2892 for use with ICE is expected, including TCP and datagram-oriented 2893 protocols such as SCTP and DCCP. In particular, UDP encapsulations 2894 for SCTP and DCCP are expected to be standardized in the near future, 2895 greatly expanding the available Overlay Link protocols available for 2896 RELOAD. When additional protocols are available, the following 2897 prioritization is RECOMMENDED: 2899 o Highest priority is assigned to message-oriented protocols that 2900 offer well-understood congestion and flow control without head of 2901 line blocking. For example, SCTP without message ordering, DCCP, 2902 or those protocols encapsulated using UDP. 2903 o Second highest priority is assigned to stream-oriented protocols, 2904 e.g. TCP. 2905 o Lowest priority is assigned to protocols encapsulated over UDP 2906 that do not implement well-established congestion control 2907 algorithms. The DTLS/UDP with SR overlay link protocol is an 2908 example of such a protocol. 2910 5.5.1.7. Encoding the Attach Message 2912 Section 4.3 of ICE describes procedures for encoding the SDP for 2913 conveying RELOAD candidates. Instead of actually encoding an SDP, 2914 the candidate information (IP address and port and transport 2915 protocol, priority, foundation, type and related address) is carried 2916 within the attributes of the Attach request or its response. 2917 Similarly, the username fragment and password are carried in the 2918 Attach message or its response. Section 5.5.1 describes the detailed 2919 attribute encoding for Attach. The Attach request and its response 2920 do not contain any default candidates or the ice-lite attribute, as 2921 these features of ICE are not used by RELOAD. 2923 Since the Attach request contains the candidate information and short 2924 term credentials, it is considered as an offer for a single media 2925 stream that happens to be encoded in a format different than SDP, but 2926 is otherwise considered a valid offer for the purposes of following 2927 the ICE specification. Similarly, the Attach response is considered 2928 a valid answer for the purposes of following the ICE specification. 2930 5.5.1.8. Verifying ICE Support 2932 An agent MUST skip the verification procedures in Section 5.1 and 6.1 2933 of ICE. Since RELOAD requires full ICE from all agents, this check 2934 is not required. 2936 5.5.1.9. Role Determination 2938 The roles of controlling and controlled as described in Section 5.2 2939 of ICE are still utilized with RELOAD. However, the offerer (the 2940 entity sending the Attach request) will always be controlling, and 2941 the answerer (the entity sending the Attach response) will always be 2942 controlled. The connectivity checks MUST still contain the ICE- 2943 CONTROLLED and ICE-CONTROLLING attributes, however, even though the 2944 role reversal capability for which they are defined will never be 2945 needed with RELOAD. This is to allow for a common codebase between 2946 ICE for RELOAD and ICE for SDP. 2948 5.5.1.10. Full ICE 2950 When neither side has provided an No-ICE candidate, connectivity 2951 checks and nominations are used as in regular ICE. 2953 5.5.1.10.1. Connectivity Checks 2955 The processes of forming check lists in Section 5.7 of ICE, 2956 scheduling checks in Section 5.8, and checking connectivity checks in 2957 Section 7 are used with RELOAD without change. 2959 5.5.1.10.2. Concluding ICE 2961 The procedures in Section 8 of ICE are followed to conclude ICE, with 2962 the following exceptions: 2964 o The controlling agent MUST NOT attempt to send an updated offer 2965 once the state of its single media stream reaches Completed. 2966 o Once the state of ICE reaches Completed, the agent can immediately 2967 free all unused candidates. This is because RELOAD does not have 2968 the concept of forking, and thus the three second delay in Section 2969 8.3 of ICE does not apply. 2971 5.5.1.10.3. Media Keepalives 2973 STUN MUST be utilized for the keepalives described in Section 10 of 2974 ICE. 2976 5.5.1.11. No-ICE 2978 No-ICE is selected when either side has provided "no ICE" Overlay 2979 Link candidates. STUN is not used for connectivity checks when doing 2980 No-ICE; instead the DTLS or TLS handshake (or similar security layer 2981 of future overlay link protocols) forms the connectivity check. The 2982 certificate exchanged during the (D)TLS handshake must match the node 2983 that sent the AttachReqAns and if it does not, the connection MUST be 2984 closed. 2986 5.5.1.12. Subsequent Offers and Answers 2988 An agent MUST NOT send a subsequent offer or answer. Thus, the 2989 procedures in Section 9 of ICE MUST be ignored. 2991 5.5.1.13. Sending Media 2993 The procedures of Section 11 of ICE apply to RELOAD as well. 2994 However, in this case, the "media" takes the form of application 2995 layer protocols (RELOAD) over TLS or DTLS. Consequently, once ICE 2996 processing completes, the agent will begin TLS or DTLS procedures to 2997 establish a secure connection. The node which sent the Attach 2998 request MUST be the TLS server. The other node MUST be the TLS 2999 client. The server MUST request TLS client authentication. The 3000 nodes MUST verify that the certificate presented in the handshake 3001 matches the identity of the other peer as found in the Attach 3002 message. Once the TLS or DTLS signaling is complete, the application 3003 protocol is free to use the connection. 3005 The concept of a previous selected pair for a component does not 3006 apply to RELOAD, since ICE restarts are not possible with RELOAD. 3008 5.5.1.14. Receiving Media 3010 An agent MUST be prepared to receive packets for the application 3011 protocol (TLS or DTLS carrying RELOAD, SIP or anything else) at any 3012 time. The jitter and RTP considerations in Section 11 of ICE do not 3013 apply to RELOAD. 3015 5.5.2. AppAttach 3017 A node sends an AppAttach request when it wishes to establish a 3018 direct connection to another node for the purposes of sending 3019 application layer messages. AppAttach is nearly identical to Attach, 3020 except for the purpose of the connection: it is used to transport 3021 non-RELOAD "media". A separate request is used to avoid implementor 3022 confusion between the two methods (this was found to be a real 3023 problem with initial implementations). The AppAttach request and its 3024 response contain an application attribute, which indicates what 3025 protocol is to be run over the connection. 3027 5.5.2.1. Request Definition 3029 An AppAttachReq message contains the requesting node's ICE connection 3030 parameters formatted into a binary structure. 3032 struct { 3033 opaque ufrag<0..2^8-1>; 3034 opaque password<0..2^8-1>; 3035 uint16 application; 3036 opaque role<0..2^8-1>; 3037 IceCandidate candidates<0..2^16-1>; 3038 } AppAttachReq; 3040 The values contained in AppAttachReq and AppAttachAns are: 3042 ufrag 3043 The username fragment (from ICE) 3045 password 3046 The ICE password. 3048 application 3049 A 16-bit application-id as defined in the Section 13.5. This 3050 number represents the IANA registered application that is going to 3051 send data on this connection. For SIP, this is 5060 or 5061. 3053 role 3054 An active/passive/actpass attribute from RFC 4145 [RFC4145]. 3056 candidates 3057 One or more ICE candidate values 3059 5.5.2.2. Response Definition 3061 If a peer receives an AppAttach request, it SHOULD process the 3062 request and generate its own response with a AppAttachAns. It should 3063 then begin ICE checks. When a peer receives an AppAttach response, 3064 it SHOULD parse the response and begin its own ICE checks. If the 3065 application ID is not supported, the peer MUST reply with an 3066 Error_Not_Found error. 3068 struct { 3069 opaque ufrag<0..2^8-1>; 3070 opaque password<0..2^8-1>; 3071 uint16 application; 3072 opaque role<0..2^8-1>; 3073 IceCandidate candidates<0..2^16-1>; 3074 } AppAttachAns; 3076 The meaning of the fields is the same as in the AppAttachReq. 3078 5.5.3. Ping 3080 Ping is used to test connectivity along a path. A ping can be 3081 addressed to a specific Node-ID, to the peer controlling a given 3082 location (by using a resource ID), or to the broadcast Node-ID 3083 (2^128-1). 3085 5.5.3.1. Request Definition 3087 struct { 3088 opaque<0..2^16-1> padding; 3089 } PingReq 3091 The Ping request is empty of meaningful contents. However, it may 3092 contain up to 65535 bytes of padding to facilitate the discovery of 3093 overlay maximum packet sizes. 3095 5.5.3.2. Response Definition 3097 A successful PingAns response contains the information elements 3098 requested by the peer. 3100 struct { 3101 uint64 response_id; 3102 uint64 time; 3103 } PingAns; 3105 A PingAns message contains the following elements: 3107 response_id 3108 A randomly generated 64-bit response ID. This is used to 3109 distinguish Ping responses. 3111 time 3112 The time when the ping responses was created represented in the 3113 same way as storage_time defined in Section 6. 3115 5.5.4. ConfigUpdate 3117 The ConfigUpdate method is used to push updated configuration data 3118 across the overlay. Whenever a node detects that another node has 3119 old configuration data, it MUST generate a ConfigUpdate request. The 3120 ConfigUpdate request allows updating of two kinds of data: the 3121 configuration data (Section 5.3.2.1) and kind information 3122 (Section 6.4.1.1). 3124 5.5.4.1. Request Definition 3126 enum { reservedConfigUpdate(0), config(1), kind(2), (255) } 3127 ConfigUpdateType; 3129 typedef uint16 KindId; 3130 typedef opaque KindDescription<0..2^16-1>; 3132 struct { 3133 ConfigUpdateType type; 3134 uint32 length; 3136 select (type) { 3137 case config: 3138 opaque config_data<0..2^24-1>; 3140 case kind: 3141 KindDescription kinds<0..2^24-1>; 3143 /* This structure may be extended with new types*/ 3144 }; 3145 } ConfigUpdateReq; 3147 The ConfigUpdateReq message contains the following elements: 3149 type 3150 The type of the contents of the message. This structure allows 3151 for unknown content types. 3152 length 3153 The length of the remainder of the message. This is included to 3154 preserve backward compatibility and is 32 bits instead of 24 to 3155 facilitate easy conversion between network and host byte order. 3156 config_data (type==config) 3157 The contents of the configuration document. 3158 kinds (type==kind) 3159 One or more XML kind-block productions (see Section 10.1). These 3160 MUST be encoded with UTF-8 and assume a default namespace of 3161 "urn:ietf:params:xml:ns:p2p:config-base". 3163 5.5.4.2. Response Definition 3165 struct { 3166 } ConfigUpdateRsp 3168 If the ConfigUpdateReq is of type "config" it MUST only be processed 3169 if all the following are true: 3170 o The sequence number in the document is greater than the current 3171 configuration sequence number. 3172 o The configuration document is correctly digitally signed (see 3173 Section 10 for details on signatures. 3174 Otherwise appropriate errors MUST be generated. 3176 If the ConfigUpdateReq is of type "kind" it MUST only be processed if 3177 it is correctly digitally signed by an acceptable kind signer as 3178 specified in the configuration file. Details on kind-signer field in 3179 the configuration file is described in Section 10.1. In addition, if 3180 the kind update conflicts with an existing known kind (i.e., it is 3181 signed by a different signer), then it should be rejected with 3182 "Error_Forbidden". This should not happen in correctly functioning 3183 overlays. 3185 If the update is acceptable, then the node MUST reconfigure itself to 3186 match the new information. This may include adding permissions for 3187 new kinds, deleting old kinds, or even, in extreme circumstances, 3188 exiting and reentering the overlay, if, for instance, the DHT 3189 algorithm has changed. 3191 The response for ConfigUpdate is empty. 3193 5.6. Overlay Link Layer 3195 RELOAD can use multiple Overlay Link protocols to send its messages. 3196 Because ICE is used to establish connections (see Section 5.5.1.3), 3197 RELOAD nodes are able to detect which Overlay Link protocols are 3198 offered by other nodes and establish connections between them. Any 3199 link protocol needs to be able to establish a secure, authenticated 3200 connection and to provide data origin authentication and message 3201 integrity for individual data elements. RELOAD currently supports 3202 three Overlay Link protocols: 3204 o DTLS [RFC4347] over UDP with Simple Reliability (SR) 3205 o TLS [RFC5246] over TCP with Framing Header, No-ICE 3206 o DTLS [RFC4347] over UDP with SR, No-ICE 3208 Note that although UDP does not properly have "connections", both TLS 3209 and DTLS have a handshake which establishes a similar, stateful 3210 association, and we simply refer to these as "connections" for the 3211 purposes of this document. 3213 If a peer receives a message that is larger than value of max- 3214 message-size defined in the overlay configuration, the peer SHOULD 3215 send an Error_Message_Too_Large error and then close the TLS or DTLS 3216 session from which the message was received. Note that this error 3217 can be sent and the session closed before receiving the complete 3218 message. If the forwarding header is larger than the max-message- 3219 size, the receiver SHOULD close the TLS or DTLS session without 3220 sending an error. 3222 The Framing Header (FH) is used to frame messages and provide timing 3223 when used on a reliable stream-based transport protocol. Simple 3224 Reliability (SR) makes use of the FH to provide congestion control 3225 and semi-reliability when using unreliable message-oriented transport 3226 protocols. We will first define each of these algorithms, then 3227 define overlay link protocols that use them. 3229 Note: We expect future Overlay Link protocols to define replacements 3230 for all components of these protocols, including the framing header. 3231 These protocols have been chosen for simplicity of implementation and 3232 reasonable performance. 3234 Note to implementers: There are inherent tradeoffs in utilizing 3235 short timeouts to determine when a link has failed. To balance the 3236 tradeoffs, an implementation should be able to quickly act to remove 3237 entries from the routing table when there is reason to suspect the 3238 link has failed. For example, in a Chord-derived overlay algorithm, 3239 a closer finger table entry could be substituted for an entry in the 3240 finger table that has experienced a timeout. That entry can be 3241 restored if it proves to resume functioning, or replaced at some 3242 point in the future if necessary. End-to-end retransmissions will 3243 handle any lost messages, but only if the failing entries do not 3244 remain in the finger table for subsequent retransmissions. 3246 5.6.1. Future Overlay Link Protocols 3248 The only currently defined overlay link protocols are TLS and DTLS. 3249 It is possible to define new link-layer protocols and apply them to a 3250 new overlay using the "overlay-link-protocol" configuration directive 3251 (see Section 10.1.). However, any new protocols MUST meet the 3252 following requirements. 3254 Endpoint authentication When a node forms an association with 3255 another endpoint, it MUST be possible to cryptographically verify 3256 that the endpoint has a given NodeId. 3258 Traffic origin authentication and integrity When a node receives 3259 traffic from another endpoint, it MUST be possible to 3260 cryptographically verify that the traffic came from a given 3261 association and that it has not been modified in transit from the 3262 other endpoint in the association. The overlay link protocol MUST 3263 also provide replay prevention/detection. 3265 Traffic confidentiality When a node sends traffic to another 3266 endpoint, it MUST NOT be possible for a third party not involved 3267 in the association to determine the contents of that traffic. 3269 Any new overlay protocol MUST be defined via RFC 5226 Standards 3270 Action; see Section 13.11. 3272 5.6.1.1. HIP 3274 In a Host Identity Protocol Based Overlay Networking Environment (HIP 3275 BONE) [I-D.ietf-hip-bone] HIP [RFC5201] provides connection 3276 management (e.g., NAT traversal and mobility) and security for the 3277 overlay network. The P2PSIP Working Group has expressed interest in 3278 supporting a HIP-based link protocol. Such support would require 3279 specifying such details as: 3281 o How to issue certificates which provided identities meaningful to 3282 the HIP base exchange. We anticipate that this would require a 3283 mapping between ORCHIDs and NodeIds. 3284 o How to carry the HIP I1 and I2 messages. We anticipate that this 3285 would require defining a HIP Tunnel usage. 3286 o How to carry RELOAD messages over HIP. 3288 [I-D.ietf-hip-reload-instance] documents work in progress on using 3289 RELOAD with the HIP BONE. 3291 5.6.1.2. ICE-TCP 3293 The ICE-TCP draft [I-D.ietf-mmusic-ice-tcp] should allow TCP to be 3294 supported as an Overlay Link protocol that can be added using ICE. 3296 5.6.1.3. Message-oriented Transports 3298 Modern message-oriented transports offer high performance, good 3299 congestion control, and avoid head of line blocking in case of lost 3300 data. These characteristics make them preferable as underlying 3301 transport protocols for RELOAD links. SCTP without message ordering 3302 and DCCP are two examples of such protocols. However, currently they 3303 are not well-supported by commonly available NATs, and specifications 3304 for ICE session establishment are not available. 3306 5.6.1.4. Tunneled Transports 3308 As of the time of this writing, there is significant interest in the 3309 IETF community in tunneling other transports over UDP, motivated by 3310 the situation that UDP is well-supported by modern NAT hardware, and 3311 similar performance can be achieved to native implementation. 3312 Currently SCTP, DCCP, and a generic tunneling extension are being 3313 proposed for message-oriented protocols. Baset et al. have proposed 3314 tunneling TCP over UDP for similar reasons 3315 [I-D.baset-tsvwg-tcp-over-udp]. Once ICE traversal has been 3316 specified for these tunneled protocols, they should be 3317 straightforward to support as overlay link protocols. 3319 5.6.2. Framing Header 3321 In order to support unreliable links and to allow for quick detection 3322 of link failures when using reliable end-to-end transports, each 3323 message is wrapped in a very simple framing layer (FramedMessage) 3324 which is only used for each hop. This layer contains a sequence 3325 number which can then be used for ACKs. The same header is used for 3326 both reliable and unreliable transports for simplicity of 3327 implementation - not all aspects of the header apply to both types of 3328 transports. 3330 The definition of FramedMessage is: 3332 enum { data(128), ack(129), (255)} FramedMessageType; 3334 struct { 3335 FramedMessageType type; 3337 select (type) { 3338 case data: 3339 uint32 sequence; 3340 opaque message<0..2^24-1>; 3342 case ack: 3343 uint32 ack_sequence; 3344 uint32 received; 3345 }; 3346 } FramedMessage; 3348 The type field of the PDU is set to indicate whether the message is 3349 data or an acknowledgement. 3351 If the message is of type "data", then the remainder of the PDU is as 3352 follows: 3354 sequence 3355 the sequence number. This increments by 1 for each framed message 3356 sent over this transport session. 3358 message 3359 the message that is being transmitted. 3361 Each connection has it own sequence number space. Initially the 3362 value is zero and it increments by exactly one for each message sent 3363 over that connection. 3365 When the receiver receives a message, it SHOULD immediately send an 3366 ACK message. The receiver MUST keep track of the 32 most recent 3367 sequence numbers received on this association in order to generate 3368 the appropriate ack. 3370 If the PDU is of type "ack", the contents are as follows: 3372 ack_sequence 3373 The sequence number of the message being acknowledged. 3375 received 3376 A bitmask indicating if each of the previous 32 sequence numbers 3377 before this packet has been among the 32 packets most recently 3378 received on this connection. When a packet is received with a 3379 sequence number N, the receiver looks at the sequence number of 3380 the previously 32 packets received on this connection. Call the 3381 previously received packet number M. For each of the previous 32 3382 packets, if the sequence number M is less than N but greater than 3383 N-32, the N-M bit of the received bitmask is set to one; otherwise 3384 it is zero. Note that a bit being set to one indicates positively 3385 that a particular packet was received, but a bit being set to zero 3386 means only that it is unknown whether or not the packet has been 3387 received, because it might have been received before the 32 most 3388 recently received packets. 3390 The received field bits in the ACK provide a high degree of 3391 redundancy so that the sender can figure out which packets the 3392 receiver has received and can then estimate packet loss rates. If 3393 the sender also keeps track of the time at which recent sequence 3394 numbers have been sent, the RTT can be estimated. 3396 5.6.3. Simple Reliability 3398 When RELOAD is carried over DTLS or another unreliable link protocol, 3399 it needs to be used with a reliability and congestion control 3400 mechanism, which is provided on a hop-by-hop basis. The basic 3401 principle is that each message, regardless of whether or not it 3402 carries a request or response, will get an ACK and be reliably 3403 retransmitted. The receiver's job is very simple, limited to just 3404 sending ACKs. All the complexity is at the sender side. This allows 3405 the sending implementation to trade off performance versus 3406 implementation complexity without affecting the wire protocol. 3408 5.6.3.1. Retransmission and Flow Control 3410 Because the receiver's role is limited to providing packet 3411 acknowledgements, a wide variety of congestion control algorithms can 3412 be implemented on the sender side while using the same basic wire 3413 protocol. Senders MUST implement a retransmission and congestion 3414 control scheme no more aggressive then TFRC[RFC5348]. One way to do 3415 that is for senders to implement the scheme in the following section. 3416 Another alternative would be TFRC-SP [RFC4828] and use the received 3417 bitmask to allow the sender to compute packet loss event rates. 3419 5.6.3.1.1. Trivial Retransmission 3421 A node SHOULD retransmit a message if it has not received an ACK 3422 after an interval of RTO ("Retransmission TimeOut"). The node MUST 3423 double the time to wait after each retransmission. In each 3424 retransmission, the sequence number is incremented. 3426 The RTO is an estimate of the round-trip time (RTT). Implementations 3427 can use a static value for RTO or a dynamic estimate which will 3428 result in better performance. For implementations that use a static 3429 value, the default value for RTO is 500 ms. Nodes MAY use smaller 3430 values of RTO if it is known that all nodes are within the local 3431 network. The default RTO MAY be chosen larger, and this is 3432 RECOMMENDED if it is known in advance (such as on high latency access 3433 links) that the round-trip time is larger. 3435 Implementations that use a dynamic estimate to compute the RTO MUST 3436 use the algorithm described in RFC 2988[RFC2988], with the exception 3437 that the value of RTO SHOULD NOT be rounded up to the nearest second 3438 but instead rounded up to the nearest millisecond. The RTT of a 3439 successful STUN transaction from the ICE stage is used as the initial 3440 measurement for formula 2.2 of RFC 2988. The sender keeps track of 3441 the time each message was sent for all recently sent messages. Any 3442 time an ACK is received, the sender can compute the RTT for that 3443 message by looking at the time the ACK was received and the time when 3444 the message was sent. This is used as a subsequent RTT measurement 3445 for formula 2.3 of RFC 2988 to update the RTO estimate. (Note that 3446 because retransmissions receive new sequence numbers, all received 3447 ACKs are used.) 3449 The value for RTO is calculated separately for each DTLS session. 3451 Retransmissions continue until a response is received, or until a 3452 total of 5 requests have been sent or there has been a hard ICMP 3453 error [RFC1122] or a TLS alert. The sender knows a response was 3454 received when it receives an ACK with a sequence number that 3455 indicates it is a response to one of the transmissions of this 3456 messages. For example, assuming an RTO of 500 ms, requests would be 3457 sent at times 0 ms, 500 ms, 1500 ms, 3500 ms, and 7500 ms. If all 3458 retransmissions for a message fail, then the sending node SHOULD 3459 close the connection routing the message. 3461 To determine when a link may be failing without waiting for the final 3462 timeout, observe when no ACKs have been received for an entire RTO 3463 interval, and then wait for three retransmissions to occur beyond 3464 that point. If no ACKs have been received by the time the third 3465 retransmission occurs, it is RECOMMENDED that the link be removed 3466 from the routing table. The link MAY be restored to the routing 3467 table if ACKs resume before the connection is closed, as described 3468 above. 3470 Once an ACK has been received for a message, the next message can be 3471 sent, but the peer SHOULD ensure that there is at least 10 ms between 3472 sending any two messages. The only time a value less than 10 ms can 3473 be used is when it is known that all nodes are on a network that can 3474 support retransmissions faster than 10 ms with no congestion issues. 3476 5.6.4. DTLS/UDP with SR 3478 This overlay link protocol consists of DTLS over UDP while 3479 implementing the Simple Reliability protocol. STUN Connectivity 3480 checks and keepalives are used. 3482 5.6.5. TLS/TCP with FH, No-ICE 3484 This overlay link protocol consists of TLS over TCP with the framing 3485 header. Because ICE is not used, STUN connectivity checks are not 3486 used upon establishing the TCP connection, nor are they used for 3487 keepalives. 3489 Because the TCP layer's application-level timeout is too slow to be 3490 useful for overlay routing, the Overlay Link implementation MUST use 3491 the framing header to measure the RTT of the connection and calculate 3492 an RTO as specified in Section 2 of [RFC2988]. The resulting RTO is 3493 not used for retransmissions, but as a timeout to indicate when the 3494 link SHOULD be removed from the routing table. It is RECOMMENDED 3495 that such a connection be retained for 30s to determine if the 3496 failure was transient before concluding the link has failed 3497 permanently. 3499 When sending candidates for TLS/TCP with FH, No-ICE, a passive 3500 candidate MUST be provided. The following table shows which side of 3501 the exchange initiates the connection depending on whether they 3502 provided ICE or No-ICE candidates. Note that the active TCP role 3503 does not alter the TLS server/client determination. 3505 +----------------------+----------+-----------------+ 3506 | Offeror | Answerer | TCP Active Role | 3507 +----------------------+----------+-----------------+ 3508 | ICE | No-ICE | Offeror | 3509 | No-ICE | ICE | Answerer | 3510 | No-ICE | No-ICE | Offeror | 3511 +----------------------+----------+-----------------+ 3513 Table 1: Determining Active Role for No-ICE 3515 5.6.6. DTLS/UDP with SR, No-ICE 3517 This overlay link protocol consists of DTLS over UDP while 3518 implementing the Simple Reliability protocol. Because ICE is not 3519 used, no STUN connectivity checks or keepalives are used. 3521 5.7. Fragmentation and Reassembly 3523 In order to allow transmission over datagram protocols such as DTLS, 3524 RELOAD messages may be fragmented. 3526 Any node along the path can fragment the message but only the final 3527 destination reassembles the fragments. When a node takes a packet 3528 and fragments it, each fragment has a full copy of the Forwarding 3529 Header but the data after the Forwarding Header is broken up in 3530 appropriate sized chunks. The size of the payload chunks needs to 3531 take into account space to allow the via and destination lists to 3532 grow. Each fragment MUST contain a full copy of the via and 3533 destination list and MUST contain at least 256 bytes of the message 3534 body. If the via and destination list are so large that this is not 3535 possible, RELOAD fragmentation is not performed and IP-layer 3536 fragmentation is allowed to occur. When a message must be 3537 fragmented, it SHOULD be split into equal-sized fragments that are no 3538 larger than the PMTU of the next overlay link minus 32 bytes. This 3539 is to allow the via list to grow before further fragmentation is 3540 required. 3542 Note that this fragmentation is not optimal for the end-to-end path - 3543 a message may be refragmented multiple times as it traverses the 3544 overlay but is only assembled at the final destination. This option 3545 has been chosen as it is far easier to implement than e2e PMTU 3546 discovery across an ever-changing overlay, and it effectively 3547 addresses the reliability issues of relying on IP-layer 3548 fragmentation. However, PING can be used to allow e2e PMTU to be 3549 implemented if desired. 3551 Upon receipt of a fragmented message by the intended peer, the peer 3552 holds the fragments in a holding buffer until the entire message has 3553 been received. The message is then reassembled into a single message 3554 and processed. In order to mitigate denial of service attacks, 3555 receivers SHOULD time out incomplete fragments after maximum request 3556 lifetime (15 seconds). Note this time was derived from looking at 3557 the end to end retransmission time and saving fragments long enough 3558 for the full end to end retransmissions to take place. Ideally the 3559 receiver would have enough buffer space to deal with as many 3560 fragments as can arrive in the maximum request lifetime. However, if 3561 the receiver runs out of buffer space to reassemble the messages it 3562 MUST drop the message. 3564 When a message is fragmented, the fragment offset value is stored in 3565 the lower 24 bits of the fragment field of the forwarding header. 3566 The offset is the number of bytes between the end of the forwarding 3567 header and the start of the data. The first fragment therefore has 3568 an offset of 0. The first and last bit indicators MUST be 3569 appropriately set. If the message is not fragmented, then both the 3570 first and last fragment bits are set to 1 and the offset is 0 3571 resulting in a fragment value of 0xC0000000. 3573 6. Data Storage Protocol 3575 RELOAD provides a set of generic mechanisms for storing and 3576 retrieving data in the Overlay Instance. These mechanisms can be 3577 used for new applications simply by defining new code points and a 3578 small set of rules. No new protocol mechanisms are required. 3580 The basic unit of stored data is a single StoredData structure: 3582 struct { 3583 uint32 length; 3584 uint64 storage_time; 3585 uint32 lifetime; 3586 StoredDataValue value; 3587 Signature signature; 3588 } StoredData; 3590 The contents of this structure are as follows: 3592 length 3593 The size of the StoredData structure in octets excluding the size 3594 of length itself. 3596 storage_time 3597 The time when the data was stored represented as the number of 3598 milliseconds elapsed since midnight Jan 1, 1970 UTC not counting 3599 leap seconds. This will have the same values for seconds as 3600 standard UNIX time or POSIX time. More information can be found 3601 at [UnixTime]. Any attempt to store a data value with a storage 3602 time before that of a value already stored at this location MUST 3603 generate a Error_Data_Too_Old error. This prevents rollback 3604 attacks. Note that this does not require synchronized clocks: 3605 the receiving peer uses the storage time in the previous store, 3606 not its own clock. 3608 A node that is attempting to store new data in response to a user 3609 request (rather than as an overlay maintenance operation such as 3610 occurs during unpartitioning) is rejected with an 3611 Error_Data_Too_Old error, the node MAY elect to perform its store 3612 using a storage_time that increments the value used with the 3613 previous store. This situation may occur when the clocks of nodes 3614 storing to this location are not properly synchronized. 3616 lifetime 3617 The validity period for the data, in seconds, starting from the 3618 time of store. 3620 value 3621 The data value itself, as described in Section 6.2. 3623 signature 3624 A signature as defined in Section 6.1. 3626 Each Resource-ID specifies a single location in the Overlay Instance. 3627 However, each location may contain multiple StoredData values 3628 distinguished by Kind-ID. The definition of a kind describes both 3629 the data values which may be stored and the data model of the data. 3630 Some data models allow multiple values to be stored under the same 3631 Kind-ID. Section Section 6.2 describes the available data models. 3632 Thus, for instance, a given Resource-ID might contain a single-value 3633 element stored under Kind-ID X and an array containing multiple 3634 values stored under Kind-ID Y. 3636 6.1. Data Signature Computation 3638 Each StoredData element is individually signed. However, the 3639 signature also must be self-contained and cover the Kind-ID and 3640 Resource-ID even though they are not present in the StoredData 3641 structure. The input to the signature algorithm is: 3643 resource_id || kind || storage_time || StoredDataValue || 3644 SignerIdentity 3646 Where || indicates concatenation. 3648 Where these values are: 3650 resource 3651 The resource ID where this data is stored. 3653 kind 3654 The Kind-ID for this data. 3656 storage_time 3658 The contents of the storage_time data value. 3659 StoredDataValue 3660 The contents of the stored data value, as described in the 3661 previous sections. 3663 SignerIdentity 3664 The signer identity as defined in Section 5.3.4. 3666 Once the signature has been computed, the signature is represented 3667 using a signature element, as described in Section 5.3.4. 3669 6.2. Data Models 3671 The protocol currently defines the following data models: 3673 o single value 3674 o array 3675 o dictionary 3677 These are represented with the StoredDataValue structure: 3679 enum { reserved(0), single_value(1), array(2), 3680 dictionary(3), (255)} DataModel; 3682 struct { 3683 Boolean exists; 3684 opaque value<0..2^32-1>; 3685 } DataValue; 3687 struct { 3688 select (DataModel) { 3689 case single_value: 3690 DataValue single_value_entry; 3692 case array: 3693 ArrayEntry array_entry; 3695 case dictionary: 3696 DictionaryEntry dictionary_entry; 3698 /* This structure may be extended */ 3699 } ; 3700 } StoredDataValue; 3702 We now discuss the properties of each data model in turn: 3704 6.2.1. Single Value 3706 A single-value element is a simple sequence of bytes. There may be 3707 only one single-value element for each Resource-ID, Kind-ID pair. 3709 A single value element is represented as a DataValue, which contains 3710 the following two elements: 3712 exists 3713 This value indicates whether the value exists at all. If it is 3714 set to False, it means that no value is present. If it is True, 3715 that means that a value is present. This gives the protocol a 3716 mechanism for indicating nonexistence as opposed to emptiness. 3718 value 3719 The stored data. 3721 6.2.2. Array 3723 An array is a set of opaque values addressed by an integer index. 3724 Arrays are zero based. Note that arrays can be sparse. For 3725 instance, a Store of "X" at index 2 in an empty array produces an 3726 array with the values [ NA, NA, "X"]. Future attempts to fetch 3727 elements at index 0 or 1 will return values with "exists" set to 3728 False. 3730 A array element is represented as an ArrayEntry: 3732 struct { 3733 uint32 index; 3734 DataValue value; 3735 } ArrayEntry; 3737 The contents of this structure are: 3739 index 3740 The index of the data element in the array. 3742 value 3743 The stored data. 3745 6.2.3. Dictionary 3747 A dictionary is a set of opaque values indexed by an opaque key with 3748 one value for each key. A single dictionary entry is represented as 3749 follows: 3751 A dictionary element is represented as a DictionaryEntry: 3753 typedef opaque DictionaryKey<0..2^16-1>; 3755 struct { 3756 DictionaryKey key; 3757 DataValue value; 3758 } DictionaryEntry; 3760 The contents of this structure are: 3762 key 3763 The dictionary key for this value. 3765 value 3766 The stored data. 3768 6.3. Access Control Policies 3770 Every kind which is storable in an overlay MUST be associated with an 3771 access control policy. This policy defines whether a request from a 3772 given node to operate on a given value should succeed or fail. It is 3773 anticipated that only a small number of generic access control 3774 policies are required. To that end, this section describes a small 3775 set of such policies and Section 13.4 establishes a registry for new 3776 policies if required. Each policy has a short string identifier 3777 which is used to reference it in the configuration document. 3779 6.3.1. USER-MATCH 3781 In the USER-MATCH policy, a given value MUST be written (or 3782 overwritten) if and only if the request is signed with a key 3783 associated with a certificate whose user name hashes (using the hash 3784 function for the overlay) to the Resource-ID for the resource. 3785 Recall that the certificate may, depending on the overlay 3786 configuration, be self-signed. 3788 6.3.2. NODE-MATCH 3790 In the NODE-MATCH policy, a given value MUST be written (or 3791 overwritten) if and only if the request is signed with a key 3792 associated with a certificate whose Node-ID hashes (using the hash 3793 function for the overlay) to the Resource-ID for the resource. 3795 6.3.3. USER-NODE-MATCH 3797 The USER-NODE-MATCH policy may only be used with dictionary types. 3798 In the USER-NODE-MATCH policy, a given value MUST be written (or 3799 overwritten) if and only if the request is signed with a key 3800 associated with a certificate whose user name hashes (using the hash 3801 function for the overlay) to the Resource-ID for the resource. In 3802 addition, the dictionary key MUST be equal to the Node-ID in the 3803 certificate. 3805 6.3.4. NODE-MULTIPLE 3807 In the NODE-MULTIPLE policy, a given value MUST be written (or 3808 overwritten) if and only if the request is signed with a key 3809 associated with a certificate containing a Node-ID such that 3810 H(Node-ID || i) is equal to the Resource-ID for some small integer 3811 value of i. When this policy is in use, the maximum value of i MUST 3812 be specified in the kind definition. 3814 Note that as i is not carried on the wire, the verifier MUST iterate 3815 through potential i values up to the maximum value in order to 3816 determine whether a store is acceptable. 3818 6.4. Data Storage Methods 3820 RELOAD provides several methods for storing and retrieving data: 3822 o Store values in the overlay 3823 o Fetch values from the overlay 3824 o Stat: get metadata about values in the overlay 3825 o Find the values stored at an individual peer 3827 These methods are each described in the following sections. 3829 6.4.1. Store 3831 The Store method is used to store data in the overlay. The format of 3832 the Store request depends on the data model which is determined by 3833 the kind. 3835 6.4.1.1. Request Definition 3837 A StoreReq message is a sequence of StoreKindData values, each of 3838 which represents a sequence of stored values for a given kind. The 3839 same Kind-ID MUST NOT be used twice in a given store request. Each 3840 value is then processed in turn. These operations MUST be atomic. 3841 If any operation fails, the state MUST be rolled back to before the 3842 request was received. 3844 The store request is defined by the StoreReq structure: 3846 struct { 3847 KindId kind; 3848 uint64 generation_counter; 3849 StoredData values<0..2^32-1>; 3850 } StoreKindData; 3852 struct { 3853 ResourceId resource; 3854 uint8 replica_number; 3855 StoreKindData kind_data<0..2^32-1>; 3856 } StoreReq; 3858 A single Store request stores data of a number of kinds to a single 3859 resource location. The contents of the structure are: 3861 resource 3862 The resource to store at. 3864 replica_number 3865 The number of this replica. When a storing peer saves replicas to 3866 other peers each peer is assigned a replica number starting from 1 3867 and sent in the Store message. This field is set to 0 when a node 3868 is storing its own data. This allows peers to distinguish replica 3869 writes from original writes. 3871 kind_data 3872 A series of elements, one for each kind of data to be stored. 3874 If the replica number is zero, then the peer MUST check that it is 3875 responsible for the resource and, if not, reject the request. If the 3876 replica number is nonzero, then the peer MUST check that it expects 3877 to be a replica for the resource and that the request sender is 3878 consistent with being the responsible node (i.e., that the receiving 3879 peer does not know of a better node) and, if not, reject the request. 3881 Each StoreKindData element represents the data to be stored for a 3882 single Kind-ID. The contents of the element are: 3884 kind 3885 The Kind-ID. Implementations MUST reject requests corresponding 3886 to unknown kinds. 3888 generation 3889 The expected current state of the generation counter 3890 (approximately the number of times this object has been written; 3891 see below for details). 3893 values 3894 The value or values to be stored. This may contain one or more 3895 stored_data values depending on the data model associated with 3896 each kind. 3898 The peer MUST perform the following checks: 3900 o The kind_id is known and supported. 3901 o The signatures over each individual data element (if any) are 3902 valid. If this check fails, the request MUST be rejected with an 3903 Error_Forbidden error. 3905 o Each element is signed by a credential which is authorized to 3906 write this kind at this Resource-ID. If this check fails, the 3907 request MUST be rejected with an Error_Forbidden error. 3908 o For original (non-replica) stores, the peer MUST check that if the 3909 generation-counter is non-zero, it equals the current value of the 3910 generation-counter for this kind. This feature allows the 3911 generation counter to be used in a way similar to the HTTP Etag 3912 feature. 3913 o For replica Stores, the peer MUST set the generation counter to 3914 match the generation_counter in the message, and MUST NOT check 3915 the generation counter against the current value. Replica Stores 3916 MUST NOT use a generation counter of 0. 3917 o The storage time values are greater than that of any value which 3918 would be replaced by this Store. 3919 o The size and number of the stored values is consistent with the 3920 limits specified in the overlay configuration. 3922 If all these checks succeed, the peer MUST attempt to store the data 3923 values. For non-replica stores, if the store succeeds and the data 3924 is changed, then the peer must increase the generation counter by at 3925 least one. If there are multiple stored values in a single 3926 StoreKindData, it is permissible for the peer to increase the 3927 generation counter by only 1 for the entire Kind-ID, or by 1 or more 3928 than one for each value. Accordingly, all stored data values must 3929 have a generation counter of 1 or greater. 0 is used in the Store 3930 request to indicate that the generation counter should be ignored for 3931 processing this request; however the responsible peer should increase 3932 the stored generation counter and should return the correct 3933 generation counter in the response. 3935 When a peer stores data previously stored by another node (e.g., for 3936 replicas or topology shifts) it MUST adjust the lifetime value 3937 downward to reflect the amount of time the value was stored at the 3938 peer. 3940 Unless otherwise specified by the usage, if a peer attempts to store 3941 data previously stored by another node (e.g., for replicas or 3942 topology shifts) and that store fails with either an 3943 Error_Generation_Counter_Too_Low or an Error_Data_Too old error, the 3944 peer MUST fetch the newer data from the peer generating the error and 3945 use that to replace its own copy. This rule allows resynchronization 3946 after partitions heal. 3948 The properties of stores for each data model are as follows: 3950 Single-value: 3951 A store of a new single-value element creates the element if it 3952 does not exist and overwrites any existing value with the new 3953 value. 3955 Array: 3956 A store of an array entry replaces (or inserts) the given value at 3957 the location specified by the index. Because arrays are sparse, a 3958 store past the end of the array extends it with nonexistent values 3959 (exists=False) as required. A store at index 0xffffffff places 3960 the new value at the end of the array regardless of the length of 3961 the array. The resulting StoredData has the correct index value 3962 when it is subsequently fetched. 3964 Dictionary: 3965 A store of a dictionary entry replaces (or inserts) the given 3966 value at the location specified by the dictionary key. 3968 The following figure shows the relationship between these structures 3969 for an example store which stores the following values at resource 3970 "1234" 3972 o The value "abc" in the single value location for kind X 3973 o The value "foo" at index 0 in the array for kind Y 3974 o The value "bar" at index 1 in the array for kind Y 3975 Store 3976 resource=1234 3977 replica_number = 0 3978 / \ 3979 / \ 3980 StoreKindData StoreKindData 3981 kind=X (Single-Value) kind=Y (Array) 3982 generation_counter = 99 generation_counter = 107 3983 | /\ 3984 | / \ 3985 StoredData / \ 3986 storage_time = xxxxxxx / \ 3987 lifetime = 86400 / \ 3988 signature = XXXX / \ 3989 | | | 3990 | StoredData StoredData 3991 | storage_time = storage_time = 3992 | yyyyyyyy zzzzzzz 3993 | lifetime = 86400 lifetime = 33200 3994 | signature = YYYY signature = ZZZZ 3995 | | | 3996 StoredDataValue | | 3997 value="abc" | | 3998 | | 3999 StoredDataValue StoredDataValue 4000 index=0 index=1 4001 value="foo" value="bar" 4003 6.4.1.2. Response Definition 4005 In response to a successful Store request the peer MUST return a 4006 StoreAns message containing a series of StoreKindResponse elements 4007 containing the current value of the generation counter for each 4008 Kind-ID, as well as a list of the peers where the data will be 4009 replicated by the node processing the request.. 4011 struct { 4012 KindId kind; 4013 uint64 generation_counter; 4014 NodeId replicas<0..2^16-1>; 4015 } StoreKindResponse; 4017 struct { 4018 StoreKindResponse kind_responses<0..2^16-1>; 4019 } StoreAns; 4021 The contents of each StoreKindResponse are: 4023 kind 4024 The Kind-ID being represented. 4026 generation 4027 The current value of the generation counter for that Kind-ID. 4029 replicas 4030 The list of other peers at which the data was/will be replicated. 4031 In overlays and applications where the responsible peer is 4032 intended to store redundant copies, this allows the storing peer 4033 to independently verify that the replicas have in fact been 4034 stored. It does this verification by using the Stat method. Note 4035 that the storing peer is not require to perform this verification. 4037 The response itself is just StoreKindResponse values packed end-to- 4038 end. 4040 If any of the generation counters in the request precede the 4041 corresponding stored generation counter, then the peer MUST fail the 4042 entire request and respond with an Error_Generation_Counter_Too_Low 4043 error. The error_info in the ErrorResponse MUST be a StoreAns 4044 response containing the correct generation counter for each kind and 4045 the replica list, which will be empty. For original (non-replica) 4046 stores, a node which receives such an error SHOULD attempt to fetch 4047 the data and, if the storage_time value is newer, replace its own 4048 data with that newer data. This rule improves data consistency in 4049 the case of partitions and merges. 4051 If the data being stored is too large for the allowed limit by the 4052 given usage, then the peer MUST fail the request and generate an 4053 Error_Data_Too_Large error. 4055 If any type of request tries to access a data kind that the node does 4056 not know about, an Error_Unknown_Kind MUST be generated. The 4057 error_info in the Error_Response is: 4059 KindId unknown_kinds<0..2^8-1>; 4061 which lists all the kinds that were unrecognized. 4063 6.4.1.3. Removing Values 4065 This version of RELOAD (unlike previous versions) does not have an 4066 explicit Remove operation. Rather, values are Removed by storing 4067 "nonexistent" values in their place. Each DataValue contains a 4068 boolean value called "exists" which indicates whether a value is 4069 present at that location. In order to effectively remove a value, 4070 the owner stores a new DataValue with: 4072 exists = false 4073 value = {} (0 length) 4075 Storing nodes MUST treat these nonexistent values the same way they 4076 treat any other stored value, including overwriting the existing 4077 value, replicating them, and aging them out as necessary when 4078 lifetime expires. When a stored nonexistent value's lifetime 4079 expires, it is simply removed from the storing node like any other 4080 stored value expiration. Note that in the case of arrays and 4081 dictionaries, this may create an implicit, unsigned "nonexistent" 4082 value to represent a gap in the data structure. However, this value 4083 isn't persistent nor is it replicated. It is simply synthesized by 4084 the storing node. 4086 6.4.2. Fetch 4088 The Fetch request retrieves one or more data elements stored at a 4089 given Resource-ID. A single Fetch request can retrieve multiple 4090 different kinds. 4092 6.4.2.1. Request Definition 4094 struct { 4095 int32 first; 4096 int32 last; 4097 } ArrayRange; 4099 struct { 4100 KindId kind; 4101 uint64 generation; 4102 uint16 length; 4104 select (model) { 4105 case single_value: ; /* Empty */ 4107 case array: 4108 ArrayRange indices<0..2^16-1>; 4110 case dictionary: 4111 DictionaryKey keys<0..2^16-1>; 4113 /* This structure may be extended */ 4115 } model_specifier; 4116 } StoredDataSpecifier; 4118 struct { 4119 ResourceId resource; 4120 StoredDataSpecifier specifiers<0..2^16-1>; 4121 } FetchReq; 4123 The contents of the Fetch requests are as follows: 4125 resource 4126 The resource ID to fetch from. 4128 specifiers 4129 A sequence of StoredDataSpecifier values, each specifying some of 4130 the data values to retrieve. 4132 Each StoredDataSpecifier specifies a single kind of data to retrieve 4133 and (if appropriate) the subset of values that are to be retrieved. 4134 The contents of the StoredDataSpecifier structure are as follows: 4136 kind 4137 The Kind-ID of the data being fetched. Implementations SHOULD 4138 reject requests corresponding to unknown kinds unless specifically 4139 configured otherwise. 4141 model 4142 The data model of the data. This must be checked against the 4143 Kind-ID. 4145 generation 4146 The last generation counter that the requesting node saw. This 4147 may be used to avoid unnecessary fetches or it may be set to zero. 4149 length 4150 The length of the rest of the structure, thus allowing 4151 extensibility. 4153 model_specifier 4154 A reference to the data value being requested within the data 4155 model specified for the kind. For instance, if the data model is 4156 "array", it might specify some subset of the values. 4158 The model_specifier is as follows: 4160 o If the data model is single value, the specifier is empty. 4161 o If the data model is array, the specifier contains a list of 4162 ArrayRange elements, each of which contains two integers. The 4163 first integer is the beginning of the range and the second is the 4164 end of the range. 0 is used to indicate the first element and 4165 0xffffffff is used to indicate the final element. The first 4166 integer must be less than the second. While multiple ranges MAY 4167 be specified, they MUST NOT overlap. 4168 o If the data model is dictionary then the specifier contains a list 4169 of the dictionary keys being requested. If no keys are specified, 4170 than this is a wildcard fetch and all key-value pairs are 4171 returned. 4173 The generation-counter is used to indicate the requester's expected 4174 state of the storing peer. If the generation-counter in the request 4175 matches the stored counter, then the storing peer returns a response 4176 with no StoredData values. 4178 Note that because the certificate for a user is typically stored at 4179 the same location as any data stored for that user, a requesting node 4180 that does not already have the user's certificate should request the 4181 certificate in the Fetch as an optimization. 4183 6.4.2.2. Response Definition 4185 The response to a successful Fetch request is a FetchAns message 4186 containing the data requested by the requester. 4188 struct { 4189 KindId kind; 4190 uint64 generation; 4191 StoredData values<0..2^32-1>; 4192 } FetchKindResponse; 4194 struct { 4195 FetchKindResponse kind_responses<0..2^32-1>; 4196 } FetchAns; 4198 The FetchAns structure contains a series of FetchKindResponse 4199 structures. There MUST be one FetchKindResponse element for each 4200 Kind-ID in the request. 4202 The contents of the FetchKindResponse structure are as follows: 4204 kind 4205 the kind that this structure is for. 4207 generation 4208 the generation counter for this kind. 4210 values 4211 the relevant values. If the generation counter in the request 4212 matches the generation-counter in the stored data, then no 4213 StoredData values are returned. Otherwise, all relevant data 4214 values MUST be returned. A nonexistent value is represented with 4215 "exists" set to False. 4217 There is one subtle point about signature computation on arrays. If 4218 the storing node uses the append feature (where the 4219 index=0xffffffff), then the index in the StoredData that is returned 4220 will not match that used by the storing node, which would break the 4221 signature. In order to avoid this issue, the index value in the 4222 array is set to zero before the signature is computed. This implies 4223 that malicious storing nodes can reorder array entries without being 4224 detected. 4226 6.4.3. Stat 4228 The Stat request is used to get metadata (length, generation counter, 4229 digest, etc.) for a stored element without retrieving the element 4230 itself. The name is from the UNIX stat(2) system call which performs 4231 a similar function for files in a file system. It also allows the 4232 requesting node to get a list of matching elements without requesting 4233 the entire element. 4235 6.4.3.1. Request Definition 4237 The Stat request is identical to the Fetch request. It simply 4238 specifies the elements to get metadata about. 4240 struct { 4241 ResourceId resource; 4242 StoredDataSpecifier specifiers<0..2^16-1>; 4243 } StatReq; 4245 6.4.3.2. Response Definition 4247 The Stat response contains the same sort of entries that a Fetch 4248 response would contain; however, instead of containing the element 4249 data it contains metadata. 4251 struct { 4252 Boolean exists; 4253 uint32 value_length; 4254 HashAlgorithm hash_algorithm; 4255 opaque hash_value<0..255>; 4256 } MetaData; 4258 struct { 4259 uint32 index; 4260 MetaData value; 4261 } ArrayEntryMeta; 4263 struct { 4264 DictionaryKey key; 4265 MetaData value; 4266 } DictionaryEntryMeta; 4268 struct { 4269 select (model) { 4270 case single_value: 4271 MetaData single_value_entry; 4273 case array: 4275 ArrayEntryMeta array_entry; 4277 case dictionary: 4278 DictionaryEntryMeta dictionary_entry; 4280 /* This structure may be extended */ 4281 } ; 4282 } MetaDataValue; 4284 struct { 4285 uint32 value_length; 4286 uint64 storage_time; 4287 uint32 lifetime; 4288 MetaDataValue metadata; 4289 } StoredMetaData; 4291 struct { 4292 KindId kind; 4293 uint64 generation; 4294 StoredMetaData values<0..2^32-1>; 4295 } StatKindResponse; 4297 struct { 4298 StatKindResponse kind_responses<0..2^32-1>; 4299 } StatAns; 4301 The structures used in StatAns parallel those used in FetchAns: a 4302 response consists of multiple StatKindResponse values, one for each 4303 kind that was in the request. The contents of the StatKindResponse 4304 are the same as those in the FetchKindResponse, except that the 4305 values list contains StoredMetaData entries instead of StoredData 4306 entries. 4308 The contents of the StoredMetaData structure are the same as the 4309 corresponding fields in StoredData except that there is no signature 4310 field and the value is a MetaDataValue rather than a StoredDataValue. 4312 A MetaDataValue is a variant structure, like a StoredDataValue, 4313 except for the types of each arm, which replace DataValue with 4314 MetaData. 4316 The only really new structure is MetaData, which has the following 4317 contents: 4319 exists 4320 Same as in DataValue 4322 value_length 4323 The length of the stored value. 4325 hash_algorithm 4326 The hash algorithm used to perform the digest of the value. 4328 hash_value 4329 A digest of the value using hash_algorithm. 4331 6.4.4. Find 4333 The Find request can be used to explore the Overlay Instance. A Find 4334 request for a Resource-ID R and a Kind-ID T retrieves the Resource-ID 4335 (if any) of the resource of kind T known to the target peer which is 4336 closest to R. This method can be used to walk the Overlay Instance by 4337 iteratively fetching R_n+1=nearest(1 + R_n). 4339 6.4.4.1. Request Definition 4341 The FindReq message contains a Resource-ID and a series of Kind-IDs 4342 identifying the resource the peer is interested in. 4344 struct { 4345 ResourceId resource; 4346 KindId kinds<0..2^8-1>; 4347 } FindReq; 4349 The request contains a list of Kind-IDs which the Find is for, as 4350 indicated below: 4352 resource 4353 The desired Resource-ID 4355 kinds 4356 The desired Kind-IDs. Each value MUST only appear once, and if 4357 not the request MUST be rejected with an error. 4359 6.4.4.2. Response Definition 4361 A response to a successful Find request is a FindAns message 4362 containing the closest Resource-ID on the peer for each kind 4363 specified in the request. 4365 struct { 4366 KindId kind; 4367 ResourceId closest; 4368 } FindKindData; 4370 struct { 4371 FindKindData results<0..2^16-1>; 4372 } FindAns; 4374 If the processing peer is not responsible for the specified 4375 Resource-ID, it SHOULD return an Error_Not_Found error code. 4377 For each Kind-ID in the request the response MUST contain a 4378 FindKindData indicating the closest Resource-ID for that Kind-ID, 4379 unless the kind is not allowed to be used with Find in which case a 4380 FindKindData for that Kind-ID MUST NOT be included in the response. 4381 If a Kind-ID is not known, then the corresponding Resource-ID MUST be 4382 0. Note that different Kind-IDs may have different closest Resource- 4383 IDs. 4385 The response is simply a series of FindKindData elements, one per 4386 kind, concatenated end-to-end. The contents of each element are: 4388 kind 4389 The Kind-ID. 4391 closest 4392 The closest resource ID to the specified resource ID. This is 0 4393 if no resource ID is known. 4395 Note that the response does not contain the contents of the data 4396 stored at these Resource-IDs. If the requester wants this, it must 4397 retrieve it using Fetch. 4399 6.4.5. Defining New Kinds 4401 There are two ways to define a new kind. The first is by writing a 4402 document and registering the kind-id with IANA. This is the 4403 preferred method for kinds which may be widely used and reused. The 4404 second method is to simply define the kind and its parameters in the 4405 configuration document using the section of kind-id space set aside 4406 for private use. This method MAY be used to define ad hoc kinds in 4407 new overlays. 4409 However a kind is defined, the definition must include: 4411 o The meaning of the data to be stored (in some textual form). 4412 o The Kind-ID. 4413 o The data model (single value, array, dictionary, etc). 4414 o The access control model. 4416 In addition, when kinds are registered with IANA, each kind is 4417 assigned a short string name which is used to refer to it in 4418 configuration documents. 4420 While each kind needs to define what data model is used for its data, 4421 that does not mean that it must define new data models. Where 4422 practical, kinds should use the existing data models. The intention 4423 is that the basic data model set be sufficient for most applications/ 4424 usages. 4426 7. Certificate Store Usage 4428 The Certificate Store usage allows a peer to store its certificate in 4429 the overlay, thus avoiding the need to send a certificate in each 4430 message - a reference may be sent instead. 4432 A user/peer MUST store its certificate at Resource-IDs derived from 4433 two Resource Names: 4435 o The user name in the certificate. 4436 o The Node-ID in the certificate. 4438 Note that in the second case the certificate is not stored at the 4439 peer's Node-ID but rather at a hash of the peer's Node-ID. The 4440 intention here (as is common throughout RELOAD) is to avoid making a 4441 peer responsible for its own data. 4443 A peer MUST ensure that the user's certificates are stored in the 4444 Overlay Instance. New certificates are stored at the end of the 4445 list. This structure allows users to store an old and a new 4446 certificate that both have the same Node-ID, which allows for 4447 migration of certificates when they are renewed. 4449 This usage defines the following kinds: 4451 Name: CERTIFICATE_BY_NODE 4452 Data Model: The data model for CERTIFICATE_BY_NODE data is array. 4454 Access Control: NODE-MATCH. 4456 Name: CERTIFICATE_BY_USER 4458 Data Model: The data model for CERTIFICATE_BY_USER data is array. 4460 Access Control: USER-MATCH. 4462 8. TURN Server Usage 4464 The TURN server usage allows a RELOAD peer to advertise that it is 4465 prepared to be a TURN server as defined in [RFC5766]. When a node 4466 starts up, it joins the overlay network and forms several connections 4467 in the process. If the ICE stage in any of these connections returns 4468 a reflexive address that is not the same as the peer's perceived 4469 address, then the peer is behind a NAT and not a candidate for a TURN 4470 server. Additionally, if the peer's IP address is in the private 4471 address space range, then it is also not a candidate for a TURN 4472 server. Otherwise, the peer SHOULD assume it is a potential TURN 4473 server and follow the procedures below. 4475 If the node is a candidate for a TURN server it will insert some 4476 pointers in the overlay so that other peers can find it. The overlay 4477 configuration file specifies a turn-density parameter that indicates 4478 how many times each TURN server should record itself in the overlay. 4479 Typically this should be set to the reciprocal of the estimate of 4480 what percentage of peers will act as TURN servers. If the turn- 4481 density is not set to zero, for each value, called d, between 1 and 4482 turn-density, the peer forms a Resource Name by concatenating its 4483 Node-ID and the value d. This Resource Name is hashed to form a 4484 Resource-ID. The address of the peer is stored at that Resource-ID 4485 using type TURN-SERVICE and the TurnServer object: 4487 struct { 4488 uint8 iteration; 4489 IpAddressAndPort server_address; 4490 } TurnServer; 4492 The contents of this structure are as follows: 4494 iteration 4495 the d value 4497 server_address 4498 the address at which the TURN server can be contacted. 4500 Note: Correct functioning of this algorithm depends on having turn- 4501 density be an reasonable estimate of the reciprocal of the 4502 proportion of nodes in the overlay that can act as TURN servers. 4503 If the turn-density value in the configuration file is too low, 4504 then the process of finding TURN servers becomes more expensive as 4505 multiple candidate Resource-IDs must be probed to find a TURN 4506 server. 4508 Peers that provide this service need to support the TURN extensions 4509 to STUN for media relay as defined in [RFC5766]. 4511 This usage defines the following kind to indicate that a peer is 4512 willing to act as a TURN server: 4514 Name TURN-SERVICE 4515 Data Model The TURN-SERVICE kind stores a single value for each 4516 Resource-ID. 4517 Access Control NODE-MULTIPLE, with maximum iteration counter 20. 4519 Peers can find other servers by selecting a random Resource-ID and 4520 then doing a Find request for the appropriate KindID with that 4521 Resource-ID. The Find request gets routed to a random peer based on 4522 the Resource-ID. If that peer knows of any servers, they will be 4523 returned. The returned response may be empty if the peer does not 4524 know of any servers, in which case the process gets repeated with 4525 some other random Resource-ID. As long as the ratio of servers 4526 relative to peers is not too low, this approach will result in 4527 finding a server relatively quickly. 4529 9. Chord Algorithm 4531 This algorithm is assigned the name chord-reload to indicate it is an 4532 adaptation of the basic Chord DHT algorithm. 4534 This algorithm differs from the originally presented Chord algorithm 4535 [Chord]. It has been updated based on more recent research results 4536 and implementation experiences, and to adapt it to the RELOAD 4537 protocol. A short list of differences: 4539 o The original Chord algorithm specified that a single predecessor 4540 and a successor list be stored. The chord-reload algorithm 4541 attempts to have more than one predecessor and successor. The 4542 predecessor sets help other neighbors learn their successor list. 4543 o The original Chord specification and analysis called for iterative 4544 routing. RELOAD specifies recursive routing. In addition to the 4545 performance implications, the cost of NAT traversal dictates 4546 recursive routing. 4547 o Finger table entries are indexed in opposite order. Original 4548 Chord specifies finger[0] as the immediate successor of the peer. 4549 chord-reload specifies finger[0] as the peer 180 degrees around 4550 the ring from the peer. This change was made to simplify 4551 discussion and implementation of variable sized finger tables. 4552 However, with either approach no more than O(log N) entries should 4553 typically be stored in a finger table. 4554 o The stabilize() and fix_fingers() algorithms in the original Chord 4555 algorithm are merged into a single periodic process. 4556 Stabilization is implemented slightly differently because of the 4557 larger neighborhood, and fix_fingers is not as aggressive to 4558 reduce load, nor does it search for optimal matches of the finger 4559 table entries. 4560 o RELOAD uses a 128 bit hash instead of a 160 bit hash, as RELOAD is 4561 not designed to be used in networks with close to or more than 4562 2^128 nodes (and it is hard to see how one would assemble such a 4563 network). 4564 o RELOAD uses randomized finger entries as described in 4565 Section 9.6.4.2. 4566 o This algorithm allows the use of either reactive or periodic 4567 recovery. The original Chord paper used periodic recovery. 4568 Reactive recovery provides better performance in small overlays, 4569 but is believed to be unstable in large (>1000) overlays with high 4570 levels of churn [handling-churn-usenix04]. The overlay 4571 configuration file specifies a "chord-reload-reactive" element 4572 that indicates whether reactive recovery should be used. 4574 9.1. Overview 4576 The algorithm described here is a modified version of the Chord 4577 algorithm. Each peer keeps track of a finger table and a neighbor 4578 table. The neighbor table contains at least the three peers before 4579 and after this peer in the DHT ring. There may not be three entries 4580 in all cases such as small rings or while the ring topology is 4581 changing. The first entry in the finger table contains the peer 4582 half-way around the ring from this peer; the second entry contains 4583 the peer that is 1/4 of the way around; the third entry contains the 4584 peer that is 1/8th of the way around, and so on. Fundamentally, the 4585 chord data structure can be thought of a doubly-linked list formed by 4586 knowing the successors and predecessor peers in the neighbor table, 4587 sorted by the Node-ID. As long as the successor peers are correct, 4588 the DHT will return the correct result. The pointers to the prior 4589 peers are kept to enable the insertion of new peers into the list 4590 structure. Keeping multiple predecessor and successor pointers makes 4591 it possible to maintain the integrity of the data structure even when 4592 consecutive peers simultaneously fail. The finger table forms a skip 4593 list, so that entries in the linked list can be found in O(log(N)) 4594 time instead of the typical O(N) time that a linked list would 4595 provide. 4597 A peer, n, is responsible for a particular Resource-ID k if k is less 4598 than or equal to n and k is greater than p, where p is the peer id of 4599 the previous peer in the neighbor table. Care must be taken when 4600 computing to note that all math is modulo 2^128. 4602 9.2. Routing 4604 The routing table is the union of the neighbor table and the finger 4605 table. 4607 If a peer is not responsible for a Resource-ID k, but is directly 4608 connected to a node with Node-ID k, then it routes the message to 4609 that node. Otherwise, it routes the request to the peer in the 4610 routing table that has the largest Node-ID that is in the interval 4611 between the peer and k. If no such node is found, it finds the 4612 smallest node id that is greater than k and routes the message to 4613 that node. 4615 9.3. Redundancy 4617 When a peer receives a Store request for Resource-ID k, and it is 4618 responsible for Resource-ID k, it stores the data and returns a 4619 success response. It then sends a Store request to its successor in 4620 the neighbor table and to that peer's successor. Note that these 4621 Store requests are addressed to those specific peers, even though the 4622 Resource-ID they are being asked to store is outside the range that 4623 they are responsible for. The peers receiving these check they came 4624 from an appropriate predecessor in their neighbor table and that they 4625 are in a range that this predecessor is responsible for, and then 4626 they store the data. They do not themselves perform further Stores 4627 because they can determine that they are not responsible for the 4628 Resource-ID. 4630 Managing replicas as the overlay changes is described in 4631 Section 9.6.3. 4633 The sequential replicas used in this overlay algorithm protect 4634 against peer failure but not against malicious peers. Additional 4635 replication from the Usage is required to protect resources from such 4636 attacks, as discussed in Section 12.5.4. 4638 9.4. Joining 4640 The join process for a joining party (JP) with Node-ID n is as 4641 follows. 4643 1. JP MUST connect to its chosen bootstrap node. 4644 2. JP SHOULD send an Attach request to the admitting peer (AP) for 4645 Node-ID n. The "send_update" flag should be used to acquire the 4646 routing table for AP. 4647 3. JP SHOULD send Attach requests to initiate connections to each of 4648 the peers in the neighbor table as well as to the desired finger 4649 table entries. Note that this does not populate their routing 4650 tables, but only their connection tables, so JP will not get 4651 messages that it is expected to route to other nodes. 4652 4. JP MUST enter all the peers it has contacted into its routing 4653 table. 4654 5. JP MUST send a Join to AP. The AP sends the response to the 4655 Join. 4656 6. AP MUST do a series of Store requests to JP to store the data 4657 that JP will be responsible for. 4658 7. AP MUST send JP an Update explicitly labeling JP as its 4659 predecessor. At this point, JP is part of the ring and 4660 responsible for a section of the overlay. AP can now forget any 4661 data which is assigned to JP and not AP. 4662 8. The AP MUST send an Update to all of its neighbors with the new 4663 values of its neighbor set (including JP). 4664 9. The JP MUST send Updates to all the peers in its neighbor table. 4666 If JP sends an Attach to AP with send_update, it immediately knows 4667 most of its expected neighbors from AP's routing table update and can 4668 directly connect to them. This is the RECOMMENDED procedure. 4670 If for some reason JP does not get AP's routing table, it can still 4671 populate its neighbor table incrementally. It sends a Ping directed 4672 at Resource-ID n+1 (directly after its own Resource-ID). This allows 4673 it to discover its own successor. Call that node p0. It then sends 4674 a ping to p0+1 to discover its successor (p1). This process can be 4675 repeated to discover as many successors as desired. The values for 4676 the two peers before p will be found at a later stage when n receives 4677 an Update. An alternate procedure is to send Attaches to those nodes 4678 rather than pings, which forms the connections immediately but may be 4679 slower if the nodes need to collect ICE candidates, thus reducing 4680 parallelism. 4682 In order to set up its finger table entry for peer i, JP simply sends 4683 an Attach to peer (n+2^(128-i). This will be routed to a peer in 4684 approximately the right location around the ring. 4686 The joining peer MUST NOT send any Update message placing itself in 4687 the overlay until it has successfully completed an Attach with each 4688 peer that should be in its neighbor table. 4690 9.5. Routing Attaches 4692 When a peer needs to Attach to a new peer in its neighbor table, it 4693 MUST source-route the Attach request through the peer from which it 4694 learned the new peer's Node-ID. Source-routing these requests allows 4695 the overlay to recover from instability. 4697 All other Attach requests, such as those for new finger table 4698 entries, are routed conventionally through the overlay. 4700 9.6. Updates 4702 A chord Update is defined as 4704 enum { reserved (0), 4705 peer_ready(1), neighbors(2), full(3), (255) } 4706 ChordUpdateType; 4708 struct { 4709 uint32 uptime; 4710 ChordUpdateType type; 4711 select(type){ 4712 case peer_ready: /* Empty */ 4713 ; 4715 case neighbors: 4716 NodeId predecessors<0..2^16-1>; 4717 NodeId successors<0..2^16-1>; 4719 case full: 4720 NodeId predecessors<0..2^16-1>; 4721 NodeId successors<0..2^16-1>; 4722 NodeId fingers<0..2^16-1>; 4723 }; 4724 } ChordUpdate; 4726 The "uptime" field contains the time this peer has been up in 4727 seconds. 4729 The "type" field contains the type of the update, which depends on 4730 the reason the update was sent. 4732 peer_ready: this peer is ready to receive messages. This message 4733 is used to indicate that a node which has Attached is a peer and 4734 can be routed through. It is also used as a connectivity check to 4735 non-neighbor peers. 4737 neighbors: this version is sent to members of the Chord neighbor 4738 table. 4740 full: this version is sent to peers which request an Update with a 4741 RouteQueryReq. 4743 If the message is of type "neighbors", then the contents of the 4744 message will be: 4746 predecessors 4747 The predecessor set of the Updating peer. 4749 successors 4750 The successor set of the Updating peer. 4752 If the message is of type "full", then the contents of the message 4753 will be: 4755 predecessors 4756 The predecessor set of the Updating peer. 4758 successors 4759 The successor set of the Updating peer. 4761 fingers 4762 The finger table of the Updating peer, in numerically ascending 4763 order. 4765 A peer MUST maintain an association (via Attach) to every member of 4766 its neighbor set. A peer MUST attempt to maintain at least three 4767 predecessors and three successors, even though this will not be 4768 possible if the ring is very small. It is RECOMMENDED that O(log(N)) 4769 predecessors and successors be maintained in the neighbor set. 4771 9.6.1. Handling Neighbor Failures 4773 Every time a connection to a peer in the neighbor table is lost (as 4774 determined by connectivity pings or the failure of some request), the 4775 peer MUST remove the entry from its neighbor table and replace it 4776 with the best match it has from the other peers in its routing table. 4777 If using reactive recovery, it then sends an immediate Update to all 4778 nodes in its Neighbor Table. The update will contain all the Node- 4779 IDs of the current entries of the table (after the failed one has 4780 been removed). Note that when replacing a successor the peer SHOULD 4781 delay the creation of new replicas for successor replacement hold- 4782 down time (30 seconds) after removing the failed entry from its 4783 neighbor table in order to allow a triggered update to inform it of a 4784 better match for its neighbor table. 4786 If the neighbor failure effects the peer's range of responsible IDs, 4787 then the Update MUST be sent to all nodes in its Connection Table. 4789 A peer MAY attempt to reestablish connectivity with a lost neighbor 4790 either by waiting additional time to see if connectivity returns or 4791 by actively routing a new Attach to the lost peer. Details for these 4792 procedures are beyond the scope of this document. In no event does 4793 an attempt to reestablish connectivity with a lost neighbor allow the 4794 peer to remain in the neighbor table. Such a peer is returned to the 4795 neighbor table once connectivity is reestablished. 4797 If connectivity is lost to all successor peers in the neighbor table, 4798 then this peer should behave as if it is joining the network and use 4799 Pings to find a peer and send it a Join. If connectivity is lost to 4800 all the peers in the finger table, this peer should assume that it 4801 has been disconnected from the rest of the network, and it should 4802 periodically try to join the DHT. 4804 9.6.2. Handling Finger Table Entry Failure 4806 If a finger table entry is found to have failed, all references to 4807 the failed peer are removed from the finger table and replaced with 4808 the closest preceding peer from the finger table or neighbor table. 4810 If using reactive recovery, the peer initiates a search for a new 4811 finger table entry as described below. 4813 9.6.3. Receiving Updates 4815 When a peer, N, receives an Update request, it examines the Node-IDs 4816 in the UpdateReq and at its neighbor table and decides if this 4817 UpdateReq would change its neighbor table. This is done by taking 4818 the set of peers currently in the neighbor table and comparing them 4819 to the peers in the update request. There are two major cases: 4821 o The UpdateReq contains peers that match N's neighbor table, so no 4822 change is needed to the neighbor set. 4823 o The UpdateReq contains peers N does not know about that should be 4824 in N's neighbor table, i.e. they are closer than entries in the 4825 neighbor table. 4827 In the first case, no change is needed. 4829 In the second case, N MUST attempt to Attach to the new peers and if 4830 it is successful it MUST adjust its neighbor set accordingly. Note 4831 that it can maintain the now inferior peers as neighbors, but it MUST 4832 remember the closer ones. 4834 After any Pings and Attaches are done, if the neighbor table changes 4835 and the peer is using reactive recovery, the peer sends an Update 4836 request to each member of its Connection Table. These Update 4837 requests are what end up filling in the predecessor/successor tables 4838 of peers that this peer is a neighbor to. A peer MUST NOT enter 4839 itself in its successor or predecessor table and instead should leave 4840 the entries empty. 4842 If peer N is responsible for a Resource-ID R, and N discovers that 4843 the replica set for R (the next two nodes in its successor set) has 4844 changed, it MUST send a Store for any data associated with R to any 4845 new node in the replica set. It SHOULD NOT delete data from peers 4846 which have left the replica set. 4848 When a peer N detects that it is no longer in the replica set for a 4849 resource R (i.e., there are three predecessors between N and R), it 4850 SHOULD delete all data associated with R from its local store. 4852 When a peer discovers that its range of responsible IDs have changed, 4853 it MUST send an Update to all entries in its connection table. 4855 9.6.4. Stabilization 4857 There are four components to stabilization: 4858 1. exchange Updates with all peers in its neighbor table to exchange 4859 state. 4860 2. search for better peers to place in its finger table. 4861 3. search to determine if the current finger table size is 4862 sufficiently large. 4863 4. search to determine if the overlay has partitioned and needs to 4864 recover. 4866 9.6.4.1. Updating neighbor table 4868 A peer MUST periodically send an Update request to every peer in its 4869 Connection Table. The purpose of this is to keep the predecessor and 4870 successor lists up to date and to detect failed peers. The default 4871 time is about every ten minutes, but the enrollment server SHOULD set 4872 this in the configuration document using the "chord-reload-update- 4873 interval" element (denominated in seconds.) A peer SHOULD randomly 4874 offset these Update requests so they do not occur all at once. 4876 9.6.4.2. Refreshing finger table 4878 A peer MUST periodically search for new peers to replace invalid 4879 (repeated) entries in the finger table. A finger table entry i is 4880 valid if it is in the range [n+2^(128-i), 4881 n+2^(128-(i-1))-2^(128-(i+1))]. Invalid entries occur in the finger 4882 table when a previous finger table entry has failed or when no peer 4883 has been found in that range. 4885 A peer SHOULD NOT send Ping requests looking for new finger table 4886 entries more often than the configuration element "chord-reload-ping- 4887 interval", which defaults to 3600 seconds (one per hour). 4889 Two possible methods for searching for new peers for the finger table 4890 entries are presented: 4892 Alternative 1: A peer selects one entry in the finger table from 4893 among the invalid entries. It pings for a new peer for that finger 4894 table entry. The selection SHOULD be exponentially weighted to 4895 attempt to replace earlier (lower i) entries in the finger table. A 4896 simple way to implement this selection is to search through the 4897 finger table entries from i=0 and each time an invalid entry is 4898 encountered, send a Ping to replace that entry with probability 0.5. 4900 Alternative 2: A peer monitors the Update messages received from its 4901 connections to observe when an Update indicates a peer that would be 4902 used to replace in invalid finger table entry, i, and flags that 4903 entry in the finger table. Every "chord-reload-ping-interval" 4904 seconds, the peer selects from among those flagged candidates using 4905 an exponentially weighted probability as above. 4907 When searching for a better entry, the peer SHOULD send the Ping to a 4908 Node-ID selected randomly from that range. Random selection is 4909 preferred over a search for strictly spaced entries to minimize the 4910 effect of churn on overlay routing [minimizing-churn-sigcomm06]. An 4911 implementation or subsequent specification MAY choose a method for 4912 selecting finger table entries other than choosing randomly within 4913 the range. Any such alternate methods SHOULD be employed only on 4914 finger table stabilization and not for the selection of initial 4915 finger table entries unless the alternative method is faster and 4916 imposes less overhead on the overlay. 4918 A peer MAY choose to keep connections to multiple peers that can act 4919 for a given finger table entry. 4921 9.6.4.3. Adjusting finger table size 4923 If the finger table has less than 16 entries, the node SHOULD attempt 4924 to discover more fingers to grow the size of the table to 16. The 4925 value 16 was chosen to ensure high odds of a node maintaining 4926 connectivity to the overlay even with strange network partitions. 4928 For many overlays, 16 finger table entries will be enough, but as an 4929 overlay grows very large, more than 16 entries may be required in the 4930 finger table for efficient routing. An implementation SHOULD be 4931 capable of increasing the number of entries in the finger table to 4932 128 entries. 4934 Note to implementers: Although log(N) entries are all that are 4935 required for optimal performance, careful implementation of 4936 stabilization will result in no additional traffic being generated 4937 when maintaining a finger table larger than log(N) entries. 4938 Implementers are encouraged to make use of RouteQuery and algorithms 4939 for determining where new finger table entries may be found. 4940 Complete details of possible implementations are outside the scope of 4941 this specification. 4943 A simple approach to sizing the finger table is to ensure the finger 4944 table is large enough to contain at least the final successor in the 4945 peer's neighbor table. 4947 9.6.4.4. Detecting partitioning 4949 To detect that a partitioning has occurred and to heal the overlay, a 4950 peer P MUST periodically repeat the discovery process used in the 4951 initial join for the overlay to locate an appropriate bootstrap node, 4952 B. P should then send a Ping for its own Node-ID routed through B. If 4953 a response is received from a peer S', which is not P's successor, 4954 then the overlay is partitioned and P should send an Attach to S' 4955 routed through B, followed by an Update sent to S'. (Note that S' 4956 may not be in P's neighbor table once the overlay is healed, but the 4957 connection will allow S' to discover appropriate neighbor entries for 4958 itself via its own stabilization.) 4960 Future specifications may describe alternative mechanisms for 4961 determining when to repeat the discovery process. 4963 9.7. Route query 4965 For this topology plugin, the RouteQueryReq contains no additional 4966 information. The RouteQueryAns contains the single node ID of the 4967 next peer to which the responding peer would have routed the request 4968 message in recursive routing: 4970 struct { 4971 NodeId next_peer; 4972 } ChordRouteQueryAns; 4974 The contents of this structure are as follows: 4976 next_peer 4977 The peer to which the responding peer would route the message in 4978 order to deliver it to the destination listed in the request. 4980 If the requester has set the send_update flag, the responder SHOULD 4981 initiate an Update immediately after sending the RouteQueryAns. 4983 9.8. Leaving 4985 To support extensions, such as [I-D.maenpaa-p2psip-self-tuning], 4986 Peers SHOULD send a Leave request to all members of their neighbor 4987 table prior to exiting the Overlay Instance. The 4988 overlay_specific_data field MUST contain the ChordLeaveData structure 4989 defined below: 4991 enum { reserved (0), 4992 from_succ(1), from_pred(2), (255) } 4993 ChordLeaveType; 4995 struct { 4996 ChordLeaveType type; 4998 select(type) { 4999 case from_succ: 5000 NodeId successors<0..2^16-1>; 5001 case from_pred: 5002 NodeId predecessors<0..2^16-1>; 5003 }; 5004 } ChordLeaveData; 5006 The 'type' field indicates whether the Leave request was sent by a 5007 predecessor or a successor of the recipient: 5009 from_succ 5010 The Leave request was sent by a successor. 5012 from_pred 5013 The Leave request was sent by a predecessor. 5015 If the type of the request is 'from_succ', the contents will be: 5017 successors 5018 The sender's successor list. 5020 If the type of the request is 'from_pred', the contents will be: 5022 predecessors 5023 The sender's predecessor list. 5025 Any peer which receives a Leave for a peer n in its neighbor set 5026 follows procedures as if it had detected a peer failure as described 5027 in Section 9.6.1. 5029 10. Enrollment and Bootstrap 5031 The section defines the format of the configuration data as well the 5032 process to join a new overlay. 5034 10.1. Overlay Configuration 5036 This specification defines a new content type "application/ 5037 p2p-overlay+xml" for an MIME entity that contains overlay 5038 information. An example document is shown below. 5040 5042 5045 5047 false 5048 5049 5050 5051 30 5052 TLS 5053 false 5054 10 5055 4000 5056 https://example.org 5057 foo 5058 300 5059 400 5060 false 5062 asecret 5063 chord 5064 16 5065 DATA GOES HERE 5066 5067 5068 5069 single 5070 user-match 5071 1 5072 100 5073 5074 5075 VGhpcyBpcyBub3QgcmlnaHQhCg== 5076 5077 5078 5079 5080 array 5081 node-multiple 5082 3 5083 22 5084 4 5085 1 5086 5087 5088 5089 VGhpcyBpcyBub3QgcmlnaHQhCg== 5090 5091 5092 5093 47112162e84c69ba 5094 6eba45d31a900c06 5095 6ebc45d31a900c06 5096 5097 VGhpcyBpcyBub3QgcmlnaHQhCg== 5098 5099 The file MUST be a well formed XML document and it SHOULD contain an 5100 encoding declaration in the XML declaration. If the charset 5101 parameter of the MIME content type declaration is present and it is 5102 different from the encoding declaration, the charset parameter takes 5103 precedence. Every application conforming to this specification MUST 5104 accept the UTF-8 character encoding to ensure minimal 5105 interoperability. The namespace for the elements defined in this 5106 specification is urn:ietf:params:xml:ns:p2p:config-base and 5107 urn:ietf:params:xml:ns:p2p:config-chord". 5109 The file can contain multiple "configuration" elements where each one 5110 contains the configuration information for a different overlay. Each 5111 "configuration" has the following attributes: 5113 instance-name: name of the overlay 5114 expiration: time in future at which this overlay configuration is no 5115 longer valid and needs to be retrieved again 5116 sequence: a monotonically increasing sequence number between 1 and 5117 2^32 5119 Inside each overlay element, the following elements can occur: 5121 topology-plugin This element has defines the overlay algorithm being 5122 used. 5123 node-id-length This element contains the length of a NodeId 5124 (NodeIdLength) in bytes. This value MUST be between 16 (128 bits) 5125 and 20 (160 bits). If this element is not present, the default of 5126 16 is used. 5127 root-cert This element contains a PEM encoded X.509v3 certificate 5128 that is a root trust anchor used to sign all certificates in this 5129 overlay. There can be more than one root-cert element. 5130 enrollment-server This element contains the URL at which the 5131 enrollment server can be reached in a "url" element. This URL 5132 MUST be of type "https:". More than one enrollment-server element 5133 may be present. 5134 self-signed-permitted This element indicates whether self-signed 5135 certificates are permitted. If it is set to "true", then self- 5136 signed certificates are allowed, in which case the enrollment- 5137 server and root-cert elements may be absent. Otherwise, it SHOULD 5138 be absent, but MAY be set to "false". This element also contains 5139 an attribute "digest" which indicates the digest to be used to 5140 compute the Node-ID. Valid values for this parameter are "SHA-1" 5141 and "SHA-256". Implementations MUST support both of these 5142 algorithms. 5144 direct-return-response-permitted This element indicates whether 5145 direct return routed responses as described in Section 5.3.2.4 are 5146 permitted. If it is set to "true", they MAY be used. Otherwise, 5147 it SHOULD be absent, but MAY be set to "false". Implementations 5148 MAY support direct return routed response. 5149 bootstrap-node This element represents the address of one of the 5150 bootstrap nodes. It has an attribute called "address" that 5151 represents the IP address (either IPv4 or IPv6, since they can be 5152 distinguished) and an attribute called "port" that represents the 5153 port. The IP address is in typical hexadecimal form using 5154 standard period and colon separators as specified in [RFC5952]. 5155 More than one bootstrap-peer element may be present. 5156 turn-density This element is a positive integer that represents the 5157 approximate reciprocal of density of nodes that can act as TURN 5158 servers. For example, if 5% of the nodes can act as TURN servers, 5159 this would be set to 20. If it is not present, the default value 5160 is 1. If there are no TURN servers in the overlay, it is set to 5161 zero. 5162 multicast-bootstrap This element represents the address of a 5163 multicast, broadcast, or anycast address and port that may be used 5164 for bootstrap. Nodes SHOULD listen on the address. It has an 5165 attributed called "address" that represents the IP address and an 5166 attribute called "port" that represents the port. More than one 5167 "multicast-bootstrap" element may be present. 5168 clients-permitted This element represents whether clients are 5169 permitted or whether all nodes must be peers. If it is set to 5170 "TRUE" or absent, this indicates that clients are permitted. If 5171 it is set to "FALSE" then nodes MUST join as peers. 5172 no-ice This element represents whether nodes are required to use 5173 the "No-ICE" Overlay Link protocols in this overlay. If it is 5174 absent, it is treated as if it were set to "FALSE". 5175 chord-update-interval The update frequency for the Chord-reload 5176 topology plugin (see Section 9). 5177 chord-ping-interval The ping frequency for the Chord-reload 5178 topology plugin (see Section 9). 5179 chord-reload-reactive Whether reactive recovery should be used for 5180 this overlay. (see Section 9). 5181 shared-secret If shared secret mode is used, this contains the 5182 shared secret. 5183 max-message-size Maximum size in bytes of any message in the 5184 overlay. If this value is not present, the default is 5000. 5185 initial-ttl Initial default TTL (time to live, see Section 5.3.2) 5186 for messages. If this value is not present, the default is 100. 5187 overlay-link-protocol Indicates a permissible overlay link protocol 5188 (see Section 5.6.1 for requirements for such protocols). An 5189 arbitrary number of these elements may appear. If none appear, 5190 then this implies the default value, "TLS", which refers to the 5191 use of TLS and DTLS. If one or more elements appear, then no 5192 default value applies. 5193 kind-signer This contains a single Node-ID in hexadecimal and 5194 indicates that the certificate with this Node-ID is allowed to 5195 sign kinds. Identifying kind-signer by Node-ID instead of 5196 certificate allows the use of short lived certificates without 5197 constantly having to provide an updated configuration file. 5198 bad-node This contains a single Node-ID in hexadecimal and 5199 indicates that the certificate with this Node-ID MUST NOT be 5200 considered valid. This allows certificate revocation. An 5201 arbitrary number of these elements can be provided. Note that 5202 because certificates may expire, bad-node entries need only be 5203 present for the lifetime of the certificate. Technically 5204 speaking, bad node-ids may be reused once their certificates have 5205 expired, the requirement for node-ids to be pseudo randomly 5206 generated gives this event a vanishing probability. 5208 Inside each overlay element, the required-kinds elements can also 5209 occur. This element indicates the kinds that members must support 5210 and contains multiple kind-block elements that each define a single 5211 kind that MUST be supported by nodes in the overlay. Each kind-block 5212 consists of a single kind element and a kind-signature. The kind 5213 element defines the kind. The kind-signature is the signature 5214 computed over the kind element. 5216 Each kind has either an ID attribute or a name attribute. The name 5217 attribute is a string representing the kind (the name registered to 5218 IANA) while the ID is an integer kind-id allocated out of private 5219 space. 5221 In addition, the kind element contains the following elements: 5222 max-count: the maximum number of values which members of the overlay 5223 must support. 5224 data-model: the data model to be used. 5225 max-size: the maximum size of individual values. 5226 access-control: the access control model to be used. 5227 max-node-multiple: This is optional and only used when the access 5228 control is NODE-MULTIPLE. This indicates the maximum value for 5229 the i counter. This is an integer greater than 0. 5231 All of the non optional values MUST be provided. If the kind is 5232 registered with IANA, the data-model and access-control attributes 5233 MUST match those in the kind registration. For instance, the example 5234 above indicates that members must support SIP-REGISTRATION with a 5235 maximum of 10 values of up to 1000 bytes each. Multiple required- 5236 kinds elements MAY be present. 5238 The kind-block element also MUST contain a "kind-signature" element. 5239 This signature is computed across the kind from the beginning of the 5240 first < of the kind to the end of the last > of the kind in the same 5241 way as the "signature element described later in this section. 5243 The configuration file is a binary file and cannot be changed - 5244 including whitespace changes - or the signature will break. The 5245 signature is computed by taking each configuration element and 5246 starting from, and including, the first < at the start of 5247 up to and including the > in and 5248 treating this as a binary blob that is signed using the standard 5249 SecurityBlock defined in Section 5.3.4. The SecurityBlock is base 64 5250 encoded using the base64 alphabet from RFC[RFC4648] and put in the 5251 signature element following the configuration object in the config 5252 file. 5254 When a node receives a new configuration file, it MUST change its 5255 configuration to meet the new requirements. This may require the 5256 node to exit the DHT and re-join. If a node is not capable of 5257 supporting the new requirements, it MUST exit the overlay. If some 5258 information about a particular kind changes from what the node 5259 previously knew about the kind (for example the max size), the new 5260 information in the configuration files overrides any previously 5261 learned information. If any kind data was signed by a node that is 5262 no longer allowed to sign kinds, that kind MUST be discarded along 5263 with any stored information of that kind. Note that forcing an 5264 avalanche restart of the overlay with a configuration change that 5265 requires re-joining the overlay may result in serious performance 5266 problems, including total collapse of the network if configuration 5267 parameters are not properly considered. Such an event may be 5268 necessary in case of a compromised CA or similar problem, but for 5269 large overlays should be avoided in almost all circumstances. 5271 10.1.1. Relax NG Grammar 5273 The grammar for the configuration data is: 5275 namespace chord = "urn:ietf:params:xml:ns:p2p:config-chord" 5276 namespace local = "" 5277 default namespace p2pcf = "urn:ietf:params:xml:ns:p2p:config-base" 5278 namespace rng = "http://relaxng.org/ns/structure/1.0" 5280 anything = 5281 (element * { anything } 5282 | attribute * { text } 5283 | text)* 5285 foreign-elements = element * - (p2pcf:* | local:* | chord:*) 5286 { anything }* 5287 foreign-attributes = attribute * - (p2pcf:*|local:*|chord:*) 5288 { text }* 5289 foreign-nodes = (foreign-attributes | foreign-elements)* 5291 start = 5292 element p2pcf:overlay { 5293 element configuration { 5294 attribute instance-name { text }, 5295 attribute expiration { xsd:dateTime }, 5296 attribute sequence { xsd:long }, 5297 parameter 5298 }, 5299 element signature { 5300 attribute algorithm { signature-algorithm-type }?, 5301 xsd:base64Binary 5302 }? 5303 } 5304 signature-algorithm-type |= "rsa-sha1" 5306 parameter &= element topology-plugin { topology-plugin-type } 5307 parameter &= element max-message-size { xsd:int }? 5308 parameter &= element initial-ttl { xsd:int }? 5309 parameter &= element root-cert { text }? 5310 parameter &= element required-kinds { kind-block* } 5311 parameter &= element enrollment-server { xsd:anyURI }? 5312 parameter &= element kind-signer { text }* 5313 parameter &= element bad-node { text }* 5314 parameter &= element no-ice { xsd:boolean }? 5315 parameter &= 5316 element direct-return-response-permitted { xsd:boolean }? 5317 parameter &= element shared-secret { xsd:string }? 5318 parameter &= element overlay-link-protocol { xsd:string }* 5319 parameter &= element clients-permitted { xsd:boolean }? 5320 parameter &= element turn-density { xsd:int }? 5321 parameter &= element node-id-length { xsd:int }? 5322 parameter &= foreign-elements* 5323 parameter &= 5324 element self-signed-permitted { 5325 attribute digest { self-signed-digest-type }, 5326 xsd:boolean 5327 }? 5328 self-signed-digest-type |= "sha1" 5329 parameter &= 5330 element bootstrap-node { 5331 attribute address { xsd:string }, 5332 attribute port { xsd:int } 5333 }+ 5334 hostPort = text 5335 parameter &= 5336 element multicast-bootstrap { hostPort 5337 }* 5339 kind-block = element kind-block { 5340 element kind { 5341 (attribute name { kind-names } 5342 | attribute id { xsd:int }), 5343 kind-paramter 5344 } & 5345 element kind-signature { 5346 attribute algorithm { signature-algorithm-type }?, 5347 xsd:base64Binary 5348 }? 5350 } 5352 kind-paramter &= element max-count { xsd:int } 5353 kind-paramter &= element max-size { xsd:int } 5354 kind-paramter &= element data-model { data-model-type } 5355 data-model-type |= "single" 5356 data-model-type |= "array" 5357 data-model-type |= "dictionary" 5358 kind-paramter &= element access-control { access-control-type } 5359 kind-paramter &= element max-node-multiple { xsd:int }? 5360 access-control-type |= "user-match" 5361 access-control-type |= "node-match" 5362 access-control-type |= "user-node-match" 5363 access-control-type |= "node-multiple" 5364 access-control-type |= "user-match-with-anon-create" 5365 kind-paramter &= foreign-elements* 5367 # Chord specific paramters 5368 topology-plugin-type |= "chord" 5369 kind-names |= "sip-registration" 5370 kind-names |= "turn-service" 5371 parameter &= element chord:chord-ping-interval { xsd:int }? 5372 parameter &= element chord:chord-update-interval { xsd:int }? 5374 10.2. Discovery Through Enrollment Server 5376 When a node first enrolls in a new overlay, it starts with a 5377 discovery process to find an enrollment server. 5379 The node first determines the overlay name. This value is provided 5380 by the user or some other out of band provisioning mechanism. The 5381 out of band mechanisms may also provide an optional URL for the 5382 enrollment server. If a URL for the enrollment server is not 5383 provided, the node MUST do a DNS SRV query using a Service name of 5384 "p2psip-enroll" and a protocol of TCP to find an enrollment server 5385 and form the URL by appending a path of "/.well-known/p2psip-enroll" 5386 to the overlay name. This uses the "well known URI" framework 5387 defined in [RFC5785]. For example, if the overlay name was 5388 example.com, the URL would be 5389 "https://example.com//.well-known/p2psip-enroll". 5391 Once an address and URL for the enrollment server is determined, the 5392 peer forms an HTTPS connection to that IP address. The certificate 5393 MUST match the overlay name as described in [RFC2818]. Then the node 5394 MUST fetch a new copy of the configuration file. To do this, the 5395 peer performs a GET to the URL. The result of the HTTP GET is an XML 5396 configuration file described above, which replaces any previously 5397 learned configuration file for this overlay. 5399 For overlays that do not use an enrollment server, nodes obtain the 5400 configuration information needed to join the overlay through some out 5401 of band approach such an XML configuration file sent over email. 5403 10.3. Credentials 5405 If the configuration document contains a enrollment-server element, 5406 credentials are required to join the Overlay Instance. A peer which 5407 does not yet have credentials MUST contact the enrollment server to 5408 acquire them. 5410 RELOAD defines its own trivial certificate request protocol. We 5411 would have liked to have used an existing protocol but were concerned 5412 about the implementation burden of even the simplest of those 5413 protocols, such as [RFC5272] and [RFC5273]. Our objective was to 5414 have a protocol which could be easily implemented in a Web server 5415 which the operator did not control (e.g., in a hosted service) and 5416 was compatible with the existing certificate handling tooling as used 5417 with the Web certificate infrastructure. This means accepting bare 5418 PKCS#10 requests and returning a single bare X.509 certificate. 5419 Although the MIME types for these objects are defined, none of the 5420 existing protocols support exactly this model. 5422 The certificate request protocol is performed over HTTPS. The 5423 request is an HTTP POST with the following properties: 5425 o If authentication is required, there is a URL parameter of 5426 "password" and "username" containing the user's name and password 5427 in the clear (hence the need for HTTPS) 5428 o The body is of content type "application/pkcs10", as defined in 5429 [RFC2311]. 5431 o The Accept header contains the type "application/pkix-cert", 5432 indicating the type that is expected in the response. 5434 The enrollment server MUST authenticate the request using the 5435 provided user name and password. If the authentication succeeds and 5436 the requested user name is acceptable, the server generates and 5437 returns a certificate. The SubjectAltName field in the certificate 5438 contains the following values: 5440 o One or more Node-IDs which MUST be cryptographically random 5441 [RFC4086]. Each MUST be chosen by the enrollment server in such a 5442 way that they are unpredictable to the requesting user. E.g., the 5443 user MUST NOT be informed of potential (random) Node-IDs prior to 5444 authenticating. Each is placed in the subjectAltName using the 5445 uniformResourceIdentifier type and MUST contain RELOAD URIs as 5446 described in Section 13.15 and MUST contain a Destination list 5447 with a single entry of type "node_id". 5448 o A single name this user is allowed to use in the overlay, using 5449 type rfc822Name. 5451 The certificate is returned as type "application/pkix-cert", with an 5452 HTTP status code of 200 OK. Certificate processing errors should be 5453 treated as HTTP errors and have appropriate HTTP status codes. 5455 The client MUST check that the certificate returned was signed by one 5456 of the certificates received in the "root-cert" list of the overlay 5457 configuration data. The node then reads the certificate to find the 5458 Node-IDs it can use. 5460 10.3.1. Self-Generated Credentials 5462 If the "self-signed-permitted" element is present in the 5463 configuration and set to "TRUE", then a node MUST generate its own 5464 self-signed certificate to join the overlay. The self-signed 5465 certificate MAY contain any user name of the users choice. 5467 The Node-ID MUST be computed by applying the digest specified in the 5468 self-signed-permitted element to the DER representation of the user's 5469 public key (more specifically the subjectPublicKeyInfo) and taking 5470 the high order bits. When accepting a self-signed certificate, nodes 5471 MUST check that the Node-ID and public keys match. This prevents 5472 Node-ID theft. 5474 Once the node has constructed a self-signed certificate, it MAY join 5475 the overlay. Before storing its certificate in the overlay 5476 (Section 7) it SHOULD look to see if the user name is already taken 5477 and if so choose another user name. Note that this only provides 5478 protection against accidental name collisions. Name theft is still 5479 possible. If protection against name theft is desired, then the 5480 enrollment service must be used. 5482 10.4. Searching for a Bootstrap Node 5484 If no cached bootstrap nodes are available and the config file has an 5485 multicast-bootstrap element, then the node SHOULD send a Ping request 5486 over UDP to the address and port found to each multicast-bootstrap 5487 element found in the configuration document. This MAY be a 5488 multicast, broadcast, or anycast address. The Ping should use the 5489 wildcard Node-ID as the destination Node-ID. 5491 The responder node that receives the Ping request SHOULD check that 5492 the overlay name is correct and that the requester peer sending the 5493 request has appropriate credentials for the overlay before responding 5494 to the Ping request even if the response is only an error. 5496 10.5. Contacting a Bootstrap Node 5498 In order to join the overlay, the joining node MUST contact a node in 5499 the overlay. Typically this means contacting the bootstrap nodes, 5500 since they are reachable by the local peer or have public IP 5501 addresses. If the joining node has cached a list of peers it has 5502 previously been connected with in this overlay, as an optimization it 5503 MAY attempt to use one or more of them as bootstrap nodes before 5504 falling back to the bootstrap nodes listed in the configuration file. 5506 When contacting a bootstrap node, the joining node first forms the 5507 DTLS or TLS connection to the bootstrap node and then sends an Attach 5508 request over this connection with the destination Node-ID set to the 5509 joining node's Node-ID. 5511 When the requester node finally does receive a response from some 5512 responding node, it can note the Node-ID in the response and use this 5513 Node-ID to start sending requests to join the Overlay Instance as 5514 described in Section 5.4. 5516 After a node has successfully joined the overlay network, it will 5517 have direct connections to several peers. Some MAY be added to the 5518 cached bootstrap nodes list and used in future boots. Peers that are 5519 not directly connected MUST NOT be cached. The suggested number of 5520 peers to cache is 10. Algorithms for determining which peers to 5521 cache are beyond the scope of this specification. 5523 11. Message Flow Example 5525 The following abbreviation are used in the message flow diagrams: JP 5526 = joining peer, AP = admitting peer, NP = next peer after the AP, NNP 5527 = next next peer which is the peer after NP, PP = previous peer 5528 before the AP, PPP = previous previous peer which is the peer before 5529 the PP, BP = bootstrap peer. 5531 The following abbreviation are used in the message flow diagrams: 5533 In the following example, we assume that JP has formed a connection 5534 to one of the bootstrap nodes. JP then sends an Attach through that 5535 peer to the admitting peer (AP) to initiate a connection. When AP 5536 responds, JP and AP use ICE to set up a connection and then set up 5537 TLS. Once AP has connected to JP, AP sends the JP an Update to 5538 populate its Routing Table. The following example shows the Update 5539 happening after the TLS connection is formed but it could also happen 5540 before in which case the Update would often be routed through other 5541 nodes. 5543 JP PPP PP AP NP NNP BP 5544 | | | | | | | 5545 | | | | | | | 5546 | | | | | | | 5547 |Attach Dest=JP | | | | | 5548 |---------------------------------------------------------->| 5549 | | | | | | | 5550 | | | | | | | 5551 | | |Attach Dest=JP | | | 5552 | | |<--------------------------------------| 5553 | | | | | | | 5554 | | | | | | | 5555 | | |Attach Dest=JP | | | 5556 | | |-------->| | | | 5557 | | | | | | | 5558 | | | | | | | 5559 | | |AttachAns | | | 5560 | | |<--------| | | | 5561 | | | | | | | 5562 | | | | | | | 5563 | | |AttachAns | | | 5564 | | |-------------------------------------->| 5565 | | | | | | | 5566 | | | | | | | 5567 |AttachAns | | | | | 5568 |<----------------------------------------------------------| 5569 | | | | | | | 5570 | | | | | | | 5571 |TLS | | | | | | 5572 |.............................| | | | 5573 | | | | | | | 5574 | | | | | | | 5575 | | | | | | | 5576 |Update | | | | | | 5577 |<----------------------------| | | | 5578 | | | | | | | 5579 | | | | | | | 5580 |UpdateAns| | | | | | 5581 |---------------------------->| | | | 5582 | | | | | | | 5583 | | | | | | | 5584 | | | | | | | 5586 The JP then forms connections to the appropriate neighbors, such as 5587 NP, by sending an Attach which gets routed via other nodes. When NP 5588 responds, JP and NP use ICE and TLS to set up a connection. 5590 JP PPP PP AP NP NNP BP 5591 | | | | | | | 5592 | | | | | | | 5593 | | | | | | | 5594 |Attach NP | | | | | 5595 |---------------------------->| | | | 5596 | | | | | | | 5597 | | | | | | | 5598 | | | |Attach NP| | | 5599 | | | |-------->| | | 5600 | | | | | | | 5601 | | | | | | | 5602 | | | |AttachAns| | | 5603 | | | |<--------| | | 5604 | | | | | | | 5605 | | | | | | | 5606 |AttachAns | | | | | 5607 |<----------------------------| | | | 5608 | | | | | | | 5609 | | | | | | | 5610 |Attach | | | | | | 5611 |-------------------------------------->| | | 5612 | | | | | | | 5613 | | | | | | | 5614 |TLS | | | | | | 5615 |.......................................| | | 5616 | | | | | | | 5617 | | | | | | | 5618 | | | | | | | 5619 | | | | | | | 5621 JP also needs to populate its finger table (for Chord). It issues an 5622 Attach to a variety of locations around the overlay. The diagram 5623 below shows it sending an Attach halfway around the Chord ring to the 5624 JP + 2^127. 5626 JP NP XX TP 5627 | | | | 5628 | | | | 5629 | | | | 5630 |Attach JP+2<<126 | | 5631 |-------->| | | 5632 | | | | 5633 | | | | 5634 | |Attach JP+2<<126 | 5635 | |-------->| | 5636 | | | | 5637 | | | | 5638 | | |Attach JP+2<<126 5639 | | |-------->| 5640 | | | | 5641 | | | | 5642 | | |AttachAns| 5643 | | |<--------| 5644 | | | | 5645 | | | | 5646 | |AttachAns| | 5647 | |<--------| | 5648 | | | | 5649 | | | | 5650 |AttachAns| | | 5651 |<--------| | | 5652 | | | | 5653 | | | | 5654 |TLS | | | 5655 |.............................| 5656 | | | | 5657 | | | | 5658 | | | | 5659 | | | | 5661 Once JP has a reasonable set of connections, it is ready to take its 5662 place in the DHT. It does this by sending a Join to AP. AP does a 5663 series of Store requests to JP to store the data that JP will be 5664 responsible for. AP then sends JP an Update explicitly labeling JP 5665 as its predecessor. At this point, JP is part of the ring and 5666 responsible for a section of the overlay. AP can now forget any data 5667 which is assigned to JP and not AP. 5669 JP PPP PP AP NP NNP BP 5670 | | | | | | | 5671 | | | | | | | 5672 | | | | | | | 5673 |JoinReq | | | | | | 5674 |---------------------------->| | | | 5675 | | | | | | | 5676 | | | | | | | 5677 |JoinAns | | | | | | 5678 |<----------------------------| | | | 5679 | | | | | | | 5680 | | | | | | | 5681 |StoreReq Data A | | | | | 5682 |<----------------------------| | | | 5683 | | | | | | | 5684 | | | | | | | 5685 |StoreAns | | | | | | 5686 |---------------------------->| | | | 5687 | | | | | | | 5688 | | | | | | | 5689 |StoreReq Data B | | | | | 5690 |<----------------------------| | | | 5691 | | | | | | | 5692 | | | | | | | 5693 |StoreAns | | | | | | 5694 |---------------------------->| | | | 5695 | | | | | | | 5696 | | | | | | | 5697 |UpdateReq| | | | | | 5698 |<----------------------------| | | | 5699 | | | | | | | 5700 | | | | | | | 5701 |UpdateAns| | | | | | 5702 |---------------------------->| | | | 5703 | | | | | | | 5704 | | | | | | | 5705 | | | | | | | 5706 | | | | | | | 5708 In Chord, JP's neighbor table needs to contain its own predecessors. 5709 It couldn't connect to them previously because it did not yet know 5710 their addresses. However, now that it has received an Update from 5711 AP, it has AP's predecessors, which are also its own, so it sends 5712 Attaches to them. Below it is shown connecting to AP's closest 5713 predecessor, PP. 5715 JP PPP PP AP NP NNP BP 5716 | | | | | | | 5717 | | | | | | | 5718 | | | | | | | 5719 |Attach Dest=PP | | | | | 5720 |---------------------------->| | | | 5721 | | | | | | | 5722 | | | | | | | 5723 | | |Attach Dest=PP | | | 5724 | | |<--------| | | | 5725 | | | | | | | 5726 | | | | | | | 5727 | | |AttachAns| | | | 5728 | | |-------->| | | | 5729 | | | | | | | 5730 | | | | | | | 5731 |AttachAns| | | | | | 5732 |<----------------------------| | | | 5733 | | | | | | | 5734 | | | | | | | 5735 |TLS | | | | | | 5736 |...................| | | | | 5737 | | | | | | | 5738 | | | | | | | 5739 |UpdateReq| | | | | | 5740 |------------------>| | | | | 5741 | | | | | | | 5742 | | | | | | | 5743 |UpdateAns| | | | | | 5744 |<------------------| | | | | 5745 | | | | | | | 5746 | | | | | | | 5747 |UpdateReq| | | | | | 5748 |---------------------------->| | | | 5749 | | | | | | | 5750 | | | | | | | 5751 |UpdateAns| | | | | | 5752 |<----------------------------| | | | 5753 | | | | | | | 5754 | | | | | | | 5755 |UpdateReq| | | | | | 5756 |-------------------------------------->| | | 5757 | | | | | | | 5758 | | | | | | | 5759 |UpdateAns| | | | | | 5760 |<--------------------------------------| | | 5761 | | | | | | | 5762 | | | | | | | 5764 Finally, now that JP has a copy of all the data and is ready to route 5765 messages and receive requests, it sends Updates to everyone in its 5766 Routing Table to tell them it is ready to go. Below, it is shown 5767 sending such an update to TP. 5769 JP NP XX TP 5770 | | | | 5771 | | | | 5772 | | | | 5773 |Update | | | 5774 |---------------------------->| 5775 | | | | 5776 | | | | 5777 |UpdateAns| | | 5778 |<----------------------------| 5779 | | | | 5780 | | | | 5781 | | | | 5782 | | | | 5784 12. Security Considerations 5786 12.1. Overview 5788 RELOAD provides a generic storage service, albeit one designed to be 5789 useful for P2PSIP. In this section we discuss security issues that 5790 are likely to be relevant to any usage of RELOAD. More background 5791 information can be found in [RFC5765]. 5793 In any Overlay Instance, any given user depends on a number of peers 5794 with which they have no well-defined relationship except that they 5795 are fellow members of the Overlay Instance. In practice, these other 5796 nodes may be friendly, lazy, curious, or outright malicious. No 5797 security system can provide complete protection in an environment 5798 where most nodes are malicious. The goal of security in RELOAD is to 5799 provide strong security guarantees of some properties even in the 5800 face of a large number of malicious nodes and to allow the overlay to 5801 function correctly in the face of a modest number of malicious nodes. 5803 P2PSIP deployments require the ability to authenticate both peers and 5804 resources (users) without the active presence of a trusted entity in 5805 the system. We describe two mechanisms. The first mechanism is 5806 based on public key certificates and is suitable for general 5807 deployments. The second is an admission control mechanism based on 5808 an overlay-wide shared symmetric key. 5810 12.2. Attacks on P2P Overlays 5812 The two basic functions provided by overlay nodes are storage and 5813 routing: some node is responsible for storing a peer's data and for 5814 allowing a third peer to fetch this stored data. Other nodes are 5815 responsible for routing messages to and from the storing nodes. Each 5816 of these issues is covered in the following sections. 5818 P2P overlays are subject to attacks by subversive nodes that may 5819 attempt to disrupt routing, corrupt or remove user registrations, or 5820 eavesdrop on signaling. The certificate-based security algorithms we 5821 describe in this specification are intended to protect overlay 5822 routing and user registration information in RELOAD messages. 5824 To protect the signaling from attackers pretending to be valid peers 5825 (or peers other than themselves), the first requirement is to ensure 5826 that all messages are received from authorized members of the 5827 overlay. For this reason, RELOAD transports all messages over a 5828 secure channel (TLS and DTLS are defined in this document) which 5829 provides message integrity and authentication of the directly 5830 communicating peer. In addition, messages and data are digitally 5831 signed with the sender's private key, providing end-to-end security 5832 for communications. 5834 12.3. Certificate-based Security 5836 This specification stores users' registrations and possibly other 5837 data in an overlay network. This requires a solution to securing 5838 this data as well as securing, as well as possible, the routing in 5839 the overlay. Both types of security are based on requiring that 5840 every entity in the system (whether user or peer) authenticate 5841 cryptographically using an asymmetric key pair tied to a certificate. 5843 When a user enrolls in the Overlay Instance, they request or are 5844 assigned a unique name, such as "alice@dht.example.net". These names 5845 are unique and are meant to be chosen and used by humans much like a 5846 SIP Address of Record (AOR) or an email address. The user is also 5847 assigned one or more Node-IDs by the central enrollment authority. 5848 Both the name and the Node-ID are placed in the certificate, along 5849 with the user's public key. 5851 Each certificate enables an entity to act in two sorts of roles: 5853 o As a user, storing data at specific Resource-IDs in the Overlay 5854 Instance corresponding to the user name. 5855 o As a overlay peer with the Node-ID(s) listed in the certificate. 5857 Note that since only users of this Overlay Instance need to validate 5858 a certificate, this usage does not require a global PKI. Instead, 5859 certificates are signed by a central enrollment authority which acts 5860 as the certificate authority for the Overlay Instance. This 5861 authority signs each peer's certificate. Because each peer possesses 5862 the CA's certificate (which they receive on enrollment) they can 5863 verify the certificates of the other entities in the overlay without 5864 further communication. Because the certificates contain the user/ 5865 peer's public key, communications from the user/peer can be verified 5866 in turn. 5868 If self-signed certificates are used, then the security provided is 5869 significantly decreased, since attackers can mount Sybil attacks. In 5870 addition, attackers cannot trust the user names in certificates 5871 (though they can trust the Node-IDs because they are 5872 cryptographically verifiable). This scheme may be appropriate for 5873 some small deployments, such as a small office or an ad hoc overlay 5874 set up among participants in a meeting where all hosts on the network 5875 are trusted. Some additional security can be provided by using the 5876 shared secret admission control scheme as well. 5878 Because all stored data is signed by the owner of the data the 5879 storing peer can verify that the storer is authorized to perform a 5880 store at that Resource-ID and also allow any consumer of the data to 5881 verify the provenance and integrity of the data when it retrieves it. 5883 Note that RELOAD does not itself provide a revocation/status 5884 mechanism (though certificates may of course include OCSP responder 5885 information). Thus, certificate lifetimes should be chosen to 5886 balance the compromise window versus the cost of certificate renewal. 5887 Because RELOAD is already designed to operate in the face of some 5888 fraction of malicious peers, this form of compromise is not fatal. 5890 All implementations MUST implement certificate-based security. 5892 12.4. Shared-Secret Security 5894 RELOAD also supports a shared secret admission control scheme that 5895 relies on a single key that is shared among all members of the 5896 overlay. It is appropriate for small groups that wish to form a 5897 private network without complexity. In shared secret mode, all the 5898 peers share a single symmetric key which is used to key TLS-PSK 5899 [RFC4279] or TLS-SRP [RFC5054] mode. A peer which does not know the 5900 key cannot form TLS connections with any other peer and therefore 5901 cannot join the overlay. 5903 One natural approach to a shared-secret scheme is to use a user- 5904 entered password as the key. The difficulty with this is that in 5905 TLS-PSK mode, such keys are very susceptible to dictionary attacks. 5907 If passwords are used as the source of shared-keys, then TLS-SRP is a 5908 superior choice because it is not subject to dictionary attacks. 5910 12.5. Storage Security 5912 When certificate-based security is used in RELOAD, any given 5913 Resource-ID/Kind-ID pair is bound to some small set of certificates. 5914 In order to write data, the writer must prove possession of the 5915 private key for one of those certificates. Moreover, all data is 5916 stored, signed with the same private key that was used to authorize 5917 the storage. This set of rules makes questions of authorization and 5918 data integrity - which have historically been thorny for overlays - 5919 relatively simple. 5921 12.5.1. Authorization 5923 When a client wants to store some value, it first digitally signs the 5924 value with its own private key. It then sends a Store request that 5925 contains both the value and the signature towards the storing peer 5926 (which is defined by the Resource Name construction algorithm for 5927 that particular kind of value). 5929 When the storing peer receives the request, it must determine whether 5930 the storing client is authorized to store at this Resource-ID/Kind-ID 5931 pair. Determining this requires comparing the user's identity to the 5932 requirements of the access control model (see Section 6.3). If it 5933 satisfies those requirements the user is authorized to write, pending 5934 quota checks as described in the next section. 5936 For example, consider the certificate with the following properties: 5938 User name: alice@dht.example.com 5939 Node-ID: 013456789abcdef 5940 Serial: 1234 5942 If Alice wishes to Store a value of the "SIP Location" kind, the 5943 Resource Name will be the SIP AOR "sip:alice@dht.example.com". The 5944 Resource-ID will be determined by hashing the Resource Name. Because 5945 SIP Location uses the USER-NODE-MATCH policy, it first verifies that 5946 the user name in the certificate hashes to the requested Resource-ID. 5947 It then verifies that the node-id in the certificate matches the 5948 dictionary key being used for the store. If both of these checks 5949 succeed, the Store is authorized. Note that because the access 5950 control model is different for different kinds, the exact set of 5951 checks will vary. 5953 12.5.2. Distributed Quota 5955 Being a peer in an Overlay Instance carries with it the 5956 responsibility to store data for a given region of the Overlay 5957 Instance. However, allowing clients to store unlimited amounts of 5958 data would create unacceptable burdens on peers and would also enable 5959 trivial denial of service attacks. RELOAD addresses this issue by 5960 requiring configurations to define maximum sizes for each kind of 5961 stored data. Attempts to store values exceeding this size MUST be 5962 rejected (if peers are inconsistent about this, then strange 5963 artifacts will happen when the zone of responsibility shifts and a 5964 different peer becomes responsible for overlarge data). Because each 5965 Resource-ID/Kind-ID pair is bound to a small set of certificates, 5966 these size restrictions also create a distributed quota mechanism, 5967 with the quotas administered by the central enrollment server. 5969 Allowing different kinds of data to have different size restrictions 5970 allows new usages the flexibility to define limits that fit their 5971 needs without requiring all usages to have expansive limits. 5973 12.5.3. Correctness 5975 Because each stored value is signed, it is trivial for any retrieving 5976 peer to verify the integrity of the stored value. Some more care 5977 needs to be taken to prevent version rollback attacks. Rollback 5978 attacks on storage are prevented by the use of store times and 5979 lifetime values in each store. A lifetime represents the latest time 5980 at which the data is valid and thus limits (though does not 5981 completely prevent) the ability of the storing node to perform a 5982 rollback attack on retrievers. In order to prevent a rollback attack 5983 at the time of the Store request, we require that storage times be 5984 monotonically increasing. Storing peers MUST reject Store requests 5985 with storage times smaller than or equal to those they are currently 5986 storing. In addition, a fetching node which receives a data value 5987 with a storage time older than the result of the previous fetch knows 5988 a rollback has occurred. 5990 12.5.4. Residual Attacks 5992 The mechanisms described here provides a high degree of security, but 5993 some attacks remain possible. Most simply, it is possible for 5994 storing nodes to refuse to store a value (i.e., reject any request). 5995 In addition, a storing node can deny knowledge of values which it has 5996 previously accepted. To some extent these attacks can be ameliorated 5997 by attempting to store to/retrieve from replicas, but a retrieving 5998 client does not know whether it should try this or not, since there 5999 is a cost to doing so. 6001 The certificate-based authentication scheme prevents a single peer 6002 from being able to forge data owned by other peers. Furthermore, 6003 although a subversive peer can refuse to return data resources for 6004 which it is responsible, it cannot return forged data because it 6005 cannot provide authentication for such registrations. Therefore 6006 parallel searches for redundant registrations can mitigate most of 6007 the effects of a compromised peer. The ultimate reliability of such 6008 an overlay is a statistical question based on the replication factor 6009 and the percentage of compromised peers. 6011 In addition, when a kind is multivalued (e.g., an array data model), 6012 the storing node can return only some subset of the values, thus 6013 biasing its responses. This can be countered by using single values 6014 rather than sets, but that makes coordination between multiple 6015 storing agents much more difficult. This is a trade off that must be 6016 made when designing any usage. 6018 12.6. Routing Security 6020 Because the storage security system guarantees (within limits) the 6021 integrity of the stored data, routing security focuses on stopping 6022 the attacker from performing a DOS attack that misroutes requests in 6023 the overlay. There are a few obvious observations to make about 6024 this. First, it is easy to ensure that an attacker is at least a 6025 valid peer in the Overlay Instance. Second, this is a DOS attack 6026 only. Third, if a large percentage of the peers on the Overlay 6027 Instance are controlled by the attacker, it is probably impossible to 6028 perfectly secure against this. 6030 12.6.1. Background 6032 In general, attacks on DHT routing are mounted by the attacker 6033 arranging to route traffic through one or two nodes it controls. In 6034 the Eclipse attack [Eclipse] the attacker tampers with messages to 6035 and from nodes for which it is on-path with respect to a given victim 6036 node. This allows it to pretend to be all the nodes that are 6037 reachable through it. In the Sybil attack [Sybil], the attacker 6038 registers a large number of nodes and is therefore able to capture a 6039 large amount of the traffic through the DHT. 6041 Both the Eclipse and Sybil attacks require the attacker to be able to 6042 exercise control over her Node-IDs. The Sybil attack requires the 6043 creation of a large number of peers. The Eclipse attack requires 6044 that the attacker be able to impersonate specific peers. In both 6045 cases, these attacks are limited by the use of centralized, 6046 certificate-based admission control. 6048 12.6.2. Admissions Control 6050 Admission to a RELOAD Overlay Instance is controlled by requiring 6051 that each peer have a certificate containing its Node-Id. The 6052 requirement to have a certificate is enforced by using certificate- 6053 based mutual authentication on each connection. (Note: the 6054 following only applies when self-signed certificates are not used.) 6055 Whenever a peer connects to another peer, each side automatically 6056 checks that the other has a suitable certificate. These Node-Ids are 6057 randomly assigned by the central enrollment server. This has two 6058 benefits: 6060 o It allows the enrollment server to limit the number of peer IDs 6061 issued to any individual user. 6062 o It prevents the attacker from choosing specific Node-Ids. 6064 The first property allows protection against Sybil attacks (provided 6065 the enrollment server uses strict rate limiting policies). The 6066 second property deters but does not completely prevent Eclipse 6067 attacks. Because an Eclipse attacker must impersonate peers on the 6068 other side of the attacker, he must have a certificate for suitable 6069 Node-Ids, which requires him to repeatedly query the enrollment 6070 server for new certificates, which will match only by chance. From 6071 the attacker's perspective, the difficulty is that if he only has a 6072 small number of certificates, the region of the Overlay Instance he 6073 is impersonating appears to be very sparsely populated by comparison 6074 to the victim's local region. 6076 12.6.3. Peer Identification and Authentication 6078 In general, whenever a peer engages in overlay activity that might 6079 affect the routing table it must establish its identity. This 6080 happens in two ways. First, whenever a peer establishes a direct 6081 connection to another peer it authenticates via certificate-based 6082 mutual authentication. All messages between peers are sent over this 6083 protected channel and therefore the peers can verify the data origin 6084 of the last hop peer for requests and responses without further 6085 cryptography. 6087 In some situations, however, it is desirable to be able to establish 6088 the identity of a peer with whom one is not directly connected. The 6089 most natural case is when a peer Updates its state. At this point, 6090 other peers may need to update their view of the overlay structure, 6091 but they need to verify that the Update message came from the actual 6092 peer rather than from an attacker. To prevent this, all overlay 6093 routing messages are signed by the peer that generated them. 6095 Replay is typically prevented for messages that impact the topology 6096 of the overlay by having the information come directly, or be 6097 verified by, the nodes that claimed to have generated the update. 6098 Data storage replay detection is done by signing time of the node 6099 that generated the signature on the store request thus providing a 6100 time based replay protection but the time synchronization is only 6101 needed between peers that can write to the same location. 6103 12.6.4. Protecting the Signaling 6105 The goal here is to stop an attacker from knowing who is signaling 6106 what to whom. An attacker is unlikely to be able to observe the 6107 activities of a specific individual given the randomization of IDs 6108 and routing based on the present peers discussed above. Furthermore, 6109 because messages can be routed using only the header information, the 6110 actual body of the RELOAD message can be encrypted during 6111 transmission. 6113 There are two lines of defense here. The first is the use of TLS or 6114 DTLS for each communications link between peers. This provides 6115 protection against attackers who are not members of the overlay. The 6116 second line of defense is to digitally sign each message. This 6117 prevents adversarial peers from modifying messages in flight, even if 6118 they are on the routing path. 6120 12.6.5. Residual Attacks 6122 The routing security mechanisms in RELOAD are designed to contain 6123 rather than eliminate attacks on routing. It is still possible for 6124 an attacker to mount a variety of attacks. In particular, if an 6125 attacker is able to take up a position on the overlay routing between 6126 A and B it can make it appear as if B does not exist or is 6127 disconnected. It can also advertise false network metrics in an 6128 attempt to reroute traffic. However, these are primarily DOS 6129 attacks. 6131 The certificate-based security scheme secures the namespace, but if 6132 an individual peer is compromised or if an attacker obtains a 6133 certificate from the CA, then a number of subversive peers can still 6134 appear in the overlay. While these peers cannot falsify responses to 6135 resource queries, they can respond with error messages, effecting a 6136 DoS attack on the resource registration. They can also subvert 6137 routing to other compromised peers. To defend against such attacks, 6138 a resource search must still consist of parallel searches for 6139 replicated registrations. 6141 13. IANA Considerations 6143 This section contains the new code points registered by this 6144 document. [NOTE TO IANA/RFC-EDITOR: Please replace RFC-AAAA with 6145 the RFC number for this specification in the following list.] 6147 13.1. Well-Known URI Registration 6149 IANA will make the following "Well Known URI" registration as 6150 described in [RFC5785]: 6152 [[Note to RFC Editor - this paragraph can be removed before 6153 publication. ]] A review request was sent to 6154 wellknown-uri-review@ietf.org on October 12, 2010. 6156 +----------------------------+----------------------+ 6157 | URI suffix: | p2psip-enroll | 6158 | Change controller: | IETF | 6159 | Specification document(s): | [RFC-AAAA] | 6160 | Related information: | None | 6161 +----------------------------+----------------------+ 6163 13.2. Port Registrations 6165 [[Note to RFC Editor - this paragraph can be removed before 6166 publication. ]] IANA has already allocated a TCP port for the main 6167 peer to peer protocol. This port has the name p2p-sip and the port 6168 number of 6084. IANA will update this registration to be defined for 6169 UDP as well as TCP. 6171 IANA will make the following port registration: 6173 +------------------------------+------------------------------------+ 6174 | Registration Technical | Cullen Jennings | 6175 | Contact | | 6176 | Registration Owner | IETF | 6177 | Transport Protocol | TCP | 6178 | Port Number | TBD | 6179 | Service Name | p2psip-enroll | 6180 | Description | Peer to Peer Infrastructure | 6181 | | Enrollment | 6182 | Reference | [RFC-AAAA] | 6183 +------------------------------+------------------------------------+ 6185 13.3. Overlay Algorithm Types 6187 IANA SHALL create a "RELOAD Overlay Algorithm Type" Registry. 6188 Entries in this registry are strings denoting the names of overlay 6189 algorithms. The registration policy for this registry is RFC 5226 6190 IETF Review. The initial contents of this registry are: 6192 +----------------+----------+ 6193 | Algorithm Name | RFC | 6194 +----------------+----------+ 6195 | chord-reload | RFC-AAAA | 6196 +----------------+----------+ 6198 13.4. Access Control Policies 6200 IANA SHALL create a "RELOAD Access Control Policy" Registry. Entries 6201 in this registry are strings denoting access control policies, as 6202 described in Section 6.3. New entries in this registry SHALL be 6203 registered via RFC 5226 Standards Action. The initial contents of 6204 this registry are: 6206 +-----------------+----------+ 6207 | Access Policy | RFC | 6208 +-----------------+----------+ 6209 | USER-MATCH | RFC-AAAA | 6210 | NODE-MATCH | RFC-AAAA | 6211 | USER-NODE-MATCH | RFC-AAAA | 6212 | NODE-MULTIPLE | RFC-AAAA | 6213 +-----------------+----------+ 6215 13.5. Application-ID 6217 IANA SHALL create a "RELOAD Application-ID" Registry. Entries in 6218 this registry are 16-bit integers denoting application kinds. Code 6219 points in the range 0x0001 to 0x7fff SHALL be registered via RFC 5226 6220 Standards Action. Code points in the range 0x8000 to 0xf000 SHALL be 6221 registered via RFC 5226 Expert Review. Code points in the range 6222 0xf001 to 0xfffe are reserved for private use. The initial contents 6223 of this registry are: 6225 +-------------+----------------+-------------------------------+ 6226 | Application | Application-ID | Specification | 6227 +-------------+----------------+-------------------------------+ 6228 | INVALID | 0 | RFC-AAAA | 6229 | SIP | 5060 | Reserved for use by SIP Usage | 6230 | SIP | 5061 | Reserved for use by SIP Usage | 6231 | Reserved | 0xffff | RFC-AAAA | 6232 +-------------+----------------+-------------------------------+ 6234 13.6. Data Kind-ID 6236 IANA SHALL create a "RELOAD Data Kind-ID" Registry. Entries in this 6237 registry are 32-bit integers denoting data kinds, as described in 6238 Section 4.2. Code points in the range 0x00000001 to 0x7fffffff SHALL 6239 be registered via RFC 5226 Standards Action. Code points in the 6240 range 0x8000000 to 0xf0000000 SHALL be registered via RFC 5226 Expert 6241 Review. Code points in the range 0xf0000001 to 0xfffffffe are 6242 reserved for private use via the kind description mechanism described 6243 in Section 10. The initial contents of this registry are: 6245 +---------------------+------------+----------+ 6246 | Kind | Kind-ID | RFC | 6247 +---------------------+------------+----------+ 6248 | INVALID | 0 | RFC-AAAA | 6249 | TURN_SERVICE | 2 | RFC-AAAA | 6250 | CERTIFICATE_BY_NODE | 3 | RFC-AAAA | 6251 | CERTIFICATE_BY_USER | 16 | RFC-AAAA | 6252 | Reserved | 0x7fffffff | RFC-AAAA | 6253 | Reserved | 0xfffffffe | RFC-AAAA | 6254 +---------------------+------------+----------+ 6256 13.7. Data Model 6258 IANA SHALL create a "RELOAD Data Model" Registry. Entries in this 6259 registry are 8-bit integers denoting data models, as described in 6260 Section 6.2. Code points in this registry SHALL be registered via 6261 RFC 5226 Standards Action. The initial contents of this registry 6262 are: 6264 +--------------+------+----------+ 6265 | Data Model | Code | RFC | 6266 +--------------+------+----------+ 6267 | INVALID | 0 | RFC-AAAA | 6268 | SINGLE_VALUE | 1 | RFC-AAAA | 6269 | ARRAY | 2 | RFC-AAAA | 6270 | DICTIONARY | 3 | RFC-AAAA | 6271 | RESERVED | 255 | RFC-AAAA | 6272 +--------------+------+----------+ 6274 13.8. Message Codes 6276 IANA SHALL create a "RELOAD Message Code" Registry. Entries in this 6277 registry are 16-bit integers denoting method codes as described in 6278 Section 5.3.3. These codes SHALL be registered via RFC 5226 6279 Standards Action. The initial contents of this registry are: 6281 +---------------------------------+----------------+----------+ 6282 | Message Code Name | Code Value | RFC | 6283 +---------------------------------+----------------+----------+ 6284 | invalid | 0 | RFC-AAAA | 6285 | probe_req | 1 | RFC-AAAA | 6286 | probe_ans | 2 | RFC-AAAA | 6287 | attach_req | 3 | RFC-AAAA | 6288 | attach_ans | 4 | RFC-AAAA | 6289 | unused | 5 | | 6290 | unused | 6 | | 6291 | store_req | 7 | RFC-AAAA | 6292 | store_ans | 8 | RFC-AAAA | 6293 | fetch_req | 9 | RFC-AAAA | 6294 | fetch_ans | 10 | RFC-AAAA | 6295 | unused (was remove_req) | 11 | RFC-AAAA | 6296 | unused (was remove_ans) | 12 | RFC-AAAA | 6297 | find_req | 13 | RFC-AAAA | 6298 | find_ans | 14 | RFC-AAAA | 6299 | join_req | 15 | RFC-AAAA | 6300 | join_ans | 16 | RFC-AAAA | 6301 | leave_req | 17 | RFC-AAAA | 6302 | leave_ans | 18 | RFC-AAAA | 6303 | update_req | 19 | RFC-AAAA | 6304 | update_ans | 20 | RFC-AAAA | 6305 | route_query_req | 21 | RFC-AAAA | 6306 | route_query_ans | 22 | RFC-AAAA | 6307 | ping_req | 23 | RFC-AAAA | 6308 | ping_ans | 24 | RFC-AAAA | 6309 | stat_req | 25 | RFC-AAAA | 6310 | stat_ans | 26 | RFC-AAAA | 6311 | unused (was attachlite_req) | 27 | RFC-AAAA | 6312 | unused (was attachlite_ans) | 28 | RFC-AAAA | 6313 | app_attach_req | 29 | RFC-AAAA | 6314 | app_attach_ans | 30 | RFC-AAAA | 6315 | unused (was app_attachlite_req) | 31 | RFC-AAAA | 6316 | unused (was app_attachlite_ans) | 32 | RFC-AAAA | 6317 | config_update_req | 33 | RFC-AAAA | 6318 | config_update_ans | 34 | RFC-AAAA | 6319 | reserved | 0x8000..0xfffe | RFC-AAAA | 6320 | error | 0xffff | RFC-AAAA | 6321 +---------------------------------+----------------+----------+ 6323 13.9. Error Codes 6325 IANA SHALL create a "RELOAD Error Code" Registry. Entries in this 6326 registry are 16-bit integers denoting error codes. New entries SHALL 6327 be defined via RFC 5226 Standards Action. The initial contents of 6328 this registry are: 6330 +-------------------------------------+----------------+----------+ 6331 | Error Code Name | Code Value | RFC | 6332 +-------------------------------------+----------------+----------+ 6333 | invalid | 0 | RFC-AAAA | 6334 | Unused | 1 | RFC-AAAA | 6335 | Error_Forbidden | 2 | RFC-AAAA | 6336 | Error_Not_Found | 3 | RFC-AAAA | 6337 | Error_Request_Timeout | 4 | RFC-AAAA | 6338 | Error_Generation_Counter_Too_Low | 5 | RFC-AAAA | 6339 | Error_Incompatible_with_Overlay | 6 | RFC-AAAA | 6340 | Error_Unsupported_Forwarding_Option | 7 | RFC-AAAA | 6341 | Error_Data_Too_Large | 8 | RFC-AAAA | 6342 | Error_Data_Too_Old | 9 | RFC-AAAA | 6343 | Error_TTL_Exceeded | 10 | RFC-AAAA | 6344 | Error_Message_Too_Large | 11 | RFC-AAAA | 6345 | Error_Unknown_Kind | 12 | RFC-AAAA | 6346 | Error_Unknown_Extension | 13 | RFC-AAAA | 6347 | Error_Response_Too_Large | 14 | RFC-AAAA | 6348 | Error_Config_Too_Old | 15 | RFC-AAAA | 6349 | Error_Config_Too_New | 16 | RFC-AAAA | 6350 | reserved | 0x8000..0xfffe | RFC-AAAA | 6351 +-------------------------------------+----------------+----------+ 6353 13.10. Overlay Link Types 6355 IANA shall create a "RELOAD Overlay Link." New entries SHALL be 6356 defined via RFC 5226 Standards Action. This registry SHALL be 6357 initially populated with the following values: 6359 +--------------------+------+---------------+ 6360 | Protocol | Code | Specification | 6361 +--------------------+------+---------------+ 6362 | reserved | 0 | RFC-AAAA | 6363 | DTLS-UDP-SR | 1 | RFC-AAAA | 6364 | DTLS-UDP-SR-NO-ICE | 3 | RFC-AAAA | 6365 | TLS-TCP-FH-NO-ICE | 4 | RFC-AAAA | 6366 | reserved | 255 | RFC-AAAA | 6367 +--------------------+------+---------------+ 6369 13.11. Overlay Link Protocols 6371 IANA shall create an "Overlay Link Protocol Registry". Entries in 6372 this registry SHALL be defined via RFC 5226 Standards Action. This 6373 registry SHALL be initially populated with the following value: 6374 "TLS". 6376 13.12. Forwarding Options 6378 IANA shall create a "Forwarding Option Registry". Entries in this 6379 registry between 1 and 127 SHALL be defined via RFC 5226 Standards 6380 Action. Entries in this registry between 128 and 254 SHALL be 6381 defined via RFC 5226 Specification Required. This registry SHALL be 6382 initially populated with the following values: 6384 +-------------------+------+---------------+ 6385 | Forwarding Option | Code | Specification | 6386 +-------------------+------+---------------+ 6387 | invalid | 0 | RFC-AAAA | 6388 | reserved | 255 | RFC-AAAA | 6389 +-------------------+------+---------------+ 6391 13.13. Probe Information Types 6393 IANA shall create a "RELOAD Probe Information Type Registry". 6394 Entries in this registry SHALL be defined via RFC 5226 Standards 6395 Action. This registry SHALL be initially populated with the 6396 following values: 6398 +-----------------+------+---------------+ 6399 | Probe Option | Code | Specification | 6400 +-----------------+------+---------------+ 6401 | invalid | 0 | RFC-AAAA | 6402 | responsible_set | 1 | RFC-AAAA | 6403 | num_resources | 2 | RFC-AAAA | 6404 | uptime | 3 | RFC-AAAA | 6405 | reserved | 255 | RFC-AAAA | 6406 +-----------------+------+---------------+ 6408 13.14. Message Extensions 6410 IANA shall create a "RELOAD Extensions Registry". Entries in this 6411 registry SHALL be defined via RFC 5226 Specification Required. This 6412 registry SHALL be initially populated with the following values: 6414 +-----------------+--------+---------------+ 6415 | Extensions Name | Code | Specification | 6416 +-----------------+--------+---------------+ 6417 | invalid | 0 | RFC-AAAA | 6418 | reserved | 0xFFFF | RFC-AAAA | 6419 +-----------------+--------+---------------+ 6421 13.15. reload URI Scheme 6423 This section describes the scheme for a reload URI, which can be used 6424 to refer to either: 6426 o A peer. 6427 o A resource inside a peer. 6429 The reload URI is defined using a subset of the URI schema specified 6430 in Appendix A of RFC 3986 [RFC3986] and the associated URI Guidelines 6431 [RFC4395] per the following ABNF syntax: 6433 RELOAD-URI = "reload://" destination "@" overlay "/" 6434 [specifier] 6436 destination = 1 * HEXDIG 6437 overlay = reg-name 6438 specifier = 1*HEXDIG 6440 The definitions of these productions are as follows: 6442 destination: a hex-encoded Destination List object. 6444 overlay: the name of the overlay. 6446 specifier : a hex-encoded StoredDataSpecifier indicating the data 6447 element. 6449 If no specifier is present then this URI addresses the peer which can 6450 be reached via the indicated destination list at the indicated 6451 overlay name. If a specifier is present, then the URI addresses the 6452 data value. 6454 13.15.1. URI Registration 6456 [[ Note to RFC Editor - please remove this paragraph before 6457 publication. ]] Review request was sent to uri-review@ietf.org on Oct 6458 7, 2010. 6460 The following summarizes the information necessary to register the 6461 reload URI. 6463 URI Scheme Name: reload 6464 Status: permanent 6465 URI Scheme Syntax: see Section 13.15 of RFC-AAAA 6466 URI Scheme Semantics: The reload URI is intended to be used as a 6467 reference to a RELOAD peer or resource. 6468 Encoding Considerations: The reload URI is not intended to be human- 6469 readable text, so it is encoded entirely in US-ASCII. 6470 Applications/protocols that use this URI scheme: The RELOAD protocol 6471 described in RFC-AAAA. 6472 Interoperability considerations: See RFC-AAAA. 6473 Security considerations: See RFC-AAAA 6474 Contact: Cullen Jennings 6475 Author/Change controller: IESG 6476 References: RFC-AAAA 6478 14. Acknowledgments 6480 This specification is a merge of the "REsource LOcation And Discovery 6481 (RELOAD)" draft by David A. Bryan, Marcia Zangrilli and Bruce B. 6482 Lowekamp, the "Address Settlement by Peer to Peer" draft by Cullen 6483 Jennings, Jonathan Rosenberg, and Eric Rescorla, the "Security 6484 Extensions for RELOAD" draft by Bruce B. Lowekamp and James Deverick, 6485 the "A Chord-based DHT for Resource Lookup in P2PSIP" by Marcia 6486 Zangrilli and David A. Bryan, and the Peer-to-Peer Protocol (P2PP) 6487 draft by Salman A. Baset, Henning Schulzrinne, and Marcin 6488 Matuszewski. Thanks to the authors of RFC 5389 for text included 6489 from that. Vidya Narayanan provided many comments and improvements. 6491 The ideas and text for the Chord specific extension data to the Leave 6492 mechanisms was provided by J. Maenpaa, G. Camarillo, and J. 6493 Hautakorpi. 6495 Thanks to the many people who contributed including Ted Hardie, 6496 Michael Chen, Dan York, Das Saumitra, Lyndsay Campbell, Brian Rosen, 6497 David Bryan, Dave Craig, and Julian Cain. Extensive working last 6498 call comments were provided by: Jouni Maenpaa, Roni Even, Ari 6499 Keranen, John Buford, Michael Chen, Frederic-Philippe Met, and David 6500 Bryan. 6502 15. References 6504 15.1. Normative References 6506 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 6507 Requirement Levels", BCP 14, RFC 2119, March 1997. 6509 [RFC5245] Rosenberg, J., "Interactive Connectivity Establishment 6510 (ICE): A Protocol for Network Address Translator (NAT) 6511 Traversal for Offer/Answer Protocols", RFC 5245, 6512 April 2010. 6514 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, 6515 "Session Traversal Utilities for NAT (STUN)", RFC 5389, 6516 October 2008. 6518 [RFC5766] Mahy, R., Matthews, P., and J. Rosenberg, "Traversal Using 6519 Relays around NAT (TURN): Relay Extensions to Session 6520 Traversal Utilities for NAT (STUN)", RFC 5766, April 2010. 6522 [RFC5273] Schaad, J. and M. Myers, "Certificate Management over CMS 6523 (CMC): Transport Protocols", RFC 5273, June 2008. 6525 [RFC5272] Schaad, J. and M. Myers, "Certificate Management over CMS 6526 (CMC)", RFC 5272, June 2008. 6528 [RFC4279] Eronen, P. and H. Tschofenig, "Pre-Shared Key Ciphersuites 6529 for Transport Layer Security (TLS)", RFC 4279, 6530 December 2005. 6532 [RFC5246] Dierks, T. and E. Rescorla, "The Transport Layer Security 6533 (TLS) Protocol Version 1.2", RFC 5246, August 2008. 6535 [RFC4347] Rescorla, E. and N. Modadugu, "Datagram Transport Layer 6536 Security", RFC 4347, April 2006. 6538 [RFC5348] Floyd, S., Handley, M., Padhye, J., and J. Widmer, "TCP 6539 Friendly Rate Control (TFRC): Protocol Specification", 6540 RFC 5348, September 2008. 6542 [RFC4648] Josefsson, S., "The Base16, Base32, and Base64 Data 6543 Encodings", RFC 4648, October 2006. 6545 [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission 6546 Timer", RFC 2988, November 2000. 6548 [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform 6549 Resource Identifier (URI): Generic Syntax", STD 66, 6550 RFC 3986, January 2005. 6552 [RFC4395] Hansen, T., Hardie, T., and L. Masinter, "Guidelines and 6553 Registration Procedures for New URI Schemes", BCP 35, 6554 RFC 4395, February 2006. 6556 [RFC2818] Rescorla, E., "HTTP Over TLS", RFC 2818, May 2000. 6558 [RFC3447] Jonsson, J. and B. Kaliski, "Public-Key Cryptography 6559 Standards (PKCS) #1: RSA Cryptography Specifications 6560 Version 2.1", RFC 3447, February 2003. 6562 [RFC5952] Kawamura, S. and M. Kawashima, "A Recommendation for IPv6 6563 Address Text Representation", RFC 5952, August 2010. 6565 15.2. Informative References 6567 [UnixTime] 6568 Wikipedia, "Unix Time", . 6571 [I-D.ietf-mmusic-ice-tcp] 6572 Rosenberg, J., Keranen, A., Lowekamp, B., and A. Roach, 6573 "TCP Candidates with Interactive Connectivity 6574 Establishment (ICE)", draft-ietf-mmusic-ice-tcp-11 (work 6575 in progress), November 2010. 6577 [I-D.maenpaa-p2psip-self-tuning] 6578 Maenpaa, J., Camarillo, G., and J. Hautakorpi, "A Self- 6579 tuning Distributed Hash Table (DHT) for REsource LOcation 6580 And Discovery (RELOAD)", 6581 draft-maenpaa-p2psip-self-tuning-01 (work in progress), 6582 October 2009. 6584 [I-D.baset-tsvwg-tcp-over-udp] 6585 Baset, S. and H. Schulzrinne, "TCP-over-UDP", 6586 draft-baset-tsvwg-tcp-over-udp-01 (work in progress), 6587 June 2009. 6589 [RFC5201] Moskowitz, R., Nikander, P., Jokela, P., and T. Henderson, 6590 "Host Identity Protocol", RFC 5201, April 2008. 6592 [RFC4828] Floyd, S. and E. Kohler, "TCP Friendly Rate Control 6593 (TFRC): The Small-Packet (SP) Variant", RFC 4828, 6594 April 2007. 6596 [I-D.ietf-p2psip-concepts] 6597 Bryan, D., Matthews, P., Shim, E., Willis, D., and S. 6598 Dawkins, "Concepts and Terminology for Peer to Peer SIP", 6599 draft-ietf-p2psip-concepts-03 (work in progress), 6600 October 2010. 6602 [RFC1122] Braden, R., "Requirements for Internet Hosts - 6603 Communication Layers", STD 3, RFC 1122, October 1989. 6605 [RFC4145] Yon, D. and G. Camarillo, "TCP-Based Media Transport in 6606 the Session Description Protocol (SDP)", RFC 4145, 6607 September 2005. 6609 [RFC4086] Eastlake, D., Schiller, J., and S. Crocker, "Randomness 6610 Requirements for Security", BCP 106, RFC 4086, June 2005. 6612 [RFC5054] Taylor, D., Wu, T., Mavrogiannopoulos, N., and T. Perrin, 6613 "Using the Secure Remote Password (SRP) Protocol for TLS 6614 Authentication", RFC 5054, November 2007. 6616 [RFC5280] Cooper, D., Santesson, S., Farrell, S., Boeyen, S., 6617 Housley, R., and W. Polk, "Internet X.509 Public Key 6618 Infrastructure Certificate and Certificate Revocation List 6619 (CRL) Profile", RFC 5280, May 2008. 6621 [I-D.pascual-p2psip-clients] 6622 Pascual, V., Matuszewski, M., Shim, E., Zhang, H., and S. 6623 Yongchao, "P2PSIP Clients", 6624 draft-pascual-p2psip-clients-01 (work in progress), 6625 February 2008. 6627 [RFC4787] Audet, F. and C. Jennings, "Network Address Translation 6628 (NAT) Behavioral Requirements for Unicast UDP", BCP 127, 6629 RFC 4787, January 2007. 6631 [RFC2311] Dusse, S., Hoffman, P., Ramsdell, B., Lundblade, L., and 6632 L. Repka, "S/MIME Version 2 Message Specification", 6633 RFC 2311, March 1998. 6635 [I-D.jiang-p2psip-sep] 6636 Jiang, X. and H. Zhang, "Service Extensible P2P Peer 6637 Protocol", draft-jiang-p2psip-sep-01 (work in progress), 6638 February 2008. 6640 [RFC5785] Nottingham, M. and E. Hammer-Lahav, "Defining Well-Known 6641 Uniform Resource Identifiers (URIs)", RFC 5785, 6642 April 2010. 6644 [I-D.ietf-hip-reload-instance] 6645 Keranen, A., Camarillo, G., and J. Maenpaa, "Host Identity 6646 Protocol-Based Overlay Networking Environment (HIP BONE) 6647 Instance Specification for REsource LOcation And Discovery 6648 (RELOAD)", draft-ietf-hip-reload-instance-02 (work in 6649 progress), July 2010. 6651 [I-D.ietf-hip-bone] 6652 Camarillo, G., Nikander, P., Hautakorpi, J., Keranen, A., 6653 and A. Johnston, "HIP BONE: Host Identity Protocol (HIP) 6654 Based Overlay Networking Environment", 6655 draft-ietf-hip-bone-07 (work in progress), June 2010. 6657 [I-D.ietf-p2psip-sip] 6658 Jennings, C., Lowekamp, B., Rescorla, E., Baset, S., and 6659 H. Schulzrinne, "A SIP Usage for RELOAD", 6660 draft-ietf-p2psip-sip-05 (work in progress), July 2010. 6662 [Sybil] Douceur, J., "The Sybil Attack", IPTPS 02, March 2002. 6664 [Eclipse] Singh, A., Ngan, T., Druschel, T., and D. Wallach, 6665 "Eclipse Attacks on Overlay Networks: Threats and 6666 Defenses", INFOCOM 2006, April 2006. 6668 [non-transitive-dhts-worlds05] 6669 Freedman, M., Lakshminarayanan, K., Rhea, S., and I. 6670 Stoica, "Non-Transitive Connectivity and DHTs", 6671 WORLDS'05. 6673 [lookups-churn-p2p06] 6674 Wu, D., Tian, Y., and K. Ng, "Analytical Study on 6675 Improving DHT Lookup Performance under Churn", IEEE 6676 P2P'06. 6678 [bryan-design-hotp2p08] 6679 Bryan, D., Lowekamp, B., and M. Zangrilli, "The Design of 6680 a Versatile, Secure P2PSIP Communications Architecture for 6681 the Public Internet", Hot-P2P'08. 6683 [opendht-sigcomm05] 6684 Rhea, S., Godfrey, B., Karp, B., Kubiatowicz, J., 6685 Ratnasamy, S., Shenker, S., Stoica, I., and H. Yu, 6686 "OpenDHT: A Public DHT and its Uses", SIGCOMM'05. 6688 [Chord] Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., 6689 Kaashoek, M., Dabek, F., and H. Balakrishnan, "Chord: A 6690 Scalable Peer-to-peer Lookup Protocol for Internet 6691 Applications", IEEE/ACM Transactions on Networking Volume 6692 11, Issue 1, 17-32, Feb 2003. 6694 [vulnerabilities-acsac04] 6695 Srivatsa, M. and L. Liu, "Vulnerabilities and Security 6696 Threats in Structured Peer-to-Peer Systems: A Quantitative 6697 Analysis", ACSAC 2004. 6699 [RFC5765] Schulzrinne, H., Marocco, E., and E. Ivov, "Security 6700 Issues and Solutions in Peer-to-Peer Systems for Realtime 6701 Communications", RFC 5765, February 2010. 6703 [handling-churn-usenix04] 6704 Rhea, S., Geels, D., Roscoe, T., and J. Kubiatowicz, 6705 "Handling Churn in a DHT", In Proc. of the USENIX Annual 6706 Technical Conference June 2004 USENIX 2004. 6708 [minimizing-churn-sigcomm06] 6709 Godfrey, P., Shenker, S., and I. Stoica, "Minimizing Churn 6710 in Distributed Systems", SIGCOMM 2006. 6712 Appendix A. Routing Alternatives 6714 Significant discussion has been focused on the selection of a routing 6715 algorithm for P2PSIP. This section discusses the motivations for 6716 selecting symmetric recursive routing for RELOAD and describes the 6717 extensions that would be required to support additional routing 6718 algorithms. 6720 A.1. Iterative vs Recursive 6722 Iterative routing has a number of advantages. It is easier to debug, 6723 consumes fewer resources on intermediate peers, and allows the 6724 querying peer to identify and route around misbehaving peers 6725 [non-transitive-dhts-worlds05]. However, in the presence of NATs, 6726 iterative routing is intolerably expensive because a new connection 6727 must be established for each hop (using ICE) [bryan-design-hotp2p08]. 6729 Iterative routing is supported through the Route_Query mechanism and 6730 is primarily intended for debugging. It also allows the querying 6731 peer to evaluate the routing decisions made by the peers at each hop, 6732 consider alternatives, and perhaps detect at what point the 6733 forwarding path fails. 6735 A.2. Symmetric vs Forward response 6737 An alternative to the symmetric recursive routing method used by 6738 RELOAD is Forward-Only routing, where the response is routed to the 6739 requester as if it were a new message initiated by the responder (in 6740 the previous example, Z sends the response to A as if it were sending 6741 a request). Forward-only routing requires no state in either the 6742 message or intermediate peers. 6744 The drawback of forward-only routing is that it does not work when 6745 the overlay is unstable. For example, if A is in the process of 6746 joining the overlay and is sending a Join request to Z, it is not yet 6747 reachable via forward routing. Even if it is established in the 6748 overlay, if network failures produce temporary instability, A may not 6749 be reachable (and may be trying to stabilize its network connectivity 6750 via Attach messages). 6752 Furthermore, forward-only responses are less likely to reach the 6753 querying peer than symmetric recursive ones are, because the forward 6754 path is more likely to have a failed peer than is the request path 6755 (which was just tested to route the request) 6756 [non-transitive-dhts-worlds05]. 6758 An extension to RELOAD that supports forward-only routing but relies 6759 on symmetric responses as a fallback would be possible, but due to 6760 the complexities of determining when to use forward-only and when to 6761 fallback to symmetric, we have chosen not to include it as an option 6762 at this point. 6764 A.3. Direct Response 6766 Another routing option is Direct Response routing, in which the 6767 response is returned directly to the querying node. In the previous 6768 example, if A encodes its IP address in the request, then Z can 6769 simply deliver the response directly to A. In the absence of NATs or 6770 other connectivity issues, this is the optimal routing technique. 6772 The challenge of implementing direct response is the presence of 6773 NATs. There are a number of complexities that must be addressed. In 6774 this discussion, we will continue our assumption that A issued the 6775 request and Z is generating the response. 6777 o The IP address listed by A may be unreachable, either due to NAT 6778 or firewall rules. Therefore, a direct response technique must 6779 fallback to symmetric response [non-transitive-dhts-worlds05]. 6780 The hop-by-hop ACKs used by RELOAD allow Z to determine when A has 6781 received the message (and the TLS negotiation will provide earlier 6782 confirmation that A is reachable), but this fallback requires a 6783 timeout that will increase the response latency whenever A is not 6784 reachable from Z. 6785 o Whenever A is behind a NAT it will have multiple candidate IP 6786 addresses, each of which must be advertised to ensure 6787 connectivity; therefore Z will need to attempt multiple 6788 connections to deliver the response. 6789 o One (or all) of A's candidate addresses may route from Z to a 6790 different device on the Internet. In the worst case these nodes 6791 may actually be running RELOAD on the same port. Therefore, it is 6792 absolutely necessary to establish a secure connection to 6793 authenticate A before delivering the response. This step 6794 diminishes the efficiency of direct response because multiple 6795 roundtrips are required before the message can be delivered. 6797 o If A is behind a NAT and does not have a connection already 6798 established with Z, there are only two ways the direct response 6799 will work. The first is that A and Z both be behind the same NAT, 6800 in which case the NAT is not involved. In the more common case, 6801 when Z is outside A's NAT, the response will only be received if 6802 A's NAT implements endpoint-independent filtering. As the choice 6803 of filtering mode conflates application transparency with security 6804 [RFC4787], and no clear recommendation is available, the 6805 prevalence of this feature in future devices remains unclear. 6807 An extension to RELOAD that supports direct response routing but 6808 relies on symmetric responses as a fallback would be possible, but 6809 due to the complexities of determining when to use direct response 6810 and when to fallback to symmetric, and the reduced performance for 6811 responses to peers behind restrictive NATs, we have chosen not to 6812 include it as an option at this point. 6814 A.4. Relay Peers 6816 SEP [I-D.jiang-p2psip-sep] has proposed implementing a form of direct 6817 response by having A identify a peer, Q, that will be directly 6818 reachable by any other peer. A uses Attach to establish a connection 6819 with Q and advertises Q's IP address in the request sent to Z. Z 6820 sends the response to Q, which relays it to A. This then reduces the 6821 latency to two hops, plus Z negotiating a secure connection to Q. 6823 This technique relies on the relative population of nodes such as A 6824 that require relay peers and peers such as Q that are capable of 6825 serving as a relay peer. It also requires nodes to be able to 6826 identify which category they are in. This identification problem has 6827 turned out to be hard to solve and is still an open area of 6828 exploration. 6830 An extension to RELOAD that supports relay peers is possible, but due 6831 to the complexities of implementing such an alternative, we have not 6832 added such a feature to RELOAD at this point. 6834 A concept similar to relay peers, essentially choosing a relay peer 6835 at random, has previously been suggested to solve problems of 6836 pairwise non-transitivity [non-transitive-dhts-worlds05], but 6837 deterministic filtering provided by NATs makes random relay peers no 6838 more likely to work than the responding peer. 6840 A.5. Symmetric Route Stability 6842 A common concern about symmetric recursive routing has been that one 6843 or more peers along the request path may fail before the response is 6844 received. The significance of this problem essentially depends on 6845 the response latency of the overlay. An overlay that produces slow 6846 responses will be vulnerable to churn, whereas responses that are 6847 delivered very quickly are vulnerable only to failures that occur 6848 over that small interval. 6850 The other aspect of this issue is whether the request itself can be 6851 successfully delivered. Assuming typical connection maintenance 6852 intervals, the time period between the last maintenance and the 6853 request being sent will be orders of magnitude greater than the delay 6854 between the request being forwarded and the response being received. 6855 Therefore, if the path was stable enough to be available to route the 6856 request, it is almost certainly going to remain available to route 6857 the response. 6859 An overlay that is unstable enough to suffer this type of failure 6860 frequently is unlikely to be able to support reliable functionality 6861 regardless of the routing mechanism. However, regardless of the 6862 stability of the return path, studies show that in the event of high 6863 churn, iterative routing is a better solution to ensure request 6864 completion [lookups-churn-p2p06] [non-transitive-dhts-worlds05] 6866 Finally, because RELOAD retries the end-to-end request, that retry 6867 will address the issues of churn that remain. 6869 Appendix B. Why Clients? 6871 There are a wide variety of reasons a node may act as a client rather 6872 than as a peer [I-D.pascual-p2psip-clients]. This section outlines 6873 some of those scenarios and how the client's behavior changes based 6874 on its capabilities. 6876 B.1. Why Not Only Peers? 6878 For a number of reasons, a particular node may be forced to act as a 6879 client even though it is willing to act as a peer. These include: 6881 o The node does not have appropriate network connectivity, typically 6882 because it has a low-bandwidth network connection. 6883 o The node may not have sufficient resources, such as computing 6884 power, storage space, or battery power. 6885 o The overlay algorithm may dictate specific requirements for peer 6886 selection. These may include participating in the overlay to 6887 determine trustworthiness; controlling the number of peers in the 6888 overlay to reduce overly-long routing paths; or ensuring minimum 6889 application uptime before a node can join as a peer. 6891 The ultimate criteria for a node to become a peer are determined by 6892 the overlay algorithm and specific deployment. A node acting as a 6893 client that has a full implementation of RELOAD and the appropriate 6894 overlay algorithm is capable of locating its responsible peer in the 6895 overlay and using Attach to establish a direct connection to that 6896 peer. In that way, it may elect to be reachable under either of the 6897 routing approaches listed above. Particularly for overlay algorithms 6898 that elect nodes to serve as peers based on trustworthiness or 6899 population, the overlay algorithm may require such a client to locate 6900 itself at a particular place in the overlay. 6902 B.2. Clients as Application-Level Agents 6904 SIP defines an extensive protocol for registration and security 6905 between a client and its registrar/proxy server(s). Any SIP device 6906 can act as a client of a RELOAD-based P2PSIP overlay if it contacts a 6907 peer that implements the server-side functionality required by the 6908 SIP protocol. In this case, the peer would be acting as if it were 6909 the user's peer, and would need the appropriate credentials for that 6910 user. 6912 Application-level support for clients is defined by a usage. A usage 6913 offering support for application-level clients should specify how the 6914 security of the system is maintained when the data is moved between 6915 the application and RELOAD layers. 6917 Authors' Addresses 6919 Cullen Jennings 6920 Cisco 6921 170 West Tasman Drive 6922 MS: SJC-21/2 6923 San Jose, CA 95134 6924 USA 6926 Phone: +1 408 421-9990 6927 Email: fluffy@cisco.com 6929 Bruce B. Lowekamp (editor) 6930 Skype 6931 Palo Alto, CA 6932 USA 6934 Email: bbl@lowekamp.net 6935 Eric Rescorla 6936 Skype 6937 8000 Marina Blvd 6938 Brisbane, CA 94005 6939 USA 6941 Phone: +1 650 678 2350 6942 Email: ekr@skype.net 6944 Salman A. Baset 6945 Columbia University 6946 1214 Amsterdam Avenue 6947 New York, NY 6948 USA 6950 Email: salman@cs.columbia.edu 6952 Henning Schulzrinne 6953 Columbia University 6954 1214 Amsterdam Avenue 6955 New York, NY 6956 USA 6958 Email: hgs@cs.columbia.edu