idnits 2.17.1 draft-ietf-p2psip-base-26.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- -- The document has examples using IPv4 documentation addresses according to RFC6890, but does not use any IPv6 documentation addresses. Maybe there should be IPv6 examples, too? Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 2125 has weird spacing: '...gOption opti...' == Line 2390 has weird spacing: '...ionType type;...' == Line 2646 has weird spacing: '...tyValue ide...' == Line 2986 has weird spacing: '...ionType typ...' == Line 2988 has weird spacing: '...ionData val...' == (3 more instances...) -- The document seems to contain a disclaimer for pre-RFC5378 work, and may have content which was first submitted before 10 November 2008. The disclaimer is necessary when there are original authors that you have been unable to contact, or if some do not wish to grant the BCP78 rights to the IETF Trust. If you are able to get all authors (current and original) to grant those rights, you can and should remove the disclaimer; otherwise, the disclaimer is needed and you can ignore this comment. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (February 24, 2013) is 4072 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'A' is mentioned on line 1801, but not defined == Missing Reference: 'B' is mentioned on line 1801, but not defined == Missing Reference: 'C' is mentioned on line 1801, but not defined == Missing Reference: 'X' is mentioned on line 1785, but not defined == Missing Reference: 'D' is mentioned on line 1798, but not defined == Missing Reference: 'I' is mentioned on line 1798, but not defined == Missing Reference: 'NodeIdLength' is mentioned on line 2019, but not defined -- Looks like a reference, but probably isn't: '0' on line 5117 == Missing Reference: 'RFC-to-be' is mentioned on line 7394, but not defined ** Obsolete normative reference: RFC 2388 (Obsoleted by RFC 7578) ** Obsolete normative reference: RFC 2818 (Obsoleted by RFC 9110) ** Obsolete normative reference: RFC 3023 (Obsoleted by RFC 7303) ** Downref: Normative reference to an Informational RFC: RFC 3174 ** Obsolete normative reference: RFC 3447 (Obsoleted by RFC 8017) ** Obsolete normative reference: RFC 4395 (Obsoleted by RFC 7595) ** Obsolete normative reference: RFC 5245 (Obsoleted by RFC 8445, RFC 8839) ** Obsolete normative reference: RFC 5246 (Obsoleted by RFC 8446) ** Obsolete normative reference: RFC 5389 (Obsoleted by RFC 8489) ** Obsolete normative reference: RFC 5405 (Obsoleted by RFC 8085) ** Obsolete normative reference: RFC 5766 (Obsoleted by RFC 8656) ** Downref: Normative reference to an Informational RFC: RFC 6091 ** Downref: Normative reference to an Informational RFC: RFC 6234 ** Obsolete normative reference: RFC 6347 (Obsoleted by RFC 9147) == Outdated reference: A later version (-10) exists of draft-ietf-hip-reload-instance-06 == Outdated reference: A later version (-22) exists of draft-ietf-p2psip-diagnostics-09 == Outdated reference: A later version (-11) exists of draft-ietf-p2psip-rpr-03 == Outdated reference: A later version (-15) exists of draft-ietf-p2psip-self-tuning-06 == Outdated reference: A later version (-15) exists of draft-ietf-p2psip-service-discovery-06 == Outdated reference: A later version (-21) exists of draft-ietf-p2psip-sip-08 -- Obsolete informational reference (is this intentional?): RFC 4013 (Obsoleted by RFC 7613) -- Obsolete informational reference (is this intentional?): RFC 4960 (Obsoleted by RFC 9260) -- Obsolete informational reference (is this intentional?): RFC 5201 (Obsoleted by RFC 7401) -- Obsolete informational reference (is this intentional?): RFC 5785 (Obsoleted by RFC 8615) Summary: 14 errors (**), 0 flaws (~~), 21 warnings (==), 9 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 P2PSIP C. Jennings 3 Internet-Draft Cisco 4 Intended status: Standards Track B. Lowekamp, Ed. 5 Expires: August 28, 2013 Skype 6 E. Rescorla 7 RTFM, Inc. 8 S. Baset 9 H. Schulzrinne 10 Columbia University 11 February 24, 2013 13 REsource LOcation And Discovery (RELOAD) Base Protocol 14 draft-ietf-p2psip-base-26 16 Abstract 18 This specification defines REsource LOcation And Discovery (RELOAD), 19 a peer-to-peer (P2P) signaling protocol for use on the Internet. A 20 P2P signaling protocol provides its clients with an abstract storage 21 and messaging service between a set of cooperating peers that form 22 the overlay network. RELOAD is designed to support a P2P Session 23 Initiation Protocol (P2PSIP) network, but can be utilized by other 24 applications with similar requirements by defining new usages that 25 specify the kinds of data that needs to be stored for a particular 26 application. RELOAD defines a security model based on a certificate 27 enrollment service that provides unique identities. NAT traversal is 28 a fundamental service of the protocol. RELOAD also allows access 29 from "client" nodes that do not need to route traffic or store data 30 for others. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at http://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on August 28, 2013. 49 Copyright Notice 51 Copyright (c) 2013 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (http://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 This document may contain material from IETF Documents or IETF 65 Contributions published or made publicly available before November 66 10, 2008. The person(s) controlling the copyright in some of this 67 material may not have granted the IETF Trust the right to allow 68 modifications of such material outside the IETF Standards Process. 69 Without obtaining an adequate license from the person(s) controlling 70 the copyright in such materials, this document may not be modified 71 outside the IETF Standards Process, and derivative works of it may 72 not be created outside the IETF Standards Process, except to format 73 it for publication as an RFC or to translate it into languages other 74 than English. 76 Table of Contents 78 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 7 79 1.1. Basic Setting . . . . . . . . . . . . . . . . . . . . . 8 80 1.2. Architecture . . . . . . . . . . . . . . . . . . . . . . 10 81 1.2.1. Usage Layer . . . . . . . . . . . . . . . . . . . . 13 82 1.2.2. Message Transport . . . . . . . . . . . . . . . . . 13 83 1.2.3. Storage . . . . . . . . . . . . . . . . . . . . . . 14 84 1.2.4. Topology Plugin . . . . . . . . . . . . . . . . . . 15 85 1.2.5. Forwarding and Link Management Layer . . . . . . . . 15 86 1.3. Security . . . . . . . . . . . . . . . . . . . . . . . . 16 87 1.4. Structure of This Document . . . . . . . . . . . . . . . 17 88 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 18 89 3. Overlay Management Overview . . . . . . . . . . . . . . . . . 22 90 3.1. Security and Identification . . . . . . . . . . . . . . 22 91 3.1.1. Shared-Key Security . . . . . . . . . . . . . . . . 24 92 3.2. Clients . . . . . . . . . . . . . . . . . . . . . . . . 24 93 3.2.1. Client Routing . . . . . . . . . . . . . . . . . . . 25 94 3.2.2. Minimum Functionality Requirements for Clients . . . 26 95 3.3. Routing . . . . . . . . . . . . . . . . . . . . . . . . 26 96 3.4. Connectivity Management . . . . . . . . . . . . . . . . 30 97 3.5. Overlay Algorithm Support . . . . . . . . . . . . . . . 31 98 3.5.1. Support for Pluggable Overlay Algorithms . . . . . . 31 99 3.5.2. Joining, Leaving, and Maintenance Overview . . . . . 31 100 3.6. First-Time Setup . . . . . . . . . . . . . . . . . . . . 33 101 3.6.1. Initial Configuration . . . . . . . . . . . . . . . 33 102 3.6.2. Enrollment . . . . . . . . . . . . . . . . . . . . . 33 103 3.6.3. Diagnostics . . . . . . . . . . . . . . . . . . . . 34 104 4. Application Support Overview . . . . . . . . . . . . . . . . 34 105 4.1. Data Storage . . . . . . . . . . . . . . . . . . . . . . 34 106 4.1.1. Storage Permissions . . . . . . . . . . . . . . . . 35 107 4.1.2. Replication . . . . . . . . . . . . . . . . . . . . 36 108 4.2. Usages . . . . . . . . . . . . . . . . . . . . . . . . . 37 109 4.3. Service Discovery . . . . . . . . . . . . . . . . . . . 37 110 4.4. Application Connectivity . . . . . . . . . . . . . . . . 38 111 5. RFC 2119 Terminology . . . . . . . . . . . . . . . . . . . . 38 112 6. Overlay Management Protocol . . . . . . . . . . . . . . . . . 38 113 6.1. Message Receipt and Forwarding . . . . . . . . . . . . . 38 114 6.1.1. Responsible ID . . . . . . . . . . . . . . . . . . . 39 115 6.1.2. Other ID . . . . . . . . . . . . . . . . . . . . . . 40 116 6.1.3. Opaque ID . . . . . . . . . . . . . . . . . . . . . 42 117 6.2. Symmetric Recursive Routing . . . . . . . . . . . . . . 42 118 6.2.1. Request Origination . . . . . . . . . . . . . . . . 42 119 6.2.2. Response Origination . . . . . . . . . . . . . . . . 43 120 6.3. Message Structure . . . . . . . . . . . . . . . . . . . 43 121 6.3.1. Presentation Language . . . . . . . . . . . . . . . 44 122 6.3.1.1. Common Definitions . . . . . . . . . . . . . . . 45 123 6.3.2. Forwarding Header . . . . . . . . . . . . . . . . . 48 124 6.3.2.1. Processing Configuration Sequence Numbers . . . . 51 125 6.3.2.2. Destination and Via Lists . . . . . . . . . . . . 51 126 6.3.2.3. Forwarding Option . . . . . . . . . . . . . . . . 54 127 6.3.3. Message Contents Format . . . . . . . . . . . . . . 55 128 6.3.3.1. Response Codes and Response Errors . . . . . . . 57 129 6.3.4. Security Block . . . . . . . . . . . . . . . . . . . 60 130 6.4. Overlay Topology . . . . . . . . . . . . . . . . . . . . 64 131 6.4.1. Topology Plugin Requirements . . . . . . . . . . . . 64 132 6.4.2. Methods and types for use by topology plugins . . . 65 133 6.4.2.1. Join . . . . . . . . . . . . . . . . . . . . . . 65 134 6.4.2.2. Leave . . . . . . . . . . . . . . . . . . . . . . 66 135 6.4.2.3. Update . . . . . . . . . . . . . . . . . . . . . 67 136 6.4.2.4. RouteQuery . . . . . . . . . . . . . . . . . . . 67 137 6.4.2.5. Probe . . . . . . . . . . . . . . . . . . . . . . 68 138 6.5. Forwarding and Link Management Layer . . . . . . . . . . 70 139 6.5.1. Attach . . . . . . . . . . . . . . . . . . . . . . . 71 140 6.5.1.1. Request Definition . . . . . . . . . . . . . . . 72 141 6.5.1.2. Response Definition . . . . . . . . . . . . . . . 74 142 6.5.1.3. Using ICE With RELOAD . . . . . . . . . . . . . . 75 143 6.5.1.4. Collecting STUN Servers . . . . . . . . . . . . . 76 144 6.5.1.5. Gathering Candidates . . . . . . . . . . . . . . 76 145 6.5.1.6. Prioritizing Candidates . . . . . . . . . . . . . 77 146 6.5.1.7. Encoding the Attach Message . . . . . . . . . . . 78 147 6.5.1.8. Verifying ICE Support . . . . . . . . . . . . . . 78 148 6.5.1.9. Role Determination . . . . . . . . . . . . . . . 78 149 6.5.1.10. Full ICE . . . . . . . . . . . . . . . . . . . . 79 150 6.5.1.11. No-ICE . . . . . . . . . . . . . . . . . . . . . 79 151 6.5.1.12. Subsequent Offers and Answers . . . . . . . . . . 79 152 6.5.1.13. Sending Media . . . . . . . . . . . . . . . . . . 79 153 6.5.1.14. Receiving Media . . . . . . . . . . . . . . . . . 80 154 6.5.2. AppAttach . . . . . . . . . . . . . . . . . . . . . 80 155 6.5.2.1. Request Definition . . . . . . . . . . . . . . . 80 156 6.5.2.2. Response Definition . . . . . . . . . . . . . . . 81 157 6.5.3. Ping . . . . . . . . . . . . . . . . . . . . . . . . 82 158 6.5.3.1. Request Definition . . . . . . . . . . . . . . . 82 159 6.5.3.2. Response Definition . . . . . . . . . . . . . . . 82 160 6.5.4. ConfigUpdate . . . . . . . . . . . . . . . . . . . . 83 161 6.5.4.1. Request Definition . . . . . . . . . . . . . . . 83 162 6.5.4.2. Response Definition . . . . . . . . . . . . . . . 84 163 6.6. Overlay Link Layer . . . . . . . . . . . . . . . . . . . 85 164 6.6.1. Future Overlay Link Protocols . . . . . . . . . . . 86 165 6.6.1.1. HIP . . . . . . . . . . . . . . . . . . . . . . . 86 166 6.6.1.2. ICE-TCP . . . . . . . . . . . . . . . . . . . . . 87 167 6.6.1.3. Message-oriented Transports . . . . . . . . . . . 87 168 6.6.1.4. Tunneled Transports . . . . . . . . . . . . . . . 87 169 6.6.2. Framing Header . . . . . . . . . . . . . . . . . . . 87 170 6.6.3. Simple Reliability . . . . . . . . . . . . . . . . . 89 171 6.6.3.1. Stop and Wait Sender Algorithm . . . . . . . . . 90 172 6.6.4. DTLS/UDP with SR . . . . . . . . . . . . . . . . . . 91 173 6.6.5. TLS/TCP with FH, No-ICE . . . . . . . . . . . . . . 91 174 6.6.6. DTLS/UDP with SR, No-ICE . . . . . . . . . . . . . . 92 175 6.7. Fragmentation and Reassembly . . . . . . . . . . . . . . 92 176 7. Data Storage Protocol . . . . . . . . . . . . . . . . . . . . 93 177 7.1. Data Signature Computation . . . . . . . . . . . . . . . 95 178 7.2. Data Models . . . . . . . . . . . . . . . . . . . . . . 96 179 7.2.1. Single Value . . . . . . . . . . . . . . . . . . . . 97 180 7.2.2. Array . . . . . . . . . . . . . . . . . . . . . . . 97 181 7.2.3. Dictionary . . . . . . . . . . . . . . . . . . . . . 98 182 7.3. Access Control Policies . . . . . . . . . . . . . . . . 99 183 7.3.1. USER-MATCH . . . . . . . . . . . . . . . . . . . . . 99 184 7.3.2. NODE-MATCH . . . . . . . . . . . . . . . . . . . . . 99 185 7.3.3. USER-NODE-MATCH . . . . . . . . . . . . . . . . . . 100 186 7.3.4. NODE-MULTIPLE . . . . . . . . . . . . . . . . . . . 100 187 7.4. Data Storage Methods . . . . . . . . . . . . . . . . . . 100 188 7.4.1. Store . . . . . . . . . . . . . . . . . . . . . . . 100 189 7.4.1.1. Request Definition . . . . . . . . . . . . . . . 100 190 7.4.1.2. Response Definition . . . . . . . . . . . . . . . 105 191 7.4.1.3. Removing Values . . . . . . . . . . . . . . . . . 107 192 7.4.2. Fetch . . . . . . . . . . . . . . . . . . . . . . . 108 193 7.4.2.1. Request Definition . . . . . . . . . . . . . . . 108 194 7.4.2.2. Response Definition . . . . . . . . . . . . . . . 110 195 7.4.3. Stat . . . . . . . . . . . . . . . . . . . . . . . . 111 196 7.4.3.1. Request Definition . . . . . . . . . . . . . . . 112 197 7.4.3.2. Response Definition . . . . . . . . . . . . . . . 112 198 7.4.4. Find . . . . . . . . . . . . . . . . . . . . . . . . 114 199 7.4.4.1. Request Definition . . . . . . . . . . . . . . . 115 200 7.4.4.2. Response Definition . . . . . . . . . . . . . . . 115 201 7.4.5. Defining New Kinds . . . . . . . . . . . . . . . . . 116 202 8. Certificate Store Usage . . . . . . . . . . . . . . . . . . . 117 203 9. TURN Server Usage . . . . . . . . . . . . . . . . . . . . . . 118 204 10. Chord Algorithm . . . . . . . . . . . . . . . . . . . . . . . 119 205 10.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 120 206 10.2. Hash Function . . . . . . . . . . . . . . . . . . . . . 121 207 10.3. Routing . . . . . . . . . . . . . . . . . . . . . . . . 121 208 10.4. Redundancy . . . . . . . . . . . . . . . . . . . . . . . 122 209 10.5. Joining . . . . . . . . . . . . . . . . . . . . . . . . 122 210 10.6. Routing Attaches . . . . . . . . . . . . . . . . . . . . 124 211 10.7. Updates . . . . . . . . . . . . . . . . . . . . . . . . 124 212 10.7.1. Handling Neighbor Failures . . . . . . . . . . . . . 126 213 10.7.2. Handling Finger Table Entry Failure . . . . . . . . 126 214 10.7.3. Receiving Updates . . . . . . . . . . . . . . . . . 127 215 10.7.4. Stabilization . . . . . . . . . . . . . . . . . . . 128 216 10.7.4.1. Updating Neighbor Table . . . . . . . . . . . . . 128 217 10.7.4.2. Refreshing Finger Table . . . . . . . . . . . . . 128 218 10.7.4.3. Adjusting Finger Table size . . . . . . . . . . . 129 219 10.7.4.4. Detecting partitioning . . . . . . . . . . . . . 130 220 10.8. Route query . . . . . . . . . . . . . . . . . . . . . . 130 221 10.9. Leaving . . . . . . . . . . . . . . . . . . . . . . . . 130 222 11. Enrollment and Bootstrap . . . . . . . . . . . . . . . . . . 131 223 11.1. Overlay Configuration . . . . . . . . . . . . . . . . . 132 224 11.1.1. RELAX NG Grammar . . . . . . . . . . . . . . . . . . 140 225 11.2. Discovery Through Configuration Server . . . . . . . . . 142 226 11.3. Credentials . . . . . . . . . . . . . . . . . . . . . . 143 227 11.3.1. Self-Generated Credentials . . . . . . . . . . . . . 145 228 11.4. Contacting a Bootstrap Node . . . . . . . . . . . . . . 146 229 12. Message Flow Example . . . . . . . . . . . . . . . . . . . . 146 230 13. Security Considerations . . . . . . . . . . . . . . . . . . . 152 231 13.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . 152 232 13.2. Attacks on P2P Overlays . . . . . . . . . . . . . . . . 153 233 13.3. Certificate-based Security . . . . . . . . . . . . . . . 153 234 13.4. Shared-Secret Security . . . . . . . . . . . . . . . . . 154 235 13.5. Storage Security . . . . . . . . . . . . . . . . . . . . 155 236 13.5.1. Authorization . . . . . . . . . . . . . . . . . . . 155 237 13.5.2. Distributed Quota . . . . . . . . . . . . . . . . . 156 238 13.5.3. Correctness . . . . . . . . . . . . . . . . . . . . 156 239 13.5.4. Residual Attacks . . . . . . . . . . . . . . . . . . 156 240 13.6. Routing Security . . . . . . . . . . . . . . . . . . . . 157 241 13.6.1. Background . . . . . . . . . . . . . . . . . . . . . 157 242 13.6.2. Admissions Control . . . . . . . . . . . . . . . . . 158 243 13.6.3. Peer Identification and Authentication . . . . . . . 158 244 13.6.4. Protecting the Signaling . . . . . . . . . . . . . . 159 245 13.6.5. Routing Loops and Dos Attacks . . . . . . . . . . . 159 246 13.6.6. Residual Attacks . . . . . . . . . . . . . . . . . . 160 247 14. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 160 248 14.1. Well-Known URI Registration . . . . . . . . . . . . . . 160 249 14.2. Port Registrations . . . . . . . . . . . . . . . . . . . 160 250 14.3. Overlay Algorithm Types . . . . . . . . . . . . . . . . 161 251 14.4. Access Control Policies . . . . . . . . . . . . . . . . 161 252 14.5. Application-ID . . . . . . . . . . . . . . . . . . . . . 162 253 14.6. Data Kind-ID . . . . . . . . . . . . . . . . . . . . . . 162 254 14.7. Data Model . . . . . . . . . . . . . . . . . . . . . . . 163 255 14.8. Message Codes . . . . . . . . . . . . . . . . . . . . . 163 256 14.9. Error Codes . . . . . . . . . . . . . . . . . . . . . . 165 257 14.10. Overlay Link Types . . . . . . . . . . . . . . . . . . . 165 258 14.11. Overlay Link Protocols . . . . . . . . . . . . . . . . . 166 259 14.12. Forwarding Options . . . . . . . . . . . . . . . . . . . 166 260 14.13. Probe Information Types . . . . . . . . . . . . . . . . 167 261 14.14. Message Extensions . . . . . . . . . . . . . . . . . . . 167 262 14.15. reload URI Scheme . . . . . . . . . . . . . . . . . . . 168 263 14.15.1. URI Registration . . . . . . . . . . . . . . . . . . 169 264 14.16. Media Type Registration . . . . . . . . . . . . . . . . 170 265 14.17. XML Name Space Registration . . . . . . . . . . . . . . 171 266 14.17.1. Config URL . . . . . . . . . . . . . . . . . . . . . 171 267 14.17.2. Config Chord URL . . . . . . . . . . . . . . . . . . 171 268 15. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 171 269 16. References . . . . . . . . . . . . . . . . . . . . . . . . . 172 270 16.1. Normative References . . . . . . . . . . . . . . . . . . 172 271 16.2. Informative References . . . . . . . . . . . . . . . . . 176 272 Appendix A. Routing Alternatives . . . . . . . . . . . . . . . . 180 273 A.1. Iterative vs Recursive . . . . . . . . . . . . . . . . . 181 274 A.2. Symmetric vs Forward response . . . . . . . . . . . . . 181 275 A.3. Direct Response . . . . . . . . . . . . . . . . . . . . 181 276 A.4. Relay Peers . . . . . . . . . . . . . . . . . . . . . . 183 277 A.5. Symmetric Route Stability . . . . . . . . . . . . . . . 183 278 Appendix B. Why Clients? . . . . . . . . . . . . . . . . . . . . 184 279 B.1. Why Not Only Peers? . . . . . . . . . . . . . . . . . . 184 280 B.2. Clients as Application-Level Agents . . . . . . . . . . 184 282 1. Introduction 284 This document defines REsource LOcation And Discovery (RELOAD), a 285 peer-to-peer (P2P) signaling protocol for use on the Internet. It 286 provides a generic, self-organizing overlay network service, allowing 287 nodes to route messages to other nodes and to store and retrieve data 288 in the overlay. RELOAD provides several features that are critical 289 for a successful P2P protocol for the Internet: 291 Security Framework: A P2P network will often be established among a 292 set of peers that do not trust each other. RELOAD leverages a 293 central enrollment server to provide credentials for each peer 294 which can then be used to authenticate each operation. This 295 greatly reduces the possible attack surface. 297 Usage Model: RELOAD is designed to support a variety of 298 applications, including P2P multimedia communications with the 299 Session Initiation Protocol [I-D.ietf-p2psip-sip]. RELOAD allows 300 the definition of new application usages, each of which can define 301 its own data types, along with the rules for their use. This 302 allows RELOAD to be used with new applications through a simple 303 documentation process that supplies the details for each 304 application. 306 NAT Traversal: RELOAD is designed to function in environments where 307 many if not most of the nodes are behind NATs or firewalls. 308 Operations for NAT traversal are part of the base design, 309 including using Interactive Connectivity Establishment (ICE) 310 [RFC5245] to establish new RELOAD or application protocol 311 connections. 313 Optimized Routing: The very nature of overlay algorithms introduces 314 a requirement that peers participating in the P2P network route 315 requests on behalf of other peers in the network. This introduces 316 a load on those other peers, in the form of bandwidth and 317 processing power. RELOAD has been defined with a simple, 318 lightweight forwarding header, thus minimizing the amount of 319 effort for intermediate peers. 321 Pluggable Overlay Algorithms: RELOAD has been designed with an 322 abstract interface to the overlay layer to simplify implementing a 323 variety of structured (e.g., distributed hash tables) and 324 unstructured overlay algorithms. The idea here is that RELOAD 325 provides a generic structure that can fit most types of overlay 326 topologies (ring, hyperspace, etc.). To instantiate an actual 327 network, you combine RELOAD with a specific overlay algorithm, 328 which defines how to construct the overlay topology and route 329 messages efficiently within it. This specification also defines 330 how RELOAD is used with the Chord [Chord] based DHT algorithm, 331 which is mandatory to implement. Specifying a default "mandatory 332 to implement" overlay algorithm promotes interoperability, while 333 extensibility allows selection of overlay algorithms optimized for 334 a particular application. 336 Support for Clients: RELOAD clients differ from RELOAD peers 337 primarily in that they do not store information on behalf of other 338 nodes in the overlay, but only use the overlay to locate users and 339 resources as well as store information. 341 These properties were designed specifically to meet the requirements 342 for a P2P protocol to support SIP. This document defines the base 343 protocol for the distributed storage and location service, as well as 344 critical usage for NAT traversal. The SIP Usage itself is described 345 separately in [I-D.ietf-p2psip-sip]. RELOAD is not limited to usage 346 by SIP and could serve as a tool for supporting other P2P 347 applications with similar needs. 349 1.1. Basic Setting 351 In this section, we provide a brief overview of the operational 352 setting for RELOAD. A RELOAD Overlay Instance consists of a set of 353 nodes arranged in a partly connected graph. Each node in the overlay 354 is assigned a numeric Node-ID for the lifetime of the node which, 355 together with the specific overlay algorithm in use, determines its 356 position in the graph and the set of nodes it connects to. The 357 Node-ID is also tightly coupled to the certificate (see 358 Section 13.3). The figure below shows a trivial example which isn't 359 drawn from any particular overlay algorithm, but was chosen for 360 convenience of representation. 362 +--------+ +--------+ +--------+ 363 | Node 10|--------------| Node 20|--------------| Node 30| 364 +--------+ +--------+ +--------+ 365 | | | 366 | | | 367 +--------+ +--------+ +--------+ 368 | Node 40|--------------| Node 50|--------------| Node 60| 369 +--------+ +--------+ +--------+ 370 | | | 371 | | | 372 +--------+ +--------+ +--------+ 373 | Node 70|--------------| Node 80|--------------| Node 90| 374 +--------+ +--------+ +--------+ 375 | 376 | 377 +--------+ 378 | Node 85| 379 |(Client)| 380 +--------+ 382 Because the graph is not fully connected, when a node wants to send a 383 message to another node, it may need to route it through the network. 384 For instance, Node 10 can talk directly to nodes 20 and 40, but not 385 to Node 70. In order to send a message to Node 70, it would first 386 send it to Node 40 with instructions to pass it along to Node 70. 387 Different overlay algorithms will have different connectivity graphs, 388 but the general idea behind all of them is to allow any node in the 389 graph to efficiently reach every other node within a small number of 390 hops. 392 The RELOAD network is not only a messaging network. It is also a 393 storage network, albeit one designed for small-scale transient 394 storage rather than for bulk storage of large objects. Records are 395 stored under numeric addresses, called Resource-IDs, which occupy the 396 same space as node identifiers. Peers are responsible for storing 397 the data associated with some set of addresses as determined by their 398 Node-ID. For instance, we might say that every peer is responsible 399 for storing any data value which has an address less than or equal to 400 its own Node-ID, but greater than the next lowest Node-ID. Thus, 401 Node-20 would be responsible for storing values 11-20. 403 RELOAD also supports clients. These are nodes which have Node-IDs 404 but do not participate in routing or storage. For instance, in the 405 figure above Node 85 is a client. It can route to the rest of the 406 RELOAD network via Node 80, but no other node will route through it 407 and Node 90 is still responsible for all addresses between 81-90. We 408 refer to non-client nodes as peers. 410 Other applications (for instance, SIP) can be defined on top of 411 RELOAD and use these two basic RELOAD services to provide their own 412 services. 414 1.2. Architecture 416 RELOAD is fundamentally an overlay network. The following figure 417 shows the layered RELOAD architecture. 419 Application 421 +-------+ +-------+ 422 | SIP | | XMPP | ... 423 | Usage | | Usage | 424 +-------+ +-------+ 425 ------------------------------------ Messaging Service Boundary 426 +------------------+ +---------+ 427 | Message |<--->| Storage | 428 | Transport | +---------+ 429 +------------------+ ^ 430 ^ ^ | 431 | v v 432 | +-------------------+ 433 | | Topology | 434 | | Plugin | 435 | +-------------------+ 436 | ^ 437 v v 438 +------------------+ 439 | Forwarding & | 440 | Link Management | 441 +------------------+ 442 ------------------------------------ Overlay Link Service Boundary 443 +-------+ +-------+ 444 |TLS | |DTLS | ... 445 |Overlay| |Overlay| 446 |Link | |Link | 447 +-------+ +-------+ 449 The major components of RELOAD are: 451 Usage Layer: Each application defines a RELOAD usage; a set of data 452 Kinds and behaviors which describe how to use the services 453 provided by RELOAD. These usages all talk to RELOAD through a 454 common Message Transport Service. 456 Message Transport: Handles end-to-end reliability, manages request 457 state for the usages, and forwards Store and Fetch operations to 458 the Storage component. Delivers message responses to the 459 component initiating the request. 461 Storage: The Storage component is responsible for processing 462 messages relating to the storage and retrieval of data. It talks 463 directly to the Topology Plugin to manage data replication and 464 migration, and it talks to the Message Transport component to send 465 and receive messages. 467 Topology Plugin: The Topology Plugin is responsible for implementing 468 the specific overlay algorithm being used. It uses the Message 469 Transport component to send and receive overlay management 470 messages, the Storage component to manage data replication, and 471 the Forwarding Layer to control hop-by-hop message forwarding. 472 This component superficially parallels conventional routing 473 algorithms, but is more tightly coupled to the Forwarding Layer 474 because there is no single "routing table" equivalent used by all 475 overlay algorithms. The topology plugin has two functions, 476 constructing the local forwarding instructions, and selecting the 477 operational topology (i.e., creating links by sending overlay 478 management messages). 480 Forwarding and Link Management Layer: Stores and implements the 481 Routing Table by providing packet forwarding services between 482 nodes. It also handles establishing new links between nodes, 483 including setting up connections for overlay links across NATs 484 using ICE. 486 Overlay Link Layer: Responsible for actually transporting traffic 487 directly between nodes. TLS [RFC5246] and DTLS [RFC6347] are the 488 currently defined "overlay link layer" protocols used by RELOAD 489 for hop-by-hop communication. Each such protocol includes the 490 appropriate provisions for per-hop framing or hop-by-hop ACKs 491 needed by unreliable underlying transports. New protocols can be 492 defined, as described in Section 6.6.1 and Section 11.1. As this 493 document defines only TLS and DTLS, we use those terms throughout 494 the remainder of the document with the understanding that some 495 future specification may add new overlay link layers. 497 To further clarify the roles of the various layers, this figure 498 parallels the architecture with each layer's role from an overlay 499 perspective and implementation layer in the internet: 501 Internet | Internet Model | 502 Model | Equivalent | Reload 503 | in Overlay | Architecture 504 -------------+-----------------+------------------------------------ 505 | | +-------+ +-------+ 506 | Application | | SIP | | XMPP | ... 507 | | | Usage | | Usage | 508 | | +-------+ +-------+ 509 | | ---------------------------------- 510 | |+------------------+ +---------+ 511 | Transport || Message |<--->| Storage | 512 | || Transport | +---------+ 513 | |+------------------+ ^ 514 | | ^ ^ | 515 | | | v v 516 Application | | | +-------------------+ 517 | (Routing) | | | Topology | 518 | | | | Plugin | 519 | | | +-------------------+ 520 | | | ^ 521 | | v v 522 | Network | +------------------+ 523 | | | Forwarding & | 524 | | | Link Management | 525 | | +------------------+ 526 | | ---------------------------------- 527 Transport | Link | +-------+ +------+ 528 | | |TLS | |DTLS | ... 529 | | +-------+ +------+ 530 -------------+-----------------+------------------------------------ 531 Network | 532 | 533 Link | 535 In addition to the above components, nodes may communicate with a 536 central provisioning infrastructure (not shown) to get configuration 537 information, authentication credentials, and the initial set of nodes 538 to communicate with to join the overlay. 540 1.2.1. Usage Layer 542 The top layer, called the Usage Layer, has application usages, such 543 as the SIP Registration Usage [I-D.ietf-p2psip-sip], that use the 544 abstract Message Transport Service provided by RELOAD. The goal of 545 this layer is to implement application-specific usages of the generic 546 overlay services provided by RELOAD. The usage defines how a 547 specific application maps its data into something that can be stored 548 in the overlay, where to store the data, how to secure the data, and 549 finally how applications can retrieve and use the data. 551 The architecture diagram shows both a SIP usage and an XMPP usage. A 552 single application may require multiple usages; for example a 553 voicemail feature in a softphone application that stores links to the 554 messages in the overlay would require a different usage than the type 555 of rendezvous service of XMPP or SIP. A usage may define multiple 556 Kinds of data that are stored in the overlay and may also rely on 557 Kinds originally defined by other usages. 559 Because the security and storage policies for each Kind are dictated 560 by the usage defining the Kind, the usages may be coupled with the 561 Storage component to provide security policy enforcement and to 562 implement appropriate storage strategies according to the needs of 563 the usage. The exact implementation of such an interface is outside 564 the scope of this specification. 566 1.2.2. Message Transport 568 The Message Transport component provides a generic message routing 569 service for the overlay. The Message Transport layer is responsible 570 for end-to-end message transactions. Each peer is identified by its 571 location in the overlay as determined by its Node-ID. A component 572 that is a client of the Message Transport can perform two basic 573 functions: 575 o Send a message to a given peer specified by Node-ID or to the peer 576 responsible for a particular Resource-ID. 578 o Receive messages that other peers sent to a Node-ID or Resource-ID 579 for which the receiving peer is responsible. 581 All usages rely on the Message Transport component to send and 582 receive messages from peers. For instance, when a usage wants to 583 store data, it does so by sending Store requests. Note that the 584 Storage component and the Topology Plugin are themselves clients of 585 the Message Transport, because they need to send and receive messages 586 from other peers. 588 The Message Transport Service is responsible for end-to-end 589 reliability, accomplished by timer-based retransmissions. Unlike the 590 Internet transport layer, however, this layer does not provide 591 congestion control. RELOAD is a request-response protocol, with no 592 more than two pairs of request-response messages used in typical 593 transactions between pairs of nodes, therefore there are no 594 opportunities to observe and react to end-to-end congestion. As with 595 all Internet applications, implementers are strongly discouraged from 596 writing applications that react to loss by immediately retrying the 597 transaction. 599 The Message Transport Service is similar to those described as 600 providing "Key based routing" (KBR)[wikiKBR], although as RELOAD 601 supports different overlay algorithms (including non-DHT overlay 602 algorithms) that calculate keys (storage indices, not encryption 603 keys) in different ways, the actual interface needs to accept 604 Resource Names rather than actual keys. 606 Stability of the underlying network supporting the overlay (the 607 Internet) and congestion control between overlay neighbors, which 608 exchange routing updates and data replicas in addition to forwarding 609 end-to-end messages, is handled by the Forwarding and Link Management 610 layer described below. 612 Real-world experience has shown that a fixed timeout for the end-to- 613 end retransmission timer is sufficient for practical overlay 614 networks. This timer is adjustable via the overlay configuration. 615 As the overlay configuration can be rapidly updated, this value could 616 be dynamically adjusted at coarse time scales, although algorithms 617 for determining how to accomplish this are beyond the scope of this 618 specification. In many cases, however, more appropriate means of 619 improving network performance, such as the Topology Plugin removing 620 lossy links from use in overlay routing or reducing the overall hop- 621 count of end-to-end paths will be more effective than simply 622 increasing the retransmission timer. 624 1.2.3. Storage 626 One of the major functions of RELOAD is to allow nodes to store data 627 in the overlay and to retrieve data stored by other nodes or by 628 themselves. The Storage component is responsible for processing data 629 storage and retrieval messages. For instance, the Storage component 630 might receive a Store request for a given resource from the Message 631 Transport. It would then query the appropriate usage before storing 632 the data value(s) in its local data store and sending a response to 633 the Message Transport for delivery to the requesting node. 634 Typically, these messages will come from other nodes, but depending 635 on the overlay topology, a node might be responsible for storing data 636 for itself as well, especially if the overlay is small. 638 A peer's Node-ID determines the set of resources that it will be 639 responsible for storing. However, the exact mapping between these is 640 determined by the overlay algorithm in use. The Storage component 641 will only receive a Store request from the Message Transport if this 642 peer is responsible for that Resource-ID. The Storage component is 643 notified by the Topology Plugin when the Resource-IDs for which it is 644 responsible change, and the Storage component is then responsible for 645 migrating resources to other peers. 647 1.2.4. Topology Plugin 649 RELOAD is explicitly designed to work with a variety of overlay 650 algorithms. In order to facilitate this, the overlay algorithm 651 implementation is provided by a Topology Plugin so that each overlay 652 can select an appropriate overlay algorithm that relies on the common 653 RELOAD core protocols and code. 655 The Topology Plugin is responsible for maintaining the overlay 656 algorithm Routing Table, which is consulted by the Forwarding and 657 Link Management Layer before routing a message. When connections are 658 made or broken, the Forwarding and Link Management Layer notifies the 659 Topology Plugin, which adjusts the Routing Table as appropriate. The 660 Topology Plugin will also instruct the Forwarding and Link Management 661 Layer to form new connections as dictated by the requirements of the 662 overlay algorithm Topology. The Topology Plugin issues periodic 663 update requests through Message Transport to maintain and update its 664 Routing Table. 666 As peers enter and leave, resources may be stored on different peers, 667 so the Topology Plugin also keeps track of which peers are 668 responsible for which resources. As peers join and leave, the 669 Topology Plugin instructs the Storage component to issue resource 670 migration requests as appropriate, in order to ensure that other 671 peers have whatever resources they are now responsible for. The 672 Topology Plugin is also responsible for providing for redundant data 673 storage to protect against loss of information in the event of a peer 674 failure and to protect against compromised or subversive peers. 676 1.2.5. Forwarding and Link Management Layer 678 The Forwarding and Link Management Layer is responsible for getting a 679 message to the next peer, as determined by the Topology Plugin. This 680 Layer establishes and maintains the network connections as needed by 681 the Topology Plugin. This layer is also responsible for setting up 682 connections to other peers through NATs and firewalls using ICE, and 683 it can elect to forward traffic using relays for NAT and firewall 684 traversal. 686 Congestion control is implemented at this layer to protect the 687 Internet paths used to form the link in the overlay. Additionally, 688 retransmission is performed to improve the reliability of end-to-end 689 transactions. The relation of this layer to the Message Transport 690 Layer can be likened to the relation of the link-level congestion 691 control and retransmission in modern wireless networks to Internet 692 transport protocols. 694 This layer provides a generic interface that allows the topology 695 plugin to control the overlay and resource operations and messages. 696 Since each overlay algorithm is defined and functions differently, we 697 generically refer to the table of other peers that the overlay 698 algorithm maintains and uses to route requests (neighbors) as a 699 Routing Table. The Topology Plugin actually owns the Routing Table, 700 and forwarding decisions are made by querying the Topology Plugin for 701 the next hop for a particular Node-ID or Resource-ID. If this node 702 is the destination of the message, the message is delivered to the 703 Message Transport. 705 This layer also utilizes a framing header to encapsulate messages as 706 they are forwarded along each hop. This header aids reliability 707 congestion control, flow control, etc. It has meaning only in the 708 context of that individual link. 710 The Forwarding and Link Management Layer sits on top of the Overlay 711 Link Layer protocols that carry the actual traffic. This 712 specification defines how to use DTLS and TLS protocols to carry 713 RELOAD messages. 715 1.3. Security 717 RELOAD's security model is based on each node having one or more 718 public key certificates. In general, these certificates will be 719 assigned by a central server which also assigns Node-IDs, although 720 self-signed certificates can be used in closed networks. These 721 credentials can be leveraged to provide communications security for 722 RELOAD messages. RELOAD provides communications security at three 723 levels: 725 Connection Level: Connections between nodes are secured with TLS, 726 DTLS, or potentially some to be defined future protocol. 728 Message Level: Each RELOAD message is signed. 730 Object Level: Stored objects are signed by the creating node. 732 These three levels of security work together to allow nodes to verify 733 the origin and correctness of data they receive from other nodes, 734 even in the face of malicious activity by other nodes in the overlay. 735 RELOAD also provides access control built on top of these 736 communications security features. Because the peer responsible for 737 storing a piece of data can validate the signature on the data being 738 stored, the responsible peer can determine whether a given operation 739 is permitted or not. 741 RELOAD also provides an optional shared secret based admission 742 control feature using shared secrets and TLS-PSK/TLS-SRP. In order 743 to form a TLS connection to any node in the overlay, a new node needs 744 to know the shared overlay key, thus restricting access to authorized 745 users only. This feature is used together with certificate-based 746 access control, not as a replacement for it. It is typically used 747 when self-signed certificates are being used but would generally not 748 be used when the certificates were all signed by an enrollment 749 server. 751 1.4. Structure of This Document 753 The remainder of this document is structured as follows. 755 o Section 2 provides definitions of terms used in this document. 757 o Section 3 provides an overview of the mechanisms used to establish 758 and maintain the overlay. 760 o Section 4 provides an overview of the mechanism RELOAD provides to 761 support other applications. 763 o Section 6 defines the protocol messages that RELOAD uses to 764 establish and maintain the overlay. 766 o Section 7 defines the protocol messages that are used to store and 767 retrieve data using RELOAD. 769 o Section 8 defines the Certificate Store Usages. 771 o Section 9 defines the TURN Server Usage needed to locate TURN 772 servers for NAT traversal. 774 o Section 10 defines a specific Topology Plugin using Chord based 775 algorithm. 777 o Section 11 defines the mechanisms that new RELOAD nodes use to 778 join the overlay for the first time. 780 o Section 12 provides an extended example. 782 2. Terminology 784 Terms in this document are defined inline when used and are also 785 defined below for reference. The definitions in this section use 786 terminology and concepts that are not explained until later in the 787 specification. 789 Admitting Peer: A Peer in the Overlay which helps the Joining Node 790 join the Overlay. 792 Bootstrap Node: A network node used by Joining Nodes to help locate 793 the Admitting Peer. 795 Client: A host that is able to store data in and retrieve data from 796 the overlay but which is not participating in routing or data 797 storage for the overlay. 799 Configuration Document: An XML document containing all the Overlay 800 Parameters for one overlay instance. 802 Connection Table: Contains connection information for the set of 803 nodes to which a node is directly connected, which include nodes 804 that are not yet available for routing. 806 Destination List: A list of Node-IDs, Resource-ID and Opaque IDs 807 through which a message is to be routed, in strict order. A 808 single Node-ID, Resource-ID or Opaque ID is a trivial form of 809 destination list. When multiple Node-IDs are specified, a 810 Destination List is a loose source route. The list is reduced 811 hop-by-hop, does not include the source but includes the 812 destination. 814 DHT: A distributed hash table. A DHT is an abstract hash table 815 service realized by storing the contents of the hash table across 816 a set of peers. 818 ID: A generic term for any kind of identifiers in an Overlay. This 819 document specifies an ID as being a Application-ID, Kind-ID , 820 Node-ID, Transaction ID, component ID, response ID, Resource-ID, 821 or an Opaque ID. 823 Joining Node: A node that is attempting to become a Peer in a 824 particular Overlay. 826 Kind: A Kind defines a particular type of data that can be stored in 827 the overlay. Applications define new Kinds to store the data they 828 use. Each Kind is identified with a unique integer called a 829 Kind-ID. 831 Kind-ID: A unique 32 bit value identifying a Kind. Kind-IDs are 832 either private or allocated by IANA (see Section 14.6). 834 Maximum Request Lifetime: The maximum time a request will wait for a 835 response. This value is equal to the overlay-reliability-timer 836 value defined in Section 11.1 multiplied by the number of 837 transmissions, as defined in Section 6.2.1, and so defaults to 15 838 seconds. 840 Node: The term "Node" is used to refer to a host that may be either 841 a Peer or a Client. Because RELOAD uses the same protocol for 842 both clients and peers, much of the text applies equally to both. 843 Therefore we use "Node" when the text applies to both Clients and 844 Peers and the more specific term (i.e., client or peer) when the 845 text applies only to Clients or only to Peers. 847 Node-ID: A value of fixed but configurable length that uniquely 848 identifies a node. Node-IDs of all 0s and all 1s are reserved; a 849 value of zero is not used in the wire protocol but can be used to 850 indicate an invalid node in implementations and APIs; the Node-ID 851 of all 1s is used on the wire protocol as a wildcard. 853 Overlay Algorithm: An overlay algorithm defines the rules for 854 determining which peers in an overlay store a particular piece of 855 data and for determining a topology of interconnections amongst 856 peers in order to find a piece of data. 858 Overlay Instance: A specific overlay algorithm and the collection of 859 peers that are collaborating to provide read and write access to 860 it. There can be any number of overlay instances running in an IP 861 network at a time, and each operates in isolation of the others. 863 Overlay Parameters: A set of values that are shared between all 864 nodes in an overlay. The overlay parameters are distributed in an 865 XML document called the Configuration Document. 867 Peer: A host that is participating in the overlay. Peers are 868 responsible for holding some portion of the data that has been 869 stored in the overlay and also route messages on behalf of other 870 hosts as needed by the Overlay Algorithm. 872 Peer Admission: The act of admitting a node (the "Joining Node") 873 into an Overlay. After the admission process is over, the joining 874 node is a fully-functional peer of the overlay. During the 875 admission process, the joining node may need to present 876 credentials to prove that it has sufficient authority to join the 877 overlay. 879 Resource: An object or group of objects stored in a P2P network. 881 Resource-ID: A value that identifies some resources and which is 882 used as a key for storing and retrieving the resource. Often this 883 is not human friendly/readable. One way to generate a Resource-ID 884 is by applying a mapping function to some other unique name (e.g., 885 user name or service name) for the resource. The Resource-ID is 886 used by the distributed database algorithm to determine the peer 887 or peers that are responsible for storing the data for the 888 overlay. In structured P2P networks, Resource-IDs are generally 889 fixed length and are formed by hashing the resource name. In 890 unstructured networks, resource names may be used directly as 891 Resource-IDs and may be variable lengths. 893 Resource Name: The name by which a resource is identified. In 894 unstructured P2P networks, the resource name is sometimes used 895 directly as a Resource-ID. In structured P2P networks the 896 resource name is typically mapped into a Resource-ID by using the 897 string as the input to hash function. Structured and unstructured 898 P2P networks are described in [RFC5694]. A SIP resource, for 899 example, is often identified by its AOR which is an example of a 900 Resource Name. 902 Responsible Peer: The peer that is responsible for a specific 903 resource, as defined by the topology plugin algorithm. 905 Routing Table: The set of directly connected peers which a node can 906 use to forward overlay messages. In normal operation, these peers 907 will all be on the Connection Table but not vice versa, because 908 some peers may not yet be available for routing. Peers may send 909 messages directly to peers that are in their Connection Tables but 910 may only forward messages to peers that are not in their 911 Connection Table through peers that are in the Routing Table. 913 Successor Replacement Hold-Down Time: The amount of time to wait 914 before starting replication when a new successor is found; it 915 defaults to 30 seconds. 917 Transaction ID: A randomly chosen identifier selected by the 918 originator of a request and used to correlate requests and 919 responses. 921 Usage: A usage is the definition of a set of data structures (data 922 Kinds) that an application wants to store in the overlay. A usage 923 may also define a set of network protocols (application IDs) that 924 can be tunneled over TLS or DTLS direct connections between nodes. 925 E.g., the SIP usage defines a SIP registration data Kind that 926 contains information on how to reach a SIP endpoint and two 927 application IDs corresponding to the SIP and SIPS protocols. 929 User: A user is a physical person identified by the certificates 930 assigned to them. 932 User Name: A name identifying a user of the overlay, typically used 933 as a Resource Name, or as a label on a Resource that identifies 934 the user owning the resource. 936 3. Overlay Management Overview 938 The most basic function of RELOAD is as a generic overlay network. 939 Nodes need to be able to join the overlay, form connections to other 940 nodes, and route messages through the overlay to nodes to which they 941 are not directly connected. This section provides an overview of the 942 mechanisms that perform these functions. 944 3.1. Security and Identification 946 The overlay parameters are specified in a configuration document. 947 Because the parameters include security critical information such as 948 the certificate signing trust anchors, the configuration document 949 needs to be retrieved securely. The initial configuration document 950 is either initially fetched over HTTPS or manually provisioned; 951 subsequent configuration document updates are received either by 952 periodically refreshing from the configuration server, or, more 953 commonly, by being flood filled through the overlay, which allows for 954 fast propagation once an update is pushed. In the latter case, 955 updates are via digital signatures tracing back to the initial 956 configuration document. 958 Every node in the RELOAD overlay is identified by a Node-ID. The 959 Node-ID is used for three major purposes: 961 o To address the node itself. 963 o To determine its position in the overlay topology (if the overlay 964 is structured; overlays do not need to be structured). 966 o To determine the set of resources for which the node is 967 responsible. 969 Each node has a certificate [RFC5280] containing this Node-ID in a 970 subjectAltName extension, which is unique within an overlay instance. 972 The certificate serves multiple purposes: 974 o It entitles the user to store data at specific locations in the 975 Overlay Instance. Each data Kind defines the specific rules for 976 determining which certificates can access each Resource-ID/Kind-ID 977 pair. For instance, some Kinds might allow anyone to write at a 978 given location, whereas others might restrict writes to the owner 979 of a single certificate. 981 o It entitles the user to operate a node that has a Node-ID found in 982 the certificate. When the node forms a connection to another 983 peer, it uses this certificate so that a node connecting to it 984 knows it is connected to the correct node (technically: a (D)TLS 985 association with client authentication is formed.) In addition, 986 the node can sign messages, thus providing integrity and 987 authentication for messages which are sent from the node. 989 o It entitles the user to use the user name found in the 990 certificate. 992 If a user has more than one device, typically they would get one 993 certificate for each device. This allows each device to act as a 994 separate peer. 996 RELOAD supports multiple certificate issuance models. The first is 997 based on a central enrollment process which allocates a unique name 998 and Node-ID and puts them in a certificate for the user. All peers 999 in a particular Overlay Instance have the enrollment server as a 1000 trust anchor and so can verify any other peer's certificate. 1002 In some settings, a group of users want to set up an overlay network 1003 but are not concerned about attack by other users in the network. 1004 For instance, users on a LAN might want to set up a short term ad hoc 1005 network without going to the trouble of setting up an enrollment 1006 server. RELOAD supports the use of self-generated, self-signed 1007 certificates. When self-signed certificates are used, the node also 1008 generates its own Node-ID and user name. The Node-ID is computed as 1009 a digest of the public key, to prevent Node-ID theft. Note that the 1010 relevant cryptographic property for the digest is preimage 1011 resistance. Collision-resistance is not needed since an attacker who 1012 can create two nodes with the same Node-ID but different public key 1013 obtains no advantage. This model is still subject to a number of 1014 known attacks (most notably Sybil attacks [Sybil]) and can only be 1015 safely used in closed networks where users are mutually trusting. 1016 Another drawback of this approach is that user's data is then tied to 1017 their keys, so if a key is changed any data stored under their 1018 Node-ID needs to be re-stored. This is not an issue for centrally- 1019 issued Node-IDs provided that the CA re-issues the same Node-ID when 1020 a new certificate is generated. 1022 The general principle here is that the security mechanisms (TLS or 1023 DTLS at the data link layer and message signatures at the message 1024 transport layer) are always used, even if the certificates are self- 1025 signed. This allows for a single set of code paths in the systems 1026 with the only difference being whether certificate verification is 1027 used to chain to a single root of trust. 1029 3.1.1. Shared-Key Security 1031 RELOAD also provides an admission control system based on shared 1032 keys. In this model, the peers all share a single key which is used 1033 to authenticate the peer-to-peer connections via TLS-PSK [RFC4279] or 1034 TLS-SRP [RFC5054]. 1036 3.2. Clients 1038 RELOAD defines a single protocol that is used both as the peer 1039 protocol and as the client protocol for the overlay. This simplifies 1040 implementation, particularly for devices that may act in either role, 1041 and allows clients to inject messages directly into the overlay. 1043 We use the term "peer" to identify a node in the overlay that routes 1044 messages for nodes other than those to which it is directly 1045 connected. Peers also have storage responsibilities. We use the 1046 term "client" to refer to nodes that do not have routing or storage 1047 responsibilities. When text applies to both peers and clients, we 1048 will simply refer to such devices as "nodes." 1049 RELOAD's client support allows nodes that are not participating in 1050 the overlay as peers to utilize the same implementation and to 1051 benefit from the same security mechanisms as the peers. Clients 1052 possess and use certificates that authorize the user to store data at 1053 certain locations in the overlay. The Node-ID in the certificate is 1054 used to identify the particular client as a member of the overlay and 1055 to authenticate its messages. 1057 In RELOAD, unlike some other designs, clients are not a first-class 1058 entity. From the perspective of a peer, a client is a node that has 1059 connected to the overlay, but has not yet taken steps to insert 1060 itself into the overlay topology. It might never do so (if it's a 1061 client) or it might eventually do so (if it's just a node that's 1062 taking a long time to join). The routing and storage rules for 1063 RELOAD provide for correct behavior by peers regardless of whether 1064 other nodes attached to them are clients or peers. Of course, a 1065 client implementation needs to know that it intends to be a client, 1066 but this localizes complexity only to that node. 1068 For more discussion of the motivation for RELOAD's client support, 1069 see Appendix B. 1071 3.2.1. Client Routing 1073 Clients may insert themselves in the overlay in two ways: 1075 o Establish a connection to the peer responsible for the client's 1076 Node-ID in the overlay. Then requests may be sent from/to the 1077 client using its Node-ID in the same manner as if it were a peer, 1078 because the responsible peer in the overlay will handle the final 1079 step of routing to the client. This may require a TURN [RFC5766] 1080 relay in cases where NATs or firewalls prevent a client from 1081 forming a direct connection with its responsible peer. Note that 1082 clients that choose this option need to process Update 1083 (Section 6.4.2.3) messages from the peer. Those updates can 1084 indicate that the peer no longer is responsible for the Client's 1085 Node-ID. The client would then need to form a connection to the 1086 appropriate peer. Failure to do so will result in the client no 1087 longer receiving messages. 1089 o Establish a connection with an arbitrary peer in the overlay 1090 (perhaps based on network proximity or an inability to establish a 1091 direct connection with the responsible peer). In this case, the 1092 client will rely on RELOAD's Destination List (Section 6.3.2.2) 1093 feature to ensure reachability. The client can initiate requests, 1094 and any node in the overlay that knows the Destination List to its 1095 current location can reach it, but the client is not directly 1096 reachable using only its Node-ID. If the client is to receive 1097 incoming requests from other members of the overlay, the 1098 Destination List needed to reach the client needs to be learnable 1099 via other mechanisms, such as being stored in the overlay by a 1100 usage. A client connected this way using a certificate with only 1101 a single Node-ID can proceed to use the connection without 1102 performing an Attach (Section 6.5.1). A client wishing to connect 1103 using this mechanism with a certificate with multiple Node-IDs can 1104 use a Ping (Section 6.5.3) to probe the Node-ID of the node to 1105 which it is connected before doing the Attach. 1107 3.2.2. Minimum Functionality Requirements for Clients 1109 A node may act as a client simply because it does not have the 1110 capacity, or even an implementation of the topology plugin defined in 1111 Section 6.4.1, needed to act as a peer in the overlay. In order to 1112 exchange RELOAD messages with a peer, a client needs to meet a 1113 minimum level of functionality. Such a client will: 1115 o Implement RELOAD's connection-management operations that are used 1116 to establish the connection with the peer. 1118 o Implement RELOAD's data retrieval methods (with client 1119 functionality). 1121 o Be able to calculate Resource-IDs used by the overlay. 1123 o Possess security credentials needed by the overlay it is 1124 implementing. 1126 A client speaks the same protocol as the peers, knows how to 1127 calculate Resource-IDs, and signs its requests in the same manner as 1128 peers. While a client does not necessarily require a full 1129 implementation of the overlay algorithm, calculating the Resource-ID 1130 requires an implementation of an appropriate algorithm for the 1131 overlay. 1133 3.3. Routing 1135 This section discusses the capabilities of RELOAD's routing layer, 1136 the protocol features used to implement them, and a brief overview of 1137 how they are used. Appendix A discusses some alternative designs and 1138 the tradeoffs that would be necessary to support them. 1140 RELOAD's routing provides the following capabilities: 1142 Resource-based routing: RELOAD supports routing messages based 1143 solely on the name of the resource. Such messages are delivered 1144 to a node that is responsible for that resource. Both structured 1145 and unstructured overlays are supported, so the route may not be 1146 deterministic for all Topology Plugins. 1148 Node-based routing: RELOAD supports routing messages to a specific 1149 node in the overlay. 1151 Clients: RELOAD supports requests from and to clients that do not 1152 participate in overlay routing, located via either of the 1153 mechanisms described above. 1155 NAT Traversal: RELOAD supports establishing and using connections 1156 between nodes separated by one or more NATs, including locating 1157 peers behind NATs for those overlays allowing/requiring it. 1159 Low state: RELOAD's routing algorithms do not require significant 1160 state (i.e., state linear or greater in the number of outstanding 1161 messages that have passed through it) to be stored on intermediate 1162 peers. 1164 Routability in unstable topologies: Overlay topology changes 1165 constantly in an overlay of moderate size due to the failure of 1166 individual nodes and links in the system. RELOAD's routing allows 1167 peers to re-route messages when a failure is detected, and replies 1168 can be returned to the requesting node as long as the peers that 1169 originally forwarded the successful request do not fail before the 1170 response is returned. 1172 RELOAD's routing utilizes three basic mechanisms: 1174 Destination Lists: While in principle it is possible to just 1175 inject a message into the overlay with a single Node-ID as the 1176 destination, RELOAD provides a source routing capability in the 1177 form of "Destination Lists". A Destination List provides a list 1178 of the nodes through which a message flows in order (i.e., it is 1179 loose source routed). The minimal destination list contains just 1180 a single value. 1182 Via Lists: In order to allow responses to follow the same path as 1183 requests, each message also contains a "Via List", which is 1184 appended to by each node a message traverses. This via list can 1185 then be inverted and used as a destination list for the response. 1187 RouteQuery: The RouteQuery method allows a node to query a peer 1188 for the next hop it will use to route a message. This method is 1189 useful for diagnostics and for iterative routing (see 1190 Section 6.4.2.4). 1192 The basic routing mechanism used by RELOAD is Symmetric Recursive. 1193 We will first describe symmetric recursive routing and then discuss 1194 its advantages in terms of the requirements discussed above. 1196 Symmetric recursive routing requires that a request message follow a 1197 path through the overlay to the destination: each peer forwards the 1198 message closer to its destination. The return path of the response 1199 is then the same path followed in reverse. Note that a failure on 1200 the reverse path caused by a topology change after the request was 1201 sent will be handled by the end-to-end retransmission of the response 1202 as described in Section 6.2.1. For example, a message following a 1203 route from A to Z through B and X: 1205 A B X Z 1206 ------------------------------- 1208 ----------> 1209 Dest=Z 1210 ----------> 1211 Via=A 1212 Dest=Z 1213 ----------> 1214 Via=A,B 1215 Dest=Z 1217 <---------- 1218 Dest=X,B,A 1219 <---------- 1220 Dest=B,A 1221 <---------- 1222 Dest=A 1224 Note that the preceding Figure does not indicate whether A is a 1225 client or peer: A forwards its request to B and the response is 1226 returned to A in the same manner regardless of A's role in the 1227 overlay. 1229 This figure shows use of full via lists by intermediate peers B and 1230 X. However, if B and/or X are willing to store state, then they may 1231 elect to truncate the lists, save that information internally (keyed 1232 by the transaction ID), and return the response message along the 1233 path from which it was received when the response is received. This 1234 option requires greater state to be stored on intermediate peers but 1235 saves a small amount of bandwidth and reduces the need for modifying 1236 the message en route. Selection of this mode of operation is a 1237 choice for the individual peer; the techniques are interoperable even 1238 on a single message. The figure below shows B using full via lists 1239 but X truncating them to X1 and saving the state internally. 1241 A B X Z 1242 ------------------------------- 1244 ----------> 1245 Dest=Z 1246 ----------> 1247 Via=A 1248 Dest=Z 1249 ----------> 1250 Via=X1 1251 Dest=Z 1253 <---------- 1254 Dest=X,X1 1255 <---------- 1256 Dest=B,A 1257 <---------- 1258 Dest=A 1260 As before, when B receives the message, B creates a via list 1261 consisting of [A]. However, instead of sending [A, B], X creates an 1262 opaque ID X1 which maps internally to [A, B] (perhaps by being an 1263 encryption of [A, B]) and forwards to Z with only X1 as the via list. 1264 When the response arrives at X, it maps X1 back to [A, B] and then 1265 inverts it to produce the new destination list [B, A] and routes it 1266 to B. 1268 RELOAD also supports a basic Iterative "routing" mode (where the 1269 intermediate peers merely return a response indicating the next hop, 1270 but do not actually forward the message to that next hop themselves). 1271 Iterative "routing" is implemented using the RouteQuery method (see 1272 Section 6.4.2.4), which requests this behavior. Note that iterative 1273 "routing" is selected only by the initiating node. 1275 3.4. Connectivity Management 1277 In order to provide efficient routing, a peer needs to maintain a set 1278 of direct connections to other peers in the Overlay Instance. Due to 1279 the presence of NATs, these connections often cannot be formed 1280 directly. Instead, we use the Attach request to establish a 1281 connection. Attach uses Interactive Connectivity Establishment (ICE) 1282 [RFC5245] to establish the connection. It is assumed that the reader 1283 is familiar with ICE. 1285 Say that peer A wishes to form a direct connection to peer B, either 1286 to join the overlay or to add more connections in its Routing Table. 1287 It gathers ICE candidates and packages them up in an Attach request 1288 which it sends to B through usual overlay routing procedures. B does 1289 its own candidate gathering and sends back a response with its 1290 candidates. A and B then do ICE connectivity checks on the candidate 1291 pairs. The result is a connection between A and B. At this point, A 1292 and B MAY send messages directly between themselves without going 1293 through other overlay peers. In other words, A and B are on each 1294 other's Connection Tables. They MAY then execute an Update process, 1295 resulting in additions to each other's Routing Tables, and become 1296 able to route messages through each other to other overlay nodes 1298 There are two cases where Attach is not used. The first is when a 1299 peer is joining the overlay and is not connected to any peers. In 1300 order to support this case, some small number of "bootstrap nodes" 1301 typically need to be publicly accessible so that new peers can 1302 directly connect to them. Section 11 contains more detail on this. 1303 The second case is when a client connects to a peer at an arbitrary 1304 IP address, rather than to its responsible peer, as described in the 1305 second bullet point of Section 3.2.1. 1307 In general, a peer needs to maintain connections to all of the peers 1308 near it in the Overlay Instance and to enough other peers to have 1309 efficient routing (the details, e.g., on what "enough" or "near" 1310 means, depend on the specific overlay). If a peer cannot form a 1311 connection to some other peer, this is not necessarily a disaster; 1312 overlays can route correctly even without fully connected links. 1313 However, a peer needs to try to maintain the specified Routing Table 1314 defined by the topology plugin algorithm and needs to form new 1315 connections if it detects that it has fewer direct connections than 1316 specified by the algorithm. This also implies that peers, in 1317 accordance with the topology plugin algorithm, need to periodically 1318 verify that the connected peers are still alive and if not try to 1319 reform the connection or form an alternate one. See Section 10.7.4.3 1320 for an example on how a specific overlay algorithm implements these 1321 constraints. 1323 3.5. Overlay Algorithm Support 1325 The Topology Plugin allows RELOAD to support a variety of overlay 1326 algorithms. This specification defines a DHT based on Chord, which 1327 is mandatory to implement, but the base RELOAD protocol is designed 1328 to support a variety of overlay algorithms. The information needed 1329 to implement this DHT is fully contained in this specification but it 1330 is easier to understand if you are familiar with Chord [Chord] based 1331 DHTs. A nice tutorial can be found at [wikiChord]. 1333 3.5.1. Support for Pluggable Overlay Algorithms 1335 RELOAD defines three methods for overlay maintenance: Join, Update, 1336 and Leave. However, the contents of those messages, when they are 1337 sent, and their precise semantics are specified by the actual overlay 1338 algorithm, which is specified by configuration for all nodes in the 1339 overlay, and thus known to nodes prior to their attempting to join 1340 the overlay. RELOAD merely provides a framework of commonly-needed 1341 methods that provides uniformity of notation (and ease of debugging) 1342 for a variety of overlay algorithms. 1344 3.5.2. Joining, Leaving, and Maintenance Overview 1346 When a new peer wishes to join the Overlay Instance, it will need a 1347 Node-ID that it is allowed to use and a set of credentials which 1348 match that Node-ID. When an enrollment server is used, the Node-ID 1349 used is the Node-ID found in the certificate received from the 1350 enrollment server. The details of the joining procedure are defined 1351 by the overlay algorithm, but the general steps for joining an 1352 Overlay Instance are: 1354 o Forming connections to some other peers. 1356 o Acquiring the data values this peer is responsible for storing. 1358 o Informing the other peers which were previously responsible for 1359 that data that this peer has taken over responsibility. 1361 The first thing the peer needs to do is to form a connection to some 1362 "bootstrap node". Because this is the first connection the peer 1363 makes, these nodes will need public IP addresses so that they can be 1364 connected to directly. Once a peer has connected to one or more 1365 bootstrap nodes, it can form connections in the usual way by routing 1366 Attach messages through the overlay to other nodes. Once a peer has 1367 connected to the overlay for the first time, it can cache the set of 1368 past adjacencies which have public IP address and attempt to use them 1369 as future bootstrap nodes. Note that this requires some notion of 1370 which addresses are likely to be public as discussed in Section 9. 1372 Once a peer has connected to a bootstrap node, it then needs to take 1373 up its appropriate place in the overlay. This requires two major 1374 operations: 1376 o Forming connections to other peers in the overlay to populate its 1377 Routing Table. 1379 o Getting a copy of the data it is now responsible for storing and 1380 assuming responsibility for that data. 1382 The second operation is performed by contacting the Admitting Peer 1383 (AP), the node which is currently responsible for that section of the 1384 overlay. 1386 The details of this operation depend mostly on the overlay algorithm 1387 involved, but a typical case would be: 1389 1. JN (Joining Node) sends a Join request to AP (Admitting Peer) 1390 announcing its intention to join. 1392 2. AP sends a Join response. 1394 3. AP does a sequence of Stores to JN to give it the data it will 1395 need. 1397 4. AP does Updates to JN and to other peers to tell it about its own 1398 Routing Table. At this point, both JN and AP consider JN 1399 responsible for some section of the Overlay Instance. 1401 5. JN makes its own connections to the appropriate peers in the 1402 Overlay Instance. 1404 After this process is completed, JN is a full member of the Overlay 1405 Instance and can process Store/Fetch requests. 1407 Note that the first node is a special case. When ordinary nodes 1408 cannot form connections to the bootstrap nodes, then they are not 1409 part of the overlay. However, the first node in the overlay can 1410 obviously not connect to other nodes. In order to support this case, 1411 potential first nodes (which can also serve as bootstrap nodes 1412 initially) need to somehow be instructed that they are the entire 1413 overlay, rather than not part of it. (e.g., by comparing their IP 1414 address to the bootstrap IP addresses in the configuration file) 1415 Note that clients do not perform either of these operations. 1417 3.6. First-Time Setup 1419 Previous sections addressed how RELOAD works once a node has 1420 connected. This section provides an overview of how users get 1421 connected to the overlay for the first time. RELOAD is designed so 1422 that users can start with the name of the overlay they wish to join 1423 and perhaps an account name and password, and leverage that into 1424 having a working peer with minimal user intervention. This helps 1425 avoid the problems that have been experienced with conventional SIP 1426 clients where users need to manually configure a large number of 1427 settings. 1429 3.6.1. Initial Configuration 1431 In the first phase of the process, the user starts out with the name 1432 of the overlay and uses this to download an initial set of overlay 1433 configuration parameters. The node does a DNS SRV [RFC2782] lookup 1434 on the overlay name to get the address of a configuration server. It 1435 can then connect to this server with HTTPS [RFC2818] to download a 1436 configuration document which contains the basic overlay configuration 1437 parameters as well as a set of bootstrap nodes which can be used to 1438 join the overlay. The details of the relations between names in the 1439 HTTPS certificates, and the overlay names are described in 1440 Section 11.2. 1442 If a node already has the valid configuration document that it 1443 received by some out of band method, this step can be skipped. Note 1444 that that out of band method needs to provide authentication and 1445 integrity, because the configuration document contains the trust 1446 anchors used by the overlay. 1448 3.6.2. Enrollment 1450 If the overlay is using centralized enrollment, then a user needs to 1451 acquire a certificate before joining the overlay. The certificate 1452 attests both to the user's name within the overlay and to the Node- 1453 IDs which they are permitted to operate. In that case, the 1454 configuration document will contain the address of an enrollment 1455 server which can be used to obtain such a certificate, and will also 1456 contain the trust anchor, so this document must be retrieved securely 1457 (see Section 11.2). The enrollment server may (and probably will) 1458 require some sort of account name for the user and password before 1459 issuing the certificate. The enrollment server's ability to ensure 1460 attackers can not get a large number of certificates for the overlay 1461 is one of the cornerstones of RELOAD's security. 1463 3.6.3. Diagnostics 1465 Significant advice around managing a RELOAD overlay and extensions 1466 for diagnostics are described in [I-D.ietf-p2psip-diagnostics]. 1468 4. Application Support Overview 1470 RELOAD is not intended to be used alone, but rather as a substrate 1471 for other applications. These applications can use RELOAD for a 1472 variety of purposes: 1474 o To store data in the overlay and retrieve data stored by other 1475 nodes. 1477 o As a discovery mechanism for services such as TURN. 1479 o To form direct connections which can be used to transmit 1480 application-level messages without using the overlay. 1482 This section provides an overview of these services. 1484 4.1. Data Storage 1486 RELOAD provides operations to Store and Fetch data. Each location in 1487 the Overlay Instance is referenced by a Resource-ID. However, each 1488 location may contain data elements corresponding to multiple Kinds 1489 (e.g., certificate, SIP registration). Similarly, there may be 1490 multiple elements of a given Kind, as shown below: 1492 +--------------------------------+ 1493 | Resource-ID | 1494 | | 1495 | +------------+ +------------+ | 1496 | | Kind 1 | | Kind 2 | | 1497 | | | | | | 1498 | | +--------+ | | +--------+ | | 1499 | | | Value | | | | Value | | | 1500 | | +--------+ | | +--------+ | | 1501 | | | | | | 1502 | | +--------+ | | +--------+ | | 1503 | | | Value | | | | Value | | | 1504 | | +--------+ | | +--------+ | | 1505 | | | +------------+ | 1506 | | +--------+ | | 1507 | | | Value | | | 1508 | | +--------+ | | 1509 | +------------+ | 1510 +--------------------------------+ 1512 Each Kind is identified by a Kind-ID, which is a code point either 1513 assigned by IANA or allocated out of a private range. As part of the 1514 Kind definition, protocol designers may define constraints, such as 1515 limits on size, on the values which may be stored. For many Kinds, 1516 the set may be restricted to a single value; some sets may be allowed 1517 to contain multiple identical items while others may only have unique 1518 items. Note that a Kind may be employed by multiple usages and new 1519 usages are encouraged to use previously defined Kinds where possible. 1520 We define the following data models in this document, though other 1521 usages can define their own structures: 1523 single value: There can be at most one item in the set and any value 1524 overwrites the previous item. 1526 array: Many values can be stored and addressed by a numeric index. 1528 dictionary: The values stored are indexed by a key. Often this key 1529 is one of the values from the certificate of the peer sending the 1530 Store request. 1532 In order to protect stored data from tampering by other nodes, each 1533 stored value is individually digitally signed by the node which 1534 created it. When a value is retrieved, the digital signature can be 1535 verified to detect tampering. If the certificate used to verify the 1536 stored value signature expires, the value can no longer be retrieved 1537 (though may not be immediately garbage collected by the storing node) 1538 and the creating node will need to store the value again if it 1539 desires that stored value to continue to be available. 1541 4.1.1. Storage Permissions 1543 A major issue in peer-to-peer storage networks is minimizing the 1544 burden of becoming a peer, and in particular minimizing the amount of 1545 data which any peer needs to to store for other nodes. RELOAD 1546 addresses this issue by only allowing any given node to store data at 1547 a small number of locations in the overlay, with those locations 1548 being determined by the node's certificate. When a peer uses a Store 1549 request to place data at a location authorized by its certificate, it 1550 signs that data with the private key that corresponds to its 1551 certificate. Then the peer responsible for storing the data is able 1552 to verify that the peer issuing the request is authorized to make 1553 that request. Each data Kind defines the exact rules for determining 1554 what certificate is appropriate. 1556 The most natural rule is that a certificate authorizes a user to 1557 store data keyed with their user name X. Thus, only a user with a 1558 certificate for "alice@example.org" could write to that location in 1559 the overlay (see Section 11.3). However, other usages can define any 1560 rules they choose, including publicly writable values. 1562 The digital signature over the data serves two purposes. First, it 1563 allows the peer responsible for storing the data to verify that this 1564 Store is authorized. Second, it provides integrity for the data. 1565 The signature is saved along with the data value (or values) so that 1566 any reader can verify the integrity of the data. Of course, the 1567 responsible peer can "lose" the value but it cannot undetectably 1568 modify it. 1570 The size requirements of the data being stored in the overlay are 1571 variable. For instance, a SIP AOR and voicemail differ widely in the 1572 storage size. RELOAD leaves it to the Usage and overlay 1573 configuration to limit size imbalance of various Kinds. 1575 4.1.2. Replication 1577 Replication in P2P overlays can be used to provide: 1579 persistence: if the responsible peer crashes and/or if the storing 1580 peer leaves the overlay 1582 security: to guard against DoS attacks by the responsible peer or 1583 routing attacks to that responsible peer 1585 load balancing: to balance the load of queries for popular 1586 resources. 1588 A variety of schemes are used in P2P overlays to achieve some of 1589 these goals. Common techniques include replicating on neighbors of 1590 the responsible peer, randomly locating replicas around the overlay, 1591 or replicating along the path to the responsible peer. 1593 The core RELOAD specification does not specify a particular 1594 replication strategy. Instead, the first level of replication 1595 strategies are determined by the overlay algorithm, which can base 1596 the replication strategy on its particular topology. For example, 1597 Chord places replicas on successor peers, which will take over 1598 responsibility if the responsible peer fails [Chord]. 1600 If additional replication is needed, for example if data persistence 1601 is particularly important for a particular usage, then that usage may 1602 specify additional replication, such as implementing random 1603 replications by inserting a different well known constant into the 1604 Resource Name used to store each replicated copy of the resource. 1605 Such replication strategies can be added independent of the 1606 underlying algorithm, and their usage can be determined based on the 1607 needs of the particular usage. 1609 4.2. Usages 1611 By itself, the distributed storage layer just provides infrastructure 1612 on which applications are built. In order to do anything useful, a 1613 usage needs to be defined. Each Usage needs to specify several 1614 things: 1616 o Register Kind-ID code points for any Kinds that the Usage defines 1617 (Section 14.6). 1619 o Defines the data structure for each of the Kinds (the value member 1620 in Section 7.2). If the data structure contains character string, 1621 conversion rules between characters and the binary storage need to 1622 be specified. 1624 o Define access control rules for each of the Kinds (Section 7.3). 1626 o Define how the Resource Name is used to form the Resource-ID where 1627 each Kind is stored. 1629 o Describe how values will be merged when a network partition is 1630 being healed. 1632 The Kinds defined by a usage may also be applied to other usages. 1633 However, a need for different parameters, such as a different access 1634 control model, would imply the need to create a new Kind. 1636 4.3. Service Discovery 1638 RELOAD does not currently define a generic service discovery 1639 algorithm as part of the base protocol, although a simplistic TURN- 1640 specific discovery mechanism is provided. A variety of service 1641 discovery algorithms can be implemented as extensions to the base 1642 protocol, such as the service discovery algorithm ReDIR 1643 [opendht-sigcomm05] or [I-D.ietf-p2psip-service-discovery]. 1645 4.4. Application Connectivity 1647 There is no requirement that a RELOAD usage needs to use RELOAD's 1648 primitives for establishing its own communication if it already 1649 possesses its own means of establishing connections. For example, 1650 one could design a RELOAD-based resource discovery protocol which 1651 used HTTP to retrieve the actual data. 1653 For more common situations, however, it is the overlay itself - 1654 rather than an external authority such as DNS - which is used to 1655 establish a connection. RELOAD provides connectivity to applications 1656 using the AppAttach method. For example, if a P2PSIP node wishes to 1657 establish a SIP dialog with another P2PSIP node, it will use 1658 AppAttach to establish a direct connection with the other node. This 1659 new connection is separate from the peer protocol connection. It is 1660 a dedicated DTLS or TLS flow used only for the SIP dialog. 1662 5. RFC 2119 Terminology 1664 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 1665 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 1666 document are to be interpreted as described in RFC 2119 [RFC2119]. 1668 6. Overlay Management Protocol 1670 This section defines the basic protocols used to create, maintain, 1671 and use the RELOAD overlay network. We start by defining the basic 1672 concept of how message destinations are interpreted when routing 1673 messages. We then describe the symmetric recursive routing model, 1674 which is RELOAD's default routing algorithm. We then define the 1675 message structure and then finally define the messages used to join 1676 and maintain the overlay. 1678 6.1. Message Receipt and Forwarding 1680 When a node receives a message, it first examines the overlay, 1681 version, and other header fields to determine whether the message is 1682 one it can process. If any of these are incorrect, as defined in 1683 Section 6.3.2, it is an error and the message MUST be discarded. The 1684 peer SHOULD generate an appropriate error but local policy can 1685 override this and cause the messages to be silently dropped. 1687 Once the peer has determined that the message is correctly formatted 1688 (note that this does not include signature checking on intermediate 1689 nodes as the message may be fragmented) it examines the first entry 1690 on the destination list. There are three possible cases here: 1692 o The first entry on the destination list is an ID for which the 1693 peer is responsible. A peer is always responsible for the 1694 wildcard Node-ID. Handling of this case is described in 1695 Section 6.1.1. 1697 o The first entry on the destination list is an ID for which another 1698 peer is responsible. Handling of this case is described in 1699 Section 6.1.2. 1701 o The first entry on the destination list is an opaque ID that is 1702 being used for destination list compression. Handling of this 1703 case is described in Section 6.1.3. Note that opaque IDs can be 1704 distinguished from Node-IDs and Resource-IDs on the wire as 1705 described in Section 6.3.2.2. 1707 These cases are handled as discussed below. 1709 6.1.1. Responsible ID 1711 If the first entry on the destination list is an ID for which the 1712 peer is responsible, there are several (mutually exclusive) sub-cases 1713 to consider. 1715 o If the entry is a Resource-ID, then it MUST be the only entry on 1716 the destination list. If there are other entries, the message 1717 MUST be silently dropped. Otherwise, the message is destined for 1718 this node so it MUST verify the signature as described in 1719 Section 7.1 and MUST pass it up to the upper layers. "Upper 1720 layers" is used here to mean the components above the "Overlay 1721 Link Service Boundary" line in the figure in Section 1.2. 1723 o If the entry is a Node-ID which equals this node's Node-ID, then 1724 the message is destined for this node. If this is the only entry 1725 on the destination list, the message is destined for this node and 1726 so the node passes it up to the upper layers. Otherwise the node 1727 removes the entry from the destination list and repeats the 1728 routing process with the next entry on the destination list. If 1729 the message is a response and list compression was used, then the 1730 node first modifies the destination list to reinsert the saved 1731 state, e.g., by unpacking any opaque IDs. 1733 o If the entry is the wildcard Node-ID (all "1"s), the message is 1734 destined for this node and it passes it up to the upper layers. A 1735 message with a wildcard Node-ID as first entry is never forwarded 1736 and is consumed locally. 1738 o If the entry is a Node-ID which is not equal to this node, then 1739 the node MUST drop the message silently unless the Node-ID 1740 corresponds to a node which is directly connected to this node 1741 (i.e., a client). In the latter case, it MUST forward the message 1742 to the destination node as described in the next section. 1744 Note that this implies that in order to address a message to "the 1745 peer that controls region X", a sender sends to Resource-ID X, not 1746 Node-ID X. 1748 6.1.2. Other ID 1750 If the first entry in the destination list is neither an opaque ID 1751 nor an ID the peer is responsible for, then the peer MUST forward the 1752 message towards this entry. This means that it MUST select one of 1753 the peers to which it is connected and which is most likely to be 1754 responsible (according to the topology plugin) for the first entry on 1755 the destination list. For the CHORD-RELOAD topology, the routing to 1756 the most likely responsible node is explained in Section 10.3. If 1757 the first entry on the destination list is in the peer's Connection 1758 Table, then it MUST forward the message to that peer directly. 1759 Otherwise, the peer consults the Routing Table to forward the 1760 message. 1762 Any intermediate peer which forwards a RELOAD request MUST ensure 1763 that if it receives a response to that message the response can be 1764 routed back through the set of nodes through which the request 1765 passed. The peer selects one of these approaches: 1767 o The peer can add an entry to the via list in the forwarding header 1768 that will enable it to determine the correct node. This is done 1769 by appending to the via list the Node-ID of the node that sent the 1770 request to this node. 1772 o The peer can keep per-transaction state which will allow it to 1773 determine the correct node. 1775 As an example of the first strategy, consider an example with nodes 1776 A, B, C, D and E. If node D receives a message from node C with via 1777 list [A, B], then D would forward to the next node E with via list 1778 [A, B, C]. Now, if E wants to respond to the message, it reverses 1779 the via list to produce the destination list, resulting in [D, C, B, 1780 A]. When D forwards the response to C, the destination list will 1781 contain [C, B, A]. 1783 As an example of the second strategy, if node D receives a message 1784 from node C with transaction ID X (as assigned by A) and via list [A, 1785 B], it could store [X, C] in its state database and forward the 1786 message with the via list unchanged. When D receives the response, 1787 it consults its state database for transaction ID X, determines that 1788 the request came from C, and forwards the response to C. 1790 Intermediate peers which modify the via list are not required to 1791 simply add entries. The only requirement is that the peer MUST be 1792 able to reconstruct the correct destination list on the return route. 1793 RELOAD provides explicit support for this functionality in the form 1794 of opaque IDs, which can replace any number of via list entries. 1796 For instance, in the above example, Node D might send E a via list 1797 containing only the opaque ID I. E would then use the destination 1798 list [D, I] to send its return message. When D processes this 1799 destination list, it would detect that I is an opaque ID, recover the 1800 via list [A, B, C], and reverse that to produce the correct 1801 destination list [C, B, A] before sending it to C. This feature is 1802 called List Compression. Possibilities for an opaque ID include a 1803 compressed version of the original via list or an index into a state 1804 database containing the original via list, but the details are a 1805 local matter. 1807 No matter what mechanism for storing via list state is used, if an 1808 intermediate peer exits the overlay, then on the return trip the 1809 message cannot be forwarded and will be dropped. The ordinary 1810 timeout and retransmission mechanisms provide stability over this 1811 type of failure. 1813 Note that if an intermediate peer retains per-transaction state 1814 instead of modifying the via list, it needs some mechanism for timing 1815 out that state, otherwise its state database will grow without bound. 1816 Whatever algorithm is used, unless a FORWARD_CRITICAL forwarding 1817 option (Section 6.3.2.3) or overlay configuration option explicitly 1818 indicates this state is not needed, the state MUST be maintained for 1819 at least the value of the overlay-reliability-timer configuration 1820 parameter and MAY be kept longer. Future extension, such as 1821 [I-D.ietf-p2psip-rpr], may define mechanisms for determining when 1822 this state does not need to be retained. 1824 There is no requirement to ensure that a request issued after the 1825 receipt of a response follows the same path as the response. As a 1826 consequence, there is no requirement to use either of the mechanisms 1827 described above (via list or state retention) when processing a 1828 response message. 1830 An intermediate node receiving a request from another node MUST 1831 return a response to this request with a destination list equal to 1832 the concatenation of the Node-ID of the node that sent the request 1833 with the via list in the request. The intermediate node normally 1834 learns the Node-ID the other node is using via an Attach, but a node 1835 using a certificate with a single Node-ID MAY elect to not send an 1836 Attach (see Section 3.2.1 bullet 2). If a node with a certificate 1837 with multiple Node-IDs attempts to route a message other than a Ping 1838 or Attach through a node without performing an Attach, the receiving 1839 node MUST reject the request with an Error_Forbidden error. The node 1840 MUST implement support for returning responses to a Ping or Attach 1841 request made by a joining node Attaching to its responsible peer. 1843 6.1.3. Opaque ID 1845 If the first entry in the destination list is an opaque ID (e.g., a 1846 compressed via list), the peer MUST replace that entry with the 1847 original via list that it replaced and then re-examine the 1848 destination list to determine which of the three cases in Section 6.1 1849 now applies. 1851 6.2. Symmetric Recursive Routing 1853 This Section defines RELOAD's Symmetric Recursive Routing (SRR) 1854 algorithm, which is the default algorithm used by nodes to route 1855 messages through the overlay. All implementations MUST implement 1856 this routing algorithm. An overlay MAY be configured to use 1857 alternative routing algorithms, and alternative routing algorithms 1858 MAY be selected on a per-message basis. I.e., a node in an overlay 1859 which supports SRR and some other routing algorithm called XXX might 1860 use SRR some of the time and XXX some of the time. 1862 6.2.1. Request Origination 1864 In order to originate a message to a given Node-ID or Resource-ID, a 1865 node MUST construct an appropriate destination list. The simplest 1866 such destination list is a single entry containing the Node-ID or 1867 Resource-ID. The resulting message MUST use the normal overlay 1868 routing mechanisms to forward the message to that destination. The 1869 node MAY also construct a more complicated destination list for 1870 source routing. 1872 Once the message is constructed, the node sends the message to some 1873 adjacent peer. If the first entry on the destination list is 1874 directly connected, then the message MUST be routed down that 1875 connection. Otherwise, the topology plugin MUST be consulted to 1876 determine the appropriate next hop. 1878 Parallel requests for a resource are a common solution to improve 1879 reliability in the face of churn or of subversive peers. Parallel 1880 searches for usage-specified replicas are managed by the usage layer, 1881 for instance by having the usage store data at multiple Resource-IDs 1882 with the requesting node sending requests to each of those Resource- 1883 IDs. However, a single request MAY also be routed through multiple 1884 adjacent peers, even when known to be sub-optimal, to improve 1885 reliability [vulnerabilities-acsac04]. Such parallel searches MAY be 1886 specified by the topology plugin, in which case it would return 1887 multiple next hops and the request would be routed to all of them. 1889 Because messages can be lost in transit through the overlay, RELOAD 1890 incorporates an end-to-end reliability mechanism. When an 1891 originating node transmits a request it MUST set a timer to the 1892 current overlay-reliability-timer. If a response has not been 1893 received when the timer fires, the request MUST be retransmitted with 1894 the same transaction identifier. The request MAY be retransmitted up 1895 to 4 times (for a total of 5 messages). After the timer for the 1896 fifth transmission fires, the message MUST be considered to have 1897 failed. Although the originating node will be doing both end-to-end 1898 and hop-by-hop retransmissions, the end-by-end retransmission 1899 procedure is not followed by intermediate nodes. They follow the 1900 hop-by-hop reliability procedure described in Section 6.6.3. 1902 The above algorithm can result in multiple requests being delivered 1903 to a node. Receiving nodes MUST generate semantically equivalent 1904 responses to retransmissions of the same request (this can be 1905 determined by transaction ID) if the request is received within the 1906 maximum request lifetime (15 seconds). For some requests (e.g., 1907 Fetch) this can be accomplished merely by processing the request 1908 again. For other requests, (e.g., Store) it may be necessary to 1909 maintain state for the duration of the request lifetime. 1911 6.2.2. Response Origination 1913 When a peer sends a response to a request using this routing 1914 algorithm, it MUST construct the destination list by reversing the 1915 order of the entries on the via list. This has the result that the 1916 response traverses the same peers as the request traversed, except in 1917 reverse order (symmetric routing). 1919 6.3. Message Structure 1921 RELOAD is a message-oriented request/response protocol. The messages 1922 are encoded using binary fields. All integers are represented in 1923 network byte order. The general philosophy behind the design was to 1924 use Type, Length, Value fields to allow for extensibility. However, 1925 for the parts of a structure that were required in all messages, we 1926 just define these in a fixed position, as adding a type and length 1927 for them is unnecessary and would simply increase bandwidth and 1928 introduces new potential for interoperability issues. 1930 Each message has three parts, concatenated as shown below: 1932 +-------------------------+ 1933 | Forwarding Header | 1934 +-------------------------+ 1935 | Message Contents | 1936 +-------------------------+ 1937 | Security Block | 1938 +-------------------------+ 1940 The contents of these parts are as follows: 1942 Forwarding Header: Each message has a generic header which is used 1943 to forward the message between peers and to its final destination. 1944 This header is the only information that an intermediate peer 1945 (i.e., one that is not the target of a message) needs to examine. 1946 Section 6.3.2 describes the format of this part. 1948 Message Contents: The message being delivered between the peers. 1949 From the perspective of the forwarding layer, the contents are 1950 opaque, however, they are interpreted by the higher layers. 1951 Section 6.3.3 describes the format of this part. 1953 Security Block: A security block containing certificates and a 1954 digital signature over the "Message Contents" section. Note that 1955 this signature can be computed without parsing the message 1956 contents. All messages MUST be signed by their originator. 1957 Section 6.3.4 describes the format of this part. 1959 6.3.1. Presentation Language 1961 The structures defined in this document are defined using a C-like 1962 syntax based on the presentation language used to define TLS 1963 [RFC5246]. Advantages of this style include: 1965 o It is familiar enough looking that most readers can grasp it 1966 quickly. 1968 o The ability to define nested structures allows a separation 1969 between high-level and low-level message structures. 1971 o It has a straightforward wire encoding that allows quick 1972 implementation, but the structures can be comprehended without 1973 knowing the encoding. 1975 o The ability to mechanically compile encoders and decoders. 1977 Several idiosyncrasies of this language are worth noting. 1979 o All lengths are denoted in bytes, not objects. 1981 o Variable length values are denoted like arrays with angle 1982 brackets. 1984 o "select" is used to indicate variant structures. 1986 For instance, "uint16 array<0..2^8-2>;" represents up to 254 bytes 1987 which corresponds to up to 127 values of two bytes (16 bits) each. 1989 A repetitive structure member shares a common notation with a member 1990 containing a variable length block of data. The latter always starts 1991 with "opaque" whereas the former does not. For instance the 1992 following denotes a variable block of data: 1994 opaque data<0..2^32-1>; 1996 whereas the following denotes a list of 0, 1 or more instances of the 1997 Name element: 1999 Name names<0..2^32-1>; 2001 6.3.1.1. Common Definitions 2003 This section provides an introduction to the presentation language 2004 used throughout RELOAD. 2006 An enum represents an enumerated type. The values associated with 2007 each possibility are represented in parentheses and the maximum value 2008 is represented as a nameless value, for purposes of describing the 2009 width of the containing integral type. For instance, Boolean 2010 represents a true or false: 2012 enum { false(0), true(1), (255) } Boolean; 2014 A boolean value is either a 1 or a 0. The max value of 255 indicates 2015 this is represented as a single byte on the wire. 2017 The NodeId, shown below, represents a single Node-ID. 2019 typedef opaque NodeId[NodeIdLength]; 2021 A NodeId is a fixed-length structure represented as a series of 2022 bytes, with the most significant byte first. The length is set on a 2023 per-overlay basis within the range of 16-20 bytes (128 to 160 bits). 2024 (See Section 11.1 for how NodeIdLength is set.) Note: the use of 2025 "typedef" here is an extension to the TLS language, but its meaning 2026 should be relatively obvious. Note the [ size ] syntax defines a 2027 fixed length element that does not include the length of the element 2028 in the on the wire encoding. 2030 A ResourceId, shown below, represents a single Resource-ID. 2032 typedef opaque ResourceId<0..2^8-1>; 2034 Like a NodeId, a ResourceId is an opaque string of bytes, but unlike 2035 NodeIds, ResourceIds are variable length, up to 254 bytes (2040 bits) 2036 in length. On the wire, each ResourceId is preceded by a single 2037 length byte (allowing lengths up to 255). Thus, the 3-byte value 2038 "FOO" would be encoded as: 03 46 4f 4f. Note the < range > syntax 2039 defines a variable length element that does include the length of the 2040 element in the on the wire encoding. The number of bytes to encode 2041 the length on the wire is derived by range; i.e., it is the minimum 2042 number of bytes which can encode the largest range value. 2044 A more complicated example is IpAddressPort, which represents a 2045 network address and can be used to carry either an IPv6 or IPv4 2046 address: 2048 enum { invalidAddressType(0), ipv4_address(1), ipv6_address(2), 2049 (255) } AddressType; 2051 struct { 2052 uint32 addr; 2053 uint16 port; 2054 } IPv4AddrPort; 2056 struct { 2057 uint128 addr; 2058 uint16 port; 2059 } IPv6AddrPort; 2061 struct { 2062 AddressType type; 2063 uint8 length; 2065 select (type) { 2066 case ipv4_address: 2067 IPv4AddrPort v4addr_port; 2069 case ipv6_address: 2070 IPv6AddrPort v6addr_port; 2072 /* This structure can be extended */ 2073 }; 2074 } IpAddressPort; 2076 The first two fields in the structure are the same no matter what 2077 kind of address is being represented: 2079 type: the type of address (v4 or v6). 2081 length: the length of the rest of the structure. 2083 By having the type and the length appear at the beginning of the 2084 structure regardless of the kind of address being represented, an 2085 implementation which does not understand new address type X can still 2086 parse the IpAddressPort field and then discard it if it is not 2087 needed. 2089 The rest of the IpAddressPort structure is either an IPv4AddrPort or 2090 an IPv6AddrPort. Both of these simply consist of an address 2091 represented as an integer and a 16-bit port. As an example, here is 2092 the wire representation of the IPv4 address "192.0.2.1" with port 2093 "6084". 2095 01 ; type = IPv4 2096 06 ; length = 6 2097 c0 00 02 01 ; address = 192.0.2.1 2098 17 c4 ; port = 6084 2100 Unless a given structure that uses a select explicitly allows for 2101 unknown types in the select, any unknown type SHOULD be treated as an 2102 parsing error and the whole message discarded with no response. 2104 6.3.2. Forwarding Header 2106 The forwarding header is defined as a ForwardingHeader structure, as 2107 shown below. 2109 struct { 2110 uint32 relo_token; 2111 uint32 overlay; 2112 uint16 configuration_sequence; 2113 uint8 version; 2114 uint8 ttl; 2115 uint32 fragment; 2116 uint32 length; 2117 uint64 transaction_id; 2118 uint32 max_response_length; 2119 uint16 via_list_length; 2120 uint16 destination_list_length; 2121 uint16 options_length; 2122 Destination via_list[via_list_length]; 2123 Destination destination_list 2124 [destination_list_length]; 2125 ForwardingOption options[options_length]; 2126 } ForwardingHeader; 2128 The contents of the structure are: 2130 relo_token: The first four bytes identify this message as a RELOAD 2131 message. This field MUST contain the value 0xd2454c4f (the string 2132 'RELO' with the high bit of the first byte set). 2134 overlay: The 32 bit checksum/hash of the overlay being used. This 2135 MUST be formed by taking the lower 32 bits of the SHA-1 [RFC3174] 2136 hash of the overlay name. The purpose of this field is to allow 2137 nodes to participate in multiple overlays and to detect accidental 2138 misconfiguration. This is not a security critical function. The 2139 overlay name MUST consist of a sequence of characters what would 2140 be allowable as a DNS name. Specifically, as it is used in a DNS 2141 lookup, it will need to be compliant with the grammar for the 2142 domain as specified in section 2.3.1 of [RFC1035] . 2144 configuration_sequence: The sequence number of the configuration 2145 file. See Section 6.3.2.1 for details 2147 version: The version of the RELOAD protocol being used. This is a 2148 fixed point integer between 0.1 and 25.4. This document describes 2149 version 1.0, with a value of 0x0a. [Note: Pre-RFC versions used 2150 version number 0.1]. Nodes MUST reject messages with other 2151 versions. 2153 ttl: An 8 bit field indicating the number of iterations, or hops, a 2154 message can experience before it is discarded. The TTL value MUST 2155 be decremented by one at every hop along the route the message 2156 traverses just before transmission. If a received message has a 2157 TTL of 0, and the message is not destined for the receiving node, 2158 then the message MUST NOT be propagated further and a 2159 "Error_TTL_Exceeded" error should be generated. The initial value 2160 of the TTL SHOULD be 100 and MUST NOT exceed 100 unless defined 2161 otherwise by the overlay configuration. Implementations which 2162 receive message with a TTL greater than the current value of 2163 initial-ttl (or the 100 default) MUST discard the message and send 2164 an "Error_TTL_Exceeded" error. 2166 fragment: This field is used to handle fragmentation. The high bit 2167 (0x80000000) MUST be set for historical reasons. If the next bit 2168 (0x40000000) is set to 1, it indicates that this is the last (or 2169 only) fragment. The next six bits (0x20000000 to 0x01000000) are 2170 reserved and SHOULD be set to zero. The remainder of the field is 2171 used to indicate the fragment offset; see Section 6.7. 2173 length: The count in bytes of the size of the message including the 2174 header, after the eventual fragmentation. 2176 transaction_id: A unique 64 bit number that identifies this 2177 transaction and also allows receivers to disambiguate transactions 2178 which are otherwise identical. In order to provide a high 2179 probability that transaction IDs are unique, they MUST be randomly 2180 generated. Responses use the same transaction ID as the request 2181 they correspond to. Transaction IDs are also used for fragment 2182 reassembly. See Section 6.7 for details. 2184 max_response_length: The maximum size in bytes of a response. Used 2185 by requesting nodes to avoid receiving (unexpected) very large 2186 responses. If this value is non-zero, responding peers MUST check 2187 that any response would not exceed it and if so generate an 2188 "Error_Incompatible_with_Overlay" value. This value SHOULD be set 2189 to zero for responses. 2191 via_list_length: The length of the via list in bytes. Note that in 2192 this field and the following two length fields we depart from the 2193 usual variable-length convention of having the length immediately 2194 precede the value in order to make it easier for hardware decoding 2195 engines to quickly determine the length of the header. 2197 destination_list_length: The length of the destination list in 2198 bytes. 2200 options_length: The length of the header options in bytes. 2202 via_list: The via_list contains the sequence of destinations through 2203 which the message has passed. The via_list starts out empty and 2204 grows as the message traverses each peer. In stateless cases, the 2205 previous hop that the message is from is appended to the via list 2206 as specified in Section 6.1.2. 2208 destination_list: The destination_list contains a sequence of 2209 destinations which the message should pass through. The 2210 destination list is constructed by the message originator. The 2211 first element in the destination list is where the message goes 2212 next. The list shrinks as the message traverses each listed peer. 2214 options: Contains a series of ForwardingOption entries. See 2215 Section 6.3.2.3. 2217 6.3.2.1. Processing Configuration Sequence Numbers 2219 In order to be part of the overlay, a node MUST have a copy of the 2220 overlay configuration document. In order to allow for configuration 2221 document changes, each version of the configuration document MUST 2222 contain a sequence number which MUST be monotonically increasing mod 2223 65535. Because the sequence number may in principle wrap, greater 2224 than or less than are interpreted by modulo arithmetic as in TCP. 2226 When a destination node receives a request, it MUST check that the 2227 configuration_sequence field is equal to its own configuration 2228 sequence number. If they do not match, it MUST generate an error, 2229 either Error_Config_Too_Old or Error_Config_Too_New. In addition, if 2230 the configuration file in the request is too old, it MUST generate a 2231 ConfigUpdate message to update the requesting node. This allows new 2232 configuration documents to propagate quickly throughout the system. 2233 The one exception to this rule is that if the configuration_sequence 2234 field is equal to 65535, and the message type is ConfigUpdate, then 2235 the message MUST be accepted regardless of the receiving node's 2236 configuration sequence number. Since 65535 is a special value, peers 2237 sending a new configuration when the configuration sequence is 2238 currently 65534 MUST set the configuration sequence number to 0 when 2239 they send out a new configuration. 2241 6.3.2.2. Destination and Via Lists 2243 The destination list and via list are sequences of Destination 2244 values: 2246 enum { invalidDestinationType(0), node(1), resource(2), 2247 opaque_id_type(3), /* 128-255 not allowed */ (255) } 2248 DestinationType; 2250 select (destination_type) { 2251 case node: 2252 NodeId node_id; 2254 case resource: 2255 ResourceId resource_id; 2257 case opaque_id_type: 2258 opaque opaque_id<0..2^8-1>; 2260 /* This structure may be extended with new types */ 2261 } DestinationData; 2263 struct { 2264 DestinationType type; 2265 uint8 length; 2266 DestinationData destination_data; 2267 } Destination; 2269 struct { 2270 uint16 opaque_id; /* top bit MUST be 1 */ 2271 } Destination; 2273 If the destination structure is a 16 bit integer, then the first bit 2274 MUST be set to 1 and it MUST be treated as if it were a full 2275 structure with a DestinationType of opaque_id_type and an opaque_id 2276 that was 2 bytes long with the value of the 16 bit integer. If the 2277 destination structure is starting with DestinationType, then the 2278 first bit MUST be set to 0 and it is using the TLV structure with the 2279 following contents: 2281 type 2282 The type of the DestinationData Payload Data Unit (PDU). This may 2283 be one of "node", "resource", or "opaque_id_type". 2285 length 2286 The length of the destination_data. 2288 destination_data 2289 The destination value itself, which is an encoded DestinationData 2290 structure, depending on the value of "type". 2292 Note: This structure encodes a type, length, value. The length 2293 field specifies the length of the DestinationData values, which 2294 allows the addition of new DestinationTypes. This allows an 2295 implementation which does not understand a given DestinationType 2296 to skip over it. 2298 A DestinationData can be one of three types: 2300 node 2301 A Node-ID. 2303 opaque 2304 A compressed list of Node-IDs and an eventual Resource-ID. 2305 Because this value was compressed by one of the peers, it is only 2306 meaningful to that peer and cannot be decoded by other peers. 2307 Thus, it is represented as an opaque string. 2309 resource 2310 The Resource-ID of the resource which is desired. This type MUST 2311 only appear in the final location of a destination list and MUST 2312 NOT appear in a via list. It is meaningless to try to route 2313 through a resource. 2315 One possible encoding of the 16 bit integer version as an opaque 2316 identifier is to encode an index into a Connection Table. To avoid 2317 misrouting responses in the event a response is delayed and the 2318 Connection Table entry has changed, the identifier SHOULD be split 2319 between an index and a generation counter for that index. At 2320 startup, the generation counters SHOULD be initialized to random 2321 values. An implementation MAY use 12 bits for the Connection Table 2322 index and 3 bits for the generation counter. (Note that this does 2323 not suggest a 4096 entry Connection Table for every peer, only the 2324 ability to encode for a larger Connection Table.) When a Connection 2325 Table slot is used for a new connection, the generation counter is 2326 incremented (with wrapping). Connection Table slots are used on a 2327 rotating basis to maximize the time interval between uses of the same 2328 slot for different connections. When routing a message to an entry 2329 in the destination list encoding a Connection Table entry, the peer 2330 MUST confirm that the generation counter matches the current 2331 generation counter of that index before forwarding the message. If 2332 it does not match, the message MUST be silently dropped. 2334 6.3.2.3. Forwarding Option 2336 The Forwarding header can be extended with forwarding header options, 2337 which are a series of ForwardingOption structures: 2339 enum { invalidForwardingOptionType(0), (255) } 2340 ForwardingOptionType; 2342 struct { 2343 ForwardingOptionType type; 2344 uint8 flags; 2345 uint16 length; 2346 select (type) { 2347 /* This type may be extended */ 2348 }; 2349 } ForwardingOption; 2351 Each ForwardingOption consists of the following values: 2353 type 2354 The type of the option. This structure allows for unknown options 2355 types. 2357 flags 2358 Three flags are defined FORWARD_CRITICAL(0x01), 2359 DESTINATION_CRITICAL(0x02), and RESPONSE_COPY(0x04). These flags 2360 MUST NOT be set in a response. If the FORWARD_CRITICAL flag is 2361 set, any peer that would forward the message but does not 2362 understand this options MUST reject the request with an 2363 Error_Unsupported_Forwarding_Option error response. If the 2364 DESTINATION_CRITICAL flag is set, any node that generates a 2365 response to the message but does not understand the forwarding 2366 option MUST reject the request with an 2367 Error_Unsupported_Forwarding_Option error response. If the 2368 RESPONSE_COPY flag is set, any node generating a response MUST 2369 copy the option from the request to the response except that the 2370 RESPONSE_COPY, FORWARD_CRITICAL and DESTINATION_CRITICAL flags 2371 MUST be cleared. 2373 length 2374 The length of the rest of the structure. Note that a 0 length may 2375 be reasonable if the mere presence of the option is meaningful and 2376 no value is required. 2378 option 2379 The option value. 2381 6.3.3. Message Contents Format 2383 The second major part of a RELOAD message is the contents part, which 2384 is defined by MessageContents: 2386 enum { invalidMessageExtensionType(0), 2387 (2^16-1) } MessageExtensionType; 2389 struct { 2390 MessageExtensionType type; 2391 Boolean critical; 2392 opaque extension_contents<0..2^32-1>; 2393 } MessageExtension; 2395 struct { 2396 uint16 message_code; 2397 opaque message_body<0..2^32-1>; 2398 MessageExtension extensions<0..2^32-1>; 2399 } MessageContents; 2401 The contents of this structure are as follows: 2403 message_code 2404 This indicates the message that is being sent. The code space is 2405 broken up as follows. 2407 0 Reserved 2409 1 .. 0x7fff Requests and responses. These code points are always 2410 paired, with requests being odd and the corresponding response 2411 being the request code plus 1. Thus, "probe_request" (the 2412 Probe request) has value 1 and "probe_answer" (the Probe 2413 response) has value 2 2415 0x8000 .. 0xfffe Reserved 2417 0xffff Error 2419 The message codes are defined in Section 14.8 2421 message_body 2422 The message body itself, represented as a variable-length string 2423 of bytes. The bytes themselves are dependent on the code value. 2424 See the sections describing the various RELOAD methods (Join, 2425 Update, Attach, Store, Fetch, etc.) for the definitions of the 2426 payload contents. 2428 extensions 2429 Extensions to the message. Currently no extensions are defined, 2430 but new extensions can be defined by the process described in 2431 Section 14.14. 2433 All extensions have the following form: 2435 type 2436 The extension type. 2438 critical 2439 Whether this extension needs to be understood in order to process 2440 the message. If critical = True and the recipient does not 2441 understand the message, it MUST generate an 2442 Error_Unknown_Extension error. If critical = False, the recipient 2443 MAY choose to process the message even if it does not understand 2444 the extension. 2446 extension_contents 2447 The contents of the extension (extension-dependent). 2449 The subsections in Section 6.4.2, Section 6.5 and Section 7 describe 2450 structures that are inserted inside the message_body member, 2451 depending on the value of the message_code value. For example a 2452 message_code value of join_req means that the structure named JoinReq 2453 is inserted inside message_body. This document does not contain a 2454 mapping between message_code values and structure names as the 2455 conversion between the two is obvious. 2457 Similarly this document uses the name of the structure without the 2458 "Req" or "Ans" suffix to mean the execution of a transaction 2459 comprised of the matching request and answer. For example when the 2460 text says "perform an Attach", it must be understood as performing a 2461 transaction composed of an AttachReq and an AttachAns. 2463 6.3.3.1. Response Codes and Response Errors 2465 A node processing a request MUST return its status in the 2466 message_code field. If the request was a success, then the message 2467 code MUST be set to the response code that matches the request (i.e., 2468 the next code up). The response payload is then as defined in the 2469 request/response descriptions. 2471 If the request has failed, then the message code MUST be set to 2472 0xffff (error) and the payload MUST be an error_response message, as 2473 shown below. 2475 When the message code is 0xffff, the payload MUST be an 2476 ErrorResponse. 2478 public struct { 2479 uint16 error_code; 2480 opaque error_info<0..2^16-1>; 2481 } ErrorResponse; 2483 The contents of this structure are as follows: 2485 error_code 2486 A numeric error code indicating the error that occurred. 2488 error_info 2489 An optional arbitrary byte string. Unless otherwise specified, 2490 this will be a UTF-8 text string providing further information 2491 about what went wrong. Developers are encouraged to put enough 2492 diagnostic information to be useful in error_info. The specific 2493 text to be used and any relevant language or encoding thereof is 2494 left to the implementation. 2496 The following error code values are defined. The numeric values for 2497 these are defined in Section 14.9. 2499 Error_Forbidden: The requesting node does not have permission to 2500 make this request. 2502 Error_Not_Found: The resource or node cannot be found or does not 2503 exist. 2505 Error_Request_Timeout: A response to the request has not been 2506 received in a suitable amount of time. The requesting node MAY 2507 resend the request at a later time. 2509 Error_Data_Too_Old: A store cannot be completed because the 2510 storage_time precedes the existing value. 2512 Error_Data_Too_Large: A store cannot be completed because the 2513 requested object exceeds the size limits for that Kind. 2515 Error_Generation_Counter_Too_Low: A store cannot be completed 2516 because the generation counter precedes the existing value. 2518 Error_Incompatible_with_Overlay: A peer receiving the request is 2519 using a different overlay, overlay algorithm, or hash algorithm, 2520 or some other parameter that is inconsistent with the overlay 2521 configuration. 2523 Error_Unsupported_Forwarding_Option: A node receiving the request 2524 with a forwarding options flagged as critical but the node does 2525 not support this option. See section Section 6.3.2.3. 2527 Error_TTL_Exceeded: A peer receiving the request where the TTL got 2528 decremented to zero. See section Section 6.3.2. 2530 Error_Message_Too_Large: A peer receiving the request that was too 2531 large. See section Section 6.6. 2533 Error_Response_Too_Large: A node would have generated a response 2534 that is too large per the max_response_length field. 2536 Error_Config_Too_Old: A destination node received a request with a 2537 configuration sequence that's too old. See Section 6.3.2.1. 2539 Error_Config_Too_New: A destination node received a request with a 2540 configuration sequence that's too new. See Section 6.3.2.1. 2542 Error_Unknown_Kind: A destination peer received a request with an 2543 unknown Kind-ID. See Section 7.4.1.2. 2545 Error_In_Progress: An Attach is already in progress to this peer. 2546 See Section 6.5.1.2. 2548 Error_Unknown_Extension: A destination node received a request with 2549 an unknown extension. 2551 Error_Invalid_Message: Something about this message is invalid but 2552 it doesn't fit the other error codes. When this message is sent, 2553 implementations SHOULD provide some meaningful description in 2554 error_info to aid in debugging. 2556 Error_Exp_A: For the purposes of experimentation. Not meant for 2557 vendor specific use of any sort and MUST NOT be used for 2558 operational deployments. 2560 Error_Exp_B: For the purposes of experimentation. Not meant for 2561 vendor specific use of any sort and MUST NOT be used for 2562 operational deployments. 2564 6.3.4. Security Block 2566 The third part of a RELOAD message is the security block. The 2567 security block is represented by a SecurityBlock structure: 2569 struct { 2570 CertificateType type; 2571 opaque certificate<0..2^16-1>; 2572 } GenericCertificate; 2574 struct { 2575 GenericCertificate certificates<0..2^16-1>; 2576 Signature signature; 2577 } SecurityBlock; 2579 The contents of this structure are: 2581 certificates 2582 A bucket of certificates. 2584 signature 2585 A signature. 2587 The certificates bucket SHOULD contain all the certificates necessary 2588 to verify every signature in both the message and the internal 2589 message objects, except for those certificates in a root-cert element 2590 of the current configuration file. This is the only location in the 2591 message which contains certificates, thus allowing for only a single 2592 copy of each certificate to be sent. In systems that have an 2593 alternative certificate distribution mechanism, some certificates MAY 2594 be omitted. However, unless an alternative mechanism for immediately 2595 generating certificates, such as shared secret security 2596 (Section 13.4) is used, implementors MUST include all referenced 2597 certificates. 2599 NOTE TO IMPLEMENTERS: This requirement implies that a peer storing 2600 data is obligated to retain certificates for the data it holds. 2602 Each certificate is represented by a GenericCertificate structure, 2603 which has the following contents: 2605 type 2606 The type of the certificate, as defined in [RFC6091]. Only the 2607 use of X.509 certificates is defined in this document. 2609 certificate 2610 The encoded version of the certificate. For X.509 certificates, 2611 it is the DER form. 2613 The signature is computed over the payload and parts of the 2614 forwarding header. In case of a Store the payload MUST contain an 2615 additional signature computed as described in Section 7.1. All 2616 signatures MUST be formatted using the Signature element. This 2617 element is also used in other contexts where signatures are needed. 2618 The input structure to the signature computation MAY vary depending 2619 on the data element being signed. 2621 enum { invalidSignerIdentityType(0), 2622 cert_hash(1), cert_hash_node_id(2), 2623 none(3) 2624 (255) } SignerIdentityType; 2626 struct { 2627 select (identity_type) { 2629 case cert_hash; 2630 HashAlgorithm hash_alg; // From TLS 2631 opaque certificate_hash<0..2^8-1>; 2633 case cert_hash_node_id: 2634 HashAlgorithm hash_alg; // From TLS 2635 opaque certificate_node_id_hash<0..2^8-1>; 2637 case none: 2638 /* empty */ 2639 /* This structure may be extended with new types if necessary*/ 2640 }; 2641 } SignerIdentityValue; 2643 struct { 2644 SignerIdentityType identity_type; 2645 uint16 length; 2646 SignerIdentityValue identity[SignerIdentity.length]; 2647 } SignerIdentity; 2649 struct { 2650 SignatureAndHashAlgorithm algorithm; // From TLS 2651 SignerIdentity identity; 2652 opaque signature_value<0..2^16-1>; 2653 } Signature; 2655 The Signature construct contains the following values: 2657 algorithm 2658 The signature algorithm in use. The algorithm definitions are 2659 found in the IANA TLS SignatureAlgorithm and HashAlgorithm 2660 Registries. All implementations MUST support RSASSA-PKCS1-v1_5 2661 [RFC3447] signatures with SHA-256 hashes. 2663 identity 2664 The identity, as defined in the two paragraphs following this 2665 list, used to form the signature. 2667 signature_value 2668 The value of the signature. 2670 Note that storage operations allow for special values of algorithm 2671 and identity. See Store Request Definition (Section 7.4.1.1) and 2672 Fetch Response Definition (Section 7.4.2.2). 2674 There are two permitted identity formats, one for a certificate with 2675 only one Node-ID and one for a certificate with multiple Node-IDs. 2676 In the first case, the cert_hash type MUST be used. The hash_alg 2677 field is used to indicate the algorithm used to produce the hash. 2678 The certificate_hash contains the hash of the certificate object 2679 (i.e., the DER-encoded certificate). 2681 In the second case, the cert_hash_node_id type MUST be used. The 2682 hash_alg is as in cert_hash but the cert_hash_node_id is computed 2683 over the NodeId used to sign concatenated with the certificate. 2684 I.e., H(NodeId || certificate). The NodeId is represented without 2685 any framing or length fields, as simple raw bytes. This is safe 2686 because NodeIds are fixed-length for a given overlay. 2688 For signatures over messages the input to the signature is computed 2689 over: 2691 overlay || transaction_id || MessageContents || SignerIdentity 2693 where overlay and transaction_id come from the forwarding header and 2694 || indicates concatenation. 2696 The input to signatures over data values is different, and is 2697 described in Section 7.1. 2699 All RELOAD messages MUST be signed. Intermediate nodes do not verify 2700 signatures. Upon receipt (and fragment reassembly if needed) the 2701 destination node MUST verify the signature and the authorizing 2702 certificate. If the signature fails, the implementation SHOULD 2703 simply drop the message and MUST NOT process it. This check provides 2704 a minimal level of assurance that the sending node is a valid part of 2705 the overlay as well as cryptographic authentication of the sending 2706 node. In addition, responses MUST be checked as follows by the 2707 requesting node: 2709 1. The response to a message sent to a Node-ID MUST have been sent 2710 by that Node-ID, unless it has being sent to the wildcard 2711 Node-ID. 2713 2. The response to a message sent to a Resource-ID MUST have been 2714 sent by a Node-ID which is as close to or closer to the target 2715 Resource-ID than any node in the requesting node's Neighbor 2716 Table. 2718 The second condition serves as a primitive check for responses from 2719 wildly wrong nodes but is not a complete check. Note that in periods 2720 of churn, it is possible for the requesting node to obtain a closer 2721 neighbor while the request is outstanding. This will cause the 2722 response to be rejected and the request to be retransmitted. 2724 In addition, some methods (especially Store) have additional 2725 authentication requirements, which are described in the sections 2726 covering those methods. 2728 6.4. Overlay Topology 2730 As discussed in previous sections RELOAD defines a default overlay 2731 topology (CHORD-RELOAD) but allows for other topologies through the 2732 use of Topology Plugins. This section describes the requirements for 2733 new topology plugins and the methods that RELOAD provides for overlay 2734 topology maintenance. 2736 6.4.1. Topology Plugin Requirements 2738 When specifying a new overlay algorithm, at least the following MUST 2739 be described: 2741 o Joining procedures, including the contents of the Join message. 2743 o Stabilization procedures, including the contents of the Update 2744 message, the frequency of topology probes and keepalives, and the 2745 mechanism used to detect when peers have disconnected. 2747 o Exit procedures, including the contents of the Leave message. 2749 o The length of the Resource-IDs. For DHTs, the hash algorithm to 2750 compute the hash of an identifier. 2752 o The procedures that peers use to route messages. 2754 o The replication strategy used to ensure data redundancy. 2756 All overlay algorithms MUST specify maintenance procedures that send 2757 Updates to clients and peers that have established connections to the 2758 peer responsible for a particular ID when the responsibility for that 2759 ID changes. Because tracking this information is difficult, overlay 2760 algorithms MAY simply specify that an Update is sent to all members 2761 of the Connection Table whenever the range of IDs for which the peer 2762 is responsible changes. 2764 6.4.2. Methods and types for use by topology plugins 2766 This section describes the methods that topology plugins use to join, 2767 leave, and maintain the overlay. 2769 6.4.2.1. Join 2771 A new peer (but one that already has credentials) uses the JoinReq 2772 message to join the overlay. The JoinReq is sent to the responsible 2773 peer depending on the routing mechanism described in the topology 2774 plugin. This notifies the responsible peer that the new peer is 2775 taking over some of the overlay and it needs to synchronize its 2776 state. 2778 struct { 2779 NodeId joining_peer_id; 2780 opaque overlay_specific_data<0..2^16-1>; 2781 } JoinReq; 2783 The minimal JoinReq contains only the Node-ID which the sending peer 2784 wishes to assume. Overlay algorithms MAY specify other data to 2785 appear in this request. Receivers of the JoinReq MUST verify that 2786 the joining_peer_id field matches the Node-ID used to sign the 2787 message and if not MUST reject the message with an Error_Forbidden 2788 error. 2790 Because joins may only be executed between nodes which are directly 2791 adjacent, receiving peers MUST verify that any JoinReq they receive 2792 arrives from a transport channel that is bound to the Node-ID to be 2793 assumed by the joining node. This also prevents replay attacks 2794 provided that DTLS anti-replay is used. 2796 If the request succeeds, the responding peer responds with a JoinAns 2797 message, as defined below: 2799 struct { 2800 opaque overlay_specific_data<0..2^16-1>; 2801 } JoinAns; 2803 If the request succeeds, the responding peer MUST follow up by 2804 executing the right sequence of Stores and Updates to transfer the 2805 appropriate section of the overlay space to the joining node. In 2806 addition, overlay algorithms MAY define data to appear in the 2807 response payload that provides additional info. 2809 Joining nodes MUST verify that the signature on the JoinAns message 2810 matches the expected target (i.e., the adjacency over which they are 2811 joining.) If not, they MUST discard the message. 2813 In general, nodes which cannot form connections SHOULD report an 2814 error to the user. However, implementations MUST provide some 2815 mechanism whereby nodes can determine that they are potentially the 2816 first node and take responsibility for the overlay (the idea is to 2817 avoid having ordinary nodes try to become responsible for the entire 2818 overlay during a partition.) This specification does not mandate any 2819 particular mechanism, but a configuration flag or setting seems 2820 appropriate. 2822 6.4.2.2. Leave 2824 The LeaveReq message is used to indicate that a node is exiting the 2825 overlay. A node SHOULD send this message to each peer with which it 2826 is directly connected prior to exiting the overlay. 2828 struct { 2829 NodeId leaving_peer_id; 2830 opaque overlay_specific_data<0..2^16-1>; 2831 } LeaveReq; 2833 LeaveReq contains only the Node-ID of the leaving peer. Overlay 2834 algorithms MAY specify other data to appear in this request. 2835 Receivers of the LeaveReq MUST verify that the leaving_peer_id field 2836 matches the Node-ID used to sign the message and if not MUST reject 2837 the message with an Error_Forbidden error. 2839 Because leaves may only be executed between nodes which are directly 2840 adjacent, receiving peers MUST verify that any LeaveReq they receive 2841 arrives from a transport channel that is bound to the Node-ID to be 2842 assumed by the leaving peer. This also prevents replay attacks 2843 provided that DTLS anti-replay is used. 2845 Upon receiving a Leave request, a peer MUST update its own Routing 2846 Table, and send the appropriate Store/Update sequences to re- 2847 stabilize the overlay. 2849 6.4.2.3. Update 2851 Update is the primary overlay-specific maintenance message. It is 2852 used by the sender to notify the recipient of the sender's view of 2853 the current state of the overlay (its routing state), and it is up to 2854 the recipient to take whatever actions are appropriate to deal with 2855 the state change. In general, peers send Update messages to all 2856 their adjacencies whenever they detect a topology shift. 2858 When a peer receives an Attach request with the send_update flag set 2859 to True (Section 6.4.2.4.1), it MUST send an Update message back to 2860 the sender of the Attach request after the completion of the 2861 corresponding ICE check and TLS connection. Note that the sender of 2862 a such Attach request may not have joined the overlay yet. 2864 When a peer detects through an Update that it is no longer 2865 responsible for any data value it is storing, it MUST attempt to 2866 Store a copy to the correct node unless it knows the newly 2867 responsible node already has a copy of the data. This prevents data 2868 loss during large-scale topology shifts such as the merging of 2869 partitioned overlays. 2871 The contents of the UpdateReq message are completely overlay- 2872 specific. The UpdateAns response is expected to be either success or 2873 an error. 2875 6.4.2.4. RouteQuery 2877 The RouteQuery request allows the sender to ask a peer where they 2878 would route a message directed to a given destination. In other 2879 words, a RouteQuery for a destination X requests the Node-ID for the 2880 node that the receiving peer would next route to in order to get to 2881 X. A RouteQuery can also request that the receiving peer initiates an 2882 Update request to transfer the receiving peer's Routing Table. 2884 One important use of the RouteQuery request is to support iterative 2885 routing. The sender selects one of the peers in its Routing Table 2886 and sends it a RouteQuery message with the destination field set to 2887 the Node-ID or Resource-ID it wishes to route to. The receiving peer 2888 responds with information about the peers to which the request would 2889 be routed. The sending peer MAY then use the Attach method to attach 2890 to that peer(s), and repeat the RouteQuery. Eventually, the sender 2891 gets a response from a peer that is closest to the identifier in the 2892 destination field as determined by the topology plugin. At that 2893 point, the sender can send messages directly to that peer. 2895 6.4.2.4.1. Request Definition 2897 A RouteQueryReq message indicates the peer or resource that the 2898 requesting node is interested in. It also contains a "send_update" 2899 option allowing the requesting node to request a full copy of the 2900 other peer's Routing Table. 2902 struct { 2903 Boolean send_update; 2904 Destination destination; 2905 opaque overlay_specific_data<0..2^16-1>; 2906 } RouteQueryReq; 2908 The contents of the RouteQueryReq message are as follows: 2910 send_update 2911 A single byte. This may be set to True to indicate that the 2912 requester wishes the responder to initiate an Update request 2913 immediately. Otherwise, this value MUST be set to False. 2915 destination 2916 The destination which the requester is interested in. This may be 2917 any valid destination object, including a Node-ID, opaque ID, or 2918 Resource-ID. 2920 overlay_specific_data 2921 Other data as appropriate for the overlay. 2923 6.4.2.4.2. Response Definition 2925 A response to a successful RouteQueryReq request is a RouteQueryAns 2926 message. This is completely overlay specific. 2928 6.4.2.5. Probe 2930 Probe provides primitive "exploration" services: it allows a node to 2931 determine which resources another node is responsible for. A probe 2932 can be addressed to a specific Node-ID, or the peer controlling a 2933 given location (by using a Resource-ID). In either case, the target 2934 node responds with a simple response containing some status 2935 information. 2937 6.4.2.5.1. Request Definition 2939 The ProbeReq message contains a list (potentially empty) of the 2940 pieces of status information that the requester would like the 2941 responder to provide. 2943 enum { invalidProbeInformationType(0), responsible_set(1), 2944 num_resources(2), uptime(3), (255) } 2945 ProbeInformationType; 2947 struct { 2948 ProbeInformationType requested_info<0..2^8-1>; 2949 } ProbeReq; 2951 The currently defined values for ProbeInformationType are: 2953 responsible_set 2954 indicates that the peer should Respond with the fraction of the 2955 overlay for which the responding peer is responsible. 2957 num_resources 2958 indicates that the peer should Respond with the number of 2959 resources currently being stored by the peer. 2961 uptime 2962 indicates that the peer should Respond with how long the peer has 2963 been up in seconds. 2965 6.4.2.5.2. Response Definition 2967 A successful ProbeAns response contains the information elements 2968 requested by the peer. 2970 struct { 2971 select (type) { 2972 case responsible_set: 2973 uint32 responsible_ppb; 2975 case num_resources: 2976 uint32 num_resources; 2978 case uptime: 2979 uint32 uptime; 2981 /* This type may be extended */ 2982 }; 2983 } ProbeInformationData; 2985 struct { 2986 ProbeInformationType type; 2987 uint8 length; 2988 ProbeInformationData value; 2989 } ProbeInformation; 2991 struct { 2992 ProbeInformation probe_info<0..2^16-1>; 2993 } ProbeAns; 2995 A ProbeAns message contains a sequence of ProbeInformation 2996 structures. Each has a "length" indicating the length of the 2997 following value field. This structure allows for unknown option 2998 types. 3000 Each of the current possible Probe information types is a 32-bit 3001 unsigned integer. For type "responsible_ppb", it is the fraction of 3002 the overlay for which the peer is responsible in parts per billion. 3003 For type "num_resources", it is the number of resources the peer is 3004 storing. For the type "uptime" it is the number of seconds the peer 3005 has been up. 3007 The responding peer SHOULD include any values that the requesting 3008 node requested and that it recognizes. They SHOULD be returned in 3009 the requested order. Any other values MUST NOT be returned. 3011 6.5. Forwarding and Link Management Layer 3013 Each node maintains connections to a set of other nodes defined by 3014 the topology plugin. This section defines the methods RELOAD uses to 3015 form and maintain connections between nodes in the overlay. Three 3016 methods are defined: 3018 Attach: used to form RELOAD connections between nodes using ICE 3019 for NAT traversal. When node A wants to connect to node B, it 3020 sends an Attach message to node B through the overlay. The Attach 3021 contains A's ICE parameters. B responds with its ICE parameters 3022 and the two nodes perform ICE to form connection. Attach also 3023 allows two nodes to connect via No-ICE instead of full ICE. 3025 AppAttach: used to form application layer connections between 3026 nodes. 3028 Ping: is a simple request/response which is used to verify 3029 connectivity of the target peer. 3031 6.5.1. Attach 3033 A node sends an Attach request when it wishes to establish a direct 3034 Overlay Link connection to another node for the purpose of sending 3035 RELOAD messages. A client that can establish a connection directly 3036 need not send an Attach as described in the second bullet of 3037 Section 3.2.1 3039 As described in Section 6.1, an Attach may be routed to either a 3040 Node-ID or to a Resource-ID. An Attach routed to a specific Node-ID 3041 will fail if that node is not reached. An Attach routed to a 3042 Resource-ID will establish a connection with the peer currently 3043 responsible for that Resource-ID, which may be useful in establishing 3044 a direct connection to the responsible peer for use with frequent or 3045 large resource updates. 3047 An Attach in and of itself does not result in updating the Routing 3048 Table of either node. That function is performed by Updates. If 3049 node A has Attached to node B, but not received any Updates from B, 3050 it MAY route messages which are directly addressed to B through that 3051 channel but MUST NOT route messages through B to other peers via that 3052 channel. The process of Attaching is separate from the process of 3053 becoming a peer (using Join and Update), to prevent half-open states 3054 where a node has started to form connections but is not really ready 3055 to act as a peer. Thus, clients (unlike peers) can simply Attach 3056 without sending Join or Update. 3058 6.5.1.1. Request Definition 3060 An Attach request message contains the requesting node ICE connection 3061 parameters formatted into a binary structure. 3063 enum { invalidOverlayLinkType(0), DTLS-UDP-SR(1), 3064 DTLS-UDP-SR-NO-ICE(3), TLS-TCP-FH-NO-ICE(4), 3065 (255) } OverlayLinkType; 3067 enum { invalidCandType(0), 3068 host(1), srflx(2), prflx(3), relay(4), 3069 (255) } CandType; 3071 struct { 3072 opaque name<0..2^16-1>; 3073 opaque value<0..2^16-1>; 3074 } IceExtension; 3076 struct { 3077 IpAddressPort addr_port; 3078 OverlayLinkType overlay_link; 3079 opaque foundation<0..255>; 3080 uint32 priority; 3081 CandType type; 3082 select (type) { 3083 case host: 3084 ; /* Empty */ 3085 case srflx: 3086 case prflx: 3087 case relay: 3088 IpAddressPort rel_addr_port; 3089 }; 3090 IceExtension extensions<0..2^16-1>; 3091 } IceCandidate; 3093 struct { 3094 opaque ufrag<0..2^8-1>; 3095 opaque password<0..2^8-1>; 3096 opaque role<0..2^8-1>; 3097 IceCandidate candidates<0..2^16-1>; 3098 Boolean send_update; 3099 } AttachReqAns; 3101 The values contained in AttachReqAns are: 3103 ufrag 3104 The username fragment (from ICE). 3106 password 3107 The ICE password. 3109 role 3110 An active/passive/actpass attribute from RFC 4145 [RFC4145]. This 3111 value MUST be 'passive' for the offerer (the peer sending the 3112 Attach request) and 'active' for the answerer (the peer sending 3113 the Attach response). 3115 candidates 3116 One or more ICE candidate values, as described below. 3118 send_update 3119 Has the same meaning as the send_update field in RouteQueryReq. 3121 Each ICE candidate is represented as an IceCandidate structure, which 3122 is a direct translation of the information from the ICE string 3123 structures, with the exception of the component ID. Since there is 3124 only one component, it is always 1, and thus left out of the 3125 structure. The remaining values are specified as follows: 3127 addr_port 3128 corresponds to the ICE connection-address and port productions. 3130 overlay_link 3131 corresponds to the ICE transport production, Overlay Link 3132 protocols used with No-ICE MUST specify "No-ICE" in their 3133 description. Future overlay link values can be added by defining 3134 new OverlayLinkType values in the IANA registry in Section 14.10. 3135 Future extensions to the encapsulation or framing, that provide 3136 for backward compatibility with the previously specified 3137 encapsulation or framing, values MUST use that same 3138 OverlayLinkType value that was previously defined. 3140 OverlayLinkType protocols are defined in Section 6.6 3142 A single AttachReqAns MUST NOT include both candidates whose 3143 OverlayLinkType protocols use ICE (the default) and candidates 3144 that specify "No-ICE". 3146 foundation 3147 corresponds to the ICE foundation production. 3149 priority 3150 corresponds to the ICE priority production. 3152 type 3153 corresponds to the ICE cand-type production. 3155 rel_addr_port 3156 corresponds to the ICE rel-addr and rel-port productions. Only 3157 present for types "relay", "srflx" and "prflx". 3159 extensions 3160 ICE extensions. The name and value fields correspond to binary 3161 translations of the equivalent fields in the ICE extensions. 3163 These values should be generated using the procedures described in 3164 Section 6.5.1.3. 3166 6.5.1.2. Response Definition 3168 If a peer receives an Attach request, it MUST determine how to 3169 process the request as follows: 3171 o If it has not initiated an Attach request to the originating peer 3172 of this Attach request, it MUST process this request and SHOULD 3173 generate its own response with an AttachReqAns. It should then 3174 begin ICE checks. 3176 o If it has already sent an Attach request to and received the 3177 response from the originating peer of this Attach request, and as 3178 a result, an ICE check and TLS connection is in progress, then it 3179 SHOULD generate an Error_In_Progress error instead of an 3180 AttachReqAns. 3182 o If it has already sent an Attach request to but not yet received 3183 the response from the originating peer of this Attach request, it 3184 SHOULD apply the following tie-breaker heuristic to determine how 3185 to handle this Attach request and the incomplete Attach request it 3186 has sent out: 3188 * If the peer's own Node-ID is smaller when compared as big- 3189 endian unsigned integers, it MUST cancel retransmission of its 3190 own incomplete Attach request. It MUST then process this 3191 Attach request, generate an AttachReqAns response, and proceed 3192 with the corresponding ICE check. 3194 * If the peer's own Node-ID is larger when compared as big-endian 3195 unsigned integers, it MUST generate an Error_In_Progress error 3196 to this Attach request, then proceed to wait for and complete 3197 the Attach and the corresponding ICE check it has originated. 3199 o If the peer is overloaded or detects some other kind of error, it 3200 MAY generate an error instead of an AttachReqAns. 3202 When a peer receives an Attach response, it SHOULD parse the response 3203 and begin its own ICE checks. 3205 6.5.1.3. Using ICE With RELOAD 3207 This section describes the profile of ICE that is used with RELOAD. 3208 RELOAD implementations MUST implement full ICE. 3210 In ICE as defined by [RFC5245], SDP is used to carry the ICE 3211 parameters. In RELOAD, this function is performed by a binary 3212 encoding in the Attach method. This encoding is more restricted than 3213 the SDP encoding because the RELOAD environment is simpler: 3215 o Only a single media stream is supported. 3217 o In this case, the "stream" refers not to RTP or other types of 3218 media, but rather to a connection for RELOAD itself or other 3219 application-layer protocols such as SIP. 3221 o RELOAD only allows for a single offer/answer exchange. Unlike the 3222 usage of ICE within SIP, there is never a need to send a 3223 subsequent offer to update the default candidates to match the 3224 ones selected by ICE. 3226 An agent follows the ICE specification as described in [RFC5245] with 3227 the changes and additional procedures described in the subsections 3228 below. 3230 6.5.1.4. Collecting STUN Servers 3232 ICE relies on the node having one or more STUN servers to use. In 3233 conventional ICE, it is assumed that nodes are configured with one or 3234 more STUN servers through some out of band mechanism. This is still 3235 possible in RELOAD but RELOAD also learns STUN servers as it connects 3236 to other peers. Because all RELOAD peers implement ICE and use STUN 3237 keepalives, every peer is a capable of responding to STUN Binding 3238 requests [RFC5389]. Accordingly, any peer that a node knows about 3239 can be used like a STUN server -- though of course it may be behind a 3240 NAT. 3242 A peer on a well-provisioned wide-area overlay will be configured 3243 with one or more bootstrap nodes. These nodes make an initial list 3244 of STUN servers. However, as the peer forms connections with 3245 additional peers, it builds more peers it can use like STUN servers. 3247 Because complicated NAT topologies are possible, a peer may need more 3248 than one STUN server. Specifically, a peer that is behind a single 3249 NAT will typically observe only two IP addresses in its STUN checks: 3250 its local address and its server reflexive address from a STUN server 3251 outside its NAT. However, if there are more NATs involved, it may 3252 learn additional server reflexive addresses (which vary based on 3253 where in the topology the STUN server is). To maximize the chance of 3254 achieving a direct connection, a peer SHOULD group other peers by the 3255 peer-reflexive addresses it discovers through them. It SHOULD then 3256 select one peer from each group to use as a STUN server for future 3257 connections. 3259 Only peers to which the peer currently has connections may be used. 3260 If the connection to that host is lost, it MUST be removed from the 3261 list of STUN servers and a new server from the same group MUST be 3262 selected unless there are no others servers in the group in which 3263 case some other peer MAY be used. 3265 6.5.1.5. Gathering Candidates 3267 When a node wishes to establish a connection for the purposes of 3268 RELOAD signaling or application signaling, it follows the process of 3269 gathering candidates as described in Section 4 of ICE [RFC5245]. 3270 RELOAD utilizes a single component. Consequently, gathering for 3271 these "streams" requires a single component. In the case where a 3272 node has not yet found a TURN server, the agent would not include a 3273 relayed candidate. 3275 The ICE specification assumes that an ICE agent is configured with, 3276 or somehow knows of, TURN and STUN servers. RELOAD provides a way 3277 for an agent to learn these by querying the overlay, as described in 3278 Section 6.5.1.4 and Section 9. 3280 The default candidate selection described in Section 4.1.4 of ICE is 3281 ignored; defaults are not signaled or utilized by RELOAD. 3283 An alternative to using the full ICE supported by the Attach request 3284 is to use No-ICE mechanism by providing candidates with "No-ICE" 3285 Overlay Link protocols. Configuration for the overlay indicates 3286 whether or not these Overlay Link protocols can be used. An overlay 3287 MUST be either all ICE or all No-ICE. 3289 No-ICE will not work in all of the scenarios where ICE would work, 3290 but in some cases, particularly those with no NATs or firewalls, it 3291 will work. 3293 6.5.1.6. Prioritizing Candidates 3295 However, standardization of additional protocols for use with ICE is 3296 expected, including TCP [RFC6544] and protocols such as SCTP 3297 [RFC4960] and DCCP [RFC4340]. UDP encapsulations for SCTP and DCCP 3298 would expand the available Overlay Link protocols available for 3299 RELOAD. When additional protocols are available, the following 3300 prioritization is RECOMMENDED: 3302 o Highest priority is assigned to protocols that offer well- 3303 understood congestion and flow control without head of line 3304 blocking. For example, SCTP without message ordering, DCCP, or 3305 those protocols encapsulated using UDP. 3307 o Second highest priority is assigned to protocols that offer well- 3308 understood congestion and flow control but have head of line 3309 blocking such as TCP. 3311 o Lowest priority is assigned to protocols encapsulated over UDP 3312 that do not implement well-established congestion control 3313 algorithms. The DTLS/UDP with SR overlay link protocol is an 3314 example of such a protocol. 3316 Head of line blocking is undesirable in an Overlay Link protocol 3317 because the messages carried on a RELOAD link are independent, rather 3318 than stream-oriented. Therefore, if message N on a link is lost, 3319 delaying message N+1 on that same link until N is successfully 3320 retransmitted does nothing other than increase the latency for the 3321 transaction of message N+1 as they are unrelated to each other. 3322 Therefore, while the high quality, performance, and availability of 3323 modern TCP implementations makes them very attractive, their 3324 performance as an Overlay Link protocol is not optimal. 3326 Note that none of the protocols defined in this document meets these 3327 conditions, but it is expected that new Overlay link protocols 3328 defined in the future will fill this gap. 3330 6.5.1.7. Encoding the Attach Message 3332 Section 4.3 of ICE describes procedures for encoding the SDP for 3333 conveying RELOAD candidates. Instead of actually encoding an SDP 3334 message, the candidate information (IP address and port and transport 3335 protocol, priority, foundation, type and related address) is carried 3336 within the attributes of the Attach request or its response. 3337 Similarly, the username fragment and password are carried in the 3338 Attach message or its response. Section 6.5.1 describes the detailed 3339 attribute encoding for Attach. The Attach request and its response 3340 do not contain any default candidates or the ice-lite attribute, as 3341 these features of ICE are not used by RELOAD. 3343 Since the Attach request contains the candidate information and short 3344 term credentials, it is considered as an offer for a single media 3345 stream that happens to be encoded in a format different than SDP, but 3346 is otherwise considered a valid offer for the purposes of following 3347 the ICE specification. Similarly, the Attach response is considered 3348 a valid answer for the purposes of following the ICE specification. 3350 6.5.1.8. Verifying ICE Support 3352 An agent MUST skip the verification procedures in Section 5.1 and 6.1 3353 of ICE. Since RELOAD requires full ICE from all agents, this check 3354 is not required. 3356 6.5.1.9. Role Determination 3358 The roles of controlling and controlled as described in Section 5.2 3359 of ICE are still utilized with RELOAD. However, the offerer (the 3360 entity sending the Attach request) will always be controlling, and 3361 the answerer (the entity sending the Attach response) will always be 3362 controlled. The connectivity checks MUST still contain the ICE- 3363 CONTROLLED and ICE-CONTROLLING attributes, however, even though the 3364 role reversal capability for which they are defined will never be 3365 needed with RELOAD. This is to allow for a common codebase between 3366 ICE for RELOAD and ICE for SDP. 3368 6.5.1.10. Full ICE 3370 When the overlay uses ICE, connectivity checks and nominations are 3371 used as in regular ICE. 3373 6.5.1.10.1. Connectivity Checks 3375 The processes of forming check lists in Section 5.7 of ICE, 3376 scheduling checks in Section 5.8, and checking connectivity checks in 3377 Section 7 are used with RELOAD without change. 3379 6.5.1.10.2. Concluding ICE 3381 The procedures in Section 8 of ICE are followed to conclude ICE, with 3382 the following exceptions: 3384 o The controlling agent MUST NOT attempt to send an updated offer 3385 once the state of its single media stream reaches Completed. 3387 o Once the state of ICE reaches Completed, the agent can immediately 3388 free all unused candidates. This is because RELOAD does not have 3389 the concept of forking, and thus the three second delay in Section 3390 8.3 of ICE does not apply. 3392 6.5.1.10.3. Media Keepalives 3394 STUN MUST be utilized for the keepalives described in Section 10 of 3395 ICE. 3397 6.5.1.11. No-ICE 3399 No-ICE is selected when either side has provided "no ICE" Overlay 3400 Link candidates. STUN is not used for connectivity checks when doing 3401 No-ICE; instead the DTLS or TLS handshake (or similar security layer 3402 of future overlay link protocols) forms the connectivity check. The 3403 certificate exchanged during the (D)TLS handshake MUST match the node 3404 that sent the AttachReqAns and if it does not, the connection MUST be 3405 closed. 3407 6.5.1.12. Subsequent Offers and Answers 3409 An agent MUST NOT send a subsequent offer or answer. Thus, the 3410 procedures in Section 9 of ICE MUST be ignored. 3412 6.5.1.13. Sending Media 3414 The procedures of Section 11 of ICE apply to RELOAD as well. 3415 However, in this case, the "media" takes the form of application 3416 layer protocols (e.g., RELOAD) over TLS or DTLS. Consequently, once 3417 ICE processing completes, the agent will begin TLS or DTLS procedures 3418 to establish a secure connection. The node which sent the Attach 3419 request MUST be the TLS server. The other node MUST be the TLS 3420 client. The server MUST request TLS client authentication. The 3421 nodes MUST verify that the certificate presented in the handshake 3422 matches the identity of the other peer as found in the Attach 3423 message. Once the TLS or DTLS signaling is complete, the application 3424 protocol is free to use the connection. 3426 The concept of a previous selected pair for a component does not 3427 apply to RELOAD, since ICE restarts are not possible with RELOAD. 3429 6.5.1.14. Receiving Media 3431 An agent MUST be prepared to receive packets for the application 3432 protocol (TLS or DTLS carrying RELOAD) at any time. The jitter and 3433 RTP considerations in Section 11 of ICE do not apply to RELOAD. 3435 6.5.2. AppAttach 3437 A node sends an AppAttach request when it wishes to establish a 3438 direct connection to another node for the purposes of sending 3439 application layer messages. AppAttach is nearly identical to Attach, 3440 except for the purpose of the connection: it is used to transport 3441 non-RELOAD "media". A separate request is used to avoid implementor 3442 confusion between the two methods (this was found to be a real 3443 problem with initial implementations). The AppAttach request and its 3444 response contain an application attribute, which indicates what 3445 protocol is to be run over the connection. 3447 6.5.2.1. Request Definition 3449 An AppAttachReq message contains the requesting node's ICE connection 3450 parameters formatted into a binary structure. 3452 struct { 3453 opaque ufrag<0..2^8-1>; 3454 opaque password<0..2^8-1>; 3455 uint16 application; 3456 opaque role<0..2^8-1>; 3457 IceCandidate candidates<0..2^16-1>; 3458 } AppAttachReq; 3460 The values contained in AppAttachReq and AppAttachAns are: 3462 ufrag 3463 The username fragment (from ICE) 3465 password 3466 The ICE password. 3468 application 3469 A 16-bit application-id as defined in the Section 14.5. This 3470 number represents the IANA registered application that is going to 3471 send data on this connection. 3473 role 3474 An active/passive/actpass attribute from RFC 4145 [RFC4145]. 3476 candidates 3477 One or more ICE candidate values 3479 The application using connection set up with this request is 3480 responsible for providing sufficiently frequent keep traffic for NAT 3481 and Firewall keep alive and for deciding when to close the 3482 connection. 3484 6.5.2.2. Response Definition 3486 If a peer receives an AppAttach request, it SHOULD process the 3487 request and generate its own response with a AppAttachAns. It should 3488 then begin ICE checks. When a peer receives an AppAttach response, 3489 it SHOULD parse the response and begin its own ICE checks. If the 3490 application ID is not supported, the peer MUST reply with an 3491 Error_Not_Found error. 3493 struct { 3494 opaque ufrag<0..2^8-1>; 3495 opaque password<0..2^8-1>; 3496 uint16 application; 3497 opaque role<0..2^8-1>; 3498 IceCandidate candidates<0..2^16-1>; 3499 } AppAttachAns; 3501 The meaning of the fields is the same as in the AppAttachReq. 3503 6.5.3. Ping 3505 Ping is used to test connectivity along a path. A ping can be 3506 addressed to a specific Node-ID, to the peer controlling a given 3507 location (by using a Resource-ID) or to the wildcard Node-ID. 3509 6.5.3.1. Request Definition 3511 struct { 3512 opaque<0..2^16-1> padding; 3513 } PingReq; 3515 The Ping request is empty of meaningful contents. However, it may 3516 contain up to 65535 bytes of padding to facilitate the discovery of 3517 overlay maximum packet sizes. 3519 6.5.3.2. Response Definition 3521 A successful PingAns response contains the information elements 3522 requested by the peer. 3524 struct { 3525 uint64 response_id; 3526 uint64 time; 3527 } PingAns; 3529 A PingAns message contains the following elements: 3531 response_id 3532 A randomly generated 64-bit response ID. This is used to 3533 distinguish Ping responses. 3535 time 3536 The time when the Ping response was created represented in the 3537 same way as storage_time defined in Section 7. 3539 6.5.4. ConfigUpdate 3541 The ConfigUpdate method is used to push updated configuration data 3542 across the overlay. Whenever a node detects that another node has 3543 old configuration data, it MUST generate a ConfigUpdate request. The 3544 ConfigUpdate request allows updating of two kinds of data: the 3545 configuration data (Section 6.3.2.1) and the Kind information 3546 (Section 7.4.1.1). 3548 6.5.4.1. Request Definition 3550 enum { invalidConfigUpdateType(0), config(1), kind(2), (255) } 3551 ConfigUpdateType; 3553 typedef uint32 KindId; 3554 typedef opaque KindDescription<0..2^16-1>; 3556 struct { 3557 ConfigUpdateType type; 3558 uint32 length; 3560 select (type) { 3561 case config: 3562 opaque config_data<0..2^24-1>; 3564 case kind: 3565 KindDescription kinds<0..2^24-1>; 3567 /* This structure may be extended with new types*/ 3568 }; 3569 } ConfigUpdateReq; 3571 The ConfigUpdateReq message contains the following elements: 3573 type 3574 The type of the contents of the message. This structure allows 3575 for unknown content types. 3577 length 3578 The length of the remainder of the message. This is included to 3579 preserve backward compatibility and is 32 bits instead of 24 to 3580 facilitate easy conversion between network and host byte order. 3582 config_data (type==config) 3583 The contents of the configuration document. 3585 kinds (type==kind) 3586 One or more XML kind-block productions (see Section 11.1). These 3587 MUST be encoded with UTF-8 and assume a default namespace of 3588 "urn:ietf:params:xml:ns:p2p:config-base". 3590 6.5.4.2. Response Definition 3592 struct { 3593 } ConfigUpdateAns; 3595 If the ConfigUpdateReq is of type "config" it MUST only be processed 3596 if all the following are true: 3598 o The sequence number in the document is greater than the current 3599 configuration sequence number. 3601 o The configuration document is correctly digitally signed (see 3602 Section 11 for details on signatures.) 3604 Otherwise appropriate errors MUST be generated. 3606 If the ConfigUpdateReq is of type "kind" it MUST only be processed if 3607 it is correctly digitally signed by an acceptable Kind signer (i.e., 3608 one listed in the current configuration file). Details on kind- 3609 signer field in the configuration file are described in Section 11.1. 3610 In addition, if the Kind update conflicts with an existing known Kind 3611 (i.e., it is signed by a different signer), then it should be 3612 rejected with "Error_Forbidden". This should not happen in correctly 3613 functioning overlays. 3615 If the update is acceptable, then the node MUST reconfigure itself to 3616 match the new information. This may include adding permissions for 3617 new Kinds, deleting old Kinds, or even, in extreme circumstances, 3618 exiting and reentering the overlay, if, for instance, the DHT 3619 algorithm has changed. 3621 If an implementation misses enough ConfigUpdates which include key 3622 changes, it is possible that it will no longer be able to verify new 3623 valid ConfigUpdates. In that case, the only available recovery 3624 mechanism is to attempt to retrieve a new configuration document, 3625 typically by the mechanisms it would use for initial bootstrapping. 3626 It is up to implementors whether or how to decide to employ this sort 3627 of recovery mechanism. 3629 The response for ConfigUpdate is empty. 3631 6.6. Overlay Link Layer 3633 RELOAD can use multiple Overlay Link protocols to send its messages. 3634 Because ICE is used to establish connections (see Section 6.5.1.3), 3635 RELOAD nodes are able to detect which Overlay Link protocols are 3636 offered by other nodes and establish connections between them. Any 3637 link protocol needs to be able to establish a secure, authenticated 3638 connection and to provide data origin authentication and message 3639 integrity for individual data elements. RELOAD currently supports 3640 three Overlay Link protocols: 3642 o DTLS [RFC6347] over UDP with Simple Reliability (SR) 3643 (OverlayLinkType=DTLS-UDP-SR) 3645 o TLS [RFC5246] over TCP with Framing Header, No-ICE 3646 (OverlayLinkType=TLS-TCP-FH-NO-ICE) 3648 o DTLS [RFC6347] over UDP with SR, No-ICE (OverlayLinkType=DTLS-UDP- 3649 SR-NO-ICE) 3651 Note that although UDP does not properly have "connections", both TLS 3652 and DTLS have a handshake which establishes a similar, stateful 3653 association, and we simply refer to these as "connections" for the 3654 purposes of this document. 3656 If a peer receives a message that is larger than value of max- 3657 message-size defined in the overlay configuration, the peer SHOULD 3658 send an Error_Message_Too_Large error and then close the TLS or DTLS 3659 session from which the message was received. Note that this error 3660 can be sent and the session closed before receiving the complete 3661 message. If the forwarding header is larger than the max-message- 3662 size, the receiver SHOULD close the TLS or DTLS session without 3663 sending an error. 3665 The RELOAD mechanism requires that failed links are quickly removed 3666 from the routing table so end-to-end retransmission can handle lost 3667 messages. Overlay link protocols MUST be designed with a mechanism 3668 that quickly signals a likely failure and implementations SHOULD 3669 quickly act to remove it from the routing table when receiving this 3670 signal. The entry can be restored if it proves to resume 3671 functioning, or replaced at some point in the future if necessary. 3672 Section 10.7.2 contains more details specific to the CHORD-RELOAD 3673 topology plugin. 3675 The Framing Header (FH) is used to frame messages and provide timing 3676 when used on a reliable stream-based transport protocol. Simple 3677 Reliability (SR) makes use of the FH to provide congestion control 3678 and semi-reliability when using unreliable message-oriented transport 3679 protocols. We will first define each of these algorithms in 3680 Section 6.6.2 and Section 6.6.3, then define overlay link protocols 3681 that use them in Section 6.6.4, Section 6.6.5 and Section 6.6.6. 3683 Note: We expect future Overlay Link protocols to define replacements 3684 for all components of these protocols, including the framing header. 3685 These three protocols have been chosen for simplicity of 3686 implementation and reasonable performance. 3688 6.6.1. Future Overlay Link Protocols 3690 It is possible to define new link-layer protocols and apply them to a 3691 new overlay using the "overlay-link-protocol" configuration directive 3692 (see Section 11.1.). However, any new protocols MUST meet the 3693 following requirements. 3695 Endpoint authentication When a node forms an association with 3696 another endpoint, it MUST be possible to cryptographically verify 3697 that the endpoint has a given Node-ID. 3699 Traffic origin authentication and integrity When a node receives 3700 traffic from another endpoint, it MUST be possible to 3701 cryptographically verify that the traffic came from a given 3702 association and that it has not been modified in transit from the 3703 other endpoint in the association. The overlay link protocol MUST 3704 also provide replay prevention/detection. 3706 Traffic confidentiality When a node sends traffic to another 3707 endpoint, it MUST NOT be possible for a third party not involved 3708 in the association to determine the contents of that traffic. 3710 Any new overlay protocol MUST be defined via RFC 5226 Standards 3711 Action; see Section 14.11. 3713 6.6.1.1. HIP 3715 In a Host Identity Protocol Based Overlay Networking Environment (HIP 3716 BONE) [RFC6079] HIP [RFC5201] provides connection management (e.g., 3717 NAT traversal and mobility) and security for the overlay network. 3718 The P2PSIP Working Group has expressed interest in supporting a HIP- 3719 based link protocol. Such support would require specifying such 3720 details as: 3722 o How to issue certificates which provided identities meaningful to 3723 the HIP base exchange. We anticipate that this would require a 3724 mapping between ORCHIDs and NodeIds. 3726 o How to carry the HIP I1 and I2 messages. 3728 o How to carry RELOAD messages over HIP. 3730 [I-D.ietf-hip-reload-instance] documents work in progress on using 3731 RELOAD with the HIP BONE. 3733 6.6.1.2. ICE-TCP 3735 The ICE-TCP RFC [RFC6544] allows TCP to be supported as an Overlay 3736 Link protocol that can be added using ICE. 3738 6.6.1.3. Message-oriented Transports 3740 Modern message-oriented transports offer high performance, good 3741 congestion control, and avoid head of line blocking in case of lost 3742 data. These characteristics make them preferable as underlying 3743 transport protocols for RELOAD links. SCTP without message ordering 3744 and DCCP are two examples of such protocols. However, currently they 3745 are not well-supported by commonly available NATs, and specifications 3746 for ICE session establishment are not available. 3748 6.6.1.4. Tunneled Transports 3750 As of the time of this writing, there is significant interest in the 3751 IETF community in tunneling other transports over UDP, motivated by 3752 the situation that UDP is well-supported by modern NAT hardware, and 3753 similar performance can be achieved to native implementation. 3754 Currently SCTP, DCCP, and a generic tunneling extension are being 3755 proposed for message-oriented protocols. Once ICE traversal has been 3756 specified for these tunneled protocols, they should be 3757 straightforward to support as overlay link protocols. 3759 6.6.2. Framing Header 3761 In order to support unreliable links and to allow for quick detection 3762 of link failures when using reliable end-to-end transports, each 3763 message is wrapped in a very simple framing layer (FramedMessage) 3764 which is only used for each hop. This layer contains a sequence 3765 number which can then be used for ACKs. The same header is used for 3766 both reliable and unreliable transports for simplicity of 3767 implementation. 3769 The definition of FramedMessage is: 3771 enum { data(128), ack(129), (255) } FramedMessageType; 3773 struct { 3774 FramedMessageType type; 3776 select (type) { 3777 case data: 3778 uint32 sequence; 3779 opaque message<0..2^24-1>; 3781 case ack: 3782 uint32 ack_sequence; 3783 uint32 received; 3784 }; 3785 } FramedMessage; 3787 The type field of the PDU is set to indicate whether the message is 3788 data or an acknowledgement. 3790 If the message is of type "data", then the remainder of the PDU is as 3791 follows: 3793 sequence 3794 the sequence number. This increments by 1 for each framed message 3795 sent over this transport session. 3797 message 3798 the message that is being transmitted. 3800 Each connection has it own sequence number space. Initially the 3801 value is zero and it increments by exactly one for each message sent 3802 over that connection. 3804 When the receiver receives a message, it SHOULD immediately send an 3805 ACK message. The receiver MUST keep track of the 32 most recent 3806 sequence numbers received on this association in order to generate 3807 the appropriate ack. 3809 If the PDU is of type "ack", the contents are as follows: 3811 ack_sequence 3812 The sequence number of the message being acknowledged. 3814 received 3815 A bitmask indicating if each of the previous 32 sequence numbers 3816 before this packet has been among the 32 packets most recently 3817 received on this connection. When a packet is received with a 3818 sequence number N, the receiver looks at the sequence number of 3819 the previously 32 packets received on this connection. Call the 3820 previously received packet number M. For each of the previous 32 3821 packets, if the sequence number M is less than N but greater than 3822 N-32, the N-M bit of the received bitmask is set to one; otherwise 3823 it is zero. Note that a bit being set to one indicates positively 3824 that a particular packet was received, but a bit being set to zero 3825 means only that it is unknown whether or not the packet has been 3826 received, because it might have been received before the 32 most 3827 recently received packets. 3829 The received field bits in the ACK provide a high degree of 3830 redundancy so that the sender can figure out which packets the 3831 receiver has received and can then estimate packet loss rates. If 3832 the sender also keeps track of the time at which recent sequence 3833 numbers have been sent, the RTT can be estimated. 3835 Note that because retransmissions receive new sequence numbers, 3836 multiple ACKs may be received for the same message. This approach 3837 provides more information than traditional TCP sequence numbers, but 3838 care must be taken when applying algorithms designed based on TCP's 3839 stream-oriented sequence number. 3841 6.6.3. Simple Reliability 3843 When RELOAD is carried over DTLS or another unreliable link protocol, 3844 it needs to be used with a reliability and congestion control 3845 mechanism, which is provided on a hop-by-hop basis. The basic 3846 principle is that each message, regardless of whether or not it 3847 carries a request or response, will get an ACK and be reliably 3848 retransmitted. The receiver's job is very simple, limited to just 3849 sending ACKs. All the complexity is at the sender side. This allows 3850 the sending implementation to trade off performance versus 3851 implementation complexity without affecting the wire protocol. 3853 Because the receiver's role is limited to providing packet 3854 acknowledgements, a wide variety of congestion control algorithms can 3855 be implemented on the sender side while using the same basic wire 3856 protocol. The sender algorithm used MUST meet the requirements of 3857 [RFC5405]. 3859 6.6.3.1. Stop and Wait Sender Algorithm 3861 This section describes one possible implementation of a sender 3862 algorithm for Simple Reliability. It is adequate for overlays 3863 running on underlying networks with low latency and loss (LANs) or 3864 low-traffic overlays on the Internet. 3866 A node MUST NOT have more than one unacknowledged message on the DTLS 3867 connection at a time. Note that because retransmissions of the same 3868 message are given new sequence numbers, there may be multiple 3869 unacknowledged sequence numbers in use. 3871 The RTO ("Retransmission TimeOut") is based on an estimate of the 3872 round-trip time (RTT). The value for RTO is calculated separately 3873 for each DTLS session. Implementations can use a static value for 3874 RTO or a dynamic estimate which will result in better performance. 3875 For implementations that use a static value, the default value for 3876 RTO is 500 ms. Nodes MAY use smaller values of RTO if it is known 3877 that all nodes are within the local network. The default RTO MAY be 3878 chosen larger, and this is RECOMMENDED if it is known in advance 3879 (such as on high latency access links) that the round-trip time is 3880 larger. 3882 Implementations that use a dynamic estimate to compute the RTO MUST 3883 use the algorithm described in RFC 6298[RFC6298], with the exception 3884 that the value of RTO SHOULD NOT be rounded up to the nearest second 3885 but instead rounded up to the nearest millisecond. The RTT of a 3886 successful STUN transaction from the ICE stage is used as the initial 3887 measurement for formula 2.2 of RFC 6298. The sender keeps track of 3888 the time each message was sent for all recently sent messages. Any 3889 time an ACK is received, the sender can compute the RTT for that 3890 message by looking at the time the ACK was received and the time when 3891 the message was sent. This is used as a subsequent RTT measurement 3892 for formula 2.3 of RFC 6298 to update the RTO estimate. (Note that 3893 because retransmissions receive new sequence numbers, all received 3894 ACKs are used.) 3896 An initiating node SHOULD retransmit a message if it has not received 3897 an ACK after an interval of RTO (transit nodes do not retransmit at 3898 this layer). The node MUST double the time to wait after each 3899 retransmission. For each retransmission, the sequence number MUST be 3900 incremented. 3902 Retransmissions continue until a response is received, or until a 3903 total of 5 requests have been sent or there has been a hard ICMP 3904 error [RFC1122] or a TLS alert. The sender knows a response was 3905 received when it receives an ACK with a sequence number that 3906 indicates it is a response to one of the transmissions of this 3907 messages. For example, assuming an RTO of 500 ms, requests would be 3908 sent at times 0 ms, 500 ms, 1500 ms, 3500 ms, and 7500 ms. If all 3909 retransmissions for a message fail, then the sending node SHOULD 3910 close the connection routing the message. 3912 To determine when a link might be failing without waiting for the 3913 final timeout, observe when no ACKs have been received for an entire 3914 RTO interval, and then wait for three retransmissions to occur beyond 3915 that point. If no ACKs have been received by the time the third 3916 retransmission occurs, it is RECOMMENDED that the link be removed 3917 from the Routing Table. The link MAY be restored to the Routing 3918 Table if ACKs resume before the connection is closed, as described 3919 above. 3921 A sender MUST wait 10ms between receipt of an ACK and transmission of 3922 the next message. 3924 6.6.4. DTLS/UDP with SR 3926 This overlay link protocol consists of DTLS over UDP while 3927 implementing the Simple Reliability protocol. STUN Connectivity 3928 checks and keepalives are used. Any compliant sender algorithm may 3929 be used. 3931 6.6.5. TLS/TCP with FH, No-ICE 3933 This overlay link protocol consists of TLS over TCP with the framing 3934 header. Because ICE is not used, STUN connectivity checks are not 3935 used upon establishing the TCP connection, nor are they used for 3936 keepalives. 3938 Because the TCP layer's application-level timeout is too slow to be 3939 useful for overlay routing, the Overlay Link implementation MUST use 3940 the framing header to measure the RTT of the connection and calculate 3941 an RTO as specified in Section 2 of [RFC6298]. The resulting RTO is 3942 not used for retransmissions, but as a timeout to indicate when the 3943 link SHOULD be removed from the Routing Table. It is RECOMMENDED 3944 that such a connection be retained for 30s to determine if the 3945 failure was transient before concluding the link has failed 3946 permanently. 3948 When sending candidates for TLS/TCP with FH, No-ICE, a passive 3949 candidate MUST be provided. 3951 6.6.6. DTLS/UDP with SR, No-ICE 3953 This overlay link protocol consists of DTLS over UDP while 3954 implementing the Simple Reliability protocol. Because ICE is not 3955 used, no STUN connectivity checks or keepalives are used. 3957 6.7. Fragmentation and Reassembly 3959 In order to allow transmission over datagram protocols such as DTLS, 3960 RELOAD messages may be fragmented. 3962 Any node along the path can fragment the message but only the final 3963 destination reassembles the fragments. When a node takes a packet 3964 and fragments it, each fragment has a full copy of the Forwarding 3965 Header but the data after the Forwarding Header is broken up in 3966 appropriate sized chunks. The size of the payload chunks needs to 3967 take into account space to allow the via and destination lists to 3968 grow. Each fragment MUST contain a full copy of the via list, 3969 destination list, and ForwardingOptions and MUST contain at least 256 3970 bytes of the message body. If these elements cannot fit within the 3971 MTU of the underlying datagram protocol, RELOAD fragmentation is not 3972 performed and IP-layer fragmentation is allowed to occur. The length 3973 field MUST contain the size of the message after fragmentation. When 3974 a message MUST be fragmented, it SHOULD be split into equal-sized 3975 fragments that are no larger than the PMTU of the next overlay link 3976 minus 32 bytes. This is to allow the via list to grow before further 3977 fragmentation is required. 3979 Note that this fragmentation is not optimal for the end-to-end path - 3980 a message may be refragmented multiple times as it traverses the 3981 overlay but is only assembled at the final destination. This option 3982 has been chosen as it is far easier to implement than e2e PMTU 3983 discovery across an ever-changing overlay, and it effectively 3984 addresses the reliability issues of relying on IP-layer 3985 fragmentation. However, Ping can be used to allow e2e PMTU discovery 3986 to be implemented if desired. 3988 Upon receipt of a fragmented message by the intended peer, the peer 3989 holds the fragments in a holding buffer until the entire message has 3990 been received. The message is then reassembled into a single message 3991 and processed. In order to mitigate denial of service attacks, 3992 receivers SHOULD time out incomplete fragments after maximum request 3993 lifetime (15 seconds). Note this time was derived from looking at 3994 the end-to-end retransmission time and saving fragments long enough 3995 for the full end-to-end retransmissions to take place. Ideally the 3996 receiver would have enough buffer space to deal with as many 3997 fragments as can arrive in the maximum request lifetime. However, if 3998 the receiver runs out of buffer space to reassemble the messages it 3999 MUST drop the message. 4001 The fragment field of the forwarding header is used to encode 4002 fragmentation information. The offset is the number of bytes between 4003 the end of the forwarding header and the start of the data. The 4004 first fragment therefore has an offset of 0. The last fragment 4005 indicator MUST be appropriately set. If the message is not 4006 fragmented, it is simply treated as if it is the only fragment: the 4007 last fragment bit is set and the offset is 0 resulting in a fragment 4008 value of 0xC0000000. 4010 Note: the reason for this definition of the fragment field is that 4011 originally the high bit was defined in part of the specification as 4012 "is fragmented" and so there was some specification ambiguity about 4013 how to encode messages with only one fragment. This ambiguity was 4014 resolved in favor of always encoding as the "last" fragment with 4015 offset 0, thus simplifying the receiver code path, but resulting in 4016 the high bit being redundant. Because messages MUST be set with the 4017 high bit set to 1, implementations SHOULD discard any message with it 4018 set to 0. Implementations (presumably legacy ones) which choose to 4019 accept such messages MUST either ignore the remaining bits or ensure 4020 that they are 0. They MUST NOT try to interpret as fragmented 4021 messages with the high bit set low. 4023 7. Data Storage Protocol 4025 RELOAD provides a set of generic mechanisms for storing and 4026 retrieving data in the Overlay Instance. These mechanisms can be 4027 used for new applications simply by defining new code points and a 4028 small set of rules. No new protocol mechanisms are required. 4030 The basic unit of stored data is a single StoredData structure: 4032 struct { 4033 uint32 length; 4034 uint64 storage_time; 4035 uint32 lifetime; 4036 StoredDataValue value; 4037 Signature signature; 4038 } StoredData; 4040 The contents of this structure are as follows: 4042 length 4043 The size of the StoredData structure in bytes excluding the size 4044 of length itself. 4046 storage_time 4047 The time when the data was stored represented as the number of 4048 milliseconds elapsed since midnight Jan 1, 1970 UTC not counting 4049 leap seconds. This will have the same values for seconds as 4050 standard UNIX time or POSIX time. More information can be found 4051 at [UnixTime]. Any attempt to store a data value with a storage 4052 time before that of a value already stored at this location MUST 4053 generate a Error_Data_Too_Old error. This prevents rollback 4054 attacks. The node SHOULD make a best-effort attempt to use a 4055 correct clock to determine this number, however, the protocol does 4056 not require synchronized clocks: the receiving peer uses the 4057 storage time in the previous store, not its own clock. Clock 4058 values are used so that when clocks are generally synchronized, 4059 data may be stored in a single transaction, rather than querying 4060 for the value of a counter before the actual store. 4062 If a node attempting to store new data in response to a user 4063 request (rather than as an overlay maintenance operation such as 4064 occurs when healing the overlay from a partition) is rejected with 4065 an Error_Data_Too_Old error, the node MAY elect to perform its 4066 store using a storage_time that increments the value used with the 4067 previous store. This situation may occur when the clocks of nodes 4068 storing to this location are not properly synchronized. 4070 lifetime 4071 The validity period for the data, in seconds, starting from the 4072 time the peer receives the StoreReq. 4074 value 4075 The data value itself, as described in Section 7.2. 4077 signature 4078 A signature as defined in Section 7.1. 4080 Each Resource-ID specifies a single location in the Overlay Instance. 4081 However, each location may contain multiple StoredData values 4082 distinguished by Kind-ID. The definition of a Kind describes both 4083 the data values which may be stored and the data model of the data. 4084 Some data models allow multiple values to be stored under the same 4085 Kind-ID. Section 7.2 describes the available data models. Thus, for 4086 instance, a given Resource-ID might contain a single-value element 4087 stored under Kind-ID X and an array containing multiple values stored 4088 under Kind-ID Y. 4090 7.1. Data Signature Computation 4092 Each StoredData element is individually signed. However, the 4093 signature also must be self-contained and cover the Kind-ID and 4094 Resource-ID even though they are not present in the StoredData 4095 structure. The input to the signature algorithm is: 4097 resource_id || kind || storage_time || StoredDataValue || 4098 SignerIdentity 4100 Where || indicates concatenation. 4102 Where these values are: 4104 resource_id 4105 The Resource-ID where this data is stored. 4107 kind 4108 The Kind-ID for this data. 4110 storage_time 4111 The contents of the storage_time data value. 4113 StoredDataValue 4114 The contents of the stored data value, as described in the 4115 previous sections. 4117 SignerIdentity 4118 The signer identity as defined in Section 6.3.4. 4120 Once the signature has been computed, the signature is represented 4121 using a signature element, as described in Section 6.3.4. 4123 Note that there is no necessary relationship between the validity 4124 window of a certificate and the expiry of the data it is 4125 authenticating. When signatures are verified, the current time MUST 4126 be compared to the certificate validity period. Stored data MAY be 4127 set to expire after the signing certificate's validity period. Such 4128 signatures are not considered valid after the signing certificate 4129 expires. Implementations may garbage collect such data at their 4130 convenience, either purging it automatically (perhaps by setting the 4131 upper bound on data storage to the lifetime of the signing 4132 certificate) or by simply leaving it in-place until it expires 4133 naturally and relying on users of that data to notice the expired 4134 signing certificate. 4136 7.2. Data Models 4138 The protocol currently defines the following data models: 4140 o single value 4142 o array 4144 o dictionary 4146 These are represented with the StoredDataValue structure. The actual 4147 data model is known from the Kind being stored. 4149 struct { 4150 Boolean exists; 4151 opaque value<0..2^32-1>; 4152 } DataValue; 4154 struct { 4155 select (DataModel) { 4156 case single_value: 4157 DataValue single_value_entry; 4159 case array: 4160 ArrayEntry array_entry; 4162 case dictionary: 4163 DictionaryEntry dictionary_entry; 4165 /* This structure may be extended */ 4166 }; 4167 } StoredDataValue; 4169 We now discuss the properties of each data model in turn: 4171 7.2.1. Single Value 4173 A single-value element is a simple sequence of bytes. There may be 4174 only one single-value element for each Resource-ID, Kind-ID pair. 4176 A single value element is represented as a DataValue, which contains 4177 the following two elements: 4179 exists 4180 This value indicates whether the value exists at all. If it is 4181 set to False, it means that no value is present. If it is True, 4182 that means that a value is present. This gives the protocol a 4183 mechanism for indicating nonexistence as opposed to emptiness. 4185 value 4186 The stored data. 4188 7.2.2. Array 4190 An array is a set of opaque values addressed by an integer index. 4191 Arrays are zero based. Note that arrays can be sparse. For 4192 instance, a Store of "X" at index 2 in an empty array produces an 4193 array with the values [ NA, NA, "X"]. Future attempts to fetch 4194 elements at index 0 or 1 will return values with "exists" set to 4195 False. 4197 A array element is represented as an ArrayEntry: 4199 struct { 4200 uint32 index; 4201 DataValue value; 4202 } ArrayEntry; 4204 The contents of this structure are: 4206 index 4207 The index of the data element in the array. 4209 value 4210 The stored data. 4212 7.2.3. Dictionary 4214 A dictionary is a set of opaque values indexed by an opaque key with 4215 one value for each key. A single dictionary entry is represented as 4216 follows: 4218 A dictionary element is represented as a DictionaryEntry: 4220 typedef opaque DictionaryKey<0..2^16-1>; 4222 struct { 4223 DictionaryKey key; 4224 DataValue value; 4225 } DictionaryEntry; 4227 The contents of this structure are: 4229 key 4230 The dictionary key for this value. 4232 value 4233 The stored data. 4235 7.3. Access Control Policies 4237 Every Kind which is storable in an overlay MUST be associated with an 4238 access control policy. This policy defines whether a request from a 4239 given node to operate on a given value should succeed or fail. It is 4240 anticipated that only a small number of generic access control 4241 policies are required. To that end, this section describes a small 4242 set of such policies and Section 14.4 establishes a registry for new 4243 policies if required. Each policy has a short string identifier 4244 which is used to reference it in the configuration document. 4246 In the following policies, the term "signer" refers to the signer of 4247 the StoredValue object and, in the case of non-replica stores, to the 4248 signer of the StoreReq message. I.e., in a non-replica store, both 4249 the signer of the StoredValue and the signer of the StoreReq MUST 4250 conform to the policy. In the case of a replica store, the signer of 4251 the StoredValue MUST conform to the policy and the StoreReq itself 4252 MUST be checked as described in Section 7.4.1.1. 4254 7.3.1. USER-MATCH 4256 In the USER-MATCH policy, a given value MUST be written (or 4257 overwritten) if and only if the signer's certificate has a user name 4258 which hashes (using the hash function for the overlay) to the 4259 Resource-ID for the resource. Recall that the certificate may, 4260 depending on the overlay configuration, be self-signed. 4262 7.3.2. NODE-MATCH 4264 In the NODE-MATCH policy, a given value MUST be written (or 4265 overwritten) if and only if the signer's certificate has a specified 4266 Node-ID which hashes (using the hash function for the overlay) to the 4267 Resource-ID for the resource and that Node-ID is the one indicated in 4268 the SignerIdentity value cert_hash. 4270 7.3.3. USER-NODE-MATCH 4272 The USER-NODE-MATCH policy may only be used with dictionary types. 4273 In the USER-NODE-MATCH policy, a given value MUST be written (or 4274 overwritten) if and only if the signer's certificate has a user name 4275 which hashes (using the hash function for the overlay) to the 4276 Resource-ID for the resource. In addition, the dictionary key MUST 4277 be equal to the Node-ID in the certificate and that Node-ID MUST be 4278 the one indicated in the SignerIdentity value cert_hash. 4280 7.3.4. NODE-MULTIPLE 4282 In the NODE-MULTIPLE policy, a given value MUST be written (or 4283 overwritten) if and only if the signer's certificate contains a 4284 Node-ID such that H(Node-ID || i) is equal to the Resource-ID for 4285 some small integer value of i and that Node-ID is the one indicated 4286 in the SignerIdentity value cert_hash. When this policy is in use, 4287 the maximum value of i MUST be specified in the Kind definition. 4289 Note that as i is not carried on the wire, the verifier MUST iterate 4290 through potential i values up to the maximum value in order to 4291 determine whether a store is acceptable. 4293 7.4. Data Storage Methods 4295 RELOAD provides several methods for storing and retrieving data: 4297 o Store values in the overlay 4299 o Fetch values from the overlay 4301 o Stat: get metadata about values in the overlay 4303 o Find the values stored at an individual peer 4305 These methods are each described in the following sections. 4307 7.4.1. Store 4309 The Store method is used to store data in the overlay. The format of 4310 the Store request depends on the data model which is determined by 4311 the Kind. 4313 7.4.1.1. Request Definition 4315 A StoreReq message is a sequence of StoreKindData values, each of 4316 which represents a sequence of stored values for a given Kind. The 4317 same Kind-ID MUST NOT be used twice in a given store request. Each 4318 value is then processed in turn. These operations MUST be atomic. 4319 If any operation fails, the state MUST be rolled back to before the 4320 request was received. 4322 The store request is defined by the StoreReq structure: 4324 struct { 4325 KindId kind; 4326 uint64 generation_counter; 4327 StoredData values<0..2^32-1>; 4328 } StoreKindData; 4330 struct { 4331 ResourceId resource; 4332 uint8 replica_number; 4333 StoreKindData kind_data<0..2^32-1>; 4334 } StoreReq; 4336 A single Store request stores data of a number of Kinds to a single 4337 resource location. The contents of the structure are: 4339 resource 4340 The resource to store at. 4342 replica_number 4343 The number of this replica. When a storing peer saves replicas to 4344 other peers each peer is assigned a replica number starting from 1 4345 and sent in the Store message. This field is set to 0 when a node 4346 is storing its own data. This allows peers to distinguish replica 4347 writes from original writes. Different topologies may choose to 4348 allocate or interpret the replica number differently (see 4349 Section 10.4). 4351 kind_data 4352 A series of elements, one for each Kind of data to be stored. 4354 If the replica number is zero, then the peer MUST check that it is 4355 responsible for the resource and, if not, reject the request. If the 4356 replica number is nonzero, then the peer MUST check that it expects 4357 to be a replica for the resource and that the request sender is 4358 consistent with being the responsible node (i.e., that the receiving 4359 peer does not know of a better node) and, if not, reject the request. 4361 Each StoreKindData element represents the data to be stored for a 4362 single Kind-ID. The contents of the element are: 4364 kind 4365 The Kind-ID. Implementations MUST reject requests corresponding 4366 to unknown Kinds. 4368 generation_counter 4369 The expected current state of the generation counter 4370 (approximately the number of times this object has been written; 4371 see below for details). 4373 values 4374 The value or values to be stored. This may contain one or more 4375 stored_data values depending on the data model associated with 4376 each Kind. 4378 The peer MUST perform the following checks: 4380 o The Kind-ID is known and supported. 4382 o The signatures over each individual data element (if any) are 4383 valid. If this check fails, the request MUST be rejected with an 4384 Error_Forbidden error. 4386 o Each element is signed by a credential which is authorized to 4387 write this Kind at this Resource-ID. If this check fails, the 4388 request MUST be rejected with an Error_Forbidden error. 4390 o For original (non-replica) stores, the StoreReq is signed by a 4391 credential which is authorized to write this Kind at this 4392 Resource-ID. If this check fails, the request MUST be rejected 4393 with an Error_Forbidden error. 4395 o For replica stores, the StoreReq is signed by a Node-ID which is a 4396 plausible node to either have originally stored the value or in 4397 the replica set. What this means is overlay specific, but in the 4398 case of the Chord based DHT defined in this specification, replica 4399 StoreReqs MUST come from nodes which are either in the known 4400 replica set for a given resource or which are closer than some 4401 node in the replica set. If this check fails, the request MUST be 4402 rejected with an Error_Forbidden error. 4404 o For original (non-replica) stores, the peer MUST check that if the 4405 generation counter is non-zero, it equals the current value of the 4406 generation counter for this Kind. This feature allows the 4407 generation counter to be used in a way similar to the HTTP Etag 4408 feature. 4410 o For replica Stores, the peer MUST set the generation counter to 4411 match the generation counter in the message, and MUST NOT check 4412 the generation counter against the current value. Replica Stores 4413 MUST NOT use a generation counter of 0. 4415 o The storage time values are greater than that of any value which 4416 would be replaced by this Store. 4418 o The size and number of the stored values is consistent with the 4419 limits specified in the overlay configuration. 4421 o If the data is signed with identity_type set to "none" and/or 4422 SignatureAndHashAlgorithm values set to {0, 0} ("anonymous" and 4423 "none"), the StoreReq MUST be rejected with an Error_forbidden 4424 error. Only synthesized data returned by the storage can use 4425 these values (see Section 7.4.2.2) 4427 If all these checks succeed, the peer MUST attempt to store the data 4428 values. For non-replica stores, if the store succeeds and the data 4429 is changed, then the peer MUST increase the generation counter by at 4430 least one. If there are multiple stored values in a single 4431 StoreKindData, it is permissible for the peer to increase the 4432 generation counter by only 1 for the entire Kind-ID, or by 1 or more 4433 than one for each value. Accordingly, all stored data values MUST 4434 have a generation counter of 1 or greater. 0 is used in the Store 4435 request to indicate that the generation counter should be ignored for 4436 processing this request; however the responsible peer should increase 4437 the stored generation counter and should return the correct 4438 generation counter in the response. 4440 When a peer stores data previously stored by another node (e.g., for 4441 replicas or topology shifts) it MUST adjust the lifetime value 4442 downward to reflect the amount of time the value was stored at the 4443 peer. The adjustment SHOULD be implemented by an algorithm 4444 equivalent to the following: at the time the peer initially receives 4445 the StoreReq it notes the local time T. When it then attempts to do a 4446 StoreReq to another node it should decrement the lifetime value by 4447 the difference between the current local time and T. 4449 Unless otherwise specified by the usage, if a peer attempts to store 4450 data previously stored by another node (e.g., for replicas or 4451 topology shifts) and that store fails with either an 4452 Error_Generation_Counter_Too_Low or an Error_Data_Too_Old error, the 4453 peer MUST fetch the newer data from the peer generating the error and 4454 use that to replace its own copy. This rule allows resynchronization 4455 after partitions heal. 4457 When a network partition is being healed and unless otherwise 4458 specified, the default merging rule is to act as if all the values 4459 that need to be merged were stored and as if the order they were 4460 stored in corresponds to the stored time values associated with (and 4461 carried in) their values. Because the stored time values are those 4462 associated with the peer which did the writing, clock skew is 4463 generally not an issue. If two nodes are on different partitions, 4464 write to the same location, and have clock skew, this can create 4465 merge conflicts. However because RELOAD deliberately segregates 4466 storage so that data from different users and peers is stored in 4467 different locations, and a single peer will typically only be in a 4468 single network partition, this case will generally not arise. 4470 The properties of stores for each data model are as follows: 4472 Single-value: 4473 A store of a new single-value element creates the element if it 4474 does not exist and overwrites any existing value with the new 4475 value. 4477 Array: 4478 A store of an array entry replaces (or inserts) the given value at 4479 the location specified by the index. Because arrays are sparse, a 4480 store past the end of the array extends it with nonexistent values 4481 (exists = False) as required. A store at index 0xffffffff places 4482 the new value at the end of the array regardless of the length of 4483 the array. The resulting StoredData has the correct index value 4484 when it is subsequently fetched. 4486 Dictionary: 4487 A store of a dictionary entry replaces (or inserts) the given 4488 value at the location specified by the dictionary key. 4490 The following figure shows the relationship between these structures 4491 for an example store which stores the following values at resource 4492 "1234" 4494 o The value "abc" in the single value location for Kind X 4496 o The value "foo" at index 0 in the array for Kind Y 4498 o The value "bar" at index 1 in the array for Kind Y 4500 Store 4501 resource=1234 4502 replica_number = 0 4503 / \ 4504 / \ 4505 StoreKindData StoreKindData 4506 kind=X (Single-Value) kind=Y (Array) 4507 generation_counter = 99 generation_counter = 107 4508 | /\ 4509 | / \ 4510 StoredData / \ 4511 storage_time = xxxxxxx / \ 4512 lifetime = 86400 / \ 4513 signature = XXXX / \ 4514 | | | 4515 | StoredData StoredData 4516 | storage_time = storage_time = 4517 | yyyyyyyy zzzzzzz 4518 | lifetime = 86400 lifetime = 33200 4519 | signature = YYYY signature = ZZZZ 4520 | | | 4521 StoredDataValue | | 4522 value="abc" | | 4523 | | 4524 StoredDataValue StoredDataValue 4525 index=0 index=1 4526 value="foo" value="bar" 4528 7.4.1.2. Response Definition 4530 In response to a successful Store request the peer MUST return a 4531 StoreAns message containing a series of StoreKindResponse elements 4532 containing the current value of the generation counter for each 4533 Kind-ID, as well as a list of the peers where the data will be 4534 replicated by the node processing the request. 4536 struct { 4537 KindId kind; 4538 uint64 generation_counter; 4539 NodeId replicas<0..2^16-1>; 4540 } StoreKindResponse; 4542 struct { 4543 StoreKindResponse kind_responses<0..2^16-1>; 4544 } StoreAns; 4546 The contents of each StoreKindResponse are: 4548 kind 4549 The Kind-ID being represented. 4551 generation_counter 4552 The current value of the generation counter for that Kind-ID. 4554 replicas 4555 The list of other peers at which the data was/will be replicated. 4556 In overlays and applications where the responsible peer is 4557 intended to store redundant copies, this allows the storing node 4558 to independently verify that the replicas have in fact been 4559 stored. It does this verification by using the Stat method (see 4560 Section 7.4.3). Note that the storing node is not required to 4561 perform this verification. 4563 The response itself is just StoreKindResponse values packed end-to- 4564 end. 4566 If any of the generation counters in the request precede the 4567 corresponding stored generation counter, then the peer MUST fail the 4568 entire request and respond with an Error_Generation_Counter_Too_Low 4569 error. The error_info in the ErrorResponse MUST be a StoreAns 4570 response containing the correct generation counter for each Kind and 4571 the replica list, which will be empty. For original (non-replica) 4572 stores, a node which receives such an error SHOULD attempt to fetch 4573 the data and, if the storage_time value is newer, replace its own 4574 data with that newer data. This rule improves data consistency in 4575 the case of partitions and merges. 4577 If the data being stored is too large for the allowed limit by the 4578 given usage, then the peer MUST fail the request and generate an 4579 Error_Data_Too_Large error. 4581 If any type of request tries to access a data Kind that the peer does 4582 not know about, an Error_Unknown_Kind MUST be generated. The 4583 error_info in the Error_Response is: 4585 KindId unknown_kinds<0..2^8-1>; 4587 which lists all the Kinds that were unrecognized. A node which 4588 receives this error MUST generate a ConfigUpdate message which 4589 contains the appropriate Kind definition (assuming that in fact a 4590 Kind was used which was defined in the configuration document). 4592 7.4.1.3. Removing Values 4594 RELOAD does not have an explicit Remove operation. Rather, values 4595 are Removed by storing "nonexistent" values in their place. Each 4596 DataValue contains a boolean value called "exists" which indicates 4597 whether a value is present at that location. In order to effectively 4598 remove a value, the owner stores a new DataValue with "exists" set to 4599 False: 4601 exists = False 4603 value = {} (0 length) 4605 The owner SHOULD use a lifetime for the nonexistent value at least as 4606 long as the remainder of the lifetime of the value it is replacing; 4607 otherwise it is possible for the original value to be accidentally or 4608 maliciously re-stored after the storing node has expired it. Note 4609 that there is still a window of vulnerability for replay attack after 4610 the original lifetime has expired (as with any store). This attack 4611 can be mitigated by doing a nonexistent store with a very long 4612 lifetime. 4614 Storing nodes MUST treat these nonexistent values the same way they 4615 treat any other stored value, including overwriting the existing 4616 value, replicating them, and aging them out as necessary when 4617 lifetime expires. When a stored nonexistent value's lifetime 4618 expires, it is simply removed from the storing node like any other 4619 stored value expiration. 4621 Note that in the case of arrays and dictionaries, expiration may 4622 create an implicit, unsigned "nonexistent" value to represent a gap 4623 in the data structure, as might happen when any value is aged out. 4624 However, this value isn't persistent nor is it replicated. It is 4625 simply synthesized by the storing node. 4627 7.4.2. Fetch 4629 The Fetch request retrieves one or more data elements stored at a 4630 given Resource-ID. A single Fetch request can retrieve multiple 4631 different Kinds. 4633 7.4.2.1. Request Definition 4635 struct { 4636 int32 first; 4637 int32 last; 4638 } ArrayRange; 4640 struct { 4641 KindId kind; 4642 uint64 generation; 4643 uint16 length; 4645 select (DataModel) { 4646 case single_value: ; /* Empty */ 4648 case array: 4649 ArrayRange indices<0..2^16-1>; 4651 case dictionary: 4652 DictionaryKey keys<0..2^16-1>; 4654 /* This structure may be extended */ 4656 } model_specifer; 4657 } StoredDataSpecifier; 4659 struct { 4660 ResourceId resource; 4661 StoredDataSpecifier specifiers<0..2^16-1>; 4662 } FetchReq; 4664 The contents of the Fetch requests are as follows: 4666 resource 4667 The Resource-ID to fetch from. 4669 specifiers 4670 A sequence of StoredDataSpecifier values, each specifying some of 4671 the data values to retrieve. 4673 Each StoredDataSpecifier specifies a single Kind of data to retrieve 4674 and (if appropriate) the subset of values that are to be retrieved. 4675 The contents of the StoredDataSpecifier structure are as follows: 4677 kind 4678 The Kind-ID of the data being fetched. Implementations SHOULD 4679 reject requests corresponding to unknown Kinds unless specifically 4680 configured otherwise. 4682 DataModel 4683 The data model of the data. This is not transmitted on the wire 4684 but comes from the definition of the Kind. 4686 generation 4687 The last generation counter that the requesting node saw. This 4688 may be used to avoid unnecessary fetches or it may be set to zero. 4690 length 4691 The length of the rest of the structure, thus allowing 4692 extensibility. 4694 model_specifier 4695 A reference to the data value being requested within the data 4696 model specified for the Kind. For instance, if the data model is 4697 "array", it might specify some subset of the values. 4699 The model_specifier is as follows: 4701 o If the data model is single value, the specifier is empty. 4703 o If the data model is array, the specifier contains a list of 4704 ArrayRange elements, each of which contains two integers. The 4705 first integer is the beginning of the range and the second is the 4706 end of the range. 0 is used to indicate the first element and 4707 0xffffffff is used to indicate the final element. The first 4708 integer MUST be less than the second. While multiple ranges MAY 4709 be specified, they MUST NOT overlap. 4711 o If the data model is dictionary then the specifier contains a list 4712 of the dictionary keys being requested. If no keys are specified, 4713 than this is a wildcard fetch and all key-value pairs are 4714 returned. 4716 The generation counter is used to indicate the requester's expected 4717 state of the storing peer. If the generation counter in the request 4718 matches the stored counter, then the storing peer returns a response 4719 with no StoredData values. 4721 7.4.2.2. Response Definition 4723 The response to a successful Fetch request is a FetchAns message 4724 containing the data requested by the requester. 4726 struct { 4727 KindId kind; 4728 uint64 generation; 4729 StoredData values<0..2^32-1>; 4730 } FetchKindResponse; 4732 struct { 4733 FetchKindResponse kind_responses<0..2^32-1>; 4734 } FetchAns; 4736 The FetchAns structure contains a series of FetchKindResponse 4737 structures. There MUST be one FetchKindResponse element for each 4738 Kind-ID in the request. 4740 The contents of the FetchKindResponse structure are as follows: 4742 kind 4743 the Kind that this structure is for. 4745 generation 4746 the generation counter for this Kind. 4748 values 4749 the relevant values. If the generation counter in the request 4750 matches the generation counter in the stored data, then no 4751 StoredData values are returned. Otherwise, all relevant data 4752 values MUST be returned. A nonexistent value (i.e., one which the 4753 node has no knowledge of) is represented by a synthetic value with 4754 "exists" set to False and has an empty signature. Specifically, 4755 the identity_type is set to "none", the SignatureAndHashAlgorithm 4756 values are set to {0, 0} ("anonymous" and "none" respectively), 4757 and the signature value is of zero length. This removes the need 4758 for the responding node to do signatures for values which do not 4759 exist. These signatures are unnecessary as the entire response is 4760 signed by that node. Note that entries which have been removed by 4761 the procedure of Section 7.4.1.3 and have not yet expired also 4762 have exists = False but have valid signatures from the node which 4763 did the store. 4765 Upon receipt of a FetchAns message, nodes MUST verify the signatures 4766 on all the received values. Any values with invalid signatures 4767 (including expired certificates) MUST be discarded. Note that this 4768 implies that implementations which wish to store data for long 4769 periods of time must have certificates with appropriate expiry dates 4770 or re-store periodically. Implementations MAY return the subset of 4771 values with valid signatures, but in that case SHOULD somehow signal 4772 to the application that a partial response was received. 4774 There is one subtle point about signature computation on arrays. If 4775 the storing node uses the append feature (where the 4776 index=0xffffffff), then the index in the StoredData that is returned 4777 will not match that used by the storing node, which would break the 4778 signature. In order to avoid this issue, the index value in the 4779 array is set to zero before the signature is computed. This implies 4780 that malicious storing nodes can reorder array entries without being 4781 detected. 4783 7.4.3. Stat 4785 The Stat request is used to get metadata (length, generation counter, 4786 digest, etc.) for a stored element without retrieving the element 4787 itself. The name is from the UNIX stat(2) system call which performs 4788 a similar function for files in a file system. It also allows the 4789 requesting node to get a list of matching elements without requesting 4790 the entire element. 4792 7.4.3.1. Request Definition 4794 The Stat request is identical to the Fetch request. It simply 4795 specifies the elements to get metadata about. 4797 struct { 4798 ResourceId resource; 4799 StoredDataSpecifier specifiers<0..2^16-1>; 4800 } StatReq; 4802 7.4.3.2. Response Definition 4804 The Stat response contains the same sort of entries that a Fetch 4805 response would contain; however, instead of containing the element 4806 data it contains metadata. 4808 struct { 4809 Boolean exists; 4810 uint32 value_length; 4811 HashAlgorithm hash_algorithm; 4812 opaque hash_value<0..255>; 4813 } MetaData; 4815 struct { 4816 uint32 index; 4817 MetaData value; 4818 } ArrayEntryMeta; 4820 struct { 4821 DictionaryKey key; 4822 MetaData value; 4823 } DictionaryEntryMeta; 4825 struct { 4826 select (DataModel) { 4827 case single_value: 4828 MetaData single_value_entry; 4830 case array: 4831 ArrayEntryMeta array_entry; 4833 case dictionary: 4834 DictionaryEntryMeta dictionary_entry; 4836 /* This structure may be extended */ 4837 }; 4838 } MetaDataValue; 4840 struct { 4841 uint32 value_length; 4842 uint64 storage_time; 4843 uint32 lifetime; 4844 MetaDataValue metadata; 4845 } StoredMetaData; 4847 struct { 4848 KindId kind; 4849 uint64 generation; 4850 StoredMetaData values<0..2^32-1>; 4851 } StatKindResponse; 4853 struct { 4854 StatKindResponse kind_responses<0..2^32-1>; 4855 } StatAns; 4857 The structures used in StatAns parallel those used in FetchAns: a 4858 response consists of multiple StatKindResponse values, one for each 4859 Kind that was in the request. The contents of the StatKindResponse 4860 are the same as those in the FetchKindResponse, except that the 4861 values list contains StoredMetaData entries instead of StoredData 4862 entries. 4864 The contents of the StoredMetaData structure are the same as the 4865 corresponding fields in StoredData except that there is no signature 4866 field and the value is a MetaDataValue rather than a StoredDataValue. 4868 A MetaDataValue is a variant structure, like a StoredDataValue, 4869 except for the types of each arm, which replace DataValue with 4870 MetaData. 4872 The only really new structure is MetaData, which has the following 4873 contents: 4875 exists 4876 Same as in DataValue 4878 value_length 4879 The length of the stored value. 4881 hash_algorithm 4882 The hash algorithm used to perform the digest of the value. 4884 hash_value 4885 A digest using hash_algorithm on the value field of the DataValue 4886 including its 4 leading length bytes. 4888 7.4.4. Find 4890 The Find request can be used to explore the Overlay Instance. A Find 4891 request for a Resource-ID R and a Kind-ID T retrieves the Resource-ID 4892 (if any) of the resource of Kind T known to the target peer which is 4893 closest to R. This method can be used to walk the Overlay Instance by 4894 iteratively fetching R_n+1=nearest(1 + R_n). 4896 7.4.4.1. Request Definition 4898 The FindReq message contains a Resource-ID and a series of Kind-IDs 4899 identifying the resource the peer is interested in. 4901 struct { 4902 ResourceId resource; 4903 KindId kinds<0..2^8-1>; 4904 } FindReq; 4906 The request contains a list of Kind-IDs which the Find is for, as 4907 indicated below: 4909 resource 4910 The desired Resource-ID 4912 kinds 4913 The desired Kind-IDs. Each value MUST only appear once, and if 4914 not the request MUST be rejected with an error. 4916 7.4.4.2. Response Definition 4918 A response to a successful Find request is a FindAns message 4919 containing the closest Resource-ID on the peer for each Kind 4920 specified in the request. 4922 struct { 4923 KindId kind; 4924 ResourceId closest; 4925 } FindKindData; 4927 struct { 4928 FindKindData results<0..2^16-1>; 4929 } FindAns; 4931 If the processing peer is not responsible for the specified 4932 Resource-ID, it SHOULD return an Error_Not_Found error code. 4934 For each Kind-ID in the request the response MUST contain a 4935 FindKindData indicating the closest Resource-ID for that Kind-ID, 4936 unless the Kind is not allowed to be used with Find in which case a 4937 FindKindData for that Kind-ID MUST NOT be included in the response. 4939 If a Kind-ID is not known, then the corresponding Resource-ID MUST be 4940 0. Note that different Kind-IDs may have different closest Resource- 4941 IDs. 4943 The response is simply a series of FindKindData elements, one per 4944 Kind, concatenated end-to-end. The contents of each element are: 4946 kind 4947 The Kind-ID. 4949 closest 4950 The closest Resource-ID to the specified Resource-ID. This is 0 4951 if no Resource-ID is known. 4953 Note that the response does not contain the contents of the data 4954 stored at these Resource-IDs. If the requester wants this, it must 4955 retrieve it using Fetch. 4957 7.4.5. Defining New Kinds 4959 There are two ways to define a new Kind. The first is by writing a 4960 document and registering the Kind-ID with IANA. This is the 4961 preferred method for Kinds which may be widely used and reused. The 4962 second method is to simply define the Kind and its parameters in the 4963 configuration document using the section of Kind-ID space set aside 4964 for private use. This method MAY be used to define ad hoc Kinds in 4965 new overlays. 4967 However a Kind is defined, the definition MUST include: 4969 o The meaning of the data to be stored (in some textual form). 4971 o The Kind-ID. 4973 o The data model (single value, array, dictionary, etc). 4975 o The access control model. 4977 In addition, when Kinds are registered with IANA, each Kind is 4978 assigned a short string name which is used to refer to it in 4979 configuration documents. 4981 While each Kind needs to define what data model is used for its data, 4982 that does not mean that it must define new data models. Where 4983 practical, Kinds should use the existing data models. The intention 4984 is that the basic data model set be sufficient for most applications/ 4985 usages. 4987 8. Certificate Store Usage 4989 The Certificate Store usage allows a node to store its certificate in 4990 the overlay. 4992 A user/node MUST store its certificate at Resource-IDs derived from 4993 two Resource Names: 4995 o The user name in the certificate. 4997 o The Node-ID in the certificate. 4999 Note that in the second case the certificate for a peer is not stored 5000 at its Node-ID but rather at a hash of its Node-ID. The intention 5001 here (as is common throughout RELOAD) is to avoid making a peer 5002 responsible for its own data. 5004 New certificates are stored at the end of the list. This structure 5005 allows users to store an old and a new certificate that both have the 5006 same Node-ID, which allows for migration of certificates when they 5007 are renewed. 5009 This usage defines the following Kinds: 5011 Name: CERTIFICATE_BY_NODE 5013 Data Model: The data model for CERTIFICATE_BY_NODE data is array. 5015 Access Control: NODE-MATCH. 5017 Name: CERTIFICATE_BY_USER 5018 Data Model: The data model for CERTIFICATE_BY_USER data is array. 5020 Access Control: USER-MATCH. 5022 9. TURN Server Usage 5024 The TURN server usage allows a RELOAD peer to advertise that it is 5025 prepared to be a TURN server as defined in [RFC5766]. When a node 5026 starts up, it joins the overlay network and forms several connections 5027 in the process. If the ICE stage in any of these connections returns 5028 a reflexive address that is not the same as the peer's perceived 5029 address, then the peer is behind a NAT and SHOULD NOT be a candidate 5030 for a TURN server. Additionally, if the peer's IP address is in the 5031 private address space range as defined by [RFC1918], then it is also 5032 SHOULD NOT be a candidate for a TURN server. Otherwise, the peer 5033 SHOULD assume it is a potential TURN server and follow the procedures 5034 below. 5036 If the node is a candidate for a TURN server it will insert some 5037 pointers in the overlay so that other peers can find it. The overlay 5038 configuration file specifies a turn-density parameter that indicates 5039 how many times each TURN server SHOULD record itself in the overlay. 5040 Typically this should be set to the reciprocal of the estimate of 5041 what percentage of peers will act as TURN servers. If the turn- 5042 density is not set to zero, for each value, called d, between 1 and 5043 turn-density, the peer forms a Resource Name by concatenating its 5044 Node-ID and the value d. This Resource Name is hashed to form a 5045 Resource-ID. The address of the peer is stored at that Resource-ID 5046 using type TURN-SERVICE and the TurnServer object: 5048 struct { 5049 uint8 iteration; 5050 IpAddressPort server_address; 5051 } TurnServer; 5053 The contents of this structure are as follows: 5055 iteration 5056 the d value 5058 server_address 5059 the address at which the TURN server can be contacted. 5061 Note: Correct functioning of this algorithm depends on having turn- 5062 density be an reasonable estimate of the reciprocal of the 5063 proportion of nodes in the overlay that can act as TURN servers. 5064 If the turn-density value in the configuration file is too low, 5065 then the process of finding TURN servers becomes more expensive as 5066 multiple candidate Resource-IDs must be probed to find a TURN 5067 server. 5069 Peers that provide this service need to support the TURN extensions 5070 to STUN for media relay as defined in [RFC5766]. 5072 This usage defines the following Kind to indicate that a peer is 5073 willing to act as a TURN server: 5075 Name TURN-SERVICE 5077 Data Model The TURN-SERVICE Kind stores a single value for each 5078 Resource-ID. 5080 Access Control NODE-MULTIPLE, with maximum iteration counter 20. 5082 Peers MAY find other servers by selecting a random Resource-ID and 5083 then doing a Find request for the appropriate Kind-ID with that 5084 Resource-ID. The Find request gets routed to a random peer based on 5085 the Resource-ID. If that peer knows of any servers, they will be 5086 returned. The returned response may be empty if the peer does not 5087 know of any servers, in which case the process gets repeated with 5088 some other random Resource-ID. As long as the ratio of servers 5089 relative to peers is not too low, this approach will result in 5090 finding a server relatively quickly. 5092 Note to implementers: The certificates used by TurnServer entries 5093 need to be retained as described in Section 6.3.4. 5095 10. Chord Algorithm 5097 This algorithm is assigned the name CHORD-RELOAD to indicate it is an 5098 adaptation of the basic Chord based DHT algorithm. 5100 This algorithm differs from the originally presented Chord algorithm 5101 [Chord]. It has been updated based on more recent research results 5102 and implementation experiences, and to adapt it to the RELOAD 5103 protocol. A short list of differences: 5105 o The original Chord algorithm specified that a single predecessor 5106 and a successor list be stored. The CHORD-RELOAD algorithm 5107 attempts to have more than one predecessor and successor. The 5108 predecessor sets help other neighbors learn their successor list. 5110 o The original Chord specification and analysis called for iterative 5111 routing. RELOAD specifies recursive routing. In addition to the 5112 performance implications, the cost of NAT traversal dictates 5113 recursive routing. 5115 o Finger Table entries are indexed in opposite order. Original 5116 Chord specifies finger[0] as the immediate successor of the peer. 5117 CHORD-RELOAD specifies finger[0] as the peer 180 degrees around 5118 the ring from the peer. This change was made to simplify 5119 discussion and implementation of variable sized Finger Tables. 5120 However, with either approach no more than O(log N) entries should 5121 typically be stored in a Finger Table. 5123 o The stabilize() and fix_fingers() algorithms in the original Chord 5124 algorithm are merged into a single periodic process. 5125 Stabilization is implemented slightly differently because of the 5126 larger neighborhood, and fix_fingers is not as aggressive to 5127 reduce load, nor does it search for optimal matches of the Finger 5128 Table entries. 5130 o RELOAD allows for a 128 bit hash instead of a 160 bit hash, as 5131 RELOAD is not designed to be used in networks with close to or 5132 more than 2^128 nodes or objects (and it is hard to see how one 5133 would assemble such a network). 5135 o RELOAD uses randomized finger entries as described in 5136 Section 10.7.4.2. 5138 o This algorithm allows the use of either reactive or periodic 5139 recovery. The original Chord paper used periodic recovery. 5140 Reactive recovery provides better performance in small overlays, 5141 but is believed to be unstable in large (>1000) overlays with high 5142 levels of churn [handling-churn-usenix04]. The overlay 5143 configuration file specifies a "chord-reactive" element that 5144 indicates whether reactive recovery should be used. 5146 10.1. Overview 5148 The algorithm described here, CHORD-RELOAD, is a modified version of 5149 the Chord algorithm. In Chord (and in the algorithm described here), 5150 nodes are arranged in a ring with node n being adjacent to nodes n-1 5151 and n+1, with all arithmetic being done modulo 2^{k}, where k is the 5152 length of the Node-ID in bits, so that node 2^{k} - 1 is directly 5153 before node 0. 5155 Each peer keeps track of a Finger Table and a Neighbor Table. The 5156 Neighbor Table contains at least the three peers before and after 5157 this peer in the DHT ring. There may not be three entries in all 5158 cases such as small rings or while the ring topology is changing. 5159 The first entry in the Finger Table contains the peer half-way around 5160 the ring from this peer; the second entry contains the peer that is 5161 1/4 of the way around; the third entry contains the peer that is 5162 1/8th of the way around, and so on. Fundamentally, the Chord DHT can 5163 be thought of as a doubly-linked list formed by knowing the 5164 successors and predecessor peers in the Neighbor Table, sorted by the 5165 Node-ID. As long as the successor peers are correct, the DHT will 5166 return the correct result. The pointers to the prior peers are kept 5167 to enable the insertion of new peers into the list structure. 5168 Keeping multiple predecessor and successor pointers makes it possible 5169 to maintain the integrity of the data structure even when consecutive 5170 peers simultaneously fail. The Finger Table forms a skip 5171 list[wikiSkiplist], so that entries in the linked list can be found 5172 in O(log(N)) time instead of the typical O(N) time that a linked list 5173 would provide where N represents the number of nodes in the DHT. 5175 The Neighbor Table and Finger Table entries contain logical Node-IDs 5176 as values but the actual mapping of an IP level addressing 5177 information to reach that Node-ID is kept in the Connection Table. 5179 A peer, x, is responsible for a particular Resource-ID k if k is less 5180 than or equal to x and k is greater than p, where p is the Node-ID of 5181 the previous peer in the Neighbor Table. Care must be taken when 5182 computing to note that all math is modulo 2^128. 5184 10.2. Hash Function 5186 For this Chord based topology plugin, the size of the Resource-ID is 5187 128 bits. The hash of a Resource-ID MUST be computed using SHA-1 5188 [RFC3174] then the SHA-1 result MUST be truncated to the most 5189 significant 128 bits. 5191 10.3. Routing 5193 The Routing Table is conceptually the union of the Neighbor Table and 5194 the Finger Table. 5196 If a peer is not responsible for a Resource-ID k, but is directly 5197 connected to a node with Node-ID k, then it MUST route the message to 5198 that node. Otherwise, it MUST route the request to the peer in the 5199 Routing Table that has the largest Node-ID that is in the interval 5200 between the peer and k. If no such node is found, it finds the 5201 smallest Node-ID that is greater than k and MUST route the message to 5202 that node. 5204 10.4. Redundancy 5206 When a peer receives a Store request for Resource-ID k, and it is 5207 responsible for Resource-ID k, it MUST store the data and returns a 5208 success response. It MUST then send a Store request to its successor 5209 in the Neighbor Table and to that peer's successor, incrementing the 5210 replica number for each successor. Note that these Store requests 5211 are addressed to those specific peers, even though the Resource-ID 5212 they are being asked to store is outside the range that they are 5213 responsible for. The peers receiving these SHOULD check they came 5214 from an appropriate predecessor in their Neighbor Table and that they 5215 are in a range that this predecessor is responsible for, and then 5216 they MUST store the data. They do not themselves perform further 5217 Stores because they can determine that they are not responsible for 5218 the Resource-ID. 5220 Note that this topology plugin do not use the replica number for 5221 other purpose than knowing the difference between a replica and a 5222 non-replica. 5224 Managing replicas as the overlay changes is described in 5225 Section 10.7.3. 5227 The sequential replicas used in this overlay algorithm protect 5228 against peer failure but not against malicious peers. Additional 5229 replication from the Usage is required to protect resources from such 5230 attacks, as discussed in Section 13.5.4. 5232 10.5. Joining 5234 The join process for a Joining Node (JN) with Node-ID n is as 5235 follows. 5237 1. JN MUST connect to its chosen bootstrap node as specified in 5238 Section 11.4. 5240 2. JN SHOULD send an Attach request to the admitting peer (AP) for 5241 Resource-ID n+1. The "send_update" flag can be used to acquire 5242 the routing table of AP. 5244 3. JN SHOULD send Attach requests to initiate connections to each of 5245 the peers in the Neighbor Table as well as to the desired Finger 5246 Table entries. Note that this does not populate their Routing 5247 Tables, but only their Connection Tables, so JN will not get 5248 messages that it is expected to route to other nodes. 5250 4. JN MUST enter all the peers it has successfully contacted into 5251 its Routing Table. 5253 5. JN MUST send a Join to AP. The AP MUST send the response to the 5254 Join. 5256 6. AP MUST do a series of Store requests to JN to store the data 5257 that JN will be responsible for. 5259 7. AP MUST send JN an Update explicitly labeling JN as its 5260 predecessor. At this point, JN is part of the ring and 5261 responsible for a section of the overlay. AP MAY now forget any 5262 data which is assigned to JN and not AP. AP SHOULD NOT forget 5263 any data where AP is the replica set for the data. 5265 8. The AP MUST send an Update to all of its neighbors with the new 5266 values of its neighbor set (including JN). 5268 9. The JN MUST send Updates to all the peers in its Neighbor Table. 5270 If JN sends an Attach to AP with send_update, it immediately knows 5271 most of its expected neighbors from AP's Routing Table update and MAY 5272 directly connect to them. This is the RECOMMENDED procedure. 5274 If for some reason JN does not get AP's Routing Table, it MAY still 5275 populate its Neighbor Table incrementally. It SHOULD send a Ping 5276 directed at Resource-ID n+1 (directly after its own Resource-ID). 5277 This allows it to discover its own successor. Call that node p0. It 5278 then SHOULD send a ping to p0+1 to discover its successor (p1). This 5279 process MAY be repeated to discover as many successors as desired. 5280 The values for the two peers before p will be found at a later stage 5281 when n receives an Update. An alternate procedure is to send 5282 Attaches to those nodes rather than pings, which forms the 5283 connections immediately but may be slower if the nodes need to 5284 collect ICE candidates, thus reducing parallelism. 5286 In order to set up its i'th Finger Table entry, JN MUST send an 5287 Attach to peer n+2^(128-i). This will be routed to a peer in 5288 approximately the right location around the ring. (Note the first 5289 entry in the Finger Table has i=1 and not i=0 in this formulation). 5291 The joining node MUST NOT send any Update message placing itself in 5292 the overlay until it has successfully completed an Attach with each 5293 peer that should be in its Neighbor Table. 5295 10.6. Routing Attaches 5297 When a peer needs to Attach to a new peer in its Neighbor Table, it 5298 MUST source-route the Attach request through the peer from which it 5299 learned the new peer's Node-ID. Source-routing these requests allows 5300 the overlay to recover from instability. 5302 All other Attach requests, such as those for new Finger Table 5303 entries, are routed conventionally through the overlay. 5305 10.7. Updates 5307 An Update for this DHT is defined as 5309 enum { invalidChordUpdateType(0), 5310 peer_ready(1), neighbors(2), full(3), (255) } 5311 ChordUpdateType; 5313 struct { 5314 uint32 uptime; 5315 ChordUpdateType type; 5316 select (type){ 5317 case peer_ready: /* Empty */ 5318 ; 5320 case neighbors: 5321 NodeId predecessors<0..2^16-1>; 5322 NodeId successors<0..2^16-1>; 5324 case full: 5325 NodeId predecessors<0..2^16-1>; 5326 NodeId successors<0..2^16-1>; 5327 NodeId fingers<0..2^16-1>; 5328 }; 5329 } ChordUpdate; 5331 The "uptime" field contains the time this peer has been up in 5332 seconds. 5334 The "type" field contains the type of the update, which depends on 5335 the reason the update was sent. 5337 peer_ready: this peer is ready to receive messages. This message 5338 is used to indicate that a node which has Attached is a peer and 5339 can be routed through. It is also used as a connectivity check to 5340 non-neighbor peers. 5342 neighbors: this version is sent to members of the Chord Neighbor 5343 Table. 5345 full: this version is sent to peers which request an Update with a 5346 RouteQueryReq. 5348 If the message is of type "neighbors", then the contents of the 5349 message will be: 5351 predecessors 5352 The predecessor set of the Updating peer. 5354 successors 5355 The successor set of the Updating peer. 5357 If the message is of type "full", then the contents of the message 5358 will be: 5360 predecessors 5361 The predecessor set of the Updating peer. 5363 successors 5364 The successor set of the Updating peer. 5366 fingers 5367 The Finger Table of the Updating peer, in numerically ascending 5368 order. 5370 A peer MUST maintain an association (via Attach) to every member of 5371 its neighbor set. A peer MUST attempt to maintain at least three 5372 predecessors and three successors, even though this will not be 5373 possible if the ring is very small. It is RECOMMENDED that O(log(N)) 5374 predecessors and successors be maintained in the neighbor set. There 5375 are many ways to estimate N, some of which are discussed in 5376 [I-D.ietf-p2psip-self-tuning]. 5378 10.7.1. Handling Neighbor Failures 5380 Every time a connection to a peer in the Neighbor Table is lost (as 5381 determined by connectivity pings or the failure of some request), the 5382 peer MUST remove the entry from its Neighbor Table and replace it 5383 with the best match it has from the other peers in its Routing Table. 5384 If using reactive recovery, it MUST send an immediate Update to all 5385 nodes in its Neighbor Table. The update will contain all the Node- 5386 IDs of the current entries of the table (after the failed one has 5387 been removed). Note that when replacing a successor the peer SHOULD 5388 delay the creation of new replicas for successor replacement hold- 5389 down time (30 seconds) after removing the failed entry from its 5390 Neighbor Table in order to allow a triggered update to inform it of a 5391 better match for its Neighbor Table. 5393 If the neighbor failure affects the peer's range of responsible IDs, 5394 then the Update MUST be sent to all nodes in its Connection Table. 5396 A peer MAY attempt to reestablish connectivity with a lost neighbor 5397 either by waiting additional time to see if connectivity returns or 5398 by actively routing a new Attach to the lost peer. Details for these 5399 procedures are beyond the scope of this document. In the case of an 5400 attempt to reestablish connectivity with a lost neighbor, the peer 5401 MUST be removed from the Neighbor Table. Such a peer is returned to 5402 the Neighbor Table once connectivity is reestablished. 5404 If connectivity is lost to all successor peers in the Neighbor Table, 5405 then this peer SHOULD behave as if it is joining the network and MUST 5406 use Pings to find a peer and send it a Join. If connectivity is lost 5407 to all the peers in the Finger Table, this peer SHOULD assume that it 5408 has been disconnected from the rest of the network, and it SHOULD 5409 periodically try to join the DHT. 5411 10.7.2. Handling Finger Table Entry Failure 5413 If a Finger Table entry is found to have failed (as determined by 5414 connectivity pings or the failure of some request), all references to 5415 the failed peer MUST be removed from the Finger Table and replaced 5416 with the closest preceding peer from the Finger Table or Neighbor 5417 Table. 5419 If using reactive recovery, the peer MUST initiate a search for a new 5420 Finger Table entry as described below. 5422 10.7.3. Receiving Updates 5424 When a peer, x, receives an Update request, it examines the Node-IDs 5425 in the UpdateReq and at its Neighbor Table and decides if this 5426 UpdateReq would change its Neighbor Table. This is done by taking 5427 the set of peers currently in the Neighbor Table and comparing them 5428 to the peers in the update request. There are two major cases: 5430 o The UpdateReq contains peers that match x's Neighbor Table, so no 5431 change is needed to the neighbor set. 5433 o The UpdateReq contains peers x does not know about that should be 5434 in x's Neighbor Table, i.e., they are closer than entries in the 5435 Neighbor Table. 5437 In the first case, no change is needed. 5439 In the second case, x MUST attempt to Attach to the new peers and if 5440 it is successful it MUST adjust its neighbor set accordingly. Note 5441 that it can maintain the now inferior peers as neighbors, but it MUST 5442 remember the closer ones. 5444 After any Pings and Attaches are done, if the Neighbor Table changes 5445 and the peer is using reactive recovery, the peer MUST send an Update 5446 request to each member of its Connection Table. These Update 5447 requests are what end up filling in the predecessor/successor tables 5448 of peers that this peer is a neighbor to. A peer MUST NOT enter 5449 itself in its successor or predecessor table and instead should leave 5450 the entries empty. 5452 If peer x is responsible for a Resource-ID R, and x discovers that 5453 the replica set for R (the next two nodes in its successor set) has 5454 changed, it MUST send a Store for any data associated with R to any 5455 new node in the replica set. It SHOULD NOT delete data from peers 5456 which have left the replica set. 5458 When a peer x detects that it is no longer in the replica set for a 5459 resource R (i.e., there are three predecessors between x and R), it 5460 SHOULD delete all data associated with R from its local store. 5462 When a peer discovers that its range of responsible IDs have changed, 5463 it MUST send an Update to all entries in its Connection Table. 5465 10.7.4. Stabilization 5467 There are four components to stabilization: 5469 1. exchange Updates with all peers in its Neighbor Table to exchange 5470 state. 5472 2. search for better peers to place in its Finger Table. 5474 3. search to determine if the current Finger Table size is 5475 sufficiently large. 5477 4. search to determine if the overlay has partitioned and needs to 5478 recover. 5480 10.7.4.1. Updating Neighbor Table 5482 A peer MUST periodically send an Update request to every peer in its 5483 Neighbor Table. The purpose of this is to keep the predecessor and 5484 successor lists up to date and to detect failed peers. The default 5485 time is about every ten minutes, but the configuration server SHOULD 5486 set this in the configuration document using the "chord-update- 5487 interval" element (denominated in seconds.) A peer SHOULD randomly 5488 offset these Update requests so they do not occur all at once. 5490 10.7.4.2. Refreshing Finger Table 5492 A peer MUST periodically search for new peers to replace invalid 5493 entries in the Finger Table. For peer x, the i'th Finger Table entry 5494 is valid if it is in the range [ x+2^( 128-i ), x+2^( 128-(i-1) )-1 5495 ]. Invalid entries occur in the Finger Table when a previous Finger 5496 Table entry has failed or when no peer has been found in that range. 5498 Two possible methods for searching for new peers for the Finger Table 5499 entries are presented: 5501 Alternative 1: A peer selects one entry in the Finger Table from 5502 among the invalid entries. It pings for a new peer for that Finger 5503 Table entry. The selection SHOULD be exponentially weighted to 5504 attempt to replace earlier (lower i) entries in the Finger Table. A 5505 simple way to implement this selection is to search through the 5506 Finger Table entries from i=1 and each time an invalid entry is 5507 encountered, send a Ping to replace that entry with probability 0.5. 5509 Alternative 2: A peer monitors the Update messages received from its 5510 connections to observe when an Update indicates a peer that would be 5511 used to replace in invalid Finger Table entry, i, and flags that 5512 entry in the Finger Table. Every "chord-ping-interval" seconds, the 5513 peer selects from among those flagged candidates using an 5514 exponentially weighted probability as above. 5516 When searching for a better entry, the peer SHOULD send the Ping to a 5517 Node-ID selected randomly from that range. Random selection is 5518 preferred over a search for strictly spaced entries to minimize the 5519 effect of churn on overlay routing [minimizing-churn-sigcomm06]. An 5520 implementation or subsequent specification MAY choose a method for 5521 selecting Finger Table entries other than choosing randomly within 5522 the range. Any such alternate methods SHOULD be employed only on 5523 Finger Table stabilization and not for the selection of initial 5524 Finger Table entries unless the alternative method is faster and 5525 imposes less overhead on the overlay. 5527 A peer SHOULD NOT send Ping requests looking for new finger table 5528 entries more often than the configuration element "chord-ping- 5529 interval", which defaults to 3600 seconds (one per hour). 5531 A peer MAY choose to keep connections to multiple peers that can act 5532 for a given Finger Table entry. 5534 10.7.4.3. Adjusting Finger Table size 5536 If the Finger Table has less than 16 entries, the node SHOULD attempt 5537 to discover more fingers to grow the size of the table to 16. The 5538 value 16 was chosen to ensure high odds of a node maintaining 5539 connectivity to the overlay even with strange network partitions. 5541 For many overlays, 16 Finger Table entries will be enough, but as an 5542 overlay grows very large, more than 16 entries may be required in the 5543 Finger Table for efficient routing. An implementation SHOULD be 5544 capable of increasing the number of entries in the Finger Table to 5545 128 entries. 5547 Although log(N) entries are all that are required for optimal 5548 performance, careful implementation of stabilization will result in 5549 no additional traffic being generated when maintaining a Finger Table 5550 larger than log(N) entries. Implementers are encouraged to make use 5551 of RouteQuery and algorithms for determining where new Finger Table 5552 entries may be found. Complete details of possible implementations 5553 are outside the scope of this specification. 5555 A simple approach to sizing the Finger Table is to ensure the Finger 5556 Table is large enough to contain at least the final successor in the 5557 peer's Neighbor Table. 5559 10.7.4.4. Detecting partitioning 5561 To detect that a partitioning has occurred and to heal the overlay, a 5562 peer P MUST periodically repeat the discovery process used in the 5563 initial join for the overlay to locate an appropriate bootstrap node, 5564 B. P SHOULD then send a Ping for its own Node-ID routed through B. If 5565 a response is received from a peer S', which is not P's successor, 5566 then the overlay is partitioned and P SHOULD send an Attach to S' 5567 routed through B, followed by an Update sent to S'. (Note that S' 5568 may not be in P's Neighbor Table once the overlay is healed, but the 5569 connection will allow S' to discover appropriate neighbor entries for 5570 itself via its own stabilization.) 5572 Future specifications may describe alternative mechanisms for 5573 determining when to repeat the discovery process. 5575 10.8. Route query 5577 For CHORD-RELOAD, the RouteQueryReq contains no additional 5578 information. The RouteQueryAns contains the single Node-ID of the 5579 next peer to which the responding peer would have routed the request 5580 message in recursive routing: 5582 struct { 5583 NodeId next_peer; 5584 } ChordRouteQueryAns; 5586 The contents of this structure are as follows: 5588 next_peer 5589 The peer to which the responding peer would route the message in 5590 order to deliver it to the destination listed in the request. 5592 If the requester has set the send_update flag, the responder SHOULD 5593 initiate an Update immediately after sending the RouteQueryAns. 5595 10.9. Leaving 5597 To support extensions, such as [I-D.ietf-p2psip-self-tuning], Peers 5598 SHOULD send a Leave request to all members of their Neighbor Table 5599 prior to exiting the Overlay Instance. The overlay_specific_data 5600 field MUST contain the ChordLeaveData structure defined below: 5602 enum { invalidChordLeaveType(0), 5603 from_succ(1), from_pred(2), (255) } 5604 ChordLeaveType; 5606 struct { 5607 ChordLeaveType type; 5609 select (type) { 5610 case from_succ: 5611 NodeId successors<0..2^16-1>; 5613 case from_pred: 5614 NodeId predecessors<0..2^16-1>; 5615 }; 5616 } ChordLeaveData; 5618 The 'type' field indicates whether the Leave request was sent by a 5619 predecessor or a successor of the recipient: 5621 from_succ 5622 The Leave request was sent by a successor. 5624 from_pred 5625 The Leave request was sent by a predecessor. 5627 If the type of the request is 'from_succ', the contents will be: 5629 successors 5630 The sender's successor list. 5632 If the type of the request is 'from_pred', the contents will be: 5634 predecessors 5635 The sender's predecessor list. 5637 Any peer which receives a Leave for a peer n in its neighbor set MUST 5638 follow procedures as if it had detected a peer failure as described 5639 in Section 10.7.1. 5641 11. Enrollment and Bootstrap 5643 The section defines the format of the configuration data as well the 5644 process to join a new overlay. 5646 11.1. Overlay Configuration 5648 This specification defines a new content type "application/ 5649 p2p-overlay+xml" for an MIME entity that contains overlay 5650 information. An example document is shown below. 5652 5653 5656 5658 CHORD-RELOAD 5659 16 5660 5661 MIIDJDCCAo2gAwIBAgIBADANBgkqhkiG9w0BAQUFADBwMQswCQYDVQQGEwJVUzET 5662 MBEGA1UECBMKQ2FsaWZvcm5pYTERMA8GA1UEBxMIU2FuIEpvc2UxDjAMBgNVBAoT 5663 BXNpcGl0MSkwJwYDVQQLEyBTaXBpdCBUZXN0IENlcnRpZmljYXRlIEF1dGhvcml0 5664 eTAeFw0wMzA3MTgxMjIxNTJaFw0xMzA3MTUxMjIxNTJaMHAxCzAJBgNVBAYTAlVT 5665 MRMwEQYDVQQIEwpDYWxpZm9ybmlhMREwDwYDVQQHEwhTYW4gSm9zZTEOMAwGA1UE 5666 ChMFc2lwaXQxKTAnBgNVBAsTIFNpcGl0IFRlc3QgQ2VydGlmaWNhdGUgQXV0aG9y 5667 aXR5MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDDIh6DkcUDLDyK9BEUxkud 5668 +nJ4xrCVGKfgjHm6XaSuHiEtnfELHM+9WymzkBNzZpJu30yzsxwfKoIKugdNUrD4 5669 N3viCicwcN35LgP/KnbN34cavXHr4ZlqxH+OdKB3hQTpQa38A7YXdaoz6goW2ft5 5670 Mi74z03GNKP/G9BoKOGd5QIDAQABo4HNMIHKMB0GA1UdDgQWBBRrRhcU6pR2JYBU 5671 bhNU2qHjVBShtjCBmgYDVR0jBIGSMIGPgBRrRhcU6pR2JYBUbhNU2qHjVBShtqF0 5672 pHIwcDELMAkGA1UEBhMCVVMxEzARBgNVBAgTCkNhbGlmb3JuaWExETAPBgNVBAcT 5673 CFNhbiBKb3NlMQ4wDAYDVQQKEwVzaXBpdDEpMCcGA1UECxMgU2lwaXQgVGVzdCBD 5674 ZXJ0aWZpY2F0ZSBBdXRob3JpdHmCAQAwDAYDVR0TBAUwAwEB/zANBgkqhkiG9w0B 5675 AQUFAAOBgQCWbRvv1ZGTRXxbH8/EqkdSCzSoUPrs+rQqR0xdQac9wNY/nlZbkR3O 5676 qAezG6Sfmklvf+DOg5RxQq/+Y6I03LRepc7KeVDpaplMFGnpfKsibETMipwzayNQ 5677 QgUf4cKBiF+65Ue7hZuDJa2EMv8qW4twEhGDYclpFU9YozyS1OhvUg== 5678 5679 YmFkIGNlcnQK 5680 https://example.org 5681 https://example.net 5682 false 5684 5685 5686 5687 20 5688 false 5689 false 5690 5691 400 5692 30 5693 true 5694 password 5695 4000 5696 30 5697 3000 5698 TLS 5699 47112162e84c69ba 5700 47112162e84c69ba 5701 6eba45d31a900c06 5702 6ebc45d31a900c06 5703 6ebc45d31a900ca6 5705 foo 5707 5708 urn:ietf:params:xml:ns:p2p:config-ext1 5709 5711 5712 5713 5714 SINGLE 5715 USER-MATCH 5716 1 5717 100 5718 5719 5720 VGhpcyBpcyBub3QgcmlnaHQhCg== 5721 5722 5723 5724 5725 ARRAY 5726 NODE-MULTIPLE 5727 3 5728 22 5729 4 5730 1 5731 5732 5733 5734 VGhpcyBpcyBub3QgcmlnaHQhCg== 5735 5736 5737 5738 5739 VGhpcyBpcyBub3QgcmlnaHQhCg== 5741 5742 5743 VGhpcyBpcyBub3QgcmlnaHQhCg== 5745 5747 The file MUST be a well formed XML document and it SHOULD contain an 5748 encoding declaration in the XML declaration. The file MUST use the 5749 UTF-8 character encoding. The namespace for the elements defined in 5750 this specification is urn:ietf:params:xml:ns:p2p:config-base and 5751 urn:ietf:params:xml:ns:p2p:config-chord". 5753 Note that elements or attributes that are defined as type xsd:boolean 5754 in the RELAX NG schema (Section 11.1.1) have two lexical 5755 representations, "1" or "true" for the concept true and "0" or 5756 "false" for the concept false. Whitespace and case processing 5757 follows the rules of [OASIS.relax_ng] and XML Schema Datatypes 5758 [W3C.REC-xmlschema-2-20041028] . 5760 The file MAY contain multiple "configuration" elements where each one 5761 contains the configuration information for a different overlay. Each 5762 configuration element MAY be followed by signature elements that 5763 provides a signature over the preceding configuration element. Each 5764 configuration element has the following attributes: 5766 instance-name: the name of the overlay (referred to as "overlay 5767 name" in this specification) 5769 expiration: time in the future at which this overlay configuration 5770 is no longer valid. The node SHOULD retrieve a new copy of the 5771 configuration at a randomly selected time that is before the 5772 expiration time. Note that if the certificates expire before a 5773 new configuration is retried, the node will not be able to 5774 validate the configuration file. All times MUST conform to the 5775 Internet Date/Time Format defined in [RFC3339] and be specified 5776 using Coordinated Universal Time (UTC). 5778 sequence: a monotonically increasing sequence number between 0 and 5779 2^16-2 5781 Inside each overlay element, the following elements can occur: 5783 topology-plugin This element defines the overlay algorithm being 5784 used. If missing the default is "CHORD-RELOAD". 5786 node-id-length This element contains the length of a NodeId 5787 (NodeIdLength) in bytes. This value MUST be between 16 (128 bits) 5788 and 20 (160 bits). If this element is not present, the default of 5789 16 is used. 5791 root-cert This element contains a base-64 encoded X.509v3 5792 certificate that is a root trust anchor used to sign all 5793 certificates in this overlay. There can be more than one root- 5794 cert element. 5796 enrollment-server This element contains the URL at which the 5797 enrollment server can be reached in a "url" element. This URL 5798 MUST be of type "https:". More than one enrollment-server element 5799 MAY be present. Note that there is no necessary relationship 5800 between the overlay name/configuration server name and the 5801 enrollment server name. 5803 self-signed-permitted This element indicates whether self-signed 5804 certificates are permitted. If it is set to "true", then self- 5805 signed certificates are allowed, in which case the enrollment- 5806 server and root-cert elements MAY be absent. Otherwise, it SHOULD 5807 be absent, but MAY be set to "false". This element also contains 5808 an attribute "digest" which indicates the digest to be used to 5809 compute the Node-ID. Valid values for this parameter are "sha1" 5810 and "sha256" representing SHA-1 [RFC3174] and SHA-256 [RFC6234] 5811 respectively. Implementations MUST support both of these 5812 algorithms. 5814 bootstrap-node This element represents the address of one of the 5815 bootstrap nodes. It has an attribute called "address" that 5816 represents the IP address (either IPv4 or IPv6, since they can be 5817 distinguished) and an optional attribute called "port" that 5818 represents the port and defaults to 6084. The IPv6 address is in 5819 typical hexadecimal form using standard period and colon 5820 separators as specified in [RFC5952]. More than one bootstrap- 5821 node element MAY be present. 5823 turn-density This element is a positive integer that represents the 5824 approximate reciprocal of density of nodes that can act as TURN 5825 servers. For example, if 5% of the nodes can act as TURN servers, 5826 this would be set to 20. If it is not present, the default value 5827 is 1. If there are no TURN servers in the overlay, it is set to 5828 zero. 5830 clients-permitted This element represents whether clients are 5831 permitted or whether all nodes must be peers. If clients are 5832 permitted, the element MUST be set to "true" or absent. If the 5833 nodes are not allowed to remain clients after the initial join, 5834 the element MUST be set to "false". There is currently no way for 5835 the overlay to enforce this. 5837 no-ice This element represents whether nodes are REQUIRED to use 5838 the "No-ICE" Overlay Link protocols in this overlay. If it is 5839 absent, it is treated as if it were set to "false". 5841 chord-update-interval The update frequency for the CHORD-RELOAD 5842 topology plugin (see Section 10). 5844 chord-ping-interval The ping frequency for the CHORD-RELOAD 5845 topology plugin (see Section 10). 5847 chord-reactive Whether reactive recovery SHOULD be used for this 5848 overlay. Set to "true" or "false". Default if missing is "true". 5849 (see Section 10). 5851 shared-secret If shared secret mode is used, this contains the 5852 shared secret. The security guarantee here is that any agent 5853 which is able to access the configuration document (presumably 5854 protected by some sort of HTTP access control or network topology) 5855 is able to recover the shared secret and hence join the overlay. 5857 max-message-size Maximum size in bytes of any message in the 5858 overlay. If this value is not present, the default is 5000. 5860 initial-ttl Initial default TTL (time to live, see Section 6.3.2) 5861 for messages. If this value is not present, the default is 100. 5863 overlay-reliability-timer Default value for the end-to-end 5864 retransmission timer for messages, in milliseconds. If not 5865 present, the default value is 3000. The value MUST be at least 5866 200 milliseconds, which means the minimum time delay before 5867 dropping a link is 1000 milliseconds. 5869 overlay-link-protocol Indicates a permissible overlay link protocol 5870 (see Section 6.6.1 for requirements for such protocols). An 5871 arbitrary number of these elements may appear. If none appear, 5872 then this implies the default value, "TLS", which refers to the 5873 use of TLS and DTLS. If one or more elements appear, then no 5874 default value applies. 5876 kind-signer This contains a single Node-ID in hexadecimal and 5877 indicates that the certificate with this Node-ID is allowed to 5878 sign Kinds. Identifying kind-signer by Node-ID instead of 5879 certificate allows the use of short lived certificates without 5880 constantly having to provide an updated configuration file. 5882 configuration-signer This contains a single Node-ID in hexadecimal 5883 and indicates that the certificate with this Node-ID is allowed to 5884 sign configurations for this instance-name. Identifying the 5885 signer by Node-ID instead of certificate allows the use of short 5886 lived certificates without constantly having to provide an updated 5887 configuration file. 5889 bad-node This contains a single Node-ID in hexadecimal and 5890 indicates that the certificate with this Node-ID MUST NOT be 5891 considered valid. This allows certificate revocation. An 5892 arbitrary number of these elements can be provided. Note that 5893 because certificates may expire, bad-node entries need only be 5894 present for the lifetime of the certificate. Technically 5895 speaking, bad Node-IDs may be reused once their certificates have 5896 expired, the requirement for Node-IDs to be pseudo randomly 5897 generated gives this event a vanishing probability. 5899 mandatory-extension This element contains the name of an XML 5900 namespace that a node joining the overlay MUST support. The 5901 presence of a mandatory-extension element does not require the 5902 extension to be used in the current configuration file, but can 5903 indicate that it may be used in the future. Note that the 5904 namespace is case-sensitive, as specified in [w3c-xml-namespaces] 5905 Section 2.3. More than one mandatory-extension element MAY be 5906 present. 5908 Inside each configuration element, the required-kinds element MAY 5909 also occur. This element indicates the Kinds that members MUST 5910 support and contains multiple kind-block elements that each define a 5911 single Kind that MUST be supported by nodes in the overlay. Each 5912 kind-block consists of a single kind element and a kind-signature. 5913 The kind element defines the Kind. The kind-signature is the 5914 signature computed over the kind element. 5916 Each kind element has either an id attribute or a name attribute. 5917 The name attribute is a string representing the Kind (the name 5918 registered to IANA) while the id is an integer Kind-ID allocated out 5919 of private space. 5921 In addition, the kind element MUST contain the following elements: 5923 max-count: the maximum number of values which members of the overlay 5924 must support. 5926 data-model: the data model to be used. 5928 max-size: the maximum size of individual values. 5930 access-control: the access control model to be used. 5932 The kind element MAY also contain the following element: 5934 max-node-multiple: if the access control is NODE-MULTIPLE, this 5935 element MUST be included. This indicates the maximum value for 5936 the i counter. It MUST be an integer greater than 0. 5938 All of the non optional values MUST be provided. If the Kind is 5939 registered with IANA, the data-model and access-control elements MUST 5940 match those in the Kind registration, and clients MUST ignore them in 5941 favor of the IANA versions. Multiple kind-block elements MAY be 5942 present. 5944 The kind-block element also MUST contain a "kind-signature" element. 5945 This signature is computed across the kind element from the beginning 5946 of the first < of the kind element to the end of the last > of the 5947 kind element in the same way as the signature element described later 5948 in this section. kind-block elements MUST be signed by a node listed 5949 in the kind-signers block of the current configuration. Receivers 5950 MUST verify the signature prior to accepting a kind-block. 5952 The configuration element MUST be treated as a binary blob that 5953 cannot be changed - including any whitespace changes - or the 5954 signature will break. The signature MUST be computed by taking each 5955 configuration element and starting from, and including, the first < 5956 at the start of up to and including the > in 5957 and treating this as a binary blob that MUST be 5958 signed using the standard SecurityBlock defined in Section 6.3.4. 5959 The SecurityBlock MUST be base 64 encoded using the base64 alphabet 5960 from [RFC4648] and MUST be put in the signature element following the 5961 configuration object in the configuration file. Any configuration 5962 file MUST be signed by one of the configuration-signer elements from 5963 the previous extant configuration. Recipients MUST verify the 5964 signature prior to accepting the configuration file. 5966 When a node receives a new configuration file, it MUST change its 5967 configuration to meet the new requirements. This may require the 5968 node to exit the DHT and re-join. If a node is not capable of 5969 supporting the new requirements, it MUST exit the overlay. If some 5970 information about a particular Kind changes from what the node 5971 previously knew about the Kind (for example the max size), the new 5972 information in the configuration files overrides any previously 5973 learned information. If any Kind data was signed by a node that is 5974 no longer allowed to sign Kinds, that Kind MUST be discarded along 5975 with any stored information of that Kind. Note that forcing an 5976 avalanche restart of the overlay with a configuration change that 5977 requires re-joining the overlay may result in serious performance 5978 problems, including total collapse of the network if configuration 5979 parameters are not properly considered. Such an event may be 5980 necessary in case of a compromised CA or similar problem, but for 5981 large overlays should be avoided in almost all circumstances. 5983 11.1.1. RELAX NG Grammar 5985 The grammar for the configuration data is: 5987 namespace chord = "urn:ietf:params:xml:ns:p2p:config-chord" 5988 namespace local = "" 5989 default namespace p2pcf = "urn:ietf:params:xml:ns:p2p:config-base" 5990 namespace rng = "http://relaxng.org/ns/structure/1.0" 5992 anything = 5993 (element * { anything } 5994 | attribute * { text } 5995 | text)* 5997 foreign-elements = element * - (p2pcf:* | local:* | chord:*) 5998 { anything }* 5999 foreign-attributes = attribute * - (p2pcf:*|local:*|chord:*) 6000 { text }* 6001 foreign-nodes = (foreign-attributes | foreign-elements)* 6003 start = element p2pcf:overlay { 6004 overlay-element 6005 } 6007 overlay-element &= element configuration { 6008 attribute instance-name { xsd:string }, 6009 attribute expiration { xsd:dateTime }?, 6010 attribute sequence { xsd:long }?, 6011 foreign-attributes*, 6012 parameter 6013 }+ 6014 overlay-element &= element signature { 6015 attribute algorithm { signature-algorithm-type }?, 6016 xsd:base64Binary 6017 }* 6019 signature-algorithm-type |= "rsa-sha1" 6020 signature-algorithm-type |= xsd:string # signature alg extensions 6022 parameter &= element topology-plugin { topology-plugin-type }? 6023 topology-plugin-type |= xsd:string # topo plugin extensions 6024 parameter &= element max-message-size { xsd:unsignedInt }? 6025 parameter &= element initial-ttl { xsd:int }? 6026 parameter &= element root-cert { xsd:base64Binary }* 6027 parameter &= element required-kinds { kind-block* }? 6028 parameter &= element enrollment-server { xsd:anyURI }* 6029 parameter &= element kind-signer { xsd:string }* 6030 parameter &= element configuration-signer { xsd:string }* 6031 parameter &= element bad-node { xsd:string }* 6032 parameter &= element no-ice { xsd:boolean }? 6033 parameter &= element shared-secret { xsd:string }? 6034 parameter &= element overlay-link-protocol { xsd:string }* 6035 parameter &= element clients-permitted { xsd:boolean }? 6036 parameter &= element turn-density { xsd:unsignedByte }? 6037 parameter &= element node-id-length { xsd:int }? 6038 parameter &= element mandatory-extension { xsd:string }* 6039 parameter &= foreign-elements* 6041 parameter &= 6042 element self-signed-permitted { 6043 attribute digest { self-signed-digest-type }, 6044 xsd:boolean 6045 }? 6046 self-signed-digest-type |= "sha1" 6047 self-signed-digest-type |= xsd:string # signature digest extensions 6049 parameter &= element bootstrap-node { 6050 attribute address { xsd:string }, 6051 attribute port { xsd:int }? 6052 }* 6054 kind-block = element kind-block { 6055 element kind { 6056 ( attribute name { kind-names } 6057 | attribute id { xsd:unsignedInt } ), 6058 kind-parameter 6059 } & 6060 element kind-signature { 6061 attribute algorithm { signature-algorithm-type }?, 6062 xsd:base64Binary 6063 }? 6064 } 6066 kind-parameter &= element max-count { xsd:int } 6067 kind-parameter &= element max-size { xsd:int } 6068 kind-parameter &= element max-node-multiple { xsd:int }? 6070 kind-parameter &= element data-model { data-model-type } 6071 data-model-type |= "SINGLE" 6072 data-model-type |= "ARRAY" 6073 data-model-type |= "DICTIONARY" 6074 data-model-type |= xsd:string # data model extensions 6076 kind-parameter &= element access-control { access-control-type } 6077 access-control-type |= "USER-MATCH" 6078 access-control-type |= "NODE-MATCH" 6079 access-control-type |= "USER-NODE-MATCH" 6080 access-control-type |= "NODE-MULTIPLE" 6081 access-control-type |= xsd:string # access control extensions 6083 kind-parameter &= foreign-elements* 6085 kind-names |= "TURN-SERVICE" 6086 kind-names |= "CERTIFICATE_BY_NODE" 6087 kind-names |= "CERTIFICATE_BY_USER" 6088 kind-names |= xsd:string # kind extensions 6090 # Chord specific parameters 6091 topology-plugin-type |= "CHORD-RELOAD" 6092 parameter &= element chord:chord-ping-interval { xsd:int }? 6093 parameter &= element chord:chord-update-interval { xsd:int }? 6094 parameter &= element chord:chord-reactive { xsd:boolean }? 6096 11.2. Discovery Through Configuration Server 6098 When a node first enrolls in a new overlay, it starts with a 6099 discovery process to find a configuration server. 6101 The node MAY start by determining the overlay name. This value MUST 6102 be provided by the user or some other out of band provisioning 6103 mechanism. The out of band mechanisms MAY also provide an optional 6104 URL for the configuration server. If a URL for the configuration 6105 server is not provided, the node MUST do a DNS SRV query using a 6106 Service name of "reload-config" and a protocol of TCP to find a 6107 configuration server and form the URL by appending a path of "/.well- 6108 known/reload-config" to the overlay name. This uses the "well known 6109 URI" framework defined in [RFC5785]. For example, if the overlay 6110 name was example.com, the URL would be 6111 "https://example.com/.well-known/reload-config". 6113 Once an address and URL for the configuration server is determined, 6114 the peer MUST form an HTTPS connection to that IP address. If an 6115 optional URL for the configuration server was provided, the 6116 certificate MUST match the domain name from the URL as described in 6117 [RFC2818]; otherwise the certificate MUST match the overlay name as 6118 described in [RFC2818]. If the HTTPS certificates passes the name 6119 matching, the node MUST fetch a new copy of the configuration file. 6120 To do this, the peer performs a GET to the URL. The result of the 6121 HTTP GET is an XML configuration file described above. If the XML is 6122 not valid, or the instance-name attribute of the overlay-element in 6123 the XML does not match the overlay name, this configurations file 6124 SHOULD be discarded. Otherwise, the new configuration MUST replace 6125 any previously learned configuration file for this overlay. 6127 For overlays that do not use a configuration server, nodes MUST 6128 obtain the configuration information needed to join the overlay 6129 through some out of band approach such as an XML configuration file 6130 sent over email. 6132 11.3. Credentials 6134 If the configuration document contains a enrollment-server element, 6135 credentials are REQUIRED to join the Overlay Instance. A peer which 6136 does not yet have credentials MUST contact the enrollment server to 6137 acquire them. 6139 RELOAD defines its own trivial certificate request protocol. We 6140 would have liked to have used an existing protocol but were concerned 6141 about the implementation burden of even the simplest of those 6142 protocols, such as [RFC5272] and [RFC5273]. The objective was to 6143 have a protocol which could be easily implemented in a Web server 6144 which the operator did not control (e.g., in a hosted service) and 6145 was compatible with the existing certificate handling tooling as used 6146 with the Web certificate infrastructure. This means accepting bare 6147 PKCS#10 requests and returning a single bare X.509 certificate. 6148 Although the MIME types for these objects are defined, none of the 6149 existing protocols support exactly this model. 6151 The certificate request protocol MUST be performed over HTTPS. The 6152 server certificate MUST match the overlay name as described in 6153 [RFC2818]. The request MUST be an HTTP POST with the parameters 6154 encoded as described in [RFC2388] and the following properties: 6156 o If authentication is required, there MUST be form parameters of 6157 "password" and "username" containing the user's account name and 6158 password in the clear (hence the need for HTTPS). The username 6159 and password strings MUST be UTF-8 strings compared as binary 6160 objects. Applications using RELOAD SHOULD define any needed 6161 string preparation as per [RFC4013] or its successor documents. 6163 o If more than one Node-ID is required, there MUST be a form 6164 parameter of "nodeids" containing the number of Node-IDs required. 6166 o There MUST be a form parameter of "csr" with a content type of 6167 "application/pkcs10", as defined in [RFC2311] that contains the 6168 certificate signing request (CSR). 6170 o The Accept header MUST contain the type "application/pkix-cert", 6171 indicating the type that is expected in the response. 6173 The enrollment server MUST authenticate the request using the 6174 provided account name and password. The reason for using the RFC 6175 2388 "multipart/form-data" encoding is so that the password parameter 6176 will not be encoded in the URL to reduce the chance of accidental 6177 leakage of the password. If the authentication succeeds and the 6178 requested user name in the CSR is acceptable, the server MUST 6179 generate and return a certificate for the CSR in the "csr" parameter 6180 of the request. The SubjectAltName field in the certificate MUST 6181 contain the following values: 6183 o One or more Node-IDs which MUST be cryptographically random 6184 [RFC4086]. Each MUST be chosen by the enrollment server in such a 6185 way that they are unpredictable to the requesting user. E.g., the 6186 user MUST NOT be informed of potential (random) Node-IDs prior to 6187 authenticating. Each is placed in the subjectAltName using the 6188 uniformResourceIdentifier type and MUST contain RELOAD URIs as 6189 described in Section 14.15 and MUST contain a Destination list 6190 with a single entry of type "node_id". The enrollment server 6191 SHOULD maintain a mapping of users to Node-IDs and if the same 6192 user returns (e.g., to have their certificate re-issued) return 6193 the same Node-IDs, thus avoiding the need for implementations to 6194 re-store all their data when their certificates expire. 6196 o A single name (the "user name") that this user is allowed to use 6197 in the overlay, using type rfc822Name. Enrollment servers SHOULD 6198 take care to only allow legal characters in the name (e.g., no 6199 embedded NULs), rather than simply accepting any name provided by 6200 the user. In some usages, the right-hand-side of the user name 6201 will match the overlay name, but there is no requirement for this 6202 match in this specification. Applications using this 6203 specification MAY define such a requirement, or MAY otherwise 6204 limit the allowed range of allowed user names. 6206 The SubjectAltName field in the certificate MUST NOT contain any 6207 other identities than listed above. The subject distinguished name 6208 in the certificate MUST be empty. 6210 The certificate MUST be returned as type "application/pkix-cert" as 6211 defined in [RFC2585], with an HTTP status code of 200 OK. 6213 Certificate processing errors SHOULD result in a HTTP return code of 6214 403 "Forbidden" along with a body of type "text/plain" and body that 6215 consists of one of the tokens defined in the following list: 6217 failed_authentication The account name and password combination used 6218 in the HTTPS request was not valid. 6220 username_not_available The requested user name in the CSR was not 6221 acceptable. 6223 Node-IDs_not_available The number of Node-IDs requested was not 6224 acceptable. 6226 bad_CSR There was some other problem with the CSR. 6228 If the client receives an unknown token in the body, it SHOULD treat 6229 it as a failure for an unknown reason. 6231 The client MUST check that the certificate returned chains back to 6232 one of the certificates received in the "root-cert" list of the 6233 overlay configuration data (including PKIX BasicConstraints checks.) 6234 The node then reads the certificate to find the Node-ID it can use. 6236 11.3.1. Self-Generated Credentials 6238 If the "self-signed-permitted" element is present in the 6239 configuration and set to "true", then a node MUST generate its own 6240 self-signed certificate to join the overlay. The self-signed 6241 certificate MAY contain any user name of the users choice. 6243 For self-signed certificate containing only one Node-ID, the Node-ID 6244 MUST be computed by applying the digest specified in the self-signed- 6245 permitted element to the DER representation of the user's public key 6246 (more specifically the subjectPublicKeyInfo) and taking the high 6247 order bits. For self-signed certificates containing multiple Node- 6248 IDs, the index of the Node-ID (from 1 to the number of Node-IDs 6249 needed) must be prepended as a 4 bytes big endian integer to the DER 6250 representation of the user's public key and taking the high order 6251 bits. When accepting a self-signed certificate, nodes MUST check 6252 that the Node-ID and public keys match. This prevents Node-ID theft. 6254 Once the node has constructed a self-signed certificate, it MAY join 6255 the overlay. It MUST store its certificate in the overlay 6256 (Section 8) but SHOULD look to see if the user name is already taken 6257 before and if so choose another user name. Note that this only 6258 provides protection against accidental name collisions. Name theft 6259 is still possible. If protection against name theft is desired, then 6260 the enrollment service MUST be used. 6262 11.4. Contacting a Bootstrap Node 6264 In order to join the overlay, the joining node MUST contact a node in 6265 the overlay. Typically this means contacting the bootstrap nodes, 6266 since they are reachable by the local peer or have public IP 6267 addresses. If the joining node has cached a list of peers it has 6268 previously been connected with in this overlay, as an optimization it 6269 MAY attempt to use one or more of them as bootstrap nodes before 6270 falling back to the bootstrap nodes listed in the configuration file. 6272 When contacting a bootstrap node, the joining node MUST first form 6273 the DTLS or TLS connection to the bootstrap node and then send an 6274 Attach request over this connection with the destination Resource-ID 6275 set to the joining node's Node-ID plus 1. 6277 When the requester node finally does receive a response from some 6278 responding node, it MUST use the Node-ID in the response to start 6279 sending requests to join the Overlay Instance as described in 6280 Section 6.4. 6282 After a node has successfully joined the overlay network, it will 6283 have direct connections to several peers. Some MAY be added to the 6284 cached bootstrap nodes list and used in future boots. Peers that are 6285 not directly connected MUST NOT be cached. The suggested number of 6286 peers to cache is 10. Algorithms for determining which peers to 6287 cache are beyond the scope of this specification. 6289 12. Message Flow Example 6291 The following abbreviations are used in the message flow diagrams: 6292 JN = joining node, AP = admitting peer, NP = next peer after the AP, 6293 NNP = next next peer which is the peer after NP, PP = previous peer 6294 before the AP, PPP = previous previous peer which is the peer before 6295 the PP, BP = bootstrap peer. 6297 In the following example, we assume that JN has formed a connection 6298 to one of the bootstrap nodes. JN then sends an Attach through that 6299 peer to a Resource-ID of itself plus 1 (JN+1). It gets routed to the 6300 admitting peer (AP) because JN is not yet part of the overlay. When 6301 AP responds, JN and AP use ICE to set up a connection and then set up 6302 DTLS. Once AP has connected to JN, AP sends to JN an Update to 6303 populate its Routing Table. The following example shows the Update 6304 happening after the DTLS connection is formed but it could also 6305 happen before in which case the Update would often be routed through 6306 other nodes. 6308 JN PPP PP AP NP NNP BP 6309 | | | | | | | 6310 | | | | | | | 6311 | | | | | | | 6312 |AttachReq Dest=JN+1| | | | | 6313 |---------------------------------------------------------->| 6314 | | | | | | | 6315 | | | | | | | 6316 | | | |AttachReq Dest=JN+1| | 6317 | | | |<----------------------------| 6318 | | | | | | | 6319 | | | | | | | 6320 | | | |AttachAns | | 6321 | | | |---------------------------->| 6322 | | | | | | | 6323 | | | | | | | 6324 |AttachAns | | | | | 6325 |<----------------------------------------------------------| 6326 | | | | | | | 6327 |ICE | | | | | | 6328 |<===========================>| | | | 6329 | | | | | | | 6330 |TLS | | | | | | 6331 |<...........................>| | | | 6332 | | | | | | | 6333 | | | | | | | 6334 | | | | | | | 6335 |UpdateReq| | | | | | 6336 |<----------------------------| | | | 6337 | | | | | | | 6338 | | | | | | | 6339 |UpdateAns| | | | | | 6340 |---------------------------->| | | | 6341 | | | | | | | 6342 | | | | | | | 6343 | | | | | | | 6345 Figure 1 6347 The JN then forms connections to the appropriate neighbors, such as 6348 NP, by sending an Attach which gets routed via other nodes. When NP 6349 responds, JN and NP use ICE and DTLS to set up a connection. 6351 JN PPP PP AP NP NNP BP 6352 | | | | | | | 6353 | | | | | | | 6354 | | | | | | | 6355 |AttachReq NP | | | | | 6356 |---------------------------->| | | | 6357 | | | | | | | 6358 | | | | | | | 6359 | | | |AttachReq NP | | 6360 | | | |-------->| | | 6361 | | | | | | | 6362 | | | | | | | 6363 | | | |AttachAns| | | 6364 | | | |<--------| | | 6365 | | | | | | | 6366 | | | | | | | 6367 |AttachAns| | | | | | 6368 |<----------------------------| | | | 6369 | | | | | | | 6370 | | | | | | | 6371 |ICE | | | | | | 6372 |<=====================================>| | | 6373 | | | | | | | 6374 | | | | | | | 6375 |TLS | | | | | | 6376 |<.....................................>| | | 6377 | | | | | | | 6378 | | | | | | | 6379 | | | | | | | 6380 | | | | | | | 6382 Figure 2 6384 JN also needs to populate its Finger Table (for the Chord based DHT). 6385 It issues an Attach to a variety of locations around the overlay. 6386 The diagram below shows it sending an Attach halfway around the Chord 6387 ring to the JN + 2^127. 6389 JN NP XX TP 6390 | | | | 6391 | | | | 6392 | | | | 6393 |AttachReq JN+2<<126| | 6394 |-------->| | | 6395 | | | | 6396 | | | | 6397 | |AttachReq JN+2<<126| 6398 | |-------->| | 6399 | | | | 6400 | | | | 6401 | | |AttachReq JN+2<<126 6402 | | |-------->| 6403 | | | | 6404 | | | | 6405 | | |AttachAns| 6406 | | |<--------| 6407 | | | | 6408 | | | | 6409 | |AttachAns| | 6410 | |<--------| | 6411 | | | | 6412 | | | | 6413 |AttachAns| | | 6414 |<--------| | | 6415 | | | | 6416 |ICE | | | 6417 |<===========================>| 6418 | | | | 6419 |TLS | | | 6420 |<...........................>| 6421 | | | | 6422 | | | | 6424 Figure 3 6426 Once JN has a reasonable set of connections, it is ready to take its 6427 place in the DHT. It does this by sending a Join to AP. AP does a 6428 series of Store requests to JN to store the data that JN will be 6429 responsible for. AP then sends JN an Update explicitly labeling JN 6430 as its predecessor. At this point, JN is part of the ring and 6431 responsible for a section of the overlay. AP can now forget any data 6432 which is assigned to JN and not AP. 6434 JN PPP PP AP NP NNP BP 6435 | | | | | | | 6436 | | | | | | | 6437 | | | | | | | 6438 |JoinReq | | | | | | 6439 |---------------------------->| | | | 6440 | | | | | | | 6441 | | | | | | | 6442 |JoinAns | | | | | | 6443 |<----------------------------| | | | 6444 | | | | | | | 6445 | | | | | | | 6446 |StoreReq Data A | | | | | 6447 |<----------------------------| | | | 6448 | | | | | | | 6449 | | | | | | | 6450 |StoreAns | | | | | | 6451 |---------------------------->| | | | 6452 | | | | | | | 6453 | | | | | | | 6454 |StoreReq Data B | | | | | 6455 |<----------------------------| | | | 6456 | | | | | | | 6457 | | | | | | | 6458 |StoreAns | | | | | | 6459 |---------------------------->| | | | 6460 | | | | | | | 6461 | | | | | | | 6462 |UpdateReq| | | | | | 6463 |<----------------------------| | | | 6464 | | | | | | | 6465 | | | | | | | 6466 |UpdateAns| | | | | | 6467 |---------------------------->| | | | 6468 | | | | | | | 6469 | | | | | | | 6470 | | | | | | | 6471 | | | | | | | 6473 Figure 4 6475 In Chord, JN's Neighbor Table needs to contain its own predecessors. 6476 It couldn't connect to them previously because it did not yet know 6477 their addresses. However, now that it has received an Update from 6478 AP, as in the previous diagram, it has AP's predecessors, which are 6479 also its own, so it sends Attaches to them. Below it is shown 6480 connecting only to AP's closest predecessor, PP. 6482 JN PPP PP AP NP NNP BP 6483 | | | | | | | 6484 | | | | | | | 6485 | | | | | | | 6486 |AttachReq Dest=PP | | | | | 6487 |---------------------------->| | | | 6488 | | | | | | | 6489 | | | | | | | 6490 | | |AttachReq Dest=PP | | | 6491 | | |<--------| | | | 6492 | | | | | | | 6493 | | | | | | | 6494 | | |AttachAns| | | | 6495 | | |-------->| | | | 6496 | | | | | | | 6497 | | | | | | | 6498 |AttachAns| | | | | | 6499 |<----------------------------| | | | 6500 | | | | | | | 6501 | | | | | | | 6502 |TLS | | | | | | 6503 |...................| | | | | 6504 | | | | | | | 6505 | | | | | | | 6506 |UpdateReq| | | | | | 6507 |------------------>| | | | | 6508 | | | | | | | 6509 | | | | | | | 6510 |UpdateAns| | | | | | 6511 |<------------------| | | | | 6512 | | | | | | | 6513 | | | | | | | 6514 |UpdateReq| | | | | | 6515 |---------------------------->| | | | 6516 | | | | | | | 6517 | | | | | | | 6518 |UpdateAns| | | | | | 6519 |<----------------------------| | | | 6520 | | | | | | | 6521 | | | | | | | 6522 |UpdateReq| | | | | | 6523 |-------------------------------------->| | | 6524 | | | | | | | 6525 | | | | | | | 6526 |UpdateAns| | | | | | 6527 |<--------------------------------------| | | 6528 | | | | | | | 6529 | | | | | | | 6530 Figure 5 6532 Finally, now that JN has a copy of all the data and is ready to route 6533 messages and receive requests, it sends Updates to everyone in its 6534 Routing Table to tell them it is ready to go. Below, it is shown 6535 sending such an update to TP. 6537 JN NP XX TP 6538 | | | | 6539 | | | | 6540 | | | | 6541 |UpdateReq| | | 6542 |---------------------------->| 6543 | | | | 6544 | | | | 6545 |UpdateAns| | | 6546 |<----------------------------| 6547 | | | | 6548 | | | | 6549 | | | | 6550 | | | | 6552 Figure 6 6554 13. Security Considerations 6556 13.1. Overview 6558 RELOAD provides a generic storage service, albeit one designed to be 6559 useful for P2PSIP. In this section we discuss security issues that 6560 are likely to be relevant to any usage of RELOAD. More background 6561 information can be found in [RFC5765]. 6563 In any Overlay Instance, any given user depends on a number of peers 6564 with which they have no well-defined relationship except that they 6565 are fellow members of the Overlay Instance. In practice, these other 6566 nodes may be friendly, lazy, curious, or outright malicious. No 6567 security system can provide complete protection in an environment 6568 where most nodes are malicious. The goal of security in RELOAD is to 6569 provide strong security guarantees of some properties even in the 6570 face of a large number of malicious nodes and to allow the overlay to 6571 function correctly in the face of a modest number of malicious nodes. 6573 P2PSIP deployments require the ability to authenticate both peers and 6574 resources (users) without the active presence of a trusted entity in 6575 the system. We describe two mechanisms. The first mechanism is 6576 based on public key certificates and is suitable for general 6577 deployments. The second is an admission control mechanism based on 6578 an overlay-wide shared symmetric key. 6580 13.2. Attacks on P2P Overlays 6582 The two basic functions provided by overlay nodes are storage and 6583 routing: some peer is responsible for storing a node's data and for 6584 allowing a third node to fetch this stored data. Other peers are 6585 responsible for routing messages to and from the storing nodes. Each 6586 of these issues is covered in the following sections. 6588 P2P overlays are subject to attacks by subversive nodes that may 6589 attempt to disrupt routing, corrupt or remove user registrations, or 6590 eavesdrop on signaling. The certificate-based security algorithms we 6591 describe in this specification are intended to protect overlay 6592 routing and user registration information in RELOAD messages. 6594 To protect the signaling from attackers pretending to be valid nodes 6595 (or nodes other than themselves), the first requirement is to ensure 6596 that all messages are received from authorized members of the 6597 overlay. For this reason, RELOAD MUST transport all messages over a 6598 secure channel (TLS and DTLS are defined in this document) which 6599 provides message integrity and authentication of the directly 6600 communicating peer. In addition, messages and data MUST be digitally 6601 signed with the sender's private key, providing end-to-end security 6602 for communications. 6604 13.3. Certificate-based Security 6606 This specification stores users' registrations and possibly other 6607 data in an overlay network. This requires a solution to securing 6608 this data as well as securing, as well as possible, the routing in 6609 the overlay. Both types of security are based on requiring that 6610 every entity in the system (whether user or peer) authenticate 6611 cryptographically using an asymmetric key pair tied to a certificate. 6613 When a user enrolls in the Overlay Instance, they request or are 6614 assigned a unique name, such as "alice@dht.example.net". These names 6615 MUST be unique and are meant to be chosen and used by humans much 6616 like a SIP Address of Record (AOR) or an email address. The user 6617 MUST also be assigned one or more Node-IDs by the central enrollment 6618 authority. Both the name and the Node-IDs are placed in the 6619 certificate, along with the user's public key. 6621 Each certificate enables an entity to act in two sorts of roles: 6623 o As a user, storing data at specific Resource-IDs in the Overlay 6624 Instance corresponding to the user name. 6626 o As a overlay peer with the Node-ID(s) listed in the certificate. 6628 Note that since only users of this Overlay Instance need to validate 6629 a certificate, this usage does not require a global PKI. Instead, 6630 certificates MUST be signed by a central enrollment authority which 6631 acts as the certificate authority for the Overlay Instance. This 6632 authority signs each node's certificate. Because each node possesses 6633 the CA's certificate (which they receive on enrollment) they can 6634 verify the certificates of the other entities in the overlay without 6635 further communication. Because the certificates contain the user/ 6636 node's public key, communications from the user/node can be verified 6637 in turn. 6639 If self-signed certificates are used, then the security provided is 6640 significantly decreased, since attackers can mount Sybil attacks. In 6641 addition, attackers cannot trust the user names in certificates 6642 (though they can trust the Node-IDs because they are 6643 cryptographically verifiable). This scheme may be appropriate for 6644 some small deployments, such as a small office or an ad hoc overlay 6645 set up among participants in a meeting where all hosts on the network 6646 are trusted. Some additional security can be provided by using the 6647 shared secret admission control scheme as well. 6649 Because all stored data is signed by the owner of the data the 6650 storing node can verify that the storer is authorized to perform a 6651 store at that Resource-ID and also allow any consumer of the data to 6652 verify the provenance and integrity of the data when it retrieves it. 6654 Note that RELOAD does not itself provide a revocation/status 6655 mechanism (though certificates may of course include OCSP responder 6656 information). Thus, certificate lifetimes SHOULD be chosen to 6657 balance the compromise window versus the cost of certificate renewal. 6658 Because RELOAD is already designed to operate in the face of some 6659 fraction of malicious nodes, this form of compromise is not fatal. 6661 All implementations MUST implement certificate-based security. 6663 13.4. Shared-Secret Security 6665 RELOAD also supports a shared secret admission control scheme that 6666 relies on a single key that is shared among all members of the 6667 overlay. It is appropriate for small groups that wish to form a 6668 private network without complexity. In shared secret mode, all the 6669 peers MUST share a single symmetric key which is used to key TLS-PSK 6670 or TLS-SRP mode. A peer which does not know the key cannot form TLS 6671 connections with any other peer and therefore cannot join the 6672 overlay. 6674 One natural approach to a shared-secret scheme is to use a user- 6675 entered password as the key. The difficulty with this is that in 6676 TLS-PSK mode, such keys are very susceptible to dictionary attacks. 6677 If passwords are used as the source of shared-keys, then TLS-SRP is a 6678 superior choice because it is not subject to dictionary attacks. 6680 13.5. Storage Security 6682 When certificate-based security is used in RELOAD, any given 6683 Resource-ID/Kind-ID pair is bound to some small set of certificates. 6684 In order to write data, the writer must prove possession of the 6685 private key for one of those certificates. Moreover, all data is 6686 stored, signed with the same private key that was used to authorize 6687 the storage. This set of rules makes questions of authorization and 6688 data integrity - which have historically been thorny for overlays - 6689 relatively simple. 6691 13.5.1. Authorization 6693 When a node wants to store some value, it MUST first digitally sign 6694 the value with its own private key. It then sends a Store request 6695 that contains both the value and the signature towards the storing 6696 peer (which is defined by the Resource Name construction algorithm 6697 for that particular Kind of value). 6699 When the storing peer receives the request, it MUST determine whether 6700 the storing node is authorized to store at this Resource-ID/Kind-ID 6701 pair. Determining this requires comparing the user's identity to the 6702 requirements of the access control model (see Section 7.3). If it 6703 satisfies those requirements the user is authorized to write, pending 6704 quota checks as described in the next section. 6706 For example, consider the certificate with the following properties: 6708 User name: alice@dht.example.com 6709 Node-ID: 013456789abcdef 6710 Serial: 1234 6712 If Alice wishes to Store a value of the "SIP Location" Kind, the 6713 Resource Name will be the SIP AOR "sip:alice@dht.example.com". The 6714 Resource-ID will be determined by hashing the Resource Name. Because 6715 SIP Location uses the USER-NODE-MATCH policy, it first verifies that 6716 the user name in the certificate hashes to the requested Resource-ID. 6717 It then verifies that the Node-ID in the certificate matches the 6718 dictionary key being used for the store. If both of these checks 6719 succeed, the Store is authorized. Note that because the access 6720 control model is different for different Kinds, the exact set of 6721 checks will vary. 6723 13.5.2. Distributed Quota 6725 Being a peer in an Overlay Instance carries with it the 6726 responsibility to store data for a given region of the Overlay 6727 Instance. However, allowing nodes to store unlimited amounts of data 6728 would create unacceptable burdens on peers and would also enable 6729 trivial denial of service attacks. RELOAD addresses this issue by 6730 requiring configurations to define maximum sizes for each Kind of 6731 stored data. Attempts to store values exceeding this size MUST be 6732 rejected (if peers are inconsistent about this, then strange 6733 artifacts will happen when the zone of responsibility shifts and a 6734 different peer becomes responsible for overlarge data). Because each 6735 Resource-ID/Kind-ID pair is bound to a small set of certificates, 6736 these size restrictions also create a distributed quota mechanism, 6737 with the quotas administered by the central configuration server. 6739 Allowing different Kinds of data to have different size restrictions 6740 allows new usages the flexibility to define limits that fit their 6741 needs without requiring all usages to have expansive limits. 6743 13.5.3. Correctness 6745 Because each stored value is signed, it is trivial for any retrieving 6746 node to verify the integrity of the stored value. Some more care 6747 needs to be taken to prevent version rollback attacks. Rollback 6748 attacks on storage are prevented by the use of store times and 6749 lifetime values in each store. A lifetime represents the latest time 6750 at which the data is valid and thus limits (though does not 6751 completely prevent) the ability of the storing node to perform a 6752 rollback attack on retrievers. In order to prevent a rollback attack 6753 at the time of the Store request, it is REQUIRED that storage times 6754 be monotonically increasing. Storing peers MUST reject Store 6755 requests with storage times smaller than or equal to those they are 6756 currently storing. In addition, a fetching node which receives a 6757 data value with a storage time older than the result of the previous 6758 fetch knows a rollback has occurred. 6760 13.5.4. Residual Attacks 6762 The mechanisms described here provides a high degree of security, but 6763 some attacks remain possible. Most simply, it is possible for 6764 storing peers to refuse to store a value (i.e., reject any request). 6765 In addition, a storing peer can deny knowledge of values which it has 6766 previously accepted. To some extent these attacks can be ameliorated 6767 by attempting to store to/retrieve from replicas, but a retrieving 6768 node does not know whether it should try this or not, since there is 6769 a cost to doing so. 6771 The certificate-based authentication scheme prevents a single peer 6772 from being able to forge data owned by other peers. Furthermore, 6773 although a subversive peer can refuse to return data resources for 6774 which it is responsible, it cannot return forged data because it 6775 cannot provide authentication for such registrations. Therefore 6776 parallel searches for redundant registrations can mitigate most of 6777 the effects of a compromised peer. The ultimate reliability of such 6778 an overlay is a statistical question based on the replication factor 6779 and the percentage of compromised peers. 6781 In addition, when a Kind is multivalued (e.g., an array data model), 6782 the storing peer can return only some subset of the values, thus 6783 biasing its responses. This can be countered by using single values 6784 rather than sets, but that makes coordination between multiple 6785 storing agents much more difficult. This is a trade off that must be 6786 made when designing any usage. 6788 13.6. Routing Security 6790 Because the storage security system guarantees (within limits) the 6791 integrity of the stored data, routing security focuses on stopping 6792 the attacker from performing a DOS attack that misroutes requests in 6793 the overlay. There are a few obvious observations to make about 6794 this. First, it is easy to ensure that an attacker is at least a 6795 valid node in the Overlay Instance. Second, this is a DOS attack 6796 only. Third, if a large percentage of the nodes on the Overlay 6797 Instance are controlled by the attacker, it is probably impossible to 6798 perfectly secure against this. 6800 13.6.1. Background 6802 In general, attacks on DHT routing are mounted by the attacker 6803 arranging to route traffic through one or two nodes it controls. In 6804 the Eclipse attack [Eclipse] the attacker tampers with messages to 6805 and from nodes for which it is on-path with respect to a given victim 6806 node. This allows it to pretend to be all the nodes that are 6807 reachable through it. In the Sybil attack [Sybil], the attacker 6808 registers a large number of nodes and is therefore able to capture a 6809 large amount of the traffic through the DHT. 6811 Both the Eclipse and Sybil attacks require the attacker to be able to 6812 exercise control over her Node-IDs. The Sybil attack requires the 6813 creation of a large number of peers. The Eclipse attack requires 6814 that the attacker be able to impersonate specific peers. In both 6815 cases, these attacks are limited by the use of centralized, 6816 certificate-based admission control. 6818 13.6.2. Admissions Control 6820 Admission to a RELOAD Overlay Instance is controlled by requiring 6821 that each peer have a certificate containing its Node-ID. The 6822 requirement to have a certificate is enforced by using certificate- 6823 based mutual authentication on each connection. (Note: the 6824 following only applies when self-signed certificates are not used.) 6825 Whenever a peer connects to another peer, each side automatically 6826 checks that the other has a suitable certificate. These Node-IDs 6827 MUST be randomly assigned by the central enrollment server. This has 6828 two benefits: 6830 o It allows the enrollment server to limit the number of Node-IDs 6831 issued to any individual user. 6833 o It prevents the attacker from choosing specific Node-IDs. 6835 The first property allows protection against Sybil attacks (provided 6836 the enrollment server uses strict rate limiting policies). The 6837 second property deters but does not completely prevent Eclipse 6838 attacks. Because an Eclipse attacker must impersonate peers on the 6839 other side of the attacker, the attacker must have a certificate for 6840 suitable Node-IDs, which requires him to repeatedly query the 6841 enrollment server for new certificates, which will match only by 6842 chance. From the attacker's perspective, the difficulty is that if 6843 the attacker only has a small number of certificates, the region of 6844 the Overlay Instance he is impersonating appears to be very sparsely 6845 populated by comparison to the victim's local region. 6847 13.6.3. Peer Identification and Authentication 6849 In general, whenever a peer engages in overlay activity that might 6850 affect the Routing Table it must establish its identity. This 6851 happens in two ways. First, whenever a peer establishes a direct 6852 connection to another peer it authenticates via certificate-based 6853 mutual authentication. All messages between peers are sent over this 6854 protected channel and therefore the peers can verify the data origin 6855 of the last hop peer for requests and responses without further 6856 cryptography. 6858 In some situations, however, it is desirable to be able to establish 6859 the identity of a peer with whom one is not directly connected. The 6860 most natural case is when a peer Updates its state. At this point, 6861 other peers may need to update their view of the overlay structure, 6862 but they need to verify that the Update message came from the actual 6863 peer rather than from an attacker. To prevent this, all overlay 6864 routing messages are signed by the peer that generated them. 6866 Replay is typically prevented for messages that impact the topology 6867 of the overlay by having the information come directly, or be 6868 verified by, the nodes that claimed to have generated the update. 6869 Data storage replay detection is done by signing time of the node 6870 that generated the signature on the store request thus providing a 6871 time based replay protection but the time synchronization is only 6872 needed between peers that can write to the same location. 6874 13.6.4. Protecting the Signaling 6876 The goal here is to stop an attacker from knowing who is signaling 6877 what to whom. An attacker is unlikely to be able to observe the 6878 activities of a specific individual given the randomization of IDs 6879 and routing based on the present peers discussed above. Furthermore, 6880 because messages can be routed using only the header information, the 6881 actual body of the RELOAD message can be encrypted during 6882 transmission. 6884 There are two lines of defense here. The first is the use of TLS or 6885 DTLS for each communications link between peers. This provides 6886 protection against attackers who are not members of the overlay. The 6887 second line of defense is to digitally sign each message. This 6888 prevents adversarial peers from modifying messages in flight, even if 6889 they are on the routing path. 6891 13.6.5. Routing Loops and Dos Attacks 6893 Source routing mechanisms are known to create the possibility for DoS 6894 amplification, especially by the induction of routing loops 6895 [RFC5095]. In order to limit amplification, the initial-ttl value in 6896 the configuration file SHOULD be set to a value slightly larger than 6897 the longest expected path through the network. For Chord, experience 6898 has shown that log(2) of the number of nodes in the network + 5 is a 6899 safe bound. Because nodes are required to enforce the initial-ttl as 6900 the maximum value, an attacker cannot achieve an amplification factor 6901 greater than initial-ttl, thus limiting the additional capabilities 6902 provided by source routing. 6904 In order to prevent the use of loops for targeted implementation 6905 attacks, implementations SHOULD check the destination list for 6906 duplicate entries and discard such records with an 6907 "Error_Invalid_Message" error. This does not completely prevent 6908 loops but does require that at least one attacker node be part of the 6909 loop. 6911 13.6.6. Residual Attacks 6913 The routing security mechanisms in RELOAD are designed to contain 6914 rather than eliminate attacks on routing. It is still possible for 6915 an attacker to mount a variety of attacks. In particular, if an 6916 attacker is able to take up a position on the overlay routing between 6917 A and B it can make it appear as if B does not exist or is 6918 disconnected. It can also advertise false network metrics in an 6919 attempt to reroute traffic. However, these are primarily DOS 6920 attacks. 6922 The certificate-based security scheme secures the namespace, but if 6923 an individual peer is compromised or if an attacker obtains a 6924 certificate from the CA, then a number of subversive peers can still 6925 appear in the overlay. While these peers cannot falsify responses to 6926 resource queries, they can respond with error messages, effecting a 6927 DoS attack on the resource registration. They can also subvert 6928 routing to other compromised peers. To defend against such attacks, 6929 a resource search must still consist of parallel searches for 6930 replicated registrations. 6932 14. IANA Considerations 6934 This section contains the new code points registered by this 6935 document. [NOTE TO IANA/RFC-EDITOR: Please replace RFC-to-be with 6936 the RFC number for this specification in the following list.] 6938 14.1. Well-Known URI Registration 6940 IANA SHALL make the following "Well Known URI" registration as 6941 described in [RFC5785]: 6943 [[Note to RFC Editor - this paragraph can be removed before 6944 publication. ]] A review request was sent to 6945 wellknown-uri-review@ietf.org on October 12, 2010. 6947 +----------------------------+----------------------+ 6948 | URI suffix: | reload-config | 6949 | Change controller: | IETF | 6950 | Specification document(s): | [RFC-to-be] | 6951 | Related information: | None | 6952 +----------------------------+----------------------+ 6954 14.2. Port Registrations 6956 [[Note to RFC Editor - this paragraph can be removed before 6957 publication. ]] IANA has already allocated a TCP port for the main 6958 peer to peer protocol. This port has the name p2psip-enroll and the 6959 port number of 6084. IANA needs to update this registration to 6960 change the service name to reload-config and to define it for UDP as 6961 well as TCP. 6963 IANA SHALL make the following port registration: 6965 +-----------------------------+-------------------------------------+ 6966 | Registration Technical | Cullen Jennings | 6967 | Contact | | 6968 | Registration Owner | IETF | 6969 | Transport Protocol | TCP & UDP | 6970 | Port Number | 6084 | 6971 | Service Name | reload-config | 6972 | Description | Peer to Peer Infrastructure | 6973 | | Configuration | 6974 +-----------------------------+-------------------------------------+ 6976 14.3. Overlay Algorithm Types 6978 IANA SHALL create a "RELOAD Overlay Algorithm Type" Registry. 6979 Entries in this registry are strings denoting the names of overlay 6980 algorithms as described in Section 11.1 of [RFC-to-be]. The 6981 registration policy for this registry is RFC 5226 IETF Review. The 6982 initial contents of this registry are: 6984 +----------------+-----------+ 6985 | Algorithm Name | RFC | 6986 +----------------+-----------+ 6987 | CHORD-RELOAD | RFC-to-be | 6988 | EXP-OVERLAY | RFC-to-be | 6989 +----------------+-----------+ 6991 The value EXP-OVERLAY has been made available for the purposes of 6992 experimentation. This value is not meant for vendor specific use of 6993 any sort and it MUST NOT be used for operational deployments. 6995 14.4. Access Control Policies 6997 IANA SHALL create a "RELOAD Access Control Policy" Registry. Entries 6998 in this registry are strings denoting access control policies, as 6999 described in Section 7.3 of [RFC-to-be]. New entries in this 7000 registry SHALL be registered via RFC 5226 Standards Action. The 7001 initial contents of this registry are: 7003 +-----------------+-----------+ 7004 | Access Policy | RFC | 7005 +-----------------+-----------+ 7006 | USER-MATCH | RFC-to-be | 7007 | NODE-MATCH | RFC-to-be | 7008 | USER-NODE-MATCH | RFC-to-be | 7009 | NODE-MULTIPLE | RFC-to-be | 7010 | EXP-MATCH | RFC-to-be | 7011 +-----------------+-----------+ 7013 The value EXP-MATCH has been made available for the purposes of 7014 experimentation. This value is not meant for vendor specific use of 7015 any sort and it MUST NOT be used for operational deployments. 7017 14.5. Application-ID 7019 IANA SHALL create a "RELOAD Application-ID" Registry. Entries in 7020 this registry are 16-bit integers denoting application-ids as 7021 described in Section 6.5.2 of [RFC-to-be]. Code points in the range 7022 0x0001 to 0x7fff SHALL be registered via RFC 5226 Standards Action. 7023 Code points in the range 0x8000 to 0xf000 SHALL be registered via RFC 7024 5226 Expert Review. Code points in the range 0xf001 to 0xfffe are 7025 reserved for private use. The initial contents of this registry are: 7027 +-------------+----------------+-------------------------------+ 7028 | Application | Application-ID | Specification | 7029 +-------------+----------------+-------------------------------+ 7030 | INVALID | 0 | RFC-to-be | 7031 | SIP | 5060 | Reserved for use by SIP Usage | 7032 | SIP | 5061 | Reserved for use by SIP Usage | 7033 | Reserved | 0xffff | RFC-to-be | 7034 +-------------+----------------+-------------------------------+ 7036 14.6. Data Kind-ID 7038 IANA SHALL create a "RELOAD Data Kind-ID" Registry. Entries in this 7039 registry are 32-bit integers denoting data Kinds, as described in 7040 Section 4.2 of [RFC-to-be]. Code points in the range 0x00000001 to 7041 0x7fffffff SHALL be registered via RFC 5226 Standards Action. Code 7042 points in the range 0x8000000 to 0xf0000000 SHALL be registered via 7043 RFC 5226 Expert Review. Code points in the range 0xf0000001 to 7044 0xfffffffe are reserved for private use via the Kind description 7045 mechanism described in Section 11 of [RFC-to-be]. The initial 7046 contents of this registry are: 7048 +---------------------+------------+-----------+ 7049 | Kind | Kind-ID | RFC | 7050 +---------------------+------------+-----------+ 7051 | INVALID | 0 | RFC-to-be | 7052 | TURN-SERVICE | 2 | RFC-to-be | 7053 | CERTIFICATE_BY_NODE | 3 | RFC-to-be | 7054 | CERTIFICATE_BY_USER | 16 | RFC-to-be | 7055 | Reserved | 0x7fffffff | RFC-to-be | 7056 | Reserved | 0xfffffffe | RFC-to-be | 7057 +---------------------+------------+-----------+ 7059 14.7. Data Model 7061 IANA SHALL create a "RELOAD Data Model" Registry. Entries in this 7062 registry are strings denoting data models, as described in 7063 Section 7.2 of [RFC-to-be]. New entries in this registry SHALL be 7064 registered via RFC 5226 Standards Action. The initial contents of 7065 this registry are: 7067 +------------+-----------+ 7068 | Data Model | RFC | 7069 +------------+-----------+ 7070 | INVALID | RFC-to-be | 7071 | SINGLE | RFC-to-be | 7072 | ARRAY | RFC-to-be | 7073 | DICTIONARY | RFC-to-be | 7074 | EXP-DATA | RFC-to-be | 7075 | RESERVED | RFC-to-be | 7076 +------------+-----------+ 7078 The value EXP-DATA has been made available for the purposes of 7079 experimentation. This value is not meant for vendor specific use of 7080 any sort and it MUST NOT be used for operational deployments. 7082 14.8. Message Codes 7084 IANA SHALL create a "RELOAD Message Code" Registry. Entries in this 7085 registry are 16-bit integers denoting method codes as described in 7086 Section 6.3.3 of [RFC-to-be]. These codes SHALL be registered via 7087 RFC 5226 Standards Action. The initial contents of this registry 7088 are: 7090 +---------------------------------+----------------+-----------+ 7091 | Message Code Name | Code Value | RFC | 7092 +---------------------------------+----------------+-----------+ 7093 | invalidMessageCode | 0 | RFC-to-be | 7094 | probe_req | 1 | RFC-to-be | 7095 | probe_ans | 2 | RFC-to-be | 7096 | attach_req | 3 | RFC-to-be | 7097 | attach_ans | 4 | RFC-to-be | 7098 | unused | 5 | | 7099 | unused | 6 | | 7100 | store_req | 7 | RFC-to-be | 7101 | store_ans | 8 | RFC-to-be | 7102 | fetch_req | 9 | RFC-to-be | 7103 | fetch_ans | 10 | RFC-to-be | 7104 | unused (was remove_req) | 11 | RFC-to-be | 7105 | unused (was remove_ans) | 12 | RFC-to-be | 7106 | find_req | 13 | RFC-to-be | 7107 | find_ans | 14 | RFC-to-be | 7108 | join_req | 15 | RFC-to-be | 7109 | join_ans | 16 | RFC-to-be | 7110 | leave_req | 17 | RFC-to-be | 7111 | leave_ans | 18 | RFC-to-be | 7112 | update_req | 19 | RFC-to-be | 7113 | update_ans | 20 | RFC-to-be | 7114 | route_query_req | 21 | RFC-to-be | 7115 | route_query_ans | 22 | RFC-to-be | 7116 | ping_req | 23 | RFC-to-be | 7117 | ping_ans | 24 | RFC-to-be | 7118 | stat_req | 25 | RFC-to-be | 7119 | stat_ans | 26 | RFC-to-be | 7120 | unused (was attachlite_req) | 27 | RFC-to-be | 7121 | unused (was attachlite_ans) | 28 | RFC-to-be | 7122 | app_attach_req | 29 | RFC-to-be | 7123 | app_attach_ans | 30 | RFC-to-be | 7124 | unused (was app_attachlite_req) | 31 | RFC-to-be | 7125 | unused (was app_attachlite_ans) | 32 | RFC-to-be | 7126 | config_update_req | 33 | RFC-to-be | 7127 | config_update_ans | 34 | RFC-to-be | 7128 | exp_a_req | 35 | RFC-to-be | 7129 | exp_a_ans | 36 | RFC-to-be | 7130 | exp_b_req | 37 | RFC-to-be | 7131 | exp_b_ans | 38 | RFC-to-be | 7132 | reserved | 0x8000..0xfffe | RFC-to-be | 7133 | error | 0xffff | RFC-to-be | 7134 +---------------------------------+----------------+-----------+ 7136 The values exp_a_req, exp_a_ans, exp_b_req, and exp_b_ans have been 7137 made available for the purposes of experimentation. These values are 7138 not meant for vendor specific use of any sort and MUST NOT be used 7139 for operational deployments. 7141 14.9. Error Codes 7143 IANA SHALL create a "RELOAD Error Code" Registry. Entries in this 7144 registry are 16-bit integers denoting error codes as described in 7145 Section 6.3.3.1 of [RFC-to-be]. New entries SHALL be defined via RFC 7146 5226 Standards Action. The initial contents of this registry are: 7148 +-------------------------------------+----------------+-----------+ 7149 | Error Code Name | Code Value | RFC | 7150 +-------------------------------------+----------------+-----------+ 7151 | invalidErrorCode | 0 | RFC-to-be | 7152 | Unused | 1 | RFC-to-be | 7153 | Error_Forbidden | 2 | RFC-to-be | 7154 | Error_Not_Found | 3 | RFC-to-be | 7155 | Error_Request_Timeout | 4 | RFC-to-be | 7156 | Error_Generation_Counter_Too_Low | 5 | RFC-to-be | 7157 | Error_Incompatible_with_Overlay | 6 | RFC-to-be | 7158 | Error_Unsupported_Forwarding_Option | 7 | RFC-to-be | 7159 | Error_Data_Too_Large | 8 | RFC-to-be | 7160 | Error_Data_Too_Old | 9 | RFC-to-be | 7161 | Error_TTL_Exceeded | 10 | RFC-to-be | 7162 | Error_Message_Too_Large | 11 | RFC-to-be | 7163 | Error_Unknown_Kind | 12 | RFC-to-be | 7164 | Error_Unknown_Extension | 13 | RFC-to-be | 7165 | Error_Response_Too_Large | 14 | RFC-to-be | 7166 | Error_Config_Too_Old | 15 | RFC-to-be | 7167 | Error_Config_Too_New | 16 | RFC-to-be | 7168 | Error_In_Progress | 17 | RFC-to-be | 7169 | Error_Exp_A | 18 | RFC-to-be | 7170 | Error_Exp_B | 19 | RFC-to-be | 7171 | Error_Invalid_Message | 20 | RFC-to-be | 7172 | reserved | 0x8000..0xfffe | RFC-to-be | 7173 +-------------------------------------+----------------+-----------+ 7175 The values Error_Exp_A and Error_Exp_B have been made available for 7176 the purposes of experimentation. These values are not meant for 7177 vendor specific use of any sort and MUST NOT be used for operational 7178 deployments. 7180 14.10. Overlay Link Types 7182 IANA SHALL create a "RELOAD Overlay Link Registry". Entries in this 7183 registry are 8 bit integers as described in Section 6.5.1.1 of [RFC- 7184 to-be]. For more information on the link types defined here, see 7185 Section 6.6 of [RFC-to-be]. New entries SHALL be defined via RFC 7186 5226 Standards Action. This registry SHALL be initially populated 7187 with the following values: 7189 +--------------------+------+---------------+ 7190 | Protocol | Code | Specification | 7191 +--------------------+------+---------------+ 7192 | INVALID-PROTOCOL | 0 | RFC-to-be | 7193 | DTLS-UDP-SR | 1 | RFC-to-be | 7194 | DTLS-UDP-SR-NO-ICE | 3 | RFC-to-be | 7195 | TLS-TCP-FH-NO-ICE | 4 | RFC-to-be | 7196 | EXP-LINK | 5 | RFC-to-be | 7197 | reserved | 255 | RFC-to-be | 7198 +--------------------+------+---------------+ 7200 The value EXP-LINK has been made available for the purposes of 7201 experimentation. This value is not meant for vendor specific use of 7202 any sort and it MUST NOT be used for operational deployments. 7204 14.11. Overlay Link Protocols 7206 IANA SHALL create an "Overlay Link Protocol Registry". Entries in 7207 this registry are strings denoting protocols as described in 7208 Section 11.1 of [RFC-to-be] and SHALL be defined via RFC 5226 7209 Standards Action. This registry SHALL be initially populated with 7210 the following values: 7212 +---------------+---------------+ 7213 | Link Protocol | Specification | 7214 +---------------+---------------+ 7215 | TLS | RFC-to-be | 7216 | EXP-PROTOCOL | RFC-to-be | 7217 +---------------+---------------+ 7219 The value EXP-PROTOCOL has been made available for the purposes of 7220 experimentation. This value is not meant for vendor specific use of 7221 any sort and it MUST NOT be used for operational deployments. 7223 14.12. Forwarding Options 7225 IANA SHALL create a "Forwarding Option Registry". Entries in this 7226 registry are 8-bit integers denoting options as described in - 7227 Section 6.3.2.3 of [RFC-to-be]. Values between 1 and 127 SHALL be 7228 defined via RFC 5226 Standards Action. Entries in this registry 7229 between 128 and 254 SHALL be defined via RFC 5226 Specification 7230 Required. This registry SHALL be initially populated with the 7231 following values: 7233 +-------------------------+------+---------------+ 7234 | Forwarding Option | Code | Specification | 7235 +-------------------------+------+---------------+ 7236 | invalidForwardingOption | 0 | RFC-to-be | 7237 | exp-forward | 1 | RFC-to-be | 7238 | reserved | 255 | RFC-to-be | 7239 +-------------------------+------+---------------+ 7241 The value exp-forward has been made available for the purposes of 7242 experimentation. This value is not meant for vendor specific use of 7243 any sort and it MUST NOT be used for operational deployments. 7245 14.13. Probe Information Types 7247 IANA SHALL create a "RELOAD Probe Information Type Registry". 7248 Entries are 8-bit integers denoting types as described in 7249 Section 6.4.2.5.1 of [RFC-to-be] and SHALL be defined via RFC 5226 7250 Standards Action. This registry SHALL be initially populated with 7251 the following values: 7253 +--------------------+------+---------------+ 7254 | Probe Option | Code | Specification | 7255 +--------------------+------+---------------+ 7256 | invalidProbeOption | 0 | RFC-to-be | 7257 | responsible_set | 1 | RFC-to-be | 7258 | num_resources | 2 | RFC-to-be | 7259 | uptime | 3 | RFC-to-be | 7260 | exp-probe | 4 | RFC-to-be | 7261 | reserved | 255 | RFC-to-be | 7262 +--------------------+------+---------------+ 7264 The value exp-probe has been made available for the purposes of 7265 experimentation. This value is not meant for vendor specific use of 7266 any sort and it MUST NOT be used for operational deployments. 7268 14.14. Message Extensions 7270 IANA SHALL create a "RELOAD Extensions Registry". Entries in this 7271 registry are 8-bit integers denoting extensions as described in 7272 Section 6.3.3 of [RFC-to-be] and SHALL be defined via RFC 5226 7273 Specification Required. This registry SHALL be initially populated 7274 with the following values: 7276 +-----------------------------+--------+---------------+ 7277 | Extensions Name | Code | Specification | 7278 +-----------------------------+--------+---------------+ 7279 | invalidMessageExtensionType | 0 | RFC-to-be | 7280 | exp-ext | 1 | RFC-to-be | 7281 | reserved | 0xFFFF | RFC-to-be | 7282 +-----------------------------+--------+---------------+ 7284 The value exp-ext has been made available for the purposes of 7285 experimentation. This value is not meant for vendor specific use of 7286 any sort and it MUST NOT be used for operational deployments. 7288 14.15. reload URI Scheme 7290 This section describes the scheme for a reload URI, which can be used 7291 to refer to either: 7293 o A peer, e.g., as used in a certificate (see Section 11.3 of [RFC- 7294 to-be]). 7296 o A resource inside a peer. 7298 The reload URI is defined using a subset of the URI schema specified 7299 in Appendix A of RFC 3986 [RFC3986] and the associated URI Guidelines 7300 [RFC4395] per the following ABNF syntax: 7302 RELOAD-URI = "reload://" destination "@" overlay "/" 7303 [specifier] 7305 destination = 1 * HEXDIG 7306 overlay = reg-name 7307 specifier = 1*HEXDIG 7309 The definitions of these productions are as follows: 7311 destination: a hex-encoded Destination List object (i.e., multiple 7312 concatenated Destination objects with no length prefix prior to 7313 the object as a whole.) 7315 overlay: the name of the overlay. 7317 specifier : a hex-encoded StoredDataSpecifier indicating the data 7318 element. 7320 If no specifier is present then this URI addresses the peer which can 7321 be reached via the indicated destination list at the indicated 7322 overlay name. If a specifier is present, then the URI addresses the 7323 data value. 7325 14.15.1. URI Registration 7327 [[ Note to RFC Editor - please remove this paragraph before 7328 publication. ]] A review request was sent to uri-review@ietf.org on 7329 Oct 7, 2010. 7331 The following summarizes the information necessary to register the 7332 reload URI. 7334 URI Scheme Name: reload 7336 Status: permanent 7338 URI Scheme Syntax: see Section 14.15 of [RFC-to-be] 7340 URI Scheme Semantics: The reload URI is intended to be used as a 7341 reference to a RELOAD peer or resource. 7343 Encoding Considerations: The reload URI is not intended to be human- 7344 readable text, so it is encoded entirely in US-ASCII. 7346 Applications/protocols that use this URI scheme: The RELOAD protocol 7347 described in RFC-to-be. 7349 Interoperability considerations: See RFC-to-be. 7351 Security considerations: See RFC-to-be 7353 Contact: Cullen Jennings 7355 Author/Change controller: IESG 7357 References: RFC-to-be 7359 14.16. Media Type Registration 7361 [[ Note to RFC Editor - please remove this paragraph before 7362 publication. ]] A review request was sent to ietf-types@iana.org on 7363 May 27, 2011. 7365 Type name: application 7367 Subtype name: p2p-overlay+xml 7369 Required parameters: none 7371 Optional parameters: none 7373 Encoding considerations: Must be binary encoded. 7375 Security considerations: This media type is typically not used to 7376 transport information that needs to be kept confidential, however 7377 there are cases where it is integrity of the information is 7378 important. For these cases using a digital signature is RECOMMENDED. 7379 One way of doing this is specified in RFC-to-be. In the case when 7380 the media includes a "shared-secret" element, then the contents of 7381 the file MUST be kept confidential or else anyone that can see the 7382 shared-secret and effect the RELOAD overlay network. 7384 Interoperability considerations: No known interoperability 7385 consideration beyond those identified for application/xml in 7386 [RFC3023]. 7388 Published specification: RFC-to-be 7390 Applications that use this media type: The type is used to configure 7391 the peer to peer overlay networks defined in RFC-to-be. 7393 Additional information: The syntax for this media type is specified 7394 in Section 11.1 of [RFC-to-be]. The contents MUST be valid XML 7395 compliant with the RELAX NG grammar specified in RFC-to-be and use 7396 the UTF-8[RFC3629] character encoding. 7398 Magic number(s): none 7400 File extension(s): relo 7402 Macintosh file type code(s): none 7404 Person & email address to contact for further information: Cullen 7405 Jennings 7407 Intended usage: COMMON 7409 Restrictions on usage: None 7411 Author: Cullen Jennings 7413 Change controller: IESG 7415 14.17. XML Name Space Registration 7417 This document registers two URIs for the config and config-chord XML 7418 namespaces in the IETF XML registry defined in [RFC3688]. 7420 14.17.1. Config URL 7422 URI: urn:ietf:params:xml:ns:p2p:config-base 7424 Registrant Contact: The IESG. 7426 XML: N/A, the requested URIs are XML namespaces 7428 14.17.2. Config Chord URL 7430 URI: urn:ietf:params:xml:ns:p2p:config-chord 7432 Registrant Contact: The IESG. 7434 XML: N/A, the requested URIs are XML namespaces 7436 15. Acknowledgments 7438 This specification is a merge of the "REsource LOcation And Discovery 7439 (RELOAD)" draft by David A. Bryan, Marcia Zangrilli and Bruce B. 7440 Lowekamp, the "Address Settlement by Peer to Peer" draft by Cullen 7441 Jennings, Jonathan Rosenberg, and Eric Rescorla, the "Security 7442 Extensions for RELOAD" draft by Bruce B. Lowekamp and James Deverick, 7443 the "A Chord-based DHT for Resource Lookup in P2PSIP" by Marcia 7444 Zangrilli and David A. Bryan, and the Peer-to-Peer Protocol (P2PP) 7445 draft by Salman A. Baset, Henning Schulzrinne, and Marcin 7446 Matuszewski. Thanks to the authors of RFC 5389 for text included 7447 from that. Vidya Narayanan provided many comments and improvements. 7449 The ideas and text for the Chord specific extension data to the Leave 7450 mechanisms was provided by Jouni Maenpaa, Gonzalo Camarillo, and Jani 7451 Hautakorpi. 7453 Thanks to the many people who contributed including Ted Hardie, 7454 Michael Chen, Dan York, Das Saumitra, Lyndsay Campbell, Brian Rosen, 7455 David Bryan, Dave Craig, and Julian Cain. Extensive last call 7456 comments were provided by: Jouni Maenpaa, Roni Even, Gonzalo 7457 Camarillo, Ari Keranen, John Buford, Michael Chen, Frederic-Philippe 7458 Met, Mary Barnes, Roland Bless, David Bryan and Polina Goltsman. 7459 Special thanks to Marc Petit-Huguenin who provided an amazing amount 7460 of detailed review. 7462 Dean Willis and Marc Petit-Huguenin helped resolve and provided text 7463 to fix many comments received during IESG review. 7465 16. References 7467 16.1. Normative References 7469 [OASIS.relax_ng] Bray, T. and M. Murata, "RELAX 7470 NG Specification", 7471 December 2001. 7473 [RFC1918] Rekhter, Y., Moskowitz, R., 7474 Karrenberg, D., Groot, G., and 7475 E. Lear, "Address Allocation for 7476 Private Internets", BCP 5, 7477 RFC 1918, February 1996. 7479 [RFC2119] Bradner, S., "Key words for use 7480 in RFCs to Indicate Requirement 7481 Levels", BCP 14, RFC 2119, 7482 March 1997. 7484 [RFC2388] Masinter, L., "Returning Values 7485 from Forms: multipart/ 7486 form-data", RFC 2388, 7487 August 1998. 7489 [RFC2585] Housley, R. and P. Hoffman, 7490 "Internet X.509 Public Key 7491 Infrastructure Operational 7492 Protocols: FTP and HTTP", 7493 RFC 2585, May 1999. 7495 [RFC2782] Gulbrandsen, A., Vixie, P., and 7496 L. Esibov, "A DNS RR for 7497 specifying the location of 7498 services (DNS SRV)", RFC 2782, 7499 February 2000. 7501 [RFC2818] Rescorla, E., "HTTP Over TLS", 7502 RFC 2818, May 2000. 7504 [RFC3023] Murata, M., St. Laurent, S., and 7505 D. Kohn, "XML Media Types", 7506 RFC 3023, January 2001. 7508 [RFC3174] Eastlake, D. and P. Jones, "US 7509 Secure Hash Algorithm 1 (SHA1)", 7510 RFC 3174, September 2001. 7512 [RFC3339] Klyne, G., Ed. and C. Newman, 7513 "Date and Time on the Internet: 7514 Timestamps", RFC 3339, 7515 July 2002. 7517 [RFC3447] Jonsson, J. and B. Kaliski, 7518 "Public-Key Cryptography 7519 Standards (PKCS) #1: RSA 7520 Cryptography Specifications 7521 Version 2.1", RFC 3447, 7522 February 2003. 7524 [RFC3629] Yergeau, F., "UTF-8, a 7525 transformation format of ISO 7526 10646", STD 63, RFC 3629, 7527 November 2003. 7529 [RFC3986] Berners-Lee, T., Fielding, R., 7530 and L. Masinter, "Uniform 7531 Resource Identifier (URI): 7532 Generic Syntax", STD 66, 7533 RFC 3986, January 2005. 7535 [RFC4279] Eronen, P. and H. Tschofenig, 7536 "Pre-Shared Key Ciphersuites for 7537 Transport Layer Security (TLS)", 7538 RFC 4279, December 2005. 7540 [RFC4395] Hansen, T., Hardie, T., and L. 7541 Masinter, "Guidelines and 7542 Registration Procedures for New 7543 URI Schemes", BCP 35, RFC 4395, 7544 February 2006. 7546 [RFC4648] Josefsson, S., "The Base16, 7547 Base32, and Base64 Data 7548 Encodings", RFC 4648, 7549 October 2006. 7551 [RFC5245] Rosenberg, J., "Interactive 7552 Connectivity Establishment 7553 (ICE): A Protocol for Network 7554 Address Translator (NAT) 7555 Traversal for Offer/Answer 7556 Protocols", RFC 5245, 7557 April 2010. 7559 [RFC5246] Dierks, T. and E. Rescorla, "The 7560 Transport Layer Security (TLS) 7561 Protocol Version 1.2", RFC 5246, 7562 August 2008. 7564 [RFC5272] Schaad, J. and M. Myers, 7565 "Certificate Management over CMS 7566 (CMC)", RFC 5272, June 2008. 7568 [RFC5273] Schaad, J. and M. Myers, 7569 "Certificate Management over CMS 7570 (CMC): Transport Protocols", 7571 RFC 5273, June 2008. 7573 [RFC5389] Rosenberg, J., Mahy, R., 7574 Matthews, P., and D. Wing, 7575 "Session Traversal Utilities for 7576 NAT (STUN)", RFC 5389, 7577 October 2008. 7579 [RFC5405] Eggert, L. and G. Fairhurst, 7580 "Unicast UDP Usage Guidelines 7581 for Application Designers", 7582 BCP 145, RFC 5405, 7583 November 2008. 7585 [RFC5766] Mahy, R., Matthews, P., and J. 7586 Rosenberg, "Traversal Using 7587 Relays around NAT (TURN): Relay 7588 Extensions to Session Traversal 7589 Utilities for NAT (STUN)", 7590 RFC 5766, April 2010. 7592 [RFC5952] Kawamura, S. and M. Kawashima, 7593 "A Recommendation for IPv6 7594 Address Text Representation", 7595 RFC 5952, August 2010. 7597 [RFC6091] Mavrogiannopoulos, N. and D. 7598 Gillmor, "Using OpenPGP Keys for 7599 Transport Layer Security (TLS) 7600 Authentication", RFC 6091, 7601 February 2011. 7603 [RFC6234] Eastlake, D. and T. Hansen, "US 7604 Secure Hash Algorithms (SHA and 7605 SHA-based HMAC and HKDF)", 7606 RFC 6234, May 2011. 7608 [RFC6298] Paxson, V., Allman, M., Chu, J., 7609 and M. Sargent, "Computing TCP's 7610 Retransmission Timer", RFC 6298, 7611 June 2011. 7613 [RFC6347] Rescorla, E. and N. Modadugu, 7614 "Datagram Transport Layer 7615 Security Version 1.2", RFC 6347, 7616 January 2012. 7618 [W3C.REC-xmlschema-2-20041028] Malhotra, A. and P. Biron, "XML 7619 Schema Part 2: Datatypes Second 7620 Edition", World Wide Web 7621 Consortium Recommendation REC- 7622 xmlschema-2-20041028, 7623 October 2004, . 7627 [w3c-xml-namespaces] Bray, T., Hollander, D., Layman, 7628 A., Tobin, R., and Henry S. , 7629 "Namespaces in XML 1.0 (Third 7630 Edition)", December 2008. 7632 16.2. Informative References 7634 [Chord] Stoica, I., Morris, R., Liben- 7635 Nowell, D., Karger, D., 7636 Kaashoek, M., Dabek, F., and H. 7637 Balakrishnan, "Chord: A Scalable 7638 Peer-to-peer Lookup Protocol for 7639 Internet Applications", IEEE/ACM 7640 Transactions on 7641 Networking Volume 11, Issue 1, 7642 17-32, Feb 2003, 2001. 7644 [Eclipse] Singh, A., Ngan, T., Druschel, 7645 T., and D. Wallach, "Eclipse 7646 Attacks on Overlay Networks: 7647 Threats and Defenses", 7648 INFOCOM 2006, April 2006. 7650 [I-D.ietf-hip-reload-instance] Keranen, A., Camarillo, G., and 7651 J. Maenpaa, "Host Identity 7652 Protocol-Based Overlay 7653 Networking Environment (HIP 7654 BONE) Instance Specification for 7655 REsource LOcation And Discovery 7656 (RELOAD)", draft-ietf-hip- 7657 reload-instance-06 (work in 7658 progress), November 2012. 7660 [I-D.ietf-p2psip-diagnostics] Song, H., Jiang, X., Even, R., 7661 and D. Bryan, "P2PSIP Overlay 7662 Diagnostics", 7663 draft-ietf-p2psip-diagnostics-09 7664 (work in progress), August 2012. 7666 [I-D.ietf-p2psip-rpr] Zong, N., Jiang, X., Even, R., 7667 and Y. Zhang, "An extension to 7668 RELOAD to support Relay Peer 7669 Routing", 7670 draft-ietf-p2psip-rpr-03 (work 7671 in progress), October 2012. 7673 [I-D.ietf-p2psip-self-tuning] Maenpaa, J., Camarillo, G., and 7674 J. Hautakorpi, "A Self-tuning 7675 Distributed Hash Table (DHT) for 7676 REsource LOcation And Discovery 7677 (RELOAD)", 7678 draft-ietf-p2psip-self-tuning-06 7679 (work in progress), July 2012. 7681 [I-D.ietf-p2psip-service-discovery] Maenpaa, J. and G. Camarillo, 7682 "Service Discovery Usage for 7683 REsource LOcation And Discovery 7684 (RELOAD)", draft-ietf-p2psip- 7685 service-discovery-06 (work in 7686 progress), October 2012. 7688 [I-D.ietf-p2psip-sip] Jennings, C., Lowekamp, B., 7689 Rescorla, E., Baset, S., 7690 Schulzrinne, H., and T. Schmidt, 7691 "A SIP Usage for RELOAD", 7692 draft-ietf-p2psip-sip-08 (work 7693 in progress), December 2012. 7695 [RFC1035] Mockapetris, P., "Domain names - 7696 implementation and 7697 specification", STD 13, 7698 RFC 1035, November 1987. 7700 [RFC1122] Braden, R., "Requirements for 7701 Internet Hosts - Communication 7702 Layers", STD 3, RFC 1122, 7703 October 1989. 7705 [RFC2311] Dusse, S., Hoffman, P., 7706 Ramsdell, B., Lundblade, L., and 7707 L. Repka, "S/MIME Version 2 7708 Message Specification", 7709 RFC 2311, March 1998. 7711 [RFC3688] Mealling, M., "The IETF XML 7712 Registry", BCP 81, RFC 3688, 7713 January 2004. 7715 [RFC4013] Zeilenga, K., "SASLprep: 7716 Stringprep Profile for User 7717 Names and Passwords", RFC 4013, 7718 February 2005. 7720 [RFC4086] Eastlake, D., Schiller, J., and 7721 S. Crocker, "Randomness 7722 Requirements for Security", 7723 BCP 106, RFC 4086, June 2005. 7725 [RFC4145] Yon, D. and G. Camarillo, "TCP- 7726 Based Media Transport in the 7727 Session Description Protocol 7728 (SDP)", RFC 4145, 7729 September 2005. 7731 [RFC4340] Kohler, E., Handley, M., and S. 7732 Floyd, "Datagram Congestion 7733 Control Protocol (DCCP)", 7734 RFC 4340, March 2006. 7736 [RFC4787] Audet, F. and C. Jennings, 7737 "Network Address Translation 7738 (NAT) Behavioral Requirements 7739 for Unicast UDP", BCP 127, 7740 RFC 4787, January 2007. 7742 [RFC4960] Stewart, R., "Stream Control 7743 Transmission Protocol", 7744 RFC 4960, September 2007. 7746 [RFC5054] Taylor, D., Wu, T., 7747 Mavrogiannopoulos, N., and T. 7748 Perrin, "Using the Secure Remote 7749 Password (SRP) Protocol for TLS 7750 Authentication", RFC 5054, 7751 November 2007. 7753 [RFC5095] Abley, J., Savola, P., and G. 7754 Neville-Neil, "Deprecation of 7755 Type 0 Routing Headers in IPv6", 7756 RFC 5095, December 2007. 7758 [RFC5201] Moskowitz, R., Nikander, P., 7759 Jokela, P., and T. Henderson, 7760 "Host Identity Protocol", 7761 RFC 5201, April 2008. 7763 [RFC5280] Cooper, D., Santesson, S., 7764 Farrell, S., Boeyen, S., 7765 Housley, R., and W. Polk, 7766 "Internet X.509 Public Key 7767 Infrastructure Certificate and 7768 Certificate Revocation List 7769 (CRL) Profile", RFC 5280, 7770 May 2008. 7772 [RFC5694] Camarillo, G. and IAB, "Peer-to- 7773 Peer (P2P) Architecture: 7774 Definition, Taxonomies, 7775 Examples, and Applicability", 7776 RFC 5694, November 2009. 7778 [RFC5765] Schulzrinne, H., Marocco, E., 7779 and E. Ivov, "Security Issues 7780 and Solutions in Peer-to-Peer 7781 Systems for Realtime 7782 Communications", RFC 5765, 7783 February 2010. 7785 [RFC5785] Nottingham, M. and E. Hammer- 7786 Lahav, "Defining Well-Known 7787 Uniform Resource Identifiers 7788 (URIs)", RFC 5785, April 2010. 7790 [RFC6079] Camarillo, G., Nikander, P., 7791 Hautakorpi, J., Keranen, A., and 7792 A. Johnston, "HIP BONE: Host 7793 Identity Protocol (HIP) Based 7794 Overlay Networking Environment 7795 (BONE)", RFC 6079, January 2011. 7797 [RFC6544] Rosenberg, J., Keranen, A., 7798 Lowekamp, B., and A. Roach, "TCP 7799 Candidates with Interactive 7800 Connectivity Establishment 7801 (ICE)", RFC 6544, March 2012. 7803 [Sybil] Douceur, J., "The Sybil Attack", 7804 IPTPS 02, March 2002. 7806 [UnixTime] Wikipedia, "Unix Time", 2013, . 7810 [bryan-design-hotp2p08] Bryan, D., Lowekamp, B., and M. 7811 Zangrilli, "The Design of a 7812 Versatile, Secure P2PSIP 7813 Communications Architecture for 7814 the Public Internet", Hot- 7815 P2P'08, 2008. 7817 [handling-churn-usenix04] Rhea, S., Geels, D., Roscoe, T., 7818 and J. Kubiatowicz, "Handling 7819 Churn in a DHT", In Proc. of the 7820 USENIX Annual Technical 7821 Conference June 2004 USENIX 7822 2004, 2004. 7824 [lookups-churn-p2p06] Wu, D., Tian, Y., and K. Ng, 7825 "Analytical Study on Improving 7826 DHT Lookup Performance under 7827 Churn", IEEE P2P'06, 2006. 7829 [minimizing-churn-sigcomm06] Godfrey, P., Shenker, S., and I. 7830 Stoica, "Minimizing Churn in 7831 Distributed Systems", SIGCOMM 7832 2006, 2006. 7834 [non-transitive-dhts-worlds05] Freedman, M., Lakshminarayanan, 7835 K., Rhea, S., and I. Stoica, 7836 "Non-Transitive Connectivity and 7837 DHTs", WORLDS'05, 2005. 7839 [opendht-sigcomm05] Rhea, S., Godfrey, B., Karp, B., 7840 Kubiatowicz, J., Ratnasamy, S., 7841 Shenker, S., Stoica, I., and H. 7842 Yu, "OpenDHT: A Public DHT and 7843 its Uses", SIGCOMM'05, 2005. 7845 [vulnerabilities-acsac04] Srivatsa, M. and L. Liu, 7846 "Vulnerabilities and Security 7847 Threats in Structured Peer-to- 7848 Peer Systems: A Quantitative 7849 Analysis", ACSAC 2004, 2004. 7851 [wikiChord] Wikipedia, "Chord (peer-to- 7852 peer)", 2013, . 7856 [wikiKBR] Wikipedia, "Key-based routing", 7857 2013, . 7860 [wikiSkiplist] Wikipedia, "Skip list", 2013, . 7864 Appendix A. Routing Alternatives 7866 Significant discussion has been focused on the selection of a routing 7867 algorithm for P2PSIP. This section discusses the motivations for 7868 selecting symmetric recursive routing for RELOAD and describes the 7869 extensions that would be required to support additional routing 7870 algorithms. 7872 A.1. Iterative vs Recursive 7874 Iterative routing has a number of advantages. It is easier to debug, 7875 consumes fewer resources on intermediate peers, and allows the 7876 querying peer to identify and route around misbehaving peers 7877 [non-transitive-dhts-worlds05]. However, in the presence of NATs, 7878 iterative routing is intolerably expensive because a new connection 7879 must be established for each hop (using ICE) [bryan-design-hotp2p08]. 7881 Iterative routing is supported through the RouteQuery mechanism and 7882 is primarily intended for debugging. It also allows the querying 7883 peer to evaluate the routing decisions made by the peers at each hop, 7884 consider alternatives, and perhaps detect at what point the 7885 forwarding path fails. 7887 A.2. Symmetric vs Forward response 7889 An alternative to the symmetric recursive routing method used by 7890 RELOAD is Forward-Only routing, where the response is routed to the 7891 requester as if it were a new message initiated by the responder (in 7892 the previous example, Z sends the response to A as if it were sending 7893 a request). Forward-only routing requires no state in either the 7894 message or intermediate peers. 7896 The drawback of forward-only routing is that it does not work when 7897 the overlay is unstable. For example, if A is in the process of 7898 joining the overlay and is sending a Join request to Z, it is not yet 7899 reachable via forward routing. Even if it is established in the 7900 overlay, if network failures produce temporary instability, A may not 7901 be reachable (and may be trying to stabilize its network connectivity 7902 via Attach messages). 7904 Furthermore, forward-only responses are less likely to reach the 7905 querying peer than symmetric recursive ones are, because the forward 7906 path is more likely to have a failed peer than is the request path 7907 (which was just tested to route the request) 7908 [non-transitive-dhts-worlds05]. 7910 An extension to RELOAD that supports forward-only routing but relies 7911 on symmetric responses as a fallback would be possible, but due to 7912 the complexities of determining when to use forward-only and when to 7913 fallback to symmetric, we have chosen not to include it as an option 7914 at this point. 7916 A.3. Direct Response 7918 Another routing option is Direct Response routing, in which the 7919 response is returned directly to the querying node. In the previous 7920 example, if A encodes its IP address in the request, then Z can 7921 simply deliver the response directly to A. In the absence of NATs or 7922 other connectivity issues, this is the optimal routing technique. 7924 The challenge of implementing direct response is the presence of 7925 NATs. There are a number of complexities that must be addressed. In 7926 this discussion, we will continue our assumption that A issued the 7927 request and Z is generating the response. 7929 o The IP address listed by A may be unreachable, either due to NAT 7930 or firewall rules. Therefore, a direct response technique must 7931 fallback to symmetric response [non-transitive-dhts-worlds05]. 7932 The hop-by-hop ACKs used by RELOAD allow Z to determine when A has 7933 received the message (and the TLS negotiation will provide earlier 7934 confirmation that A is reachable), but this fallback requires a 7935 timeout that will increase the response latency whenever A is not 7936 reachable from Z. 7938 o Whenever A is behind a NAT it will have multiple candidate IP 7939 addresses, each of which must be advertised to ensure 7940 connectivity; therefore Z will need to attempt multiple 7941 connections to deliver the response. 7943 o One (or all) of A's candidate addresses may route from Z to a 7944 different device on the Internet. In the worst case these nodes 7945 may actually be running RELOAD on the same port. Therefore, it is 7946 absolutely necessary to establish a secure connection to 7947 authenticate A before delivering the response. This step 7948 diminishes the efficiency of direct response because multiple 7949 roundtrips are required before the message can be delivered. 7951 o If A is behind a NAT and does not have a connection already 7952 established with Z, there are only two ways the direct response 7953 will work. The first is that A and Z both be behind the same NAT, 7954 in which case the NAT is not involved. In the more common case, 7955 when Z is outside A's NAT, the response will only be received if 7956 A's NAT implements endpoint-independent filtering. As the choice 7957 of filtering mode conflates application transparency with security 7958 [RFC4787], and no clear recommendation is available, the 7959 prevalence of this feature in future devices remains unclear. 7961 An extension to RELOAD that supports direct response routing but 7962 relies on symmetric responses as a fallback would be possible, but 7963 due to the complexities of determining when to use direct response 7964 and when to fallback to symmetric, and the reduced performance for 7965 responses to peers behind restrictive NATs, we have chosen not to 7966 include it as an option at this point. 7968 A.4. Relay Peers 7970 [I-D.ietf-p2psip-rpr] has proposed implementing a form of direct 7971 response by having A identify a peer, Q, that will be directly 7972 reachable by any other peer. A uses Attach to establish a connection 7973 with Q and advertises Q's IP address in the request sent to Z. Z 7974 sends the response to Q, which relays it to A. This then reduces the 7975 latency to two hops, plus Z negotiating a secure connection to Q. 7977 This technique relies on the relative population of nodes such as A 7978 that require relay peers and peers such as Q that are capable of 7979 serving as a relay peer. It also requires nodes to be able to 7980 identify which category they are in. This identification problem has 7981 turned out to be hard to solve and is still an open area of 7982 exploration. 7984 An extension to RELOAD that supports relay peers is possible, but due 7985 to the complexities of implementing such an alternative, we have not 7986 added such a feature to RELOAD at this point. 7988 A concept similar to relay peers, essentially choosing a relay peer 7989 at random, has previously been suggested to solve problems of 7990 pairwise non-transitivity [non-transitive-dhts-worlds05], but 7991 deterministic filtering provided by NATs makes random relay peers no 7992 more likely to work than the responding peer. 7994 A.5. Symmetric Route Stability 7996 A common concern about symmetric recursive routing has been that one 7997 or more peers along the request path may fail before the response is 7998 received. The significance of this problem essentially depends on 7999 the response latency of the overlay. An overlay that produces slow 8000 responses will be vulnerable to churn, whereas responses that are 8001 delivered very quickly are vulnerable only to failures that occur 8002 over that small interval. 8004 The other aspect of this issue is whether the request itself can be 8005 successfully delivered. Assuming typical connection maintenance 8006 intervals, the time period between the last maintenance and the 8007 request being sent will be orders of magnitude greater than the delay 8008 between the request being forwarded and the response being received. 8009 Therefore, if the path was stable enough to be available to route the 8010 request, it is almost certainly going to remain available to route 8011 the response. 8013 An overlay that is unstable enough to suffer this type of failure 8014 frequently is unlikely to be able to support reliable functionality 8015 regardless of the routing mechanism. However, regardless of the 8016 stability of the return path, studies show that in the event of high 8017 churn, iterative routing is a better solution to ensure request 8018 completion [lookups-churn-p2p06] [non-transitive-dhts-worlds05] 8020 Finally, because RELOAD retries the end-to-end request, that retry 8021 will address the issues of churn that remain. 8023 Appendix B. Why Clients? 8025 There are a wide variety of reasons a node may act as a client rather 8026 than as a peer. This section outlines some of those scenarios and 8027 how the client's behavior changes based on its capabilities. 8029 B.1. Why Not Only Peers? 8031 For a number of reasons, a particular node may be forced to act as a 8032 client even though it is willing to act as a peer. These include: 8034 o The node does not have appropriate network connectivity, typically 8035 because it has a low-bandwidth network connection. 8037 o The node may not have sufficient resources, such as computing 8038 power, storage space, or battery power. 8040 o The overlay algorithm may dictate specific requirements for peer 8041 selection. These may include participating in the overlay to 8042 determine trustworthiness; controlling the number of peers in the 8043 overlay to reduce overly-long routing paths; or ensuring minimum 8044 application uptime before a node can join as a peer. 8046 The ultimate criteria for a node to become a peer are determined by 8047 the overlay algorithm and specific deployment. A node acting as a 8048 client that has a full implementation of RELOAD and the appropriate 8049 overlay algorithm is capable of locating its responsible peer in the 8050 overlay and using Attach to establish a direct connection to that 8051 peer. In that way, it may elect to be reachable under either of the 8052 routing approaches listed above. Particularly for overlay algorithms 8053 that elect nodes to serve as peers based on trustworthiness or 8054 population, the overlay algorithm may require such a client to locate 8055 itself at a particular place in the overlay. 8057 B.2. Clients as Application-Level Agents 8059 SIP defines an extensive protocol for registration and security 8060 between a client and its registrar/proxy server(s). Any SIP device 8061 can act as a client of a RELOAD-based P2PSIP overlay if it contacts a 8062 peer that implements the server-side functionality required by the 8063 SIP protocol. In this case, the peer would be acting as if it were 8064 the user's peer, and would need the appropriate credentials for that 8065 user. 8067 Application-level support for clients is defined by a usage. A usage 8068 offering support for application-level clients should specify how the 8069 security of the system is maintained when the data is moved between 8070 the application and RELOAD layers. 8072 Authors' Addresses 8074 Cullen Jennings 8075 Cisco 8076 400 3rd Avenue SW, Suite 350 8077 Calgary 8078 Canada 8080 EMail: fluffy@cisco.com 8082 Bruce B. Lowekamp (editor) 8083 Skype 8084 Palo Alto, CA 8085 USA 8087 EMail: bbl@lowekamp.net 8089 Eric Rescorla 8090 RTFM, Inc. 8091 2064 Edgewood Drive 8092 Palo Alto, CA 94303 8093 USA 8095 Phone: +1 650 678 2350 8096 EMail: ekr@rtfm.com 8098 Salman A. Baset 8099 Columbia University 8100 1214 Amsterdam Avenue 8101 New York, NY 8102 USA 8104 EMail: salman@cs.columbia.edu 8105 Henning Schulzrinne 8106 Columbia University 8107 1214 Amsterdam Avenue 8108 New York, NY 8109 USA 8111 EMail: hgs@cs.columbia.edu