idnits 2.17.1 draft-brunstrom-taps-impl-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 4 instances of too long lines in the document, the longest one being 3 characters in excess of 72. ** There are 3 instances of lines with control characters in the document. ** The abstract seems to contain references ([I-D.pauly-taps-arch]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 05, 2018) is 2244 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-11) exists of draft-ietf-taps-minset-02 ** Obsolete normative reference: RFC 7540 (Obsoleted by RFC 9113) == Outdated reference: A later version (-34) exists of draft-ietf-quic-transport-10 == Outdated reference: A later version (-28) exists of draft-ietf-tls-tls13-26 -- Obsolete informational reference (is this intentional?): RFC 5245 (Obsoleted by RFC 8445, RFC 8839) Summary: 4 errors (**), 0 flaws (~~), 4 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TAPS Working Group A. Brunstrom, Ed. 3 Internet-Draft Karlstad University 4 Intended status: Informational T. Pauly, Ed. 5 Expires: September 6, 2018 Apple Inc. 6 T. Enghardt 7 TU Berlin 8 K-J. Grinnemo 9 Karlstad University 10 T. Jones 11 University of Aberdeen 12 P. Tiesel 13 TU Berlin 14 C. Perkins 15 University of Glasgow 16 M. Welzl 17 University of Oslo 18 March 05, 2018 20 Implementing Interfaces to Transport Services 21 draft-brunstrom-taps-impl-00 23 Abstract 25 The Transport Services architecture [I-D.pauly-taps-arch] defines a 26 system that allows applications to use transport networking protocols 27 flexibly. This document serves as a guide to implementation on how 28 to build such a system. 30 Status of This Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at https://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on September 6, 2018. 47 Copyright Notice 49 Copyright (c) 2018 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (https://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 65 2. Implementing Basic Objects . . . . . . . . . . . . . . . . . 3 66 3. Implementing Pre-Establishment . . . . . . . . . . . . . . . 4 67 3.1. Configuration-time errors . . . . . . . . . . . . . . . . 4 68 3.2. Role of system policy . . . . . . . . . . . . . . . . . . 5 69 4. Implementing Connection Establishment . . . . . . . . . . . . 6 70 4.1. Candidate Gathering . . . . . . . . . . . . . . . . . . . 7 71 4.1.1. Structuring Options as a Tree . . . . . . . . . . . . 7 72 4.1.2. Branch Types . . . . . . . . . . . . . . . . . . . . 9 73 4.2. Branching Order-of-Operations . . . . . . . . . . . . . . 11 74 4.3. Sorting Branches . . . . . . . . . . . . . . . . . . . . 12 75 4.4. Candidate Racing . . . . . . . . . . . . . . . . . . . . 13 76 4.4.1. Delayed Racing . . . . . . . . . . . . . . . . . . . 13 77 4.4.2. Failover . . . . . . . . . . . . . . . . . . . . . . 14 78 4.5. Completing Establishment . . . . . . . . . . . . . . . . 15 79 4.5.1. Determining Successful Establishment . . . . . . . . 15 80 4.6. Establishing multiplexed connections . . . . . . . . . . 16 81 4.7. Handling racing with "unconnected" protocols . . . . . . 17 82 4.8. Implementing listeners . . . . . . . . . . . . . . . . . 17 83 4.8.1. Implementing listeners for Connected Protocols . . . 18 84 4.8.2. Implementing listeners for Unconnected Protocols . . 18 85 4.8.3. Implementing listeners for Multiplexed Protocols . . 18 86 5. Implementing Data Transfer . . . . . . . . . . . . . . . . . 18 87 5.1. Data transfer for streams, datagrams, and frames . . . . 18 88 5.1.1. Sending Messages . . . . . . . . . . . . . . . . . . 19 89 5.1.2. Receiving Messages . . . . . . . . . . . . . . . . . 20 90 5.2. Handling of data for fast-open protocols . . . . . . . . 21 91 6. Implementing Maintenance . . . . . . . . . . . . . . . . . . 22 92 6.1. Changing Protocol Properties . . . . . . . . . . . . . . 22 93 6.2. Handling Path Changes . . . . . . . . . . . . . . . . . . 23 94 7. Implementing Termination . . . . . . . . . . . . . . . . . . 23 95 8. Cached State . . . . . . . . . . . . . . . . . . . . . . . . 24 96 8.1. Protocol state caches . . . . . . . . . . . . . . . . . . 24 97 8.2. Performance caches . . . . . . . . . . . . . . . . . . . 25 98 9. Specific Transport Protocol Considerations . . . . . . . . . 26 99 9.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 100 9.2. UDP . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 101 9.3. SCTP . . . . . . . . . . . . . . . . . . . . . . . . . . 27 102 9.4. TLS . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 103 9.5. HTTP . . . . . . . . . . . . . . . . . . . . . . . . . . 28 104 9.6. QUIC . . . . . . . . . . . . . . . . . . . . . . . . . . 28 105 9.7. HTTP/2 transport . . . . . . . . . . . . . . . . . . . . 29 106 10. Rendezvous and Environment Discovery . . . . . . . . . . . . 29 107 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 31 108 12. Security Considerations . . . . . . . . . . . . . . . . . . . 31 109 12.1. Considerations for Candidate Gathering . . . . . . . . . 31 110 12.2. Considerations for Candidate Racing . . . . . . . . . . 31 111 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 32 112 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 32 113 14.1. Normative References . . . . . . . . . . . . . . . . . . 32 114 14.2. Informative References . . . . . . . . . . . . . . . . . 33 115 Appendix A. Additional Properties . . . . . . . . . . . . . . . 34 116 A.1. Properties Affecting Sorting of Branches . . . . . . . . 34 117 A.2. Send Parameters . . . . . . . . . . . . . . . . . . . . . 35 118 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 35 120 1. Introduction 122 The Transport Services architecture [I-D.pauly-taps-arch] defines a 123 system that allows applications to use transport networking protocols 124 flexibly. The interface such a system exposes to applications is 125 defined as the Transport Services API [I-D.trammell-taps-interface]. 126 This API is designed to be generic across multiple transport 127 protocols and sets of protocols features. 129 This document serves as a guide to implementation on how to build a 130 system that provides a Transport Services API. It is the job of an 131 implementation of a Transport Services system to turn the requests of 132 an application into decisions on how to establish connections, and 133 how to transfer data over those connections once established. The 134 terminology used in this document is based on the Architecture 135 [I-D.pauly-taps-arch]. 137 2. Implementing Basic Objects 139 The basic objects that are exposed to applications for Transport 140 Services are the Preconnection, the bundle of properties that 141 describes the application constraints on the transport; the 142 Connection, the basic object that represents a flow of data in either 143 direction between the Local and Remote Endpoints; and the Listener, a 144 passive waiting object that delivers new Connections. 146 Preconnection objects should be implemented as bundles of properties 147 that an application can both read and write. Once a Preconnection 148 has been used to create an outbound Connection or a Listener, the 149 implementation should ensure that the copy of the properties held by 150 the Connection or Listener is immutable. This may involve performing 151 a deep-copy if the application is still able to modify properties on 152 the original Preconnection object. 154 Connection objects represent the interface between the application 155 and the implementation to manage transport state, and conduct data 156 transfer. During the process of establishment (Section 4), the 157 Connection will be unbound to a specific transport flow, since there 158 may be multiple candidate Protocol Stacks being raced. Once the 159 Connection is established, the object should be considered mapped to 160 a specific Protocol Stack. The notion of a Connection maps to many 161 different protocols, depending on the Protocol Stack. For example, 162 the Connection may ultimately represent the interface into a TCP 163 connection, a TLS session over TCP, a UDP flow with fully-specified 164 local and remote endpoints, a DTLS session, a SCTP stream, a QUIC 165 stream, or an HTTP/2 stream. 167 Listener objects are created with a Preconnection, at which point 168 their configuration should be considered immutable by the 169 implementation. The process of listening is described in 170 Section 4.8. 172 3. Implementing Pre-Establishment 174 During pre-establishment the application specifies the Endpoints to 175 be used for communication as well as its preferences regarding 176 Protocol and Path Selection. The implementation stores these objects 177 and properties as part of the Preconnection object for use during 178 connection establishment. For Protocol and Path Selection Properties 179 that are not provided by the application, the implementation must use 180 the default values specified in the Transport Services API 181 ([I-D.trammell-taps-interface]). 183 3.1. Configuration-time errors 185 The transport system should have a list of supported protocols 186 available, which each have transport features reflecting the 187 capabilities of the protocol. Once an application specifies its 188 Transport Parameters, the transport system should match the required 189 and prohibited properties against the transport features of the 190 available protocols. 192 In the following cases, failure should be detected during pre- 193 establishment: 195 o The application requested Protocol Properties that include 196 requirements or prohibitions that cannot be satisfied by any of 197 the available protocols. For example, if an application requires 198 "Configure Reliability per Message", but no such protocol is 199 available on the host running the transport system, e.g., because 200 SCTP is not supported by the operating system, this should result 201 in an error. 203 o The application requested Protocol Properties that are in conflict 204 with each other, i.e., the required and prohibited properties 205 cannot be satisfied by the same protocol. For example, if an 206 application prohibits "Reliable Data Transfer" but then requires 207 "Configure Reliability per Message", this mismatch should result 208 in an error. 210 It is important to fail as early as possible in such cases in order 211 to avoid allocating resources, e.g., to endpoint resolution, only to 212 find out later that there is no protocol that satisfies the 213 requirements. 215 3.2. Role of system policy 217 The properties specified during pre-establishment has a close 218 connection to system policy. The implementation is responsible for 219 combining and reconciling several different sources of preferences 220 when establishing Connections. These include, but are not limited 221 to: 223 1. Application preferences, i.e., preferences specified during the 224 pre-establishment such as Local Endpoint, Remote Endpoint, Path 225 Selection Properties, and Protocol Selection Properties. 227 2. Dynamic system policy, i.e., policy compiled from internally and 228 externally acquired information about available network 229 interfaces, supported transport protocols, and current/previous 230 Connections. Examples of ways to externally retrieve policy- 231 support information are through OS-specific statistics/ 232 measurement tools and tools that reside on middleboxes and 233 routers. 235 3. Default implementation policy, i.e., predefined policy by OS or 236 application. 238 In general, any protocol or path used for a connection must conform 239 to all three sources of constraints. Any violation of any of the 240 layers should cause a protocol or path to be considered ineligible 241 for use. For an example of application preferences leading to 242 constraints, an application may prohibit the use of metered network 243 interfaces for a given Connection to avoid user cost. Similarly, the 244 system policy at a given time may prohibit the use of such a metered 245 network interface from the application's process. Lastly, the 246 implementation itself may default to disallowing certain network 247 interfaces unless explicitly requested by the application and allowed 248 by the system. 250 It is expected that the database of system policies and the method of 251 looking up these policies will vary across various platforms. An 252 implementation should attempt to look up the relevant policies for 253 the system in a dynamic way to make sure it is reflecting an accurate 254 version of the system policy, since the system's policy regarding the 255 application's traffic may change over time due to user or 256 administrative changes. 258 4. Implementing Connection Establishment 260 The process of establishing a network connection begins when an 261 application expresses intent to communicate with a remote endpoint by 262 calling Initiate. (At this point, any constraints or requirements 263 the application may have on the connection are available from pre- 264 establishment.) The process can be considered complete once there is 265 at least one Protocol Stack that has completed any required setup to 266 the point that it can transmit and receive the application's data. 268 Connection establishment is divided into two top-level steps: 269 Candidate Gathering, to identify the paths, protocols, and endpoints 270 to use, and Candidate Racing, in which the necessary protocol 271 handshakes are conducted in order to select which set to use. 273 The most simple example of this process might involve identifying the 274 single IP address to which the implementation wishes to connect, 275 using the system's current default interface or path, and starting a 276 TCP handshake to establish a stream to the specified IP address. 277 However, each step may also vary depending on the requirements of the 278 connection: if the endpoint is defined as a hostname and port, then 279 there may be multiple resolved addresses that are available; there 280 may also be multiple interfaces or paths available, other than the 281 default system interface; and some protocols may not need any 282 transport handshake to be considered "established" (such as UDP), 283 while other connections may utilize layered protocol handshakes, such 284 as TLS over TCP. 286 Whenever an implementation has multiple options for connection 287 establishment, it can view the set of all individual connection 288 establishment options as a single, aggregate connection 289 establishment. The aggregate set conceptually includes every valid 290 combination of endpoints, paths, and protocols. As an example, 291 consider an implementation that initiates a TCP connection to a 292 hostname + port endpoint, and has two valid interfaces available (Wi- 293 Fi and LTE). The hostname resolves to a single IPv4 address on the 294 Wi-Fi network, and resolves to the same IPv4 address on the LTE 295 network, as well as a single IPv6 address. The aggregate set of 296 connection establishment options can be viewed as follows: 298 Aggregate [Endpoint: www.example.com:80] [Interface: Any] [Protocol: TCP] 299 |-> [Endpoint: 192.0.2.1:80] [Interface: Wi-Fi] [Protocol: TCP] 300 |-> [Endpoint: 192.0.2.1:80] [Interface: LTE] [Protocol: TCP] 301 |-> [Endpoint: 2001:DB8::1.80] [Interface: LTE] [Protocol: TCP] 303 Any one of these sub-entries on the aggregate connection attempt 304 would satisfy the original application intent. The concern of this 305 section is the algorithm defining which of these options to try, 306 when, and in what order. 308 4.1. Candidate Gathering 310 The step of gathering candidates involves identifying which paths, 311 protocols, and endpoints may be used for a given Connection. This 312 list is determined by the requirements, prohibitions, and preferences 313 of the application as specified in the Path Selection Properties and 314 Protocol Selection Properties. 316 4.1.1. Structuring Options as a Tree 318 When an implementation responsible for connection establishment needs 319 to consider multiple options, it should logically structure these 320 options as a hierarchical tree. Each leaf node of the tree 321 represents a single, coherent connection attempt, with an Endpoint, a 322 Path, and a set of protocols that can directly negotiate and send 323 data on the network. Each node in the tree that is not a leaf 324 represents a connection attempt that is either underspecified, or 325 else includes multiple distinct options. For example. when 326 connecting on an IP network, a connection attempt to a hostname and 327 port is underspecified, because the connection attempt requires a 328 resolved IP address as its remote endpoint. In this case, the node 329 represented by the connection attempt to the hostname is a parent 330 node, with child nodes for each IP address. Similarly, an 331 implementation that is allowed to connect using multiple interfaces 332 will have a parent node of the tree for the decision between the 333 paths, with a branch for each interface. 335 The example aggregate connection attempt above can be drawn as a tree 336 by grouping the addresses resolved on the same interface into 337 branches: 339 || 340 +==========================+ 341 | www.example.com:80/Any | 342 +==========================+ 343 // \\ 344 +==========================+ +==========================+ 345 | www.example.com:80/Wi-Fi | | www.example.com:80/LTE | 346 +==========================+ +==========================+ 347 || // \\ 348 +====================+ +====================+ +======================+ 349 | 192.0.2.1:80/Wi-Fi | | 192.0.2.1:80/LTE | | 2001:DB8::1.80/LTE | 350 +====================+ +====================+ +======================+ 352 The rest of this section will use a notation scheme to represent this 353 tree. The parent (or trunk) node of the tree will be represented by 354 a single integer, such as "1". Each child of that node will have an 355 integer that identifies it, from 1 to the number of children. That 356 child node will be uniquely identified by concatenating its integer 357 to it's parents identifier with a dot in between, such as "1.1" and 358 "1.2". Each node will be summarized by a tuple of three elements: 359 Endpoint, Path, and Protocol. The above example can now be written 360 more succinctly as: 362 1 [www.example.com:80, Any, TCP] 363 1.1 [www.example.com:80, Wi-Fi, TCP] 364 1.1.1 [192.0.2.1:80, Wi-Fi, TCP] 365 1.2 [www.example.com:80, LTE, TCP] 366 1.2.1 [192.0.2.1:80, LTE, TCP] 367 1.2.2 [2001:DB8::1.80, LTE, TCP] 369 When an implementation views this aggregate set of connection 370 attempts as a single connection establishment, it only will use one 371 of the leaf nodes to transfer data. Thus, when a single leaf node 372 becomes ready to use, then the entire connection attempt is ready to 373 use by the application. Another way to represent this is that every 374 leaf node updates the state of its parent node when it becomes ready, 375 until the trunk node of the tree is ready, which then notifies the 376 application that the connection as a whole is ready to use. 378 A connection establishment tree may be degenerate, and only have a 379 single leaf node, such as a connection attempt to an IP address over 380 a single interface with a single protocol. 382 1 [192.0.2.1:80, Wi-Fi, TCP] 383 A parent node may also only have one child (or leaf) node, such as a 384 when a hostname resolves to only a single IP address. 386 1 [www.example.com:80, Wi-Fi, TCP] 387 1.1 [192.0.2.1:80, Wi-Fi, TCP] 389 4.1.2. Branch Types 391 There are three types of branching from a parent node into one or 392 more child nodes. Any parent node of the tree must only use one type 393 of branching. 395 4.1.2.1. Derived Endpoints 397 If a connection originally targets a single endpoint, there may be 398 multiple endpoints of different types that can be derived from the 399 original. The connection library should order the derived endpoints 400 according to application preference, system policy and expected 401 performance. 403 DNS hostname-to-address resolution is the most common method of 404 endpoint derivation. When trying to connect to a hostname endpoint 405 on a traditional IP network, the implementation should send DNS 406 queries for both A (IPv4) and AAAA (IPv6) records if both are 407 supported on the local link. The algorithm for ordering and racing 408 these addresses should follow the recommendations in Happy Eyeballs 409 [RFC8305]. 411 1 [www.example.com:80, Wi-Fi, TCP] 412 1.1 [2001:DB8::1.80, Wi-Fi, TCP] 413 1.2 [192.0.2.1:80, Wi-Fi, TCP] 414 1.3 [2001:DB8::2.80, Wi-Fi, TCP] 415 1.4 [2001:DB8::3.80, Wi-Fi, TCP] 417 DNS-Based Service Discovery can also provide an endpoint derivation 418 step. When trying to connect to a named service, the client may 419 discover one or more hostname and port pairs on the local network 420 using multicast DNS. These hostnames should each be treated as a 421 branch which can be attempted independently from other hostnames. 422 Each of these hostnames may also resolve to one or more addresses, 423 thus creating multiple layers of branching. 425 1 [term-printer._ipp._tcp.meeting.ietf.org, Wi-Fi, TCP] 426 1.1 [term-printer.meeting.ietf.org:631, Wi-Fi, TCP] 427 1.1.1 [31.133.160.18.631, Wi-Fi, TCP] 429 4.1.2.2. Alternate Paths 431 If a client has multiple network interfaces available to it, such as 432 mobile client with both Wi-Fi and Cellular connectivity, it can 433 attempt a connection over either interface. This represents a branch 434 point in the connection establishment. Like with derived endpoints, 435 the interfaces should be ranked based on preference, system policy, 436 and performance. Attempts should be started on one interface, and 437 then on other interfaces successively after delays based on expected 438 round-trip-time or other available metrics. 440 1 [192.0.2.1:80, Any, TCP] 441 1.1 [192.0.2.1:80, Wi-Fi, TCP] 442 1.2 [192.0.2.1:80, LTE, TCP] 444 This same approach applies to any situation in which the client is 445 aware of multiple links or views of the network. Multiple Paths, 446 each with a coherent set of addresses, routes, DNS server, and more, 447 may share a single interface. A path may also represent a virtual 448 interface service such as a Virtual Private Network (VPN). 450 The list of available paths should be constrained by any requirements 451 or prohibitions the application sets, as well as system policy. 453 4.1.2.3. Protocol Options 455 Differences in possible protocol compositions and options can also 456 provide a branching point in connection establishment. This allows 457 clients to be resilient to situations in which a certain protocol is 458 not functioning on a server or network. 460 This approach is commonly used for connections with optional proxy 461 server configurations. A single connection may be allowed to use an 462 HTTP-based proxy, a SOCKS-based proxy, or connect directly. These 463 options should be ranked and attempted in succession. 465 1 [www.example.com:80, Any, HTTP/TCP] 466 1.1 [192.0.2.8:80, Any, HTTP/HTTP Proxy/TCP] 467 1.2 [192.0.2.7:10234, Any, HTTP/SOCKS/TCP] 468 1.3 [www.example.com:80, Any, HTTP/TCP] 469 1.3.1 [192.0.2.1:80, Any, HTTP/TCP] 471 This approach also allows a client to attempt different sets of 472 application and transport protocols that may provide preferable 473 characteristics when available. For example, the protocol options 474 could involve QUIC [I-D.ietf-quic-transport] over UDP on one branch, 475 and HTTP/2 [RFC7540] over TLS over TCP on the other: 477 1 [www.example.com:443, Any, Any HTTP] 478 1.1 [www.example.com:443, Any, QUIC/UDP] 479 1.1.1 [192.0.2.1:443, Any, QUIC/UDP] 480 1.2 [www.example.com:443, Any, HTTP2/TLS/TCP] 481 1.2.1 [192.0.2.1:443, Any, HTTP2/TLS/TCP] 483 Another example is racing SCTP with TCP: 485 1 [www.example.com:80, Any, Any Stream] 486 1.1 [www.example.com:80, Any, SCTP] 487 1.1.1 [192.0.2.1:80, Any, SCTP] 488 1.2 [www.example.com:80, Any, TCP] 489 1.2.1 [192.0.2.1:80, Any, TCP] 491 Implementations that support racing protocols and protocol options 492 should maintain a history of which protocols and protocol options 493 successfully established, on a per-network basis (see Section 8.2). 494 This information can influence future racing decisions to prioritize 495 or prune branches. 497 4.2. Branching Order-of-Operations 499 Branch types must occur in a specific order relative to one another 500 to avoid creating leaf nodes with invalid or incompatible settings. 501 In the example above, it would be invalid to branch for derived 502 endpoints (the DNS results for www.example.com) before branching 503 between interface paths, since usable DNS results on one network may 504 not necessarily be the same as DNS results on another network due to 505 local network entities, supported address families, or enterprise 506 network configurations. Implementations must be careful to branch in 507 an order that results in usable leaf nodes whenever there are 508 multiple branch types that could be used from a single node. 510 The order of operations for branching, where lower numbers are acted 511 upon first, should be: 513 1. Alternate Paths 515 2. Protocol Options 517 3. Derived Endpoints 519 Branching between paths is the first in the list because results 520 across multiple interfaces are likely not related to one another: 521 endpoint resolution may return different results, especially when 522 using locally resolved host and service names, and which protocols 523 are supported and preferred may differ across interfaces. Thus, if 524 multiple paths are attempted, the overall connection can be seen as a 525 race between the available paths or interfaces. 527 Protocol options are checked next in order. Whether or not a set of 528 protocol, or protocol-specific options, can successfully connect is 529 generally not dependent on which specific IP address is used. 530 Furthermore, the protocol stacks being attempted may influence or 531 altogether change the endpoints being used. Adding a proxy to a 532 connection's branch will change the endpoint to the proxy's IP 533 address or hostname. Choosing an alternate protocol may also modify 534 the ports that should be selected. 536 Branching for derived endpoints is the final step, and may have 537 multiple layers of derivation or resolution, such as DNS service 538 resolution and DNS hostname resolution. 540 4.3. Sorting Branches 542 Implementations should sort the branches of the tree of connection 543 options in order of their preference rank. Leaf nodes on branches 544 with higher rankings represent connection attempts that will be raced 545 first. Implementations should order the branches to reflect the 546 preferences expressed by the application for its new connection, 547 including Protocol and Path Selection Properties, which are specified 548 in [I-D.trammell-taps-interface]. In addition to the properties 549 provided by the application, an implementation may include additional 550 criteria such as cached performance estimates, see Section 8.2, or 551 system policy, see Section 3.2, in the ranking. Two examples of how 552 the Protocol and Path Selection Properties may be used to sort 553 branches are provided below: 555 o Interface Type: If the application specifies an interface type to 556 be preferred or avoided, implementations should rank paths 557 accordingly. If the application specifies an interface type to be 558 required or prohibited, we expect an implementation to not include 559 the non-conforming paths into the three. 561 o Capacity Profile: An implementation may use the Capacity Profile 562 to prefer paths optimized for the application's expected traffic 563 pattern according to cached performance estimates, see 564 Section 8.2: 566 * Interactive/Low Latency: Prefer paths with the lowest expected 567 Round Trip Time 569 * Constant Rate: Prefer paths that can satisfy the requested 570 Stream Send or Stream Receive Bitrate, based on observed 571 maximum throughput 573 * Scavenger/Bulk: Prefer paths with the highest expected 574 available bandwidth, based on observed maximum throughput 576 [Note: See Appendix A.1 for additional examples related to Properties 577 under discussion.] 579 4.4. Candidate Racing 581 The primary goal of the Candidate Racing process is to successfully 582 negotiate a protocol stack to an endpoint over an interface--to 583 connect a single leaf node of the tree--with as little delay and as 584 few unnecessary connections attempts as possible. Optimizing these 585 two factors improves the user experience, while minimizing network 586 load. 588 This section covers the dynamic aspect of connection establishment. 589 While the tree described above is a useful conceptual and 590 architectural model, an implementation does not know what the full 591 tree may become up front, nor will many of the possible branches be 592 used in the common case. 594 There are three different approaches to racing the attempts for 595 different nodes of the connection establishment tree: 597 1. Immediate 599 2. Delayed 601 3. Failover 603 Each approach is appropriate in different use-cases and branch types. 604 However, to avoid consuming unnecessary network resources, 605 implementations should not use immediate racing as a default 606 approach. 608 The timing algorithms for racing should remain independent across 609 branches of the tree. Any timers or racing logic is isolated to a 610 given parent node, and is not ordered precisely with regards to other 611 children of other nodes. 613 4.4.1. Delayed Racing 615 Delayed racing can be used whenever a single node of the tree has 616 multiple child nodes. Based on the order determined when building 617 the tree, the first child node will be initiated immediately, 618 followed by the next child node after some delay. Once that second 619 child node is initiated, the third child node (if present) will begin 620 after another delay, and so on until all child nodes have been 621 initiated, or one of the child nodes successfully completes its 622 negotiation. 624 Delayed racing attempts occur in parallel. Implementations should 625 not terminate an earlier child connection attempt upon starting a 626 secondary child. 628 The delay between starting child nodes should be based on the 629 properties of the previously started child node. For example, if the 630 first child represents an IP address with a known route, and the 631 second child represents another IP address, the delay between 632 starting the first and second IP addresses can be based on the 633 expected retransmission cadence for the first child's connection 634 (derived from historical round-trip-time). Alternatively, if the 635 first child represents a branch on a Wi-Fi interface, and the second 636 child represents a branch on an LTE interface, the delay should be 637 based on the expected time in which the branch for the first 638 interface would be able to establish a connection, based on link 639 quality and historical round-trip-time. 641 Any delay should have a defined minimum and maximum value based on 642 the branch type. Generally, branches between paths and protocols 643 should have longer delays than branches between derived endpoints. 644 The maximum delay should be considered with regards to how long a 645 user is expected to wait for the connection to complete. 647 If a child node fails to connect before the delay timer has fired for 648 the next child, the next child should be started immediately. 650 4.4.2. Failover 652 If an implementation or application has a strong preference for one 653 branch over another, the branching node may choose to wait until one 654 child has failed before starting the next. Failure of a leaf node is 655 determined by its protocol negotiation failing or timing out; failure 656 of a parent branching node is determined by all of its children 657 failing. 659 An example in which failover is recommended is a race between a 660 protocol stack that uses a proxy and a protocol stack that bypasses 661 the proxy. Failover is useful in case the proxy is down or 662 misconfigured, but any more aggressive type of racing may end up 663 unnecessarily avoiding a proxy that was preferred by policy. 665 4.5. Completing Establishment 667 The process of connection establishment completes when one leaf node 668 of the tree has completed negotiation with the remote endpoint 669 successfully, or else all nodes of the tree have failed to connect. 670 The first leaf node to complete its connection is then used by the 671 application to send and receive data. 673 It is useful to process success and failure throughout the tree by 674 child nodes reporting to their parent nodes (towards the trunk of the 675 tree). For example, in the following case, if 1.1.1 fails to 676 connect, it reports the failure to 1.1. Since 1.1 has no other child 677 nodes, it also has failed and reports that failure to 1. Because 1.2 678 has not yet failed, 1 is not considered to have failed. Since 1.2 679 has not yet started, it is started and the process continues. 680 Similarly, if 1.1.1 successfully connects, then it marks 1.1 as 681 connected, which propagates to the trunk node 1. At this point, the 682 connection as a whole is considered to be successfully connected and 683 ready to process application data 685 1 [www.example.com:80, Any, TCP] 686 1.1 [www.example.com:80, Wi-Fi, TCP] 687 1.1.1 [192.0.2.1:80, Wi-Fi, TCP] 688 1.2 [www.example.com:80, LTE, TCP] 689 ... 691 If a leaf node has successfully completed its connection, all other 692 attempts should be made ineligible for use by the application for the 693 original request. New connection attempts that involve transmitting 694 data on the network should not be started after another leaf node has 695 completed successfully, as the connection as a whole has been 696 established. An implementation may choose to let certain handshakes 697 and negotiations complete in order to gather metrics to influence 698 future connections. Similarly, an implementation may choose to hold 699 onto fully established leaf nodes that were not the first to 700 establish for use in future connections, but this approach is not 701 recommended since those attempts were slower to connect and may 702 exhibit less desirable properties. 704 4.5.1. Determining Successful Establishment 706 Implementations may select the criteria by which a leaf node is 707 considered to be successfully connected differently on a per-protocol 708 basis. If the only protocol being used is a transport protocol with 709 a clear handshake, like TCP, then the obvious choice is to declare 710 that node "connected" when the last packet of the three-way handshake 711 has been received. If the only protocol being used is an 712 "unconnected" protocol, like UDP, the implementation may consider the 713 node fully "connected" the moment it determines a route is present, 714 before sending any packets on the network, see further Section 4.7. 716 For protocol stacks with multiple handshakes, the decision becomes 717 more nuanced. If the protocol stack involves both TLS and TCP, an 718 implementation could determine that a leaf node is connected after 719 the TCP handshake is complete, or it can wait for the TLS handshake 720 to complete as well. The benefit of declaring completion when the 721 TCP handshake finishes, and thus stopping the race for other branches 722 of the tree, is that there will be less burden on the network from 723 other connection attempts. On the other hand, by waiting until the 724 TLS handshake is complete, an implementation avoids the scenario in 725 which a TCP handshake completes quickly, but TLS negotiation is 726 either very slow or fails altogether in particular network conditions 727 or to a particular endpoint. To avoid the issue of TLS possibly 728 failing, the implementation should not generate a Ready event for the 729 Connection until TLS is established. 731 If all of the leaf nodes fail to connect during racing, i.e. none of 732 the configurations that satisfy all requirements given in the 733 Transport Parameters actually work over the available paths, then the 734 transport system should notify the application with an InitiateError 735 event. An InitiateError event should also be generated in case the 736 transport system finds no usable candidates to race. 738 4.6. Establishing multiplexed connections 740 Multiplexing several Connections over a single underlying transport 741 connection requires that the Connections to be multiplexed belong to 742 the same Connection Group (as is indicated by the application using 743 the Clone call). When the underlying transport connection supports 744 multi-streaming, the Transport System can map each Connection in the 745 Connection Group to a different stream. Thus, when the Connections 746 that are offered to an application by the Transport System are 747 multiplexed, the Transport System may implement the establishment of 748 a new Connection by simply beginning to use a new stream of an 749 already established transport connection and there is no need for a 750 connection establishment procedure. This, then, also means that 751 there may not be any "establishment" message (like a TCP SYN), but 752 the application can simply start sending or receiving. Therefore, 753 when the Initiate action of a Transport System is called without 754 Messages being handed over, it cannot be guaranteed that the other 755 endpoint will have any way to know about this, and hence a passive 756 endpoint's ConnectionReceived event may not be called upon an active 757 endpoint's Inititate. Instead, calling the ConnectionReceived event 758 may be delayed until the first Message arrives. 760 4.7. Handling racing with "unconnected" protocols 762 While protocols that use an explicit handshake to validate a 763 Connection to a peer can be used for racing multiple establishment 764 attempts in parallel, "unconnected" protocols such as raw UDP do not 765 offer a way to validate the presence of a peer or the usability of a 766 Connection without application feedback. An implementation should 767 consider such a protocol stack to be established as soon as a local 768 route to the peer endpoint is confirmed. 770 However, if a peer is not reachable over the network using the 771 unconnected protocol, or data cannot be exchanged for any other 772 reason, the application may want to attempt using another candidate 773 Protocol Stack. The implementation should maintain the list of other 774 candidate Protocol Stacks that were eligible to use. In the case 775 that the application signals that the initial Protocol Stack is 776 failing for some reason and that another option should be attempted, 777 the Connection can be updated to point to the next candidate Protocol 778 Stack. This can be viewed as an application-driven form of Protocol 779 Stack racing. 781 4.8. Implementing listeners 783 When an implementation is asked to Listen, it registers with the 784 system to wait for incoming traffic to the Local Endpoint. If no 785 Local Endpoint is specified, the implementation should either use an 786 ephemeral port or generate an error. 788 If the Path Selection Properties do not require a single network 789 interface or path, but allow the use of multiple paths, the Listener 790 object should register for incoming traffic on all of the network 791 interfaces or paths that conform to the Path Selection Properties. 792 The set of available paths can change over time, so the 793 implementation should monitor network path changes and register and 794 de-register the Listener across all usable paths. When using 795 multiple paths, the Listener is generally expected to use the same 796 port for listening on each. 798 If the Protocol Selection Properties allow multiple protocols to be 799 used for listening, and the implementation supports it, the Listener 800 object should register across the eligble protocols for each path. 801 This means that inbound Connections delivered by the implementation 802 may have heterogeneous protocol stacks. 804 4.8.1. Implementing listeners for Connected Protocols 806 Connected protocols such as TCP and TLS-over-TCP have a strong 807 mapping between the Local and Remote Endpoints (five-tuple) and their 808 protocol connection state. These map well into Connection objects. 809 Whenever a new inbound handshake is being started, the Listener 810 should generate a new Connection object and pass it to the 811 application. 813 4.8.2. Implementing listeners for Unconnected Protocols 815 Unconnected protocols such as UDP and UDP-lite generally do not 816 provide the same mechanisms that connected protocols do to offer 817 Connection objects. Implementations should wait for incoming packets 818 for unconnected protocols on a listening port and should perform 819 five-tuple matching of packets to either existing Connection objects 820 or the creation of new Connection objects. On platforms with 821 facilities to create a "virtual connection" for unconnected protocols 822 implementations should use these mechanisms to minimise the handling 823 of datagrams intended for already created Connection objects. 825 4.8.3. Implementing listeners for Multiplexed Protocols 827 Protocols that provide multiplexing of streams into a single five- 828 tuple can listen both for entirely new connections (a new HTTP/2 829 stream on a new TCP connection, for example) and for new sub- 830 connections (a new HTTP/2 stream on an existing connection). If the 831 abstraction of Connection presented to the application is mapped to 832 the multiplexed stream, then the Listener should deliver new 833 Connection objects in the same way for either case. The 834 implementation should allow the application to introspect the 835 Connection Group marked on the Connections to determine the grouping 836 of the multiplexing. 838 5. Implementing Data Transfer 840 5.1. Data transfer for streams, datagrams, and frames 842 The most basic mapping for sending a Message is an abstraction of 843 datagrams, in which the transport protocol naturally deals in 844 discrete packets. Each Message here corresponds to a single 845 datagram. Generally, these will be short enough that sending and 846 receiving will always use a complete Message. 848 For protocols that expose byte-streams, the only delineation provided 849 by the protocol is the end of the stream in a given direction. Each 850 Message in this case corresponds to the entire stream of bytes in a 851 direction. These Messages may be quite long, in which case they can 852 be sent in multiple parts. 854 Protocols that provide the framing (such as length-value protocols, 855 or protocols that use delimeters) provide data boundaries that may be 856 longer than a traditional packet datagram. Each Message for framing 857 protocols corresponds to a single frame, which may be sent either as 858 a complete Message, or in multiple parts. 860 5.1.1. Sending Messages 862 The effect of the application sending a Message is determined by the 863 top-level protocol in the established Protocol Stack. That is, if 864 the top-level protocol provides an abstraction of framed messages 865 over a connection, the receiving application will be able to obtain 866 multiple Messages on that connection, even if the framing protocol is 867 built on a byte-stream protocol like TCP. 869 5.1.1.1. Send Parameters 871 o Lifetime: this should be implemented by removing the Message from 872 its queue of pending Messages after the Lifetime has expired. A 873 queue of pending Messages within the transport system 874 implementation that have yet to be handed to the Protocol Stack 875 can always support this property, but once a Message has been sent 876 into the send buffer of a protocol, only certain protocols may 877 support de-queueing a message. For example, TCP cannot remove 878 bytes from its send buffer, while in case of SCTP, such control 879 over the SCTP send buffer can be exercised using the partial 880 reliability extension [RFC8303]. When there is no standing queue 881 of Messages within the system, and the Protocol Stack does not 882 support removing a Message from its buffer, this property may be 883 ignored. 885 o Niceness: this represents the ability to de-prioritize a Message 886 in favor of other Messages. This can be implemented by the system 887 re-ordering Messages that have yet to be handed to the Protocol 888 Stack, or by giving relative priority hints to protocols that 889 support priorities per Message. For example, an implementation of 890 HTTP/2 could choose to send Messages of different niceness on 891 streams of different priority. 893 o Ordered: when this is false, it disables the requirement of in- 894 order-delivery for protocols that support configurable ordering. 896 o Idempotent: when this is true, it means that the Message can be 897 used by mechanisms that might transfer it multiple times - e.g., 898 as a result of racing multiple transports or as part of TCP Fast 899 Open. 901 o Corruption Protection Length: when this is set to any value other 902 than -1, it limits the required checksum in protocols that allow 903 limiting the checksum length (e.g. UDP-Lite). 905 o Immediate Acknowledgement: this informs the implementation that 906 the sender intends to execute tight control over the send buffer, 907 and therefore wants to avoid delayed acknowledgements. In case of 908 SCTP, a request to immediately send acknowledgements can be 909 implemented using the "sack-immediately flag" described in 910 Section 4.2 of [RFC8303] for the SEND.SCTP primitive. 912 o Instantaneous Capacity Profile: when this is set to "Interactive/ 913 Low Latency", the Message should be sent immediately, even when 914 this comes at the cost of using the network capacity less 915 efficiently. For example, small messages can sometimes be bundled 916 to fit into a single data packet for the sake of reducing header 917 overhead; such bundling should not be used. For example, in case 918 of TCP, the Nagle algorithm should be disabled when Interactive/ 919 Low Latency is selected as the capacity profile. Scavenger/Bulk 920 can translate into usage of a congestion control mechanism such as 921 LEDBAT, and/or the capacity profile can lead to a choice of a DSCP 922 value as described in [I-D.ietf-taps-minset]). 924 [Note: See also Appendix A.2 for additional Send Parameters under 925 discussion.] 927 5.1.1.2. Send Completion 929 The application should be notified whenever a Message or partial 930 Message has been consumed by the Protocol Stack, or has failed to 931 send. The meaning of the Message being consumed by the stack may 932 vary depending on the protocol. For a basic datagram protocol like 933 UDP, this may correspond to the time when the packet is sent into the 934 interface driver. For a protocol that buffers data in queues, like 935 TCP, this may correspond to when the data has entered the send 936 buffer. 938 5.1.2. Receiving Messages 940 Similar to sending, Receiving a Message is determined by the top- 941 level protocol in the established Protocol Stack. The main 942 difference with Receiving is that the size and boundaries of the 943 Message are not known beforehand. The application can communicate in 944 its Receive action the parameters for the Message, which can help the 945 implementation know how much data to deliver and when. For example, 946 if the application only wants to receive a complete Message, the 947 implementation should wait until an entire Message (datagram, stream, 948 or frame) is read before delivering any Message content to the 949 application. This requires the implementation to understand where 950 messages end, either via a supplied deframer or because the top-level 951 protocol in the established Protocol Stack preserves message 952 boundaries; if, on the other hand, the top-level protocol only 953 supports a byte-stream and no deframers were supported, the 954 application must specify the minimum number of bytes of Message 955 content it wants to receive (which may be just a single byte) to 956 control the flow of received data. 958 If a Connection becomes finished before a requested Receive action 959 can be satisfied, the implementation should deliver any partial 960 Message content outstanding, or if none is available, an indication 961 that there will be no more received Messages. 963 5.2. Handling of data for fast-open protocols 965 Several protocols allow sending higher-level protocol or application 966 data within the first packet of their protocol establishment, such as 967 TCP Fast Open [RFC7413] and TLS 1.3 [I-D.ietf-tls-tls13]. This 968 approach is referred to as sending Zero-RTT (0-RTT) data. This is a 969 desirable property, but poses challenges to an implementation that 970 uses racing during connection establishment. 972 If the application has 0-RTT data to send in any protocol handshakes, 973 it needs to provide this data before the handshakes have begun. When 974 racing, this means that the data should be provided before the 975 process of connection establishment has begun. If the application 976 wants to send 0-RTT data, it must indicate this to the implementation 977 by setting the Idempotent send parameter to true when sending the 978 data. In general, 0-RTT data may be replayed (for example, if a TCP 979 SYN contains data, and the SYN is retransmitted, the data will be 980 retransmitted as well), but racing means that different leaf nodes 981 have the opportunity to send the same data independently. If data is 982 truly idempotent, this should be permissible. 984 Once the application has provided its 0-RTT data, an implementation 985 should keep a copy of this data and provide it to each new leaf node 986 that is started and for which a 0-RTT protocol is being used. 988 It is also possible that protocol stacks within a particular leaf 989 node use 0-RTT handshakes without any idempotent application data. 990 For example, TCP Fast Open could use a Client Hello from TLS as its 991 0-RTT data, shortening the cumulative handshake time. 993 0-RTT handshakes often rely on previous state, such as TCP Fast Open 994 cookies, previously established TLS tickets, or out-of-band 995 distributed pre-shared keys (PSKs). Implementations should be aware 996 of security concerns around using these tokens across multiple 997 addresses or paths when racing. In the case of TLS, any given ticket 998 or PSK should only be used on one leaf node. If implementations have 999 multiple tickets available from a previous connection, each leaf node 1000 attempt must use a different ticket. In effect, each leaf node will 1001 send the same early application data, yet encoded (encrypted) 1002 differently on the wire. 1004 6. Implementing Maintenance 1006 Maintenance encompasses changes that the application can request to a 1007 Connection, or that a Connection can react to based on system and 1008 network changes. 1010 6.1. Changing Protocol Properties 1012 Appendix A.1 of [I-D.ietf-taps-minset] explains, using primitives 1013 that are described in [RFC8303] and [RFC8304], how to implement 1014 changing the following protocol properties of an established 1015 connection with TCP and UDP. Below, we amend this description for 1016 other protocols (if applicable): 1018 o Relative niceness: for SCTP, this can be done using the primitive 1019 CONFIGURE_STREAM_SCHEDULER.SCTP described in section 4 of 1020 [RFC8303]. 1022 o Timeout for aborting Connection: for SCTP, this can be done using 1023 the primitive CHANGE_TIMEOUT.SCTP described in section 4 of 1024 [RFC8303]. 1026 o Abort timeout to suggest to the Remote Endpoint: for TCP, this can 1027 be done using the primitive CHANGE_TIMEOUT.TCP described in 1028 section 4 of [RFC8303]. 1030 o Retransmission threshold before excessive retransmission 1031 notification: for TCP, this can be done using ERROR.TCP described 1032 in section 4 of [RFC8303]. 1034 o Required minimum coverage of the checksum for receiving: for UDP- 1035 Lite, this can be done using the primitive 1036 SET_MIN_CHECKSUM_COVERAGE.UDP-Lite described in section 4 of 1037 [RFC8303]. 1039 o Connection group transmission scheduler: for SCTP, this can be 1040 done using the primitive SET_STREAM_SCHEDULER.SCTP described in 1041 section 4 of [RFC8303]. 1043 It may happen that the application attempts to set a Protocol 1044 Property which does not apply to the actually chosen protocol. In 1045 this case, the implementation should fail gracefully, i.e., it may 1046 give a warning to the application, but it should not terminate the 1047 Connection. 1049 6.2. Handling Path Changes 1051 When a path change occurs, the Transport Services implementation is 1052 responsible for notifying Protocol Instances in the Protocol Stack. 1053 If the Protocol Stack includes a transport protocol that supports 1054 multipath connectivity, an update to the available paths should 1055 inform the Protocol Instance of the new set of paths that are 1056 permissible based on the Path Selection Properties passed by the 1057 application. A multipath protocol can establish new subflows over 1058 new paths, and should tear down subflows over paths that are no 1059 longer available. If the Protocol Stack includes a transport 1060 protocol that does not support multipath, but support migrating 1061 between paths, the update to available paths can be used as the 1062 trigger to migrating the connection. For protocols that do not 1063 support multipath or migration, the Protocol Instances may be 1064 informed of the path change, but should not be forcibly disconnected 1065 if the previously used path becomes unavailable. An exception to 1066 this case is if the System Policy changes to prohibit traffic from 1067 the Connection based on its properties, in which case the Protocol 1068 Stack should be disconnected. 1070 7. Implementing Termination 1072 With TCP, when an application closes a connection, this means that it 1073 has no more data to send (but expects all data that has been handed 1074 over to be reliably delivered). However, with TCP only, "close" does 1075 not mean that the application will stop receiving data. This is 1076 related to TCP's ability to support half-closed connections. 1078 SCTP is an example of a protocol that does not support such half- 1079 closed connections. Hence, with SCTP, the meaning of "close" is 1080 stricter: an application has no more data to send (but expects all 1081 data that has been handed over to be reliably delivered), and will 1082 also not receive any more data. 1084 Implementing a protocol independent transport system means that the 1085 exposed semantics must be the strictest subset of the semantics of 1086 all supported protocols. Hence, as is common with all reliable 1087 transport protocols, after a Close action, the application can expect 1088 to have its reliability requirements honored regarding the data it 1089 has given to the Transport System, but it cannot expect to be able to 1090 read any more data after calling Close. 1092 Abort differs from Close only in that no guarantees are given 1093 regarding data that the application has handed over to the Tranport 1094 System before calling Abort. 1096 As explained in section Section 4.6, when a new stream is multiplexed 1097 on an already existing connection of a Transport Protocol Instance, 1098 there is no need for a connection establishment procedure. Because 1099 the Connections that are offered by the Transport System can be 1100 implemented as streams that are multiplexed on a transport protocol's 1101 connection, it can therefore not be guaranteed that one Endpoint's 1102 Initiate action provokes a ConnectionReceived event at its peer. 1104 For Close (provoking a Finished event) and Abort (provoking a 1105 ConnectionError event), the same logic applies: while it is desirable 1106 to be informed when a peer closes or aborts a Connection, whether 1107 this is possible depends on the underlying protocol, and no 1108 guarantees can be given. With SCTP, the transport system can use the 1109 stream reset procedure to cause a Finish event upon a Close action 1110 from the peer [NEAT-flow-mapping]. 1112 8. Cached State 1114 Beyond a single Connection's lifetime, it is useful for an 1115 implementation to keep state and history. This cached state can help 1116 improve future Connection establishment due to re-using results and 1117 credentials, and favoring paths and protocols that performed well in 1118 the past. 1120 Cached state may be associated with different Endpoints for the same 1121 Connection, depending on the protocol generating the cached content. 1122 For example, session tickets for TLS are associated with specific 1123 endpoints, and thus should be cached based on a Connection's hostname 1124 Endpoint (if applicable). On the other hand, performance 1125 characteristics of a path are more likely tied to the IP address and 1126 subnet being used. 1128 8.1. Protocol state caches 1130 Some protocols will have long-term state to be cached in association 1131 with Endpoints. This state often has some time after which it is 1132 expired, so the implementation should allow each protocol to specify 1133 an expiration for cached content. 1135 Examples of cached protocol state include: 1137 o The DNS protocol can cache resolution answers (A and AAAA queries, 1138 for example), associated with a Time To Live (TTL) to be used for 1139 future hostname resolutions without requiring asking the DNS 1140 resolver again. 1142 o TLS caches session state and tickets based on a hostname, which 1143 can be used for resuming sessions with a server. 1145 o TCP can cache cookies for use in TCP Fast Open. 1147 Cached protocol state is primarily used during Connection 1148 establishment for a single Protocol Stack, but may be used to 1149 influence an implementation's preference between several candidate 1150 Protocol Stacks. For example, if two IP address Endpoints are 1151 otherwise equally preferred, an implementation may choose to attempt 1152 a connection to an address for which it has a TCP Fast Open cookie. 1154 Applications must have a way to flush protocol cache state if 1155 desired. This may be necessary, for example, if application-layer 1156 identifiers rotate and clients wish to avoid linkability via 1157 trackable TLS tickets or TFO cookies. 1159 8.2. Performance caches 1161 In addition to protocol state, Protocol Instances should provide data 1162 into a performance-oriented cache to help guide future protocol and 1163 path selection. Some performance information can be gathered 1164 generically across several protocols to allow predictive comparisons 1165 between protocols on given paths: 1167 o Observed Round Trip Time 1169 o Connection Establishment latency 1171 o Connection Establishment success rate 1173 These items can be cached on a per-address and per-subnet 1174 granularity, and averaged between different values. The information 1175 should be cached on a per-network basis, since it is expected that 1176 different network attachments will have different performance 1177 characteristics. Besides Protocol Instances, other system entities 1178 may also provide data into performance-oriented caches. This could 1179 for instance be signal strength information reported by radio modems 1180 like Wi-Fi and mobile broadband or information about the battery- 1181 level of the device. Furthermore, the system may cache the observed 1182 maximum throughput on a path as an estimate of the available 1183 bandwidth. 1185 An implementation should use this information, when possible, to 1186 determine preference between candidate paths, endpoints, and protocol 1187 options. Eligible options that historically had significantly better 1188 performance than others should be selected first when gathering 1189 candidates (see Section 4.1) to ensure better performance for the 1190 application. 1192 The reasonable lifetime for cached performance values will vary 1193 depending on the nature of the value. Certain information, like the 1194 connection establishment success rate to a Remote Endpoint using a 1195 given protocol stack, can be stored for a long period of time (hours 1196 or longer), since it is expected that the capabilities of the Remote 1197 Endpoint are not changing very quickly. On the other hand, Round 1198 Trip Time observed by TCP over a particular network path may vary 1199 over a relatively short time interval. For such values, the 1200 implementation should remove them from the cache more quickly, or 1201 treat older values with less confidence/weight. 1203 9. Specific Transport Protocol Considerations 1205 9.1. TCP 1207 Connection lifetime for TCP translates fairly simply into the the 1208 abstraction presented to an application. When the TCP three-way 1209 handshake is complete, its layer of the Protocol Stack can be 1210 considered Ready (established). This event will cause racing of 1211 Protocol Stack options to complete if TCP is the top-level protocol, 1212 at which point the application can be notified that the Connection is 1213 Ready to send and receive. 1215 If the application sends a Close, that can translate to a graceful 1216 termination of the TCP connection, which is performed by sending a 1217 FIN to the remote endpoint. If the application sends an Abort, then 1218 the TCP state can be closed abruptly, leading to a RST being sent to 1219 the peer. 1221 Without a layer of framing (a top-level protocol in the established 1222 Protocol Stack that preserves message boundaries, or an application- 1223 supplied deframer) on top of TCP, the receiver side of the transport 1224 system implementation can only treat the incoming stream of bytes as 1225 a single Message, terminated by a FIN when the Remote Endpoint closes 1226 the Connection. 1228 9.2. UDP 1230 UDP as a direct transport does not provide any handshake or 1231 connectivity state, so the notion of the transport protocol becoming 1232 Ready or established is degenerate. Once the system has validated 1233 that there is a route on which to send and receive UDP datagrams, the 1234 protocol is considered Ready. Similarly, a Close or Abort has no 1235 meaning to the on-the-wire protocol, but simply leads to the local 1236 state being torn down. 1238 When sending and receiving messages over UDP, each Message should 1239 correspond to a single UDP datagram. The Message can contain 1240 metadata about the packet, such as the ECN bits applied to the 1241 packet. 1243 9.3. SCTP 1245 To support sender-side stream schedulers (which are implemented on 1246 the sender side), a receiver-side Transport System should always 1247 support message interleaving [RFC8260]. 1249 SCTP messages can be very large. To allow the reception of large 1250 messages in pieces, a "partial flag" can be used to inform a (native 1251 SCTP) receiving application that a message is incomplete. After 1252 receiving the "partial flag", this application would know that the 1253 next receive calls will only deliver remaining parts of the same 1254 message (i.e., no messages or partial messages will arrive on other 1255 streams until the message is complete) (see Section 8.1.20 in 1256 [RFC6458]). The "partial flag" can therefore facilitate the 1257 implementation of the receiver buffer in the receiving application, 1258 at the cost of limiting multiplexing and temporarily creating head- 1259 of-line blocking delay at the receiver. 1261 When a Transport System transfers a Message, it seems natural to map 1262 the Message object to SCTP messages in order to support properties 1263 such as "Ordered" or "Lifetime" (which maps onto partially reliable 1264 delivery with a SCTP_PR_SCTP_TTL policy [RFC6458]). However, since 1265 multiplexing of Connections onto SCTP streams may happen, and would 1266 be hidden from the application, the Transport System requires a per- 1267 stream receiver buffer anyway, so this potential benefit is lost and 1268 the "partial flag" becomes unnecessary for the system. 1270 The problem of long messages either requiring large receiver-side 1271 buffers or getting in the way of multiplexing is addressed by message 1272 interleaving [RFC8260], which is yet another reason why a receivers- 1273 side transport system supporting SCTP should implement this 1274 mechanism. 1276 9.4. TLS 1278 The mapping of a TLS stream abstraction into the application is 1279 equivalent to the contract provided by TCP (see Section 9.1). The 1280 Ready state should be determined by the completion of the TLS 1281 handshake, which involves potentially several more round trips beyond 1282 the TCP handshake. The application should not be notified that the 1283 Connection is Ready until TLS is established. 1285 9.5. HTTP 1287 HTTP requests and responses map naturally into Messages, since they 1288 are delineated chunks of data with metadata that can be sent over a 1289 transport. To that end, HTTP can be seen as the most prevalent 1290 framing protocol that runs on top of streams like TCP, TLS, etc. 1292 In order to use a transport Connection that provides HTTP Message 1293 support, the establishment and closing of the connection can be 1294 treated as it would without the framing protocol. Sending and 1295 receiving of Messages, however, changes to treat each Message as a 1296 well-delineated HTTP request or response, with the content of the 1297 Message representing the body, and the Headers being provided in 1298 Message metadata. 1300 9.6. QUIC 1302 QUIC provides a multi-streaming interface to an encrypted transport. 1303 Each stream can be viewed as equivalent to a TLS stream over TCP, so 1304 a natural mapping is to present each QUIC stream as an individual 1305 Connection. The protocol for the stream will be considered Ready 1306 whenever the underlying QUIC connection is established to the point 1307 that this stream's data can be sent. For streams after the first 1308 stream, this will likely be an immediate operation. 1310 Closing a single QUIC stream, presented to the application as a 1311 Connection, does not imply closing the underlying QUIC connection 1312 itself. Rather, the implementation may choose to close the QUIC 1313 connection once all streams have been closed (possibly after some 1314 timeout), or after an individual stream Connection sends an Abort. 1316 Messages over a direct QUIC stream should be represented similarly to 1317 the TCP stream (one Message per direction, see Section 9.1), unless a 1318 framing mapping is used on top of QUIC. 1320 9.7. HTTP/2 transport 1322 Similar to QUIC (Section 9.6), HTTP/2 provides a multi-streaming 1323 interface. This will generally use HTTP as the unit of Messages over 1324 the streams, in which each stream can be represented as a transport 1325 Connection. The lifetime of streams and the HTTP/2 connection should 1326 be managed as described for QUIC. 1328 It is possible to treat each HTTP/2 stream as a raw byte-stream 1329 instead of a carrier for HTTP messages, in which case the Messages 1330 over the streams can be represented similarly to the TCP stream (one 1331 Message per direction, see Section 9.1). 1333 10. Rendezvous and Environment Discovery 1335 The connection establishment process outlined in Section 4 is 1336 appropriate for client-server connections, but needs to be expanded 1337 in peer-to-peer Rendezvous scenarios, as follows: 1339 o Gathering Local Endpoint candidates 1341 The set of possible Local Endpoints is gathered. In the simple 1342 case, this merely enumerates the local interfaces and protocols, 1343 allocates ephemeral source ports. For example, a system that has 1344 WiFi and Ethernet and supports IPv4 and IPv6 might gather four 1345 candidate locals (IPv4 on Ethernet, IPv6 on Ethernet, IPv4 on 1346 WiFi, and IPv6 on WiFi) that can form the source for a transient. 1348 If NAT traversal is required, the process of gathering Local 1349 Endpoints becomes broadly equivalent to the ICE candidate 1350 gathering phase [RFC5245]. The endpoint determines its server 1351 reflexive Local Endpoints (i.e., the translated address of a 1352 local, on the other side of a NAT) and relayed locals (e.g., via a 1353 TURN server or other relay), for each interface and network 1354 protocol. These are added to the set of candidate Local Endpoints 1355 for this connection. 1357 Gathering locals is primarily an endpoint local operation, 1358 although it might involve exchanges with a STUN server to derive 1359 server reflexive locals, or with a TURN server or other relay to 1360 derive relayed locals. It does not involve communication with the 1361 Remote Endpoint. 1363 o Gathering Remote Endpoint Candidates 1365 The Remote Endpoint is typically a name that needs to be resolved 1366 into a set of possible addresses that can be used for 1367 communication. Resolving the Remote Endpoint is the process of 1368 recursively performing such name lookups, until fully resolved, to 1369 return the set of candidates for the remote of this connection. 1371 How this is done will depend on the type of the Remote Endpoint, 1372 and can also be specific to each Local Endpoint. A common case is 1373 when the Remote Endpoint is a DNS name, in which case it is 1374 resolved to give a set of IPv4 and IPv6 addresses representing 1375 that name. Some types of remote might require more complex 1376 resolution. Resolving the Remote Endpoint for a peer-to-peer 1377 connection might involve communication with a rendezvous server, 1378 which in turn contacts the peer to gain consent to communicate and 1379 retrieve its set of candidate locals, which are returned and form 1380 the candidate remote addresses for contacting that peer. 1382 Resolving the remote is _not_ a local operation. It will involve 1383 a directory service, and can require communication with the remote 1384 to rendezvous and exchange peer addresses. This can expose some 1385 or all of the candidate locals to the remote. 1387 o Establishing Connections 1389 The set of candidate Local Endpoints and the set of candidate 1390 Remote Endpoints are paired, to derive a priority ordered set of 1391 Candidate Paths that can potentially be used to establish a 1392 Connection. 1394 Then, communication is attempted over each candidate path, in 1395 priority order. If there are multiple candidates with the same 1396 priority, then connection establishment proceeds simultaneously 1397 and uses the transient that wins the race to be established. 1398 Otherwise, connection establishment is sequential, paced at a rate 1399 that should not congest the network. Depending on the chosen 1400 transport, this phase might involve racing TCP connections to a 1401 server over IPv4 and IPv6 [RFC8305], or it could involve a STUN 1402 exchange to establish peer-to-peer UDP connectivity [RFC5245], or 1403 some other means. 1405 o Confirming and Maintaining Connections 1407 Once connectivity has been established, unused resources can be 1408 released and the chosen path can be confirmed. This is primarily 1409 required when establishing peer-to-peer connectivity, where 1410 connections supporting relayed locals that were not required can 1411 be closed, and where an associated signalling operation might be 1412 needed to inform middleboxes and proxies of the chosen path. 1413 Keep-alive messages may also be sent, as appropriate, to ensure 1414 NAT and firewall state is maintained, so the Connection remains 1415 operational. 1417 To support ICE, or similar protocols, that involve an out-of-band 1418 indirect signalling exchange to exchange candidates with the Remote 1419 Endpoint, it's important to be able to query the set of candidate 1420 Local Endpoints, and give the protocol stack a set of candidate 1421 Remote Endpoints, before it attempts to establish connections. 1423 (TO-DO: It is expected that a single abstract algorithm can be 1424 identified that supports both the peer-to-peer and client-server 1425 connection racing, allowing this text to be merged with Section 4) 1427 11. IANA Considerations 1429 RFC-EDITOR: Please remove this section before publication. 1431 This document has no actions for IANA. 1433 12. Security Considerations 1435 12.1. Considerations for Candidate Gathering 1437 Implementations should avoid downgrade attacks that allow network 1438 interference to cause the implementation to select less secure, or 1439 entirely insecure, combinations of paths and protocols. 1441 12.2. Considerations for Candidate Racing 1443 See Section 5.2 for security considerations around racing with 0-RTT 1444 data. 1446 An attacker that knows a particular device is racing several options 1447 during connection establishment may be able to block packets for the 1448 first connection attempt, thus inducing the device to fall back to a 1449 secondary attempt. This is a problem if the secondary attempts have 1450 worse security properties that enable further attacks. 1451 Implementations should ensure that all options have equivalent 1452 security properties to avoid incentivizing attacks. 1454 Since results from the network can determine how a connection attempt 1455 tree is built, such as when DNS returns a list of resolved endpoints, 1456 it is possible for the network to cause an implementation to consume 1457 significant on-device resources. Implementations should limit the 1458 maximum amount of state allowed for any given node, including the 1459 number of child nodes, especially when the state is based on results 1460 from the network. 1462 13. Acknowledgements 1464 This work has received funding from the European Union's Horizon 2020 1465 research and innovation programme under grant agreement No. 644334 1466 (NEAT). 1468 This work has been supported by Leibniz Prize project funds of DFG - 1469 German Research Foundation: Gottfried Wilhelm Leibniz-Preis 2011 (FKZ 1470 FE 570/4-1). 1472 This work has been supported by the UK Engineering and Physical 1473 Sciences Research Council under grant EP/R04144X/1. 1475 Thanks to Stuart Cheshire, Josh Graessley, David Schinazi, and Eric 1476 Kinnear for their implementation and design efforts, including Happy 1477 Eyeballs, that heavily influenced this work. 1479 14. References 1481 14.1. Normative References 1483 [I-D.ietf-taps-minset] 1484 Welzl, M. and S. Gjessing, "A Minimal Set of Transport 1485 Services for TAPS Systems", draft-ietf-taps-minset-02 1486 (work in progress), February 2018. 1488 [I-D.pauly-taps-arch] 1489 Pauly, T., Trammell, B., Brunstrom, A., Fairhurst, G., 1490 Perkins, C., Tiesel, P., and C. Wood, "An Architecture for 1491 Transport Services", draft-pauly-taps-arch-00 (work in 1492 progress), February 2018. 1494 [I-D.trammell-taps-interface] 1495 Trammell, B., Welzl, M., Enghardt, T., Fairhurst, G., 1496 Kuehlewind, M., Perkins, C., Tiesel, P., and C. Wood, "An 1497 Abstract Application Layer Interface to Transport 1498 Services", draft-trammell-taps-interface-00 (work in 1499 progress), March 2018. 1501 [RFC6458] Stewart, R., Tuexen, M., Poon, K., Lei, P., and V. 1502 Yasevich, "Sockets API Extensions for the Stream Control 1503 Transmission Protocol (SCTP)", RFC 6458, 1504 DOI 10.17487/RFC6458, December 2011, 1505 . 1507 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 1508 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 1509 . 1511 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 1512 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 1513 DOI 10.17487/RFC7540, May 2015, 1514 . 1516 [RFC8260] Stewart, R., Tuexen, M., Loreto, S., and R. Seggelmann, 1517 "Stream Schedulers and User Message Interleaving for the 1518 Stream Control Transmission Protocol", RFC 8260, 1519 DOI 10.17487/RFC8260, November 2017, 1520 . 1522 [RFC8303] Welzl, M., Tuexen, M., and N. Khademi, "On the Usage of 1523 Transport Features Provided by IETF Transport Protocols", 1524 RFC 8303, DOI 10.17487/RFC8303, February 2018, 1525 . 1527 [RFC8304] Fairhurst, G. and T. Jones, "Transport Features of the 1528 User Datagram Protocol (UDP) and Lightweight UDP (UDP- 1529 Lite)", RFC 8304, DOI 10.17487/RFC8304, February 2018, 1530 . 1532 [RFC8305] Schinazi, D. and T. Pauly, "Happy Eyeballs Version 2: 1533 Better Connectivity Using Concurrency", RFC 8305, 1534 DOI 10.17487/RFC8305, December 2017, 1535 . 1537 14.2. Informative References 1539 [I-D.ietf-quic-transport] 1540 Iyengar, J. and M. Thomson, "QUIC: A UDP-Based Multiplexed 1541 and Secure Transport", draft-ietf-quic-transport-10 (work 1542 in progress), March 2018. 1544 [I-D.ietf-tls-tls13] 1545 Rescorla, E., "The Transport Layer Security (TLS) Protocol 1546 Version 1.3", draft-ietf-tls-tls13-26 (work in progress), 1547 March 2018. 1549 [NEAT-flow-mapping] 1550 Weinrank, F. and M. Tuexen, "Transparent Flow Mapping for 1551 NEAT (in Workshop on Future of Internet Transport (FIT 1552 2017))", June 2017. 1554 [RFC5245] Rosenberg, J., "Interactive Connectivity Establishment 1555 (ICE): A Protocol for Network Address Translator (NAT) 1556 Traversal for Offer/Answer Protocols", RFC 5245, 1557 DOI 10.17487/RFC5245, April 2010, 1558 . 1560 [Trickle] Ghobadi, M., Cheng, Y., Jain, A. and M. Mathis, "Trickle - 1561 Rate Limiting YouTube Video Streaming (ATC 2012)", June 1562 2012. 1564 Appendix A. Additional Properties 1566 This appendix discusses implementation considerations for additional 1567 parameters and properties that could be used to enhance transport 1568 protocol and/or path selection, or the transmission of messages given 1569 a Protocol Stack that implements them. These are not part of the 1570 interface, and may be removed from the final document, but are 1571 presented here to support discussion within the TAPS working group as 1572 to whether they should be added to a future revision of the base 1573 specification. 1575 A.1. Properties Affecting Sorting of Branches 1577 In addition to the Protocol and Path Selection Properties discussed 1578 in Section 4.3, the following properties under discussion can 1579 influence branch sorting: 1581 o Size to be Sent or Received: An implementation may use the Size to 1582 be Sent or Received in combination with cached performance 1583 estimates, see Section 8.2, e.g. the observed Round Trip Time and 1584 the observed maximum throughput, to compute an estimate of the 1585 completion time of a transfer over different available paths. It 1586 may then prefer the path with the shorter expected completion 1587 time. This property may be used instead of the Capacity profile, 1588 as the application does not always know whether its transfer will 1589 be latency-bound or bandwidth-bound, and thus may not be able to 1590 specify a Capacity Profile. However, the application may know the 1591 Size to be Sent or Received from metadata, e.g., in adaptive HTTP 1592 streaming such as MPEG-DASH, or in operating system upgrades. A 1593 related paper is currently under submission. 1595 o Send / Receive Bitrate: If the application indicates an expected 1596 send or receive bitrate, an implementation may prefer a path that 1597 can likely provide the desired bandwidth, based on cached maximum 1598 throughput, see Section 8.2. The application may know the Send or 1599 Receive Bitrate from metadata in adaptive HTTP streaming, such as 1600 MPEG-DASH. 1602 o Cost Preferences: If the application indicates a preference to 1603 avoid expensive paths, and some paths are associated with a 1604 monetary cost, an implementation should decrease the ranking of 1605 such paths. If the application indicates that it prohibits using 1606 expensive paths, paths that are associated with a cost should be 1607 purged from the decision tree. 1609 A.2. Send Parameters 1611 In addition to the Send Parameters listed in Section 5.1.1.1, the 1612 following Send Parameters are under discussion: 1614 o Send Bitrate: If an application indicates a certain bitrate it 1615 wants to send on the connection, the implementation may limit the 1616 bitrate of the outgoing communication to that rate, for example by 1617 setting an upper bound for the TCP congestion window of a 1618 connection calculated from the Send Bitrate and the Round Trip 1619 Time. This helps to avoid bursty traffic patterns on video 1620 streaming servers, see [Trickle]. 1622 Authors' Addresses 1624 Anna Brunstrom (editor) 1625 Karlstad University 1626 Universitetsgatan 2 1627 651 88 Karlstad 1628 Sweden 1630 Email: anna.brunstrom@kau.se 1632 Tommy Pauly (editor) 1633 Apple Inc. 1634 One Apple Park Way 1635 Cupertino, California 95014 1636 United States of America 1638 Email: tpauly@apple.com 1640 Theresa Enghardt 1641 TU Berlin 1642 Marchstrasse 23 1643 10587 Berlin 1644 Germany 1646 Email: theresa@inet.tu-berlin.de 1647 Karl-Johan Grinnemo 1648 Karlstad University 1649 Universitetsgatan 2 1650 651 88 Karlstad 1651 Sweden 1653 Email: karl-johan.grinnemo@kau.se 1655 Tom Jones 1656 University of Aberdeen 1657 Fraser Noble Building 1658 Aberdeen, AB24 3UE 1659 UK 1661 Email: tom@erg.abdn.ac.uk 1663 Philipp S. Tiesel 1664 TU Berlin 1665 Marchstrasse 23 1666 10587 Berlin 1667 Germany 1669 Email: philipp@inet.tu-berlin.de 1671 Colin Perkins 1672 University of Glasgow 1673 School of Computing Science 1674 Glasgow G12 8QQ 1675 United Kingdom 1677 Email: csp@csperkins.org 1679 Michael Welzl 1680 University of Oslo 1681 PO Box 1080 Blindern 1682 0316 Oslo 1683 Norway 1685 Email: michawe@ifi.uio.no