idnits 2.17.1 draft-ietf-taps-impl-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 4 instances of too long lines in the document, the longest one being 3 characters in excess of 72. ** The abstract seems to contain references ([I-D.ietf-taps-arch]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 11, 2019) is 1872 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'Trickle' is defined on line 1623, but no explicit reference was found in the text == Outdated reference: A later version (-19) exists of draft-ietf-taps-arch-02 == Outdated reference: A later version (-26) exists of draft-ietf-taps-interface-02 ** Obsolete normative reference: RFC 7540 (Obsoleted by RFC 9113) == Outdated reference: A later version (-34) exists of draft-ietf-quic-transport-18 -- Obsolete informational reference (is this intentional?): RFC 5245 (Obsoleted by RFC 8445, RFC 8839) Summary: 3 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TAPS Working Group A. Brunstrom, Ed. 3 Internet-Draft Karlstad University 4 Intended status: Informational T. Pauly, Ed. 5 Expires: September 12, 2019 Apple Inc. 6 T. Enghardt 7 TU Berlin 8 K-J. Grinnemo 9 Karlstad University 10 T. Jones 11 University of Aberdeen 12 P. Tiesel 13 TU Berlin 14 C. Perkins 15 University of Glasgow 16 M. Welzl 17 University of Oslo 18 March 11, 2019 20 Implementing Interfaces to Transport Services 21 draft-ietf-taps-impl-03 23 Abstract 25 The Transport Services architecture [I-D.ietf-taps-arch] defines a 26 system that allows applications to use transport networking protocols 27 flexibly. This document serves as a guide to implementation on how 28 to build such a system. 30 Status of This Memo 32 This Internet-Draft is submitted in full conformance with the 33 provisions of BCP 78 and BCP 79. 35 Internet-Drafts are working documents of the Internet Engineering 36 Task Force (IETF). Note that other groups may also distribute 37 working documents as Internet-Drafts. The list of current Internet- 38 Drafts is at https://datatracker.ietf.org/drafts/current/. 40 Internet-Drafts are draft documents valid for a maximum of six months 41 and may be updated, replaced, or obsoleted by other documents at any 42 time. It is inappropriate to use Internet-Drafts as reference 43 material or to cite them other than as "work in progress." 45 This Internet-Draft will expire on September 12, 2019. 47 Copyright Notice 49 Copyright (c) 2019 IETF Trust and the persons identified as the 50 document authors. All rights reserved. 52 This document is subject to BCP 78 and the IETF Trust's Legal 53 Provisions Relating to IETF Documents 54 (https://trustee.ietf.org/license-info) in effect on the date of 55 publication of this document. Please review these documents 56 carefully, as they describe your rights and restrictions with respect 57 to this document. Code Components extracted from this document must 58 include Simplified BSD License text as described in Section 4.e of 59 the Trust Legal Provisions and are provided without warranty as 60 described in the Simplified BSD License. 62 Table of Contents 64 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 65 2. Implementing Basic Objects . . . . . . . . . . . . . . . . . 3 66 3. Implementing Pre-Establishment . . . . . . . . . . . . . . . 4 67 3.1. Configuration-time errors . . . . . . . . . . . . . . . . 5 68 3.2. Role of system policy . . . . . . . . . . . . . . . . . . 5 69 4. Implementing Connection Establishment . . . . . . . . . . . . 6 70 4.1. Candidate Gathering . . . . . . . . . . . . . . . . . . . 7 71 4.1.1. Structuring Options as a Tree . . . . . . . . . . . . 7 72 4.1.2. Branch Types . . . . . . . . . . . . . . . . . . . . 9 73 4.2. Branching Order-of-Operations . . . . . . . . . . . . . . 11 74 4.3. Sorting Branches . . . . . . . . . . . . . . . . . . . . 12 75 4.4. Candidate Racing . . . . . . . . . . . . . . . . . . . . 13 76 4.4.1. Delayed . . . . . . . . . . . . . . . . . . . . . . . 14 77 4.4.2. Failover . . . . . . . . . . . . . . . . . . . . . . 15 78 4.5. Completing Establishment . . . . . . . . . . . . . . . . 15 79 4.5.1. Determining Successful Establishment . . . . . . . . 16 80 4.6. Establishing multiplexed connections . . . . . . . . . . 17 81 4.7. Handling racing with "unconnected" protocols . . . . . . 17 82 4.8. Implementing listeners . . . . . . . . . . . . . . . . . 18 83 4.8.1. Implementing listeners for Connected Protocols . . . 18 84 4.8.2. Implementing listeners for Unconnected Protocols . . 18 85 4.8.3. Implementing listeners for Multiplexed Protocols . . 18 86 5. Implementing Data Transfer . . . . . . . . . . . . . . . . . 19 87 5.1. Data transfer for streams, datagrams, and frames . . . . 19 88 5.1.1. Sending Messages . . . . . . . . . . . . . . . . . . 19 89 5.1.2. Receiving Messages . . . . . . . . . . . . . . . . . 21 90 5.2. Handling of data for fast-open protocols . . . . . . . . 22 91 6. Implementing Maintenance . . . . . . . . . . . . . . . . . . 23 92 6.1. Managing Connections . . . . . . . . . . . . . . . . . . 23 93 6.2. Handling Path Changes . . . . . . . . . . . . . . . . . . 24 94 7. Implementing Termination . . . . . . . . . . . . . . . . . . 24 95 8. Cached State . . . . . . . . . . . . . . . . . . . . . . . . 25 96 8.1. Protocol state caches . . . . . . . . . . . . . . . . . . 26 97 8.2. Performance caches . . . . . . . . . . . . . . . . . . . 26 98 9. Specific Transport Protocol Considerations . . . . . . . . . 27 99 9.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 100 9.2. UDP . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 101 9.3. SCTP . . . . . . . . . . . . . . . . . . . . . . . . . . 28 102 9.4. TLS . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 103 9.5. HTTP . . . . . . . . . . . . . . . . . . . . . . . . . . 29 104 9.6. QUIC . . . . . . . . . . . . . . . . . . . . . . . . . . 29 105 9.7. HTTP/2 transport . . . . . . . . . . . . . . . . . . . . 30 106 10. Rendezvous and Environment Discovery . . . . . . . . . . . . 30 107 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32 108 12. Security Considerations . . . . . . . . . . . . . . . . . . . 32 109 12.1. Considerations for Candidate Gathering . . . . . . . . . 32 110 12.2. Considerations for Candidate Racing . . . . . . . . . . 32 111 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 33 112 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 33 113 14.1. Normative References . . . . . . . . . . . . . . . . . . 33 114 14.2. Informative References . . . . . . . . . . . . . . . . . 34 115 Appendix A. Additional Properties . . . . . . . . . . . . . . . 35 116 A.1. Properties Affecting Sorting of Branches . . . . . . . . 35 117 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 35 119 1. Introduction 121 The Transport Services architecture [I-D.ietf-taps-arch] defines a 122 system that allows applications to use transport networking protocols 123 flexibly. The interface such a system exposes to applications is 124 defined as the Transport Services API [I-D.ietf-taps-interface]. 125 This API is designed to be generic across multiple transport 126 protocols and sets of protocols features. 128 This document serves as a guide to implementation on how to build a 129 system that provides a Transport Services API. It is the job of an 130 implementation of a Transport Services system to turn the requests of 131 an application into decisions on how to establish connections, and 132 how to transfer data over those connections once established. The 133 terminology used in this document is based on the Architecture 134 [I-D.ietf-taps-arch]. 136 2. Implementing Basic Objects 138 The basic objects that are exposed to applications for Transport 139 Services are the Preconnection, the bundle of properties that 140 describes the application constraints on the transport; the 141 Connection, the basic object that represents a flow of data in either 142 direction between the Local and Remote Endpoints; and the Listener, a 143 passive waiting object that delivers new Connections. 145 Preconnection objects should be implemented as bundles of properties 146 that an application can both read and write. Once a Preconnection 147 has been used to create an outbound Connection or a Listener, the 148 implementation should ensure that the copy of the properties held by 149 the Connection or Listener is immutable. This may involve performing 150 a deep-copy if the application is still able to modify properties on 151 the original Preconnection object. 153 Connection objects represent the interface between the application 154 and the implementation to manage transport state, and conduct data 155 transfer. During the process of establishment (Section 4), the 156 Connection will be unbound to a specific transport flow, since there 157 may be multiple candidate Protocol Stacks being raced. Once the 158 Connection is established, the object should be considered mapped to 159 a specific Protocol Stack. The notion of a Connection maps to many 160 different protocols, depending on the Protocol Stack. For example, 161 the Connection may ultimately represent the interface into a TCP 162 connection, a TLS session over TCP, a UDP flow with fully-specified 163 local and remote endpoints, a DTLS session, a SCTP stream, a QUIC 164 stream, or an HTTP/2 stream. 166 Listener objects are created with a Preconnection, at which point 167 their configuration should be considered immutable by the 168 implementation. The process of listening is described in 169 Section 4.8. 171 3. Implementing Pre-Establishment 173 During pre-establishment the application specifies the Endpoints to 174 be used for communication as well as its preferences via Selection 175 Properties and, if desired, also Connection Properties. Generally, 176 Connection Properties should be configured as early as possible, as 177 they may serve as input to decisions that are made by the 178 implementation (the Capacity Profile may guide usage of a protocol 179 offering scavenger-type congestion control, for example). In the 180 remainder of this document, we only refer to Selection Properties 181 because they are the more typical case and have to be handled by all 182 implementations. 184 The implementation stores these objects and properties as part of the 185 Preconnection object for use during connection establishment. For 186 Selection Properties that are not provided by the application, the 187 implementation must use the default values specified in the Transport 188 Services API ([I-D.ietf-taps-interface]). 190 3.1. Configuration-time errors 192 The transport system should have a list of supported protocols 193 available, which each have transport features reflecting the 194 capabilities of the protocol. Once an application specifies its 195 Transport Parameters, the transport system should match the required 196 and prohibited properties against the transport features of the 197 available protocols. 199 In the following cases, failure should be detected during pre- 200 establishment: 202 o The application requested Protocol Properties that include 203 requirements or prohibitions that cannot be satisfied by any of 204 the available protocols. For example, if an application requires 205 "Configure Reliability per Message", but no such protocol is 206 available on the host running the transport system, e.g., because 207 SCTP is not supported by the operating system, this should result 208 in an error. 210 o The application requested Protocol Properties that are in conflict 211 with each other, i.e., the required and prohibited properties 212 cannot be satisfied by the same protocol. For example, if an 213 application prohibits "Reliable Data Transfer" but then requires 214 "Configure Reliability per Message", this mismatch should result 215 in an error. 217 It is important to fail as early as possible in such cases in order 218 to avoid allocating resources, e.g., to endpoint resolution, only to 219 find out later that there is no protocol that satisfies the 220 requirements. 222 3.2. Role of system policy 224 The properties specified during pre-establishment have a close 225 connection to system policy. The implementation is responsible for 226 combining and reconciling several different sources of preferences 227 when establishing Connections. These include, but are not limited 228 to: 230 1. Application preferences, i.e., preferences specified during the 231 pre-establishment via Selection Properties. 233 2. Dynamic system policy, i.e., policy compiled from internally and 234 externally acquired information about available network 235 interfaces, supported transport protocols, and current/previous 236 Connections. Examples of ways to externally retrieve policy- 237 support information are through OS-specific statistics/ 238 measurement tools and tools that reside on middleboxes and 239 routers. 241 3. Default implementation policy, i.e., predefined policy by OS or 242 application. 244 In general, any protocol or path used for a connection must conform 245 to all three sources of constraints. Any violation of any of the 246 layers should cause a protocol or path to be considered ineligible 247 for use. For an example of application preferences leading to 248 constraints, an application may prohibit the use of metered network 249 interfaces for a given Connection to avoid user cost. Similarly, the 250 system policy at a given time may prohibit the use of such a metered 251 network interface from the application's process. Lastly, the 252 implementation itself may default to disallowing certain network 253 interfaces unless explicitly requested by the application and allowed 254 by the system. 256 It is expected that the database of system policies and the method of 257 looking up these policies will vary across various platforms. An 258 implementation should attempt to look up the relevant policies for 259 the system in a dynamic way to make sure it is reflecting an accurate 260 version of the system policy, since the system's policy regarding the 261 application's traffic may change over time due to user or 262 administrative changes. 264 4. Implementing Connection Establishment 266 The process of establishing a network connection begins when an 267 application expresses intent to communicate with a remote endpoint by 268 calling Initiate. (At this point, any constraints or requirements 269 the application may have on the connection are available from pre- 270 establishment.) The process can be considered complete once there is 271 at least one Protocol Stack that has completed any required setup to 272 the point that it can transmit and receive the application's data. 274 Connection establishment is divided into two top-level steps: 275 Candidate Gathering, to identify the paths, protocols, and endpoints 276 to use, and Candidate Racing, in which the necessary protocol 277 handshakes are conducted so that the transport system can select 278 which set to use. 280 The most simple example of this process might involve identifying the 281 single IP address to which the implementation wishes to connect, 282 using the system's current default interface or path, and starting a 283 TCP handshake to establish a stream to the specified IP address. 284 However, each step may also vary depending on the requirements of the 285 connection: if the endpoint is defined as a hostname and port, then 286 there may be multiple resolved addresses that are available; there 287 may also be multiple interfaces or paths available, other than the 288 default system interface; and some protocols may not need any 289 transport handshake to be considered "established" (such as UDP), 290 while other connections may utilize layered protocol handshakes, such 291 as TLS over TCP. 293 Whenever an implementation has multiple options for connection 294 establishment, it can view the set of all individual connection 295 establishment options as a single, aggregate connection 296 establishment. The aggregate set conceptually includes every valid 297 combination of endpoints, paths, and protocols. As an example, 298 consider an implementation that initiates a TCP connection to a 299 hostname + port endpoint, and has two valid interfaces available (Wi- 300 Fi and LTE). The hostname resolves to a single IPv4 address on the 301 Wi-Fi network, and resolves to the same IPv4 address on the LTE 302 network, as well as a single IPv6 address. The aggregate set of 303 connection establishment options can be viewed as follows: 305 Aggregate [Endpoint: www.example.com:80] [Interface: Any] [Protocol: TCP] 306 |-> [Endpoint: 192.0.2.1:80] [Interface: Wi-Fi] [Protocol: TCP] 307 |-> [Endpoint: 192.0.2.1:80] [Interface: LTE] [Protocol: TCP] 308 |-> [Endpoint: 2001:DB8::1.80] [Interface: LTE] [Protocol: TCP] 310 Any one of these sub-entries on the aggregate connection attempt 311 would satisfy the original application intent. The concern of this 312 section is the algorithm defining which of these options to try, 313 when, and in what order. 315 4.1. Candidate Gathering 317 The step of gathering candidates involves identifying which paths, 318 protocols, and endpoints may be used for a given Connection. This 319 list is determined by the requirements, prohibitions, and preferences 320 of the application as specified in the Selection Properties. 322 4.1.1. Structuring Options as a Tree 324 When an implementation responsible for connection establishment needs 325 to consider multiple options, it should logically structure these 326 options as a hierarchical tree. Each leaf node of the tree 327 represents a single, coherent connection attempt, with an Endpoint, a 328 Path, and a set of protocols that can directly negotiate and send 329 data on the network. Each node in the tree that is not a leaf 330 represents a connection attempt that is either underspecified, or 331 else includes multiple distinct options. For example. when 332 connecting on an IP network, a connection attempt to a hostname and 333 port is underspecified, because the connection attempt requires a 334 resolved IP address as its remote endpoint. In this case, the node 335 represented by the connection attempt to the hostname is a parent 336 node, with child nodes for each IP address. Similarly, an 337 implementation that is allowed to connect using multiple interfaces 338 will have a parent node of the tree for the decision between the 339 paths, with a branch for each interface. 341 The example aggregate connection attempt above can be drawn as a tree 342 by grouping the addresses resolved on the same interface into 343 branches: 345 || 346 +==========================+ 347 | www.example.com:80/Any | 348 +==========================+ 349 // \\ 350 +==========================+ +==========================+ 351 | www.example.com:80/Wi-Fi | | www.example.com:80/LTE | 352 +==========================+ +==========================+ 353 || // \\ 354 +====================+ +====================+ +======================+ 355 | 192.0.2.1:80/Wi-Fi | | 192.0.2.1:80/LTE | | 2001:DB8::1.80/LTE | 356 +====================+ +====================+ +======================+ 358 The rest of this section will use a notation scheme to represent this 359 tree. The parent (or trunk) node of the tree will be represented by 360 a single integer, such as "1". Each child of that node will have an 361 integer that identifies it, from 1 to the number of children. That 362 child node will be uniquely identified by concatenating its integer 363 to it's parents identifier with a dot in between, such as "1.1" and 364 "1.2". Each node will be summarized by a tuple of three elements: 365 Endpoint, Path, and Protocol. The above example can now be written 366 more succinctly as: 368 1 [www.example.com:80, Any, TCP] 369 1.1 [www.example.com:80, Wi-Fi, TCP] 370 1.1.1 [192.0.2.1:80, Wi-Fi, TCP] 371 1.2 [www.example.com:80, LTE, TCP] 372 1.2.1 [192.0.2.1:80, LTE, TCP] 373 1.2.2 [2001:DB8::1.80, LTE, TCP] 375 When an implementation views this aggregate set of connection 376 attempts as a single connection establishment, it only will use one 377 of the leaf nodes to transfer data. Thus, when a single leaf node 378 becomes ready to use, then the entire connection attempt is ready to 379 use by the application. Another way to represent this is that every 380 leaf node updates the state of its parent node when it becomes ready, 381 until the trunk node of the tree is ready, which then notifies the 382 application that the connection as a whole is ready to use. 384 A connection establishment tree may be degenerate, and only have a 385 single leaf node, such as a connection attempt to an IP address over 386 a single interface with a single protocol. 388 1 [192.0.2.1:80, Wi-Fi, TCP] 390 A parent node may also only have one child (or leaf) node, such as a 391 when a hostname resolves to only a single IP address. 393 1 [www.example.com:80, Wi-Fi, TCP] 394 1.1 [192.0.2.1:80, Wi-Fi, TCP] 396 4.1.2. Branch Types 398 There are three types of branching from a parent node into one or 399 more child nodes. Any parent node of the tree must only use one type 400 of branching. 402 4.1.2.1. Derived Endpoints 404 If a connection originally targets a single endpoint, there may be 405 multiple endpoints of different types that can be derived from the 406 original. The connection library should order the derived endpoints 407 according to application preference, system policy and expected 408 performance. 410 DNS hostname-to-address resolution is the most common method of 411 endpoint derivation. When trying to connect to a hostname endpoint 412 on a traditional IP network, the implementation should send DNS 413 queries for both A (IPv4) and AAAA (IPv6) records if both are 414 supported on the local link. The algorithm for ordering and racing 415 these addresses should follow the recommendations in Happy Eyeballs 416 [RFC8305]. 418 1 [www.example.com:80, Wi-Fi, TCP] 419 1.1 [2001:DB8::1.80, Wi-Fi, TCP] 420 1.2 [192.0.2.1:80, Wi-Fi, TCP] 421 1.3 [2001:DB8::2.80, Wi-Fi, TCP] 422 1.4 [2001:DB8::3.80, Wi-Fi, TCP] 424 DNS-Based Service Discovery can also provide an endpoint derivation 425 step. When trying to connect to a named service, the client may 426 discover one or more hostname and port pairs on the local network 427 using multicast DNS. These hostnames should each be treated as a 428 branch which can be attempted independently from other hostnames. 430 Each of these hostnames may also resolve to one or more addresses, 431 thus creating multiple layers of branching. 433 1 [term-printer._ipp._tcp.meeting.ietf.org, Wi-Fi, TCP] 434 1.1 [term-printer.meeting.ietf.org:631, Wi-Fi, TCP] 435 1.1.1 [31.133.160.18.631, Wi-Fi, TCP] 437 4.1.2.2. Alternate Paths 439 If a client has multiple network interfaces available to it, such as 440 mobile client with both Wi-Fi and Cellular connectivity, it can 441 attempt a connection over either interface. This represents a branch 442 point in the connection establishment. Like with derived endpoints, 443 the interfaces should be ranked based on preference, system policy, 444 and performance. Attempts should be started on one interface, and 445 then on other interfaces successively after delays based on expected 446 round-trip-time or other available metrics. 448 1 [192.0.2.1:80, Any, TCP] 449 1.1 [192.0.2.1:80, Wi-Fi, TCP] 450 1.2 [192.0.2.1:80, LTE, TCP] 452 This same approach applies to any situation in which the client is 453 aware of multiple links or views of the network. Multiple Paths, 454 each with a coherent set of addresses, routes, DNS server, and more, 455 may share a single interface. A path may also represent a virtual 456 interface service such as a Virtual Private Network (VPN). 458 The list of available paths should be constrained by any requirements 459 or prohibitions the application sets, as well as system policy. 461 4.1.2.3. Protocol Options 463 Differences in possible protocol compositions and options can also 464 provide a branching point in connection establishment. This allows 465 clients to be resilient to situations in which a certain protocol is 466 not functioning on a server or network. 468 This approach is commonly used for connections with optional proxy 469 server configurations. A single connection may be allowed to use an 470 HTTP-based proxy, a SOCKS-based proxy, or connect directly. These 471 options should be ranked and attempted in succession. 473 1 [www.example.com:80, Any, HTTP/TCP] 474 1.1 [192.0.2.8:80, Any, HTTP/HTTP Proxy/TCP] 475 1.2 [192.0.2.7:10234, Any, HTTP/SOCKS/TCP] 476 1.3 [www.example.com:80, Any, HTTP/TCP] 477 1.3.1 [192.0.2.1:80, Any, HTTP/TCP] 479 This approach also allows a client to attempt different sets of 480 application and transport protocols that may provide preferable 481 characteristics when available. For example, the protocol options 482 could involve QUIC [I-D.ietf-quic-transport] over UDP on one branch, 483 and HTTP/2 [RFC7540] over TLS over TCP on the other: 485 1 [www.example.com:443, Any, Any HTTP] 486 1.1 [www.example.com:443, Any, QUIC/UDP] 487 1.1.1 [192.0.2.1:443, Any, QUIC/UDP] 488 1.2 [www.example.com:443, Any, HTTP2/TLS/TCP] 489 1.2.1 [192.0.2.1:443, Any, HTTP2/TLS/TCP] 491 Another example is racing SCTP with TCP: 493 1 [www.example.com:80, Any, Any Stream] 494 1.1 [www.example.com:80, Any, SCTP] 495 1.1.1 [192.0.2.1:80, Any, SCTP] 496 1.2 [www.example.com:80, Any, TCP] 497 1.2.1 [192.0.2.1:80, Any, TCP] 499 Implementations that support racing protocols and protocol options 500 should maintain a history of which protocols and protocol options 501 successfully established, on a per-network basis (see Section 8.2). 502 This information can influence future racing decisions to prioritize 503 or prune branches. 505 4.2. Branching Order-of-Operations 507 Branch types must occur in a specific order relative to one another 508 to avoid creating leaf nodes with invalid or incompatible settings. 509 In the example above, it would be invalid to branch for derived 510 endpoints (the DNS results for www.example.com) before branching 511 between interface paths, since usable DNS results on one network may 512 not necessarily be the same as DNS results on another network due to 513 local network entities, supported address families, or enterprise 514 network configurations. Implementations must be careful to branch in 515 an order that results in usable leaf nodes whenever there are 516 multiple branch types that could be used from a single node. 518 The order of operations for branching, where lower numbers are acted 519 upon first, should be: 521 1. Alternate Paths 523 2. Protocol Options 525 3. Derived Endpoints 526 Branching between paths is the first in the list because results 527 across multiple interfaces are likely not related to one another: 528 endpoint resolution may return different results, especially when 529 using locally resolved host and service names, and which protocols 530 are supported and preferred may differ across interfaces. Thus, if 531 multiple paths are attempted, the overall connection can be seen as a 532 race between the available paths or interfaces. 534 Protocol options are checked next in order. Whether or not a set of 535 protocol, or protocol-specific options, can successfully connect is 536 generally not dependent on which specific IP address is used. 537 Furthermore, the protocol stacks being attempted may influence or 538 altogether change the endpoints being used. Adding a proxy to a 539 connection's branch will change the endpoint to the proxy's IP 540 address or hostname. Choosing an alternate protocol may also modify 541 the ports that should be selected. 543 Branching for derived endpoints is the final step, and may have 544 multiple layers of derivation or resolution, such as DNS service 545 resolution and DNS hostname resolution. 547 For example, if the application has indicated both a preference for 548 WiFi over LTE and for a feature only available in SCTP, branches will 549 be first sorted accord to path selection, with WiFi at the top. 550 Then, branches with SCTP will be sorted to the top within their 551 subtree according to the properties influencing protocol selection. 552 However, if the implementation has cached the information that SCTP 553 is not available on the path over WiFi, there is no SCTP node in the 554 WiFi subtree. Here, the path over WiFi will be tried first, and, if 555 connection establishment succeeds, TCP will be used. So the 556 Selection Property of preferring WiFi takes precedence over the 557 Property that led to a preference for SCTP. 559 1. [www.example.com:80, Any, Any Stream] 560 1.1 [192.0.2.1:80, Wi-Fi, Any Stream] 561 1.1.1 [192.0.2.1:80, Wi-Fi, TCP] 562 1.2 [192.0.3.1:80, LTE, Any Stream] 563 1.2.1 [192.0.3.1:80, LTE, SCTP] 564 1.2.2 [192.0.3.1:80, LTE, TCP] 566 4.3. Sorting Branches 568 Implementations should sort the branches of the tree of connection 569 options in order of their preference rank. Leaf nodes on branches 570 with higher rankings represent connection attempts that will be raced 571 first. Implementations should order the branches to reflect the 572 preferences expressed by the application for its new connection, 573 including Selection Properties, which are specified in 574 [I-D.ietf-taps-interface]. 576 In addition to the properties provided by the application, an 577 implementation may include additional criteria such as cached 578 performance estimates, see Section 8.2, or system policy, see 579 Section 3.2, in the ranking. Two examples of how Selection and 580 Connection Properties may be used to sort branches are provided 581 below: 583 o "Interface Instance or Type": If the application specifies an 584 interface type to be preferred or avoided, implementations should 585 rank paths accordingly. If the application specifies an interface 586 type to be required or prohibited, we expect an implementation to 587 not include the non-conforming paths into the three. 589 o "Capacity Profile": An implementation may use the Capacity Profile 590 to prefer paths optimized for the application's expected traffic 591 pattern according to cached performance estimates, see 592 Section 8.2: 594 * Scavenger: Prefer paths with the highest expected available 595 bandwidth, based on observed maximum throughput 597 * Low Latency/Interactive: Prefer paths with the lowest expected 598 Round Trip Time 600 * Constant-Rate Streaming: Prefer paths that can satisfy the 601 requested Stream Send or Stream Receive Bitrate, based on 602 observed maximum throughput 604 Implementations should process properties in the following order: 605 Prohibit, Require, Prefer, Avoid. If Selection Properties contain 606 any prohibited properties, the implementation should first purge 607 branches containing nodes with these properties. For required 608 properties, it should only keep branches that satisfy these 609 requirements. Finally, it should order branches according to 610 preferred properties, and finally use avoided properties as a 611 tiebreaker. 613 4.4. Candidate Racing 615 The primary goal of the Candidate Racing process is to successfully 616 negotiate a protocol stack to an endpoint over an interface--to 617 connect a single leaf node of the tree--with as little delay and as 618 few unnecessary connections attempts as possible. Optimizing these 619 two factors improves the user experience, while minimizing network 620 load. 622 This section covers the dynamic aspect of connection establishment. 623 While the tree described above is a useful conceptual and 624 architectural model, an implementation does not know what the full 625 tree may become up front, nor will many of the possible branches be 626 used in the common case. 628 There are three different approaches to racing the attempts for 629 different nodes of the connection establishment tree: 631 1. Immediate 633 2. Delayed 635 3. Failover 637 Each approach is appropriate in different use-cases and branch types. 638 However, to avoid consuming unnecessary network resources, 639 implementations should not use immediate racing as a default 640 approach. 642 The timing algorithms for racing should remain independent across 643 branches of the tree. Any timers or racing logic is isolated to a 644 given parent node, and is not ordered precisely with regards to other 645 children of other nodes. 647 4.4.1. Delayed 649 Delayed racing can be used whenever a single node of the tree has 650 multiple child nodes. Based on the order determined when building 651 the tree, the first child node will be initiated immediately, 652 followed by the next child node after some delay. Once that second 653 child node is initiated, the third child node (if present) will begin 654 after another delay, and so on until all child nodes have been 655 initiated, or one of the child nodes successfully completes its 656 negotiation. 658 Delayed racing attempts occur in parallel. Implementations should 659 not terminate an earlier child connection attempt upon starting a 660 secondary child. 662 The delay between starting child nodes should be based on the 663 properties of the previously started child node. For example, if the 664 first child represents an IP address with a known route, and the 665 second child represents another IP address, the delay between 666 starting the first and second IP addresses can be based on the 667 expected retransmission cadence for the first child's connection 668 (derived from historical round-trip-time). Alternatively, if the 669 first child represents a branch on a Wi-Fi interface, and the second 670 child represents a branch on an LTE interface, the delay should be 671 based on the expected time in which the branch for the first 672 interface would be able to establish a connection, based on link 673 quality and historical round-trip-time. 675 Any delay should have a defined minimum and maximum value based on 676 the branch type. Generally, branches between paths and protocols 677 should have longer delays than branches between derived endpoints. 678 The maximum delay should be considered with regards to how long a 679 user is expected to wait for the connection to complete. 681 If a child node fails to connect before the delay timer has fired for 682 the next child, the next child should be started immediately. 684 4.4.2. Failover 686 If an implementation or application has a strong preference for one 687 branch over another, the branching node may choose to wait until one 688 child has failed before starting the next. Failure of a leaf node is 689 determined by its protocol negotiation failing or timing out; failure 690 of a parent branching node is determined by all of its children 691 failing. 693 An example in which failover is recommended is a race between a 694 protocol stack that uses a proxy and a protocol stack that bypasses 695 the proxy. Failover is useful in case the proxy is down or 696 misconfigured, but any more aggressive type of racing may end up 697 unnecessarily avoiding a proxy that was preferred by policy. 699 4.5. Completing Establishment 701 The process of connection establishment completes when one leaf node 702 of the tree has completed negotiation with the remote endpoint 703 successfully, or else all nodes of the tree have failed to connect. 704 The first leaf node to complete its connection is then used by the 705 application to send and receive data. 707 It is useful to process success and failure throughout the tree by 708 child nodes reporting to their parent nodes (towards the trunk of the 709 tree). For example, in the following case, if 1.1.1 fails to 710 connect, it reports the failure to 1.1. Since 1.1 has no other child 711 nodes, it also has failed and reports that failure to 1. Because 1.2 712 has not yet failed, 1 is not considered to have failed. Since 1.2 713 has not yet started, it is started and the process continues. 714 Similarly, if 1.1.1 successfully connects, then it marks 1.1 as 715 connected, which propagates to the trunk node 1. At this point, the 716 connection as a whole is considered to be successfully connected and 717 ready to process application data 718 1 [www.example.com:80, Any, TCP] 719 1.1 [www.example.com:80, Wi-Fi, TCP] 720 1.1.1 [192.0.2.1:80, Wi-Fi, TCP] 721 1.2 [www.example.com:80, LTE, TCP] 722 ... 724 If a leaf node has successfully completed its connection, all other 725 attempts should be made ineligible for use by the application for the 726 original request. New connection attempts that involve transmitting 727 data on the network should not be started after another leaf node has 728 completed successfully, as the connection as a whole has been 729 established. An implementation may choose to let certain handshakes 730 and negotiations complete in order to gather metrics to influence 731 future connections. Similarly, an implementation may choose to hold 732 onto fully established leaf nodes that were not the first to 733 establish for use in future connections, but this approach is not 734 recommended since those attempts were slower to connect and may 735 exhibit less desirable properties. 737 4.5.1. Determining Successful Establishment 739 Implementations may select the criteria by which a leaf node is 740 considered to be successfully connected differently on a per-protocol 741 basis. If the only protocol being used is a transport protocol with 742 a clear handshake, like TCP, then the obvious choice is to declare 743 that node "connected" when the last packet of the three-way handshake 744 has been received. If the only protocol being used is an 745 "unconnected" protocol, like UDP, the implementation may consider the 746 node fully "connected" the moment it determines a route is present, 747 before sending any packets on the network, see further Section 4.7. 749 For protocol stacks with multiple handshakes, the decision becomes 750 more nuanced. If the protocol stack involves both TLS and TCP, an 751 implementation could determine that a leaf node is connected after 752 the TCP handshake is complete, or it can wait for the TLS handshake 753 to complete as well. The benefit of declaring completion when the 754 TCP handshake finishes, and thus stopping the race for other branches 755 of the tree, is that there will be less burden on the network from 756 other connection attempts. On the other hand, by waiting until the 757 TLS handshake is complete, an implementation avoids the scenario in 758 which a TCP handshake completes quickly, but TLS negotiation is 759 either very slow or fails altogether in particular network conditions 760 or to a particular endpoint. To avoid the issue of TLS possibly 761 failing, the implementation should not generate a Ready event for the 762 Connection until TLS is established. 764 If all of the leaf nodes fail to connect during racing, i.e. none of 765 the configurations that satisfy all requirements given in the 766 Transport Parameters actually work over the available paths, then the 767 transport system should notify the application with an InitiateError 768 event. An InitiateError event should also be generated in case the 769 transport system finds no usable candidates to race. 771 4.6. Establishing multiplexed connections 773 Multiplexing several Connections over a single underlying transport 774 connection requires that the Connections to be multiplexed belong to 775 the same Connection Group (as is indicated by the application using 776 the Clone call). When the underlying transport connection supports 777 multi-streaming, the Transport System can map each Connection in the 778 Connection Group to a different stream. Thus, when the Connections 779 that are offered to an application by the Transport System are 780 multiplexed, the Transport System may implement the establishment of 781 a new Connection by simply beginning to use a new stream of an 782 already established transport connection and there is no need for a 783 connection establishment procedure. This, then, also means that 784 there may not be any "establishment" message (like a TCP SYN), but 785 the application can simply start sending or receiving. Therefore, 786 when the Initiate action of a Transport System is called without 787 Messages being handed over, it cannot be guaranteed that the other 788 endpoint will have any way to know about this, and hence a passive 789 endpoint's ConnectionReceived event may not be called upon an active 790 endpoint's Inititate. Instead, calling the ConnectionReceived event 791 may be delayed until the first Message arrives. 793 4.7. Handling racing with "unconnected" protocols 795 While protocols that use an explicit handshake to validate a 796 Connection to a peer can be used for racing multiple establishment 797 attempts in parallel, "unconnected" protocols such as raw UDP do not 798 offer a way to validate the presence of a peer or the usability of a 799 Connection without application feedback. An implementation should 800 consider such a protocol stack to be established as soon as a local 801 route to the peer endpoint is confirmed. 803 However, if a peer is not reachable over the network using the 804 unconnected protocol, or data cannot be exchanged for any other 805 reason, the application may want to attempt using another candidate 806 Protocol Stack. The implementation should maintain the list of other 807 candidate Protocol Stacks that were eligible to use. In the case 808 that the application signals that the initial Protocol Stack is 809 failing for some reason and that another option should be attempted, 810 the Connection can be updated to point to the next candidate Protocol 811 Stack. This can be viewed as an application-driven form of Protocol 812 Stack racing. 814 4.8. Implementing listeners 816 When an implementation is asked to Listen, it registers with the 817 system to wait for incoming traffic to the Local Endpoint. If no 818 Local Endpoint is specified, the implementation should either use an 819 ephemeral port or generate an error. 821 If the Selection Properties do not require a single network interface 822 or path, but allow the use of multiple paths, the Listener object 823 should register for incoming traffic on all of the network interfaces 824 or paths that conform to the Properties. The set of available paths 825 can change over time, so the implementation should monitor network 826 path changes and register and de-register the Listener across all 827 usable paths. When using multiple paths, the Listener is generally 828 expected to use the same port for listening on each. 830 If the Selection Properties allow multiple protocols to be used for 831 listening, and the implementation supports it, the Listener object 832 should register across the eligble protocols for each path. This 833 means that inbound Connections delivered by the implementation may 834 have heterogeneous protocol stacks. 836 4.8.1. Implementing listeners for Connected Protocols 838 Connected protocols such as TCP and TLS-over-TCP have a strong 839 mapping between the Local and Remote Endpoints (five-tuple) and their 840 protocol connection state. These map well into Connection objects. 841 Whenever a new inbound handshake is being started, the Listener 842 should generate a new Connection object and pass it to the 843 application. 845 4.8.2. Implementing listeners for Unconnected Protocols 847 Unconnected protocols such as UDP and UDP-lite generally do not 848 provide the same mechanisms that connected protocols do to offer 849 Connection objects. Implementations should wait for incoming packets 850 for unconnected protocols on a listening port and should perform 851 five-tuple matching of packets to either existing Connection objects 852 or the creation of new Connection objects. On platforms with 853 facilities to create a "virtual connection" for unconnected protocols 854 implementations should use these mechanisms to minimise the handling 855 of datagrams intended for already created Connection objects. 857 4.8.3. Implementing listeners for Multiplexed Protocols 859 Protocols that provide multiplexing of streams into a single five- 860 tuple can listen both for entirely new connections (a new HTTP/2 861 stream on a new TCP connection, for example) and for new sub- 862 connections (a new HTTP/2 stream on an existing connection). If the 863 abstraction of Connection presented to the application is mapped to 864 the multiplexed stream, then the Listener should deliver new 865 Connection objects in the same way for either case. The 866 implementation should allow the application to introspect the 867 Connection Group marked on the Connections to determine the grouping 868 of the multiplexing. 870 5. Implementing Data Transfer 872 5.1. Data transfer for streams, datagrams, and frames 874 The most basic mapping for sending a Message is an abstraction of 875 datagrams, in which the transport protocol naturally deals in 876 discrete packets. Each Message here corresponds to a single 877 datagram. Generally, these will be short enough that sending and 878 receiving will always use a complete Message. 880 For protocols that expose byte-streams, the only delineation provided 881 by the protocol is the end of the stream in a given direction. Each 882 Message in this case corresponds to the entire stream of bytes in a 883 direction. These Messages may be quite long, in which case they can 884 be sent in multiple parts. 886 Protocols that provide the framing (such as length-value protocols, 887 or protocols that use delimiters) provide data boundaries that may be 888 longer than a traditional packet datagram. Each Message for framing 889 protocols corresponds to a single frame, which may be sent either as 890 a complete Message, or in multiple parts. 892 5.1.1. Sending Messages 894 The effect of the application sending a Message is determined by the 895 top-level protocol in the established Protocol Stack. That is, if 896 the top-level protocol provides an abstraction of framed messages 897 over a connection, the receiving application will be able to obtain 898 multiple Messages on that connection, even if the framing protocol is 899 built on a byte-stream protocol like TCP. 901 5.1.1.1. Message Properties 903 o Lifetime: this should be implemented by removing the Message from 904 its queue of pending Messages after the Lifetime has expired. A 905 queue of pending Messages within the transport system 906 implementation that have yet to be handed to the Protocol Stack 907 can always support this property, but once a Message has been sent 908 into the send buffer of a protocol, only certain protocols may 909 support de-queueing a message. For example, TCP cannot remove 910 bytes from its send buffer, while in case of SCTP, such control 911 over the SCTP send buffer can be exercised using the partial 912 reliability extension [RFC8303]. When there is no standing queue 913 of Messages within the system, and the Protocol Stack does not 914 support removing a Message from its buffer, this property may be 915 ignored. 917 o Priority: this represents the ability to prioritize a Message over 918 other Messages. This can be implemented by the system re-ordering 919 Messages that have yet to be handed to the Protocol Stack, or by 920 giving relative priority hints to protocols that support 921 priorities per Message. For example, an implementation of HTTP/2 922 could choose to send Messages of different Priority on streams of 923 different priority. 925 o Ordered: when this is false, it disables the requirement of in- 926 order-delivery for protocols that support configurable ordering. 928 o Idempotent: when this is true, it means that the Message can be 929 used by mechanisms that might transfer it multiple times - e.g., 930 as a result of racing multiple transports or as part of TCP Fast 931 Open. 933 o Final: when this is true, it means that a transport connection can 934 be closed immediately after its transmission. 936 o Corruption Protection Length: when this is set to any value other 937 than -1, it limits the required checksum in protocols that allow 938 limiting the checksum length (e.g. UDP-Lite). 940 o Transmission Profile: TBD - because it's not final in the API yet. 941 Old text follows: when this is set to "Interactive/Low Latency", 942 the Message should be sent immediately, even when this comes at 943 the cost of using the network capacity less efficiently. For 944 example, small messages can sometimes be bundled to fit into a 945 single data packet for the sake of reducing header overhead; such 946 bundling should not be used. For example, in case of TCP, the 947 Nagle algorithm should be disabled when Interactive/Low Latency is 948 selected as the capacity profile. Scavenger/Bulk can translate 949 into usage of a congestion control mechanism such as LEDBAT, and/ 950 or the capacity profile can lead to a choice of a DSCP value as 951 described in [I-D.ietf-taps-minset]). 953 o Singular Transmission: when this is true, the application requests 954 to avoid transport-layer segmentation or network-layer 955 fragmentation. Some transports implement network-layer 956 fragmentation avoidance (Path MTU Discovery) without exposing this 957 functionality to the application; in this case, only transport- 958 layer segmentation should be avoided, by fitting the message into 959 a single transport-layer segment or otherwise failing. Otherwise, 960 network-layer fragmentation should be avoided--e.g. by requesting 961 the IP Don't Fragment bit to be set in case of UDP(-Lite) and IPv4 962 (SET_DF in [RFC8304]). 964 5.1.1.2. Send Completion 966 The application should be notified whenever a Message or partial 967 Message has been consumed by the Protocol Stack, or has failed to 968 send. The meaning of the Message being consumed by the stack may 969 vary depending on the protocol. For a basic datagram protocol like 970 UDP, this may correspond to the time when the packet is sent into the 971 interface driver. For a protocol that buffers data in queues, like 972 TCP, this may correspond to when the data has entered the send 973 buffer. 975 5.1.1.3. Batching Sends 977 Since sending a Message may involve a context switch between the 978 application and the transport system, sending patterns that involve 979 multiple small Messages can incur high overhead if each needs to be 980 enqueued separately. To avoid this, the application should have a 981 way to indicate a batch of Send actions, during which time the 982 implementation will hold off on processing Messages until the batch 983 is complete. This can also help context switches when enqueuing data 984 in the interface driver if the operation can be batched. 986 5.1.2. Receiving Messages 988 Similar to sending, Receiving a Message is determined by the top- 989 level protocol in the established Protocol Stack. The main 990 difference with Receiving is that the size and boundaries of the 991 Message are not known beforehand. The application can communicate in 992 its Receive action the parameters for the Message, which can help the 993 implementation know how much data to deliver and when. For example, 994 if the application only wants to receive a complete Message, the 995 implementation should wait until an entire Message (datagram, stream, 996 or frame) is read before delivering any Message content to the 997 application. This requires the implementation to understand where 998 messages end, either via a supplied deframer or because the top-level 999 protocol in the established Protocol Stack preserves message 1000 boundaries; if, on the other hand, the top-level protocol only 1001 supports a byte-stream and no deframers were supported, the 1002 application must specify the minimum number of bytes of Message 1003 content it wants to receive (which may be just a single byte) to 1004 control the flow of received data. 1006 If a Connection becomes finished before a requested Receive action 1007 can be satisfied, the implementation should deliver any partial 1008 Message content outstanding, or if none is available, an indication 1009 that there will be no more received Messages. 1011 5.2. Handling of data for fast-open protocols 1013 Several protocols allow sending higher-level protocol or application 1014 data within the first packet of their protocol establishment, such as 1015 TCP Fast Open [RFC7413] and TLS 1.3 [RFC8446]. This approach is 1016 referred to as sending Zero-RTT (0-RTT) data. This is a desirable 1017 property, but poses challenges to an implementation that uses racing 1018 during connection establishment. 1020 If the application has 0-RTT data to send in any protocol handshakes, 1021 it needs to provide this data before the handshakes have begun. When 1022 racing, this means that the data should be provided before the 1023 process of connection establishment has begun. If the application 1024 wants to send 0-RTT data, it must indicate this to the implementation 1025 by setting the Idempotent send parameter to true when sending the 1026 data. In general, 0-RTT data may be replayed (for example, if a TCP 1027 SYN contains data, and the SYN is retransmitted, the data will be 1028 retransmitted as well), but racing means that different leaf nodes 1029 have the opportunity to send the same data independently. If data is 1030 truly idempotent, this should be permissible. 1032 Once the application has provided its 0-RTT data, an implementation 1033 should keep a copy of this data and provide it to each new leaf node 1034 that is started and for which a 0-RTT protocol is being used. 1036 It is also possible that protocol stacks within a particular leaf 1037 node use 0-RTT handshakes without any idempotent application data. 1038 For example, TCP Fast Open could use a Client Hello from TLS as its 1039 0-RTT data, shortening the cumulative handshake time. 1041 0-RTT handshakes often rely on previous state, such as TCP Fast Open 1042 cookies, previously established TLS tickets, or out-of-band 1043 distributed pre-shared keys (PSKs). Implementations should be aware 1044 of security concerns around using these tokens across multiple 1045 addresses or paths when racing. In the case of TLS, any given ticket 1046 or PSK should only be used on one leaf node. If implementations have 1047 multiple tickets available from a previous connection, each leaf node 1048 attempt must use a different ticket. In effect, each leaf node will 1049 send the same early application data, yet encoded (encrypted) 1050 differently on the wire. 1052 6. Implementing Maintenance 1054 Maintenance encompasses changes that the application can request to a 1055 Connection, or that a Connection can react to based on system and 1056 network changes. 1058 6.1. Managing Connections 1060 Appendix A.1 of [I-D.ietf-taps-minset] explains, using primitives 1061 from [RFC8303] and [RFC8304], how to implement changing some of the 1062 following protocol properties of an established connection with TCP 1063 and UDP. Below, we amend this description for other protocols (if 1064 applicable) and extend it with Connection Properties that are not 1065 contained in [I-D.ietf-taps-minset]. 1067 o Notification of excessive retransmissions: TODO 1069 o Retransmission threshold before excessive retransmission 1070 notification: TODO; for TCP, this can be done using ERROR.TCP 1071 described in section 4 of [RFC8303]. 1073 o Notification of ICMP soft error message arrival: TODO 1075 o Required minimum coverage of the checksum for receiving: for UDP- 1076 Lite, this can be done using the primitive 1077 SET_MIN_CHECKSUM_COVERAGE.UDP-Lite described in section 4 of 1078 [RFC8303]. 1080 o Priority (Connection): TODO; for SCTP, this can be done using the 1081 primitive CONFIGURE_STREAM_SCHEDULER.SCTP described in section 4 1082 of [RFC8303]. 1084 o Timeout for aborting Connection: for SCTP, this can be done using 1085 the primitive CHANGE_TIMEOUT.SCTP described in section 4 of 1086 [RFC8303]. 1088 o Connection group transmission scheduler: for SCTP, this can be 1089 done using the primitive SET_STREAM_SCHEDULER.SCTP described in 1090 section 4 of [RFC8303]. 1092 o Maximum message size concurrent with Connection establishment: 1093 TODO 1095 o Maximum Message size before fragmentation or segmentation: TODO 1097 o Maximum Message size on send: TODO 1099 o Maximum Message size on receive: TODO 1100 o Capacity Profile: TODO 1102 o Bounds on Send or Receive Rate: TODO 1104 o TCP-specific Property: User Timeout: for TCP, this can be 1105 configured using the primitive CHANGE_TIMEOUT.TCP described in 1106 section 4 of [RFC8303]. 1108 It may happen that the application attempts to set a Protocol 1109 Property which does not apply to the actually chosen protocol. In 1110 this case, the implementation should fail gracefully, i.e., it may 1111 give a warning to the application, but it should not terminate the 1112 Connection. 1114 6.2. Handling Path Changes 1116 When a path change occurs, the Transport Services implementation is 1117 responsible for notifying Protocol Instances in the Protocol Stack. 1118 If the Protocol Stack includes a transport protocol that supports 1119 multipath connectivity, an update to the available paths should 1120 inform the Protocol Instance of the new set of paths that are 1121 permissible based on the Selection Properties passed by the 1122 application. A multipath protocol can establish new subflows over 1123 new paths, and should tear down subflows over paths that are no 1124 longer available. If the Protocol Stack includes a transport 1125 protocol that does not support multipath, but support migrating 1126 between paths, the update to available paths can be used as the 1127 trigger to migrating the connection. For protocols that do not 1128 support multipath or migration, the Protocol Instances may be 1129 informed of the path change, but should not be forcibly disconnected 1130 if the previously used path becomes unavailable. An exception to 1131 this case is if the System Policy changes to prohibit traffic from 1132 the Connection based on its properties, in which case the Protocol 1133 Stack should be disconnected. 1135 7. Implementing Termination 1137 With TCP, when an application closes a connection, this means that it 1138 has no more data to send (but expects all data that has been handed 1139 over to be reliably delivered). However, with TCP only, "close" does 1140 not mean that the application will stop receiving data. This is 1141 related to TCP's ability to support half-closed connections. 1143 SCTP is an example of a protocol that does not support such half- 1144 closed connections. Hence, with SCTP, the meaning of "close" is 1145 stricter: an application has no more data to send (but expects all 1146 data that has been handed over to be reliably delivered), and will 1147 also not receive any more data. 1149 Implementing a protocol independent transport system means that the 1150 exposed semantics must be the strictest subset of the semantics of 1151 all supported protocols. Hence, as is common with all reliable 1152 transport protocols, after a Close action, the application can expect 1153 to have its reliability requirements honored regarding the data it 1154 has given to the Transport System, but it cannot expect to be able to 1155 read any more data after calling Close. 1157 Abort differs from Close only in that no guarantees are given 1158 regarding data that the application has handed over to the Transport 1159 System before calling Abort. 1161 As explained in Section 4.6, when a new stream is multiplexed on an 1162 already existing connection of a Transport Protocol Instance, there 1163 is no need for a connection establishment procedure. Because the 1164 Connections that are offered by the Transport System can be 1165 implemented as streams that are multiplexed on a transport protocol's 1166 connection, it can therefore not be guaranteed that one Endpoint's 1167 Initiate action provokes a ConnectionReceived event at its peer. 1169 For Close (provoking a Finished event) and Abort (provoking a 1170 ConnectionError event), the same logic applies: while it is desirable 1171 to be informed when a peer closes or aborts a Connection, whether 1172 this is possible depends on the underlying protocol, and no 1173 guarantees can be given. With SCTP, the transport system can use the 1174 stream reset procedure to cause a Finish event upon a Close action 1175 from the peer [NEAT-flow-mapping]. 1177 8. Cached State 1179 Beyond a single Connection's lifetime, it is useful for an 1180 implementation to keep state and history. This cached state can help 1181 improve future Connection establishment due to re-using results and 1182 credentials, and favoring paths and protocols that performed well in 1183 the past. 1185 Cached state may be associated with different Endpoints for the same 1186 Connection, depending on the protocol generating the cached content. 1187 For example, session tickets for TLS are associated with specific 1188 endpoints, and thus should be cached based on a Connection's hostname 1189 Endpoint (if applicable). On the other hand, performance 1190 characteristics of a path are more likely tied to the IP address and 1191 subnet being used. 1193 8.1. Protocol state caches 1195 Some protocols will have long-term state to be cached in association 1196 with Endpoints. This state often has some time after which it is 1197 expired, so the implementation should allow each protocol to specify 1198 an expiration for cached content. 1200 Examples of cached protocol state include: 1202 o The DNS protocol can cache resolution answers (A and AAAA queries, 1203 for example), associated with a Time To Live (TTL) to be used for 1204 future hostname resolutions without requiring asking the DNS 1205 resolver again. 1207 o TLS caches session state and tickets based on a hostname, which 1208 can be used for resuming sessions with a server. 1210 o TCP can cache cookies for use in TCP Fast Open. 1212 Cached protocol state is primarily used during Connection 1213 establishment for a single Protocol Stack, but may be used to 1214 influence an implementation's preference between several candidate 1215 Protocol Stacks. For example, if two IP address Endpoints are 1216 otherwise equally preferred, an implementation may choose to attempt 1217 a connection to an address for which it has a TCP Fast Open cookie. 1219 Applications must have a way to flush protocol cache state if 1220 desired. This may be necessary, for example, if application-layer 1221 identifiers rotate and clients wish to avoid linkability via 1222 trackable TLS tickets or TFO cookies. 1224 8.2. Performance caches 1226 In addition to protocol state, Protocol Instances should provide data 1227 into a performance-oriented cache to help guide future protocol and 1228 path selection. Some performance information can be gathered 1229 generically across several protocols to allow predictive comparisons 1230 between protocols on given paths: 1232 o Observed Round Trip Time 1234 o Connection Establishment latency 1236 o Connection Establishment success rate 1238 These items can be cached on a per-address and per-subnet 1239 granularity, and averaged between different values. The information 1240 should be cached on a per-network basis, since it is expected that 1241 different network attachments will have different performance 1242 characteristics. Besides Protocol Instances, other system entities 1243 may also provide data into performance-oriented caches. This could 1244 for instance be signal strength information reported by radio modems 1245 like Wi-Fi and mobile broadband or information about the battery- 1246 level of the device. Furthermore, the system may cache the observed 1247 maximum throughput on a path as an estimate of the available 1248 bandwidth. 1250 An implementation should use this information, when possible, to 1251 determine preference between candidate paths, endpoints, and protocol 1252 options. Eligible options that historically had significantly better 1253 performance than others should be selected first when gathering 1254 candidates (see Section 4.1) to ensure better performance for the 1255 application. 1257 The reasonable lifetime for cached performance values will vary 1258 depending on the nature of the value. Certain information, like the 1259 connection establishment success rate to a Remote Endpoint using a 1260 given protocol stack, can be stored for a long period of time (hours 1261 or longer), since it is expected that the capabilities of the Remote 1262 Endpoint are not changing very quickly. On the other hand, Round 1263 Trip Time observed by TCP over a particular network path may vary 1264 over a relatively short time interval. For such values, the 1265 implementation should remove them from the cache more quickly, or 1266 treat older values with less confidence/weight. 1268 9. Specific Transport Protocol Considerations 1270 9.1. TCP 1272 Connection lifetime for TCP translates fairly simply into the the 1273 abstraction presented to an application. When the TCP three-way 1274 handshake is complete, its layer of the Protocol Stack can be 1275 considered Ready (established). This event will cause racing of 1276 Protocol Stack options to complete if TCP is the top-level protocol, 1277 at which point the application can be notified that the Connection is 1278 Ready to send and receive. 1280 If the application sends a Close, that can translate to a graceful 1281 termination of the TCP connection, which is performed by sending a 1282 FIN to the remote endpoint. If the application sends an Abort, then 1283 the TCP state can be closed abruptly, leading to a RST being sent to 1284 the peer. 1286 Without a layer of framing (a top-level protocol in the established 1287 Protocol Stack that preserves message boundaries, or an application- 1288 supplied deframer) on top of TCP, the receiver side of the transport 1289 system implementation can only treat the incoming stream of bytes as 1290 a single Message, terminated by a FIN when the Remote Endpoint closes 1291 the Connection. 1293 9.2. UDP 1295 UDP as a direct transport does not provide any handshake or 1296 connectivity state, so the notion of the transport protocol becoming 1297 Ready or established is degenerate. Once the system has validated 1298 that there is a route on which to send and receive UDP datagrams, the 1299 protocol is considered Ready. Similarly, a Close or Abort has no 1300 meaning to the on-the-wire protocol, but simply leads to the local 1301 state being torn down. 1303 When sending and receiving messages over UDP, each Message should 1304 correspond to a single UDP datagram. The Message can contain 1305 metadata about the packet, such as the ECN bits applied to the 1306 packet. 1308 9.3. SCTP 1310 To support sender-side stream schedulers (which are implemented on 1311 the sender side), a receiver-side Transport System should always 1312 support message interleaving [RFC8260]. 1314 SCTP messages can be very large. To allow the reception of large 1315 messages in pieces, a "partial flag" can be used to inform a (native 1316 SCTP) receiving application that a message is incomplete. After 1317 receiving the "partial flag", this application would know that the 1318 next receive calls will only deliver remaining parts of the same 1319 message (i.e., no messages or partial messages will arrive on other 1320 streams until the message is complete) (see Section 8.1.20 in 1321 [RFC6458]). The "partial flag" can therefore facilitate the 1322 implementation of the receiver buffer in the receiving application, 1323 at the cost of limiting multiplexing and temporarily creating head- 1324 of-line blocking delay at the receiver. 1326 When a Transport System transfers a Message, it seems natural to map 1327 the Message object to SCTP messages in order to support properties 1328 such as "Ordered" or "Lifetime" (which maps onto partially reliable 1329 delivery with a SCTP_PR_SCTP_TTL policy [RFC6458]). However, since 1330 multiplexing of Connections onto SCTP streams may happen, and would 1331 be hidden from the application, the Transport System requires a per- 1332 stream receiver buffer anyway, so this potential benefit is lost and 1333 the "partial flag" becomes unnecessary for the system. 1335 The problem of long messages either requiring large receiver-side 1336 buffers or getting in the way of multiplexing is addressed by message 1337 interleaving [RFC8260], which is yet another reason why a receivers- 1338 side transport system supporting SCTP should implement this 1339 mechanism. 1341 9.4. TLS 1343 The mapping of a TLS stream abstraction into the application is 1344 equivalent to the contract provided by TCP (see Section 9.1). The 1345 Ready state should be determined by the completion of the TLS 1346 handshake, which involves potentially several more round trips beyond 1347 the TCP handshake. The application should not be notified that the 1348 Connection is Ready until TLS is established. 1350 9.5. HTTP 1352 HTTP requests and responses map naturally into Messages, since they 1353 are delineated chunks of data with metadata that can be sent over a 1354 transport. To that end, HTTP can be seen as the most prevalent 1355 framing protocol that runs on top of streams like TCP, TLS, etc. 1357 In order to use a transport Connection that provides HTTP Message 1358 support, the establishment and closing of the connection can be 1359 treated as it would without the framing protocol. Sending and 1360 receiving of Messages, however, changes to treat each Message as a 1361 well-delineated HTTP request or response, with the content of the 1362 Message representing the body, and the Headers being provided in 1363 Message metadata. 1365 9.6. QUIC 1367 QUIC provides a multi-streaming interface to an encrypted transport. 1368 Each stream can be viewed as equivalent to a TLS stream over TCP, so 1369 a natural mapping is to present each QUIC stream as an individual 1370 Connection. The protocol for the stream will be considered Ready 1371 whenever the underlying QUIC connection is established to the point 1372 that this stream's data can be sent. For streams after the first 1373 stream, this will likely be an immediate operation. 1375 Closing a single QUIC stream, presented to the application as a 1376 Connection, does not imply closing the underlying QUIC connection 1377 itself. Rather, the implementation may choose to close the QUIC 1378 connection once all streams have been closed (possibly after some 1379 timeout), or after an individual stream Connection sends an Abort. 1381 Messages over a direct QUIC stream should be represented similarly to 1382 the TCP stream (one Message per direction, see Section 9.1), unless a 1383 framing mapping is used on top of QUIC. 1385 9.7. HTTP/2 transport 1387 Similar to QUIC (Section 9.6), HTTP/2 provides a multi-streaming 1388 interface. This will generally use HTTP as the unit of Messages over 1389 the streams, in which each stream can be represented as a transport 1390 Connection. The lifetime of streams and the HTTP/2 connection should 1391 be managed as described for QUIC. 1393 It is possible to treat each HTTP/2 stream as a raw byte-stream 1394 instead of a carrier for HTTP messages, in which case the Messages 1395 over the streams can be represented similarly to the TCP stream (one 1396 Message per direction, see Section 9.1). 1398 10. Rendezvous and Environment Discovery 1400 The connection establishment process outlined in Section 4 is 1401 appropriate for client-server connections, but needs to be expanded 1402 in peer-to-peer Rendezvous scenarios, as follows: 1404 o Gathering Local Endpoint candidates 1406 The set of possible Local Endpoints is gathered. In the simple 1407 case, this merely enumerates the local interfaces and protocols, 1408 allocates ephemeral source ports. For example, a system that has 1409 WiFi and Ethernet and supports IPv4 and IPv6 might gather four 1410 candidate locals (IPv4 on Ethernet, IPv6 on Ethernet, IPv4 on 1411 WiFi, and IPv6 on WiFi) that can form the source for a transient. 1413 If NAT traversal is required, the process of gathering Local 1414 Endpoints becomes broadly equivalent to the ICE candidate 1415 gathering phase [RFC5245]. The endpoint determines its server 1416 reflexive Local Endpoints (i.e., the translated address of a 1417 local, on the other side of a NAT) and relayed locals (e.g., via a 1418 TURN server or other relay), for each interface and network 1419 protocol. These are added to the set of candidate Local Endpoints 1420 for this connection. 1422 Gathering Local Endpoints is primarily a local operation, although 1423 it might involve exchanges with a STUN server to derive server 1424 reflexive locals, or with a TURN server or other relay to derive 1425 relayed locals. It does not involve communication with the Remote 1426 Endpoint. 1428 o Gathering Remote Endpoint Candidates 1430 The Remote Endpoint is typically a name that needs to be resolved 1431 into a set of possible addresses that can be used for 1432 communication. Resolving the Remote Endpoint is the process of 1433 recursively performing such name lookups, until fully resolved, to 1434 return the set of candidates for the remote of this connection. 1436 How this is done will depend on the type of the Remote Endpoint, 1437 and can also be specific to each Local Endpoint. A common case is 1438 when the Remote Endpoint is a DNS name, in which case it is 1439 resolved to give a set of IPv4 and IPv6 addresses representing 1440 that name. Some types of remote might require more complex 1441 resolution. Resolving the Remote Endpoint for a peer-to-peer 1442 connection might involve communication with a rendezvous server, 1443 which in turn contacts the peer to gain consent to communicate and 1444 retrieve its set of candidate locals, which are returned and form 1445 the candidate remote addresses for contacting that peer. 1447 Resolving the remote is _not_ a local operation. It will involve 1448 a directory service, and can require communication with the remote 1449 to rendezvous and exchange peer addresses. This can expose some 1450 or all of the candidate locals to the remote. 1452 o Establishing Connections 1454 The set of candidate Local Endpoints and the set of candidate 1455 Remote Endpoints are paired, to derive a priority ordered set of 1456 Candidate Paths that can potentially be used to establish a 1457 Connection. 1459 Then, communication is attempted over each candidate path, in 1460 priority order. If there are multiple candidates with the same 1461 priority, then connection establishment proceeds simultaneously 1462 and uses the transient that wins the race to be established. 1463 Otherwise, connection establishment is sequential, paced at a rate 1464 that should not congest the network. Depending on the chosen 1465 transport, this phase might involve racing TCP connections to a 1466 server over IPv4 and IPv6 [RFC8305], or it could involve a STUN 1467 exchange to establish peer-to-peer UDP connectivity [RFC5245], or 1468 some other means. 1470 o Confirming and Maintaining Connections 1472 Once connectivity has been established, unused resources can be 1473 released and the chosen path can be confirmed. This is primarily 1474 required when establishing peer-to-peer connectivity, where 1475 connections supporting relayed locals that were not required can 1476 be closed, and where an associated signalling operation might be 1477 needed to inform middleboxes and proxies of the chosen path. 1478 Keep-alive messages may also be sent, as appropriate, to ensure 1479 NAT and firewall state is maintained, so the Connection remains 1480 operational. 1482 To support ICE, or similar protocols, that involve an out-of-band 1483 indirect signalling exchange to exchange candidates with the Remote 1484 Endpoint, it's important to be able to query the set of candidate 1485 Local Endpoints, and give the protocol stack a set of candidate 1486 Remote Endpoints, before it attempts to establish connections. 1488 (TO-DO: It is expected that a single abstract algorithm can be 1489 identified that supports both the peer-to-peer and client-server 1490 connection racing, allowing this text to be merged with Section 4) 1492 11. IANA Considerations 1494 RFC-EDITOR: Please remove this section before publication. 1496 This document has no actions for IANA. 1498 12. Security Considerations 1500 12.1. Considerations for Candidate Gathering 1502 Implementations should avoid downgrade attacks that allow network 1503 interference to cause the implementation to select less secure, or 1504 entirely insecure, combinations of paths and protocols. 1506 12.2. Considerations for Candidate Racing 1508 See Section 5.2 for security considerations around racing with 0-RTT 1509 data. 1511 An attacker that knows a particular device is racing several options 1512 during connection establishment may be able to block packets for the 1513 first connection attempt, thus inducing the device to fall back to a 1514 secondary attempt. This is a problem if the secondary attempts have 1515 worse security properties that enable further attacks. 1516 Implementations should ensure that all options have equivalent 1517 security properties to avoid incentivizing attacks. 1519 Since results from the network can determine how a connection attempt 1520 tree is built, such as when DNS returns a list of resolved endpoints, 1521 it is possible for the network to cause an implementation to consume 1522 significant on-device resources. Implementations should limit the 1523 maximum amount of state allowed for any given node, including the 1524 number of child nodes, especially when the state is based on results 1525 from the network. 1527 13. Acknowledgements 1529 This work has received funding from the European Union's Horizon 2020 1530 research and innovation programme under grant agreement No. 644334 1531 (NEAT). 1533 This work has been supported by Leibniz Prize project funds of DFG - 1534 German Research Foundation: Gottfried Wilhelm Leibniz-Preis 2011 (FKZ 1535 FE 570/4-1). 1537 This work has been supported by the UK Engineering and Physical 1538 Sciences Research Council under grant EP/R04144X/1. 1540 Thanks to Stuart Cheshire, Josh Graessley, David Schinazi, and Eric 1541 Kinnear for their implementation and design efforts, including Happy 1542 Eyeballs, that heavily influenced this work. 1544 14. References 1546 14.1. Normative References 1548 [I-D.ietf-taps-arch] 1549 Pauly, T., Trammell, B., Brunstrom, A., Fairhurst, G., 1550 Perkins, C., Tiesel, P., and C. Wood, "An Architecture for 1551 Transport Services", draft-ietf-taps-arch-02 (work in 1552 progress), October 2018. 1554 [I-D.ietf-taps-interface] 1555 Trammell, B., Welzl, M., Enghardt, T., Fairhurst, G., 1556 Kuehlewind, M., Perkins, C., Tiesel, P., and C. Wood, "An 1557 Abstract Application Layer Interface to Transport 1558 Services", draft-ietf-taps-interface-02 (work in 1559 progress), October 2018. 1561 [I-D.ietf-taps-minset] 1562 Welzl, M. and S. Gjessing, "A Minimal Set of Transport 1563 Services for End Systems", draft-ietf-taps-minset-11 (work 1564 in progress), September 2018. 1566 [RFC6458] Stewart, R., Tuexen, M., Poon, K., Lei, P., and V. 1567 Yasevich, "Sockets API Extensions for the Stream Control 1568 Transmission Protocol (SCTP)", RFC 6458, 1569 DOI 10.17487/RFC6458, December 2011, 1570 . 1572 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 1573 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 1574 . 1576 [RFC7540] Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext 1577 Transfer Protocol Version 2 (HTTP/2)", RFC 7540, 1578 DOI 10.17487/RFC7540, May 2015, 1579 . 1581 [RFC8260] Stewart, R., Tuexen, M., Loreto, S., and R. Seggelmann, 1582 "Stream Schedulers and User Message Interleaving for the 1583 Stream Control Transmission Protocol", RFC 8260, 1584 DOI 10.17487/RFC8260, November 2017, 1585 . 1587 [RFC8303] Welzl, M., Tuexen, M., and N. Khademi, "On the Usage of 1588 Transport Features Provided by IETF Transport Protocols", 1589 RFC 8303, DOI 10.17487/RFC8303, February 2018, 1590 . 1592 [RFC8304] Fairhurst, G. and T. Jones, "Transport Features of the 1593 User Datagram Protocol (UDP) and Lightweight UDP (UDP- 1594 Lite)", RFC 8304, DOI 10.17487/RFC8304, February 2018, 1595 . 1597 [RFC8305] Schinazi, D. and T. Pauly, "Happy Eyeballs Version 2: 1598 Better Connectivity Using Concurrency", RFC 8305, 1599 DOI 10.17487/RFC8305, December 2017, 1600 . 1602 [RFC8446] Rescorla, E., "The Transport Layer Security (TLS) Protocol 1603 Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018, 1604 . 1606 14.2. Informative References 1608 [I-D.ietf-quic-transport] 1609 Iyengar, J. and M. Thomson, "QUIC: A UDP-Based Multiplexed 1610 and Secure Transport", draft-ietf-quic-transport-18 (work 1611 in progress), January 2019. 1613 [NEAT-flow-mapping] 1614 "Transparent Flow Mapping for NEAT (in Workshop on Future 1615 of Internet Transport (FIT 2017))", n.d.. 1617 [RFC5245] Rosenberg, J., "Interactive Connectivity Establishment 1618 (ICE): A Protocol for Network Address Translator (NAT) 1619 Traversal for Offer/Answer Protocols", RFC 5245, 1620 DOI 10.17487/RFC5245, April 2010, 1621 . 1623 [Trickle] "Trickle - Rate Limiting YouTube Video Streaming (ATC 1624 2012)", n.d.. 1626 Appendix A. Additional Properties 1628 This appendix discusses implementation considerations for additional 1629 parameters and properties that could be used to enhance transport 1630 protocol and/or path selection, or the transmission of messages given 1631 a Protocol Stack that implements them. These are not part of the 1632 interface, and may be removed from the final document, but are 1633 presented here to support discussion within the TAPS working group as 1634 to whether they should be added to a future revision of the base 1635 specification. 1637 A.1. Properties Affecting Sorting of Branches 1639 In addition to the Protocol and Path Selection Properties discussed 1640 in Section 4.3, the following properties under discussion can 1641 influence branch sorting: 1643 o Bounds on Send or Receive Rate: If the application indicates a 1644 bound on the expected Send or Receive bitrate, an implementation 1645 may prefer a path that can likely provide the desired bandwidth, 1646 based on cached maximum throughput, see Section 8.2. The 1647 application may know the Send or Receive Bitrate from metadata in 1648 adaptive HTTP streaming, such as MPEG-DASH. 1650 o Cost Preferences: If the application indicates a preference to 1651 avoid expensive paths, and some paths are associated with a 1652 monetary cost, an implementation should decrease the ranking of 1653 such paths. If the application indicates that it prohibits using 1654 expensive paths, paths that are associated with a cost should be 1655 purged from the decision tree. 1657 Authors' Addresses 1659 Anna Brunstrom (editor) 1660 Karlstad University 1661 Universitetsgatan 2 1662 651 88 Karlstad 1663 Sweden 1665 Email: anna.brunstrom@kau.se 1666 Tommy Pauly (editor) 1667 Apple Inc. 1668 One Apple Park Way 1669 Cupertino, California 95014 1670 United States of America 1672 Email: tpauly@apple.com 1674 Theresa Enghardt 1675 TU Berlin 1676 Marchstrasse 23 1677 10587 Berlin 1678 Germany 1680 Email: theresa@inet.tu-berlin.de 1682 Karl-Johan Grinnemo 1683 Karlstad University 1684 Universitetsgatan 2 1685 651 88 Karlstad 1686 Sweden 1688 Email: karl-johan.grinnemo@kau.se 1690 Tom Jones 1691 University of Aberdeen 1692 Fraser Noble Building 1693 Aberdeen, AB24 3UE 1694 UK 1696 Email: tom@erg.abdn.ac.uk 1698 Philipp S. Tiesel 1699 TU Berlin 1700 Marchstrasse 23 1701 10587 Berlin 1702 Germany 1704 Email: philipp@inet.tu-berlin.de 1705 Colin Perkins 1706 University of Glasgow 1707 School of Computing Science 1708 Glasgow G12 8QQ 1709 United Kingdom 1711 Email: csp@csperkins.org 1713 Michael Welzl 1714 University of Oslo 1715 PO Box 1080 Blindern 1716 0316 Oslo 1717 Norway 1719 Email: michawe@ifi.uio.no