idnits 2.17.1 draft-tiesel-taps-socketintents-bsdsockets-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The abstract seems to contain references ([I-D.tiesel-taps-socketintents]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (January 03, 2018) is 2297 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Looks like a reference, but probably isn't: '1' on line 857 -- Looks like a reference, but probably isn't: '2' on line 859 == Unused Reference: 'RFC2119' is defined on line 837, but no explicit reference was found in the text == Unused Reference: 'RFC6824' is defined on line 842, but no explicit reference was found in the text == Unused Reference: 'RFC7413' is defined on line 847, but no explicit reference was found in the text == Unused Reference: 'RFC7556' is defined on line 851, but no explicit reference was found in the text == Outdated reference: A later version (-03) exists of draft-tiesel-taps-communitgrany-01 -- Obsolete informational reference (is this intentional?): RFC 6824 (Obsoleted by RFC 8684) Summary: 3 errors (**), 0 flaws (~~), 6 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TAPS Working Group P. Tiesel 3 Internet-Draft T. Enghardt 4 Intended status: Informational TU Berlin 5 Expires: July 7, 2018 January 03, 2018 7 A Socket Intents Prototype for the BSD Socket API - Experiences, Lessons 8 Learned and Considerations 9 draft-tiesel-taps-socketintents-bsdsockets-01 11 Abstract 13 This document describes a prototype implementation of Socket Intents 14 [I-D.tiesel-taps-socketintents] for the BSD Socket API as an 15 illustrative example how Socket Intents could be implemented. It 16 described the experiences made with the prototype and lessons learned 17 from trying to extend the BSD Socket API. 19 Status of This Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at https://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on July 7, 2018. 36 Copyright Notice 38 Copyright (c) 2018 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (https://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 54 2. Prototype Architecture . . . . . . . . . . . . . . . . . . . 3 55 3. Multiple Access Manager . . . . . . . . . . . . . . . . . . . 4 56 3.1. Policy . . . . . . . . . . . . . . . . . . . . . . . . . 5 57 3.2. Path characteristics data collectors . . . . . . . . . . 6 58 4. Socket Intents Representation . . . . . . . . . . . . . . . . 7 59 5. The Socket Intents API Variants . . . . . . . . . . . . . . . 7 60 5.1. Classic API / muacc_context . . . . . . . . . . . . . . . 8 61 5.1.1. muacc_getaddrinfo() . . . . . . . . . . . . . . . . . 8 62 5.1.2. muacc_socket() . . . . . . . . . . . . . . . . . . . 9 63 5.1.3. muacc_setsockopt() . . . . . . . . . . . . . . . . . 10 64 5.1.4. muacc_connect() . . . . . . . . . . . . . . . . . . . 10 65 5.1.5. muacc_close() . . . . . . . . . . . . . . . . . . . . 11 66 5.2. Classic API / getaddrinfo . . . . . . . . . . . . . . . . 11 67 5.3. Socketconnect API . . . . . . . . . . . . . . . . . . . . 14 68 6. API Implementation Experiences & Lessons Learned . . . . . . 15 69 6.1. The Missing Link to Name Resolution . . . . . . . . . . . 15 70 6.2. File Descriptors Considered Harmful . . . . . . . . . . . 16 71 6.3. Asynchronous API Anarchy . . . . . . . . . . . . . . . . 17 72 6.4. Here Be Dragons hiding in Shadow Structures . . . . . . . 17 73 7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 18 74 8. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 18 75 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 18 76 9.1. Informative References . . . . . . . . . . . . . . . . . 19 77 9.2. URIs . . . . . . . . . . . . . . . . . . . . . . . . . . 19 78 Appendix A. API Usage Examples . . . . . . . . . . . . . . . . . 20 79 A.1. Usage Example of the Classic / muacc_context API . . . . 20 80 A.2. Usage Example of the Classic / getaddrinfo API . . . . . 21 81 A.3. Usage Example of the Socketconnect API . . . . . . . . . 22 82 Appendix B. Changes . . . . . . . . . . . . . . . . . . . . . . 23 83 B.1. Since -00 . . . . . . . . . . . . . . . . . . . . . . . . 23 84 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 23 86 1. Introduction 88 With the proliferation of devices that have multiple paths to the 89 internet and an increasing number of transport protocols available, 90 the number of transport options to serve a communication unit 91 explodes. Implementing a heuristic or strategy for choosing from 92 this overwhelming set of transport options by each application puts a 93 huge burden on the application developer. Thus, the decisions 94 regarding all transport options mentioned so far should be supported 95 and, if requested by the application, automated within the transport 96 layer. 98 Socket Intents [I-D.tiesel-taps-socketintents] allow an application 99 to express what it knows, assumes, expects or wants to prioritize 100 regarding its own network communication. This information can than 101 be used by the OS to perform destination selection, path selection 102 and transport protocol stack instance selection. 104 Our Socket Intents prototype for the BSD Socket API is a first 105 attempt to automate transport option selection within the OS. It is 106 primarily targeted at path and destination address selection and 107 tries to be as close as possible to the semantics of the BSD Socket 108 API. The prototype mostly excludes the problem of transport protocol 109 stack instance selection, which is more closely discussed in 110 [I-D.tiesel-taps-communitgrany]. 112 We implemented the prototype as a wrapper for the BSD Socket API that 113 communicates to a central Multiple Access Manager that makes the 114 actual decisions and can optimize across applications. The whole 115 implementation was done in about 15k lines of C code. The code is 116 available at Github [1] under BSD License. 118 This document describes our Socket Intents prototype for the BSD 119 Socket API. It details important aspects of the implementation and 120 the API variants we developed over time based on lessons learned. 121 Finally, it summarizes these lessons and points out why the BSD 122 Socket API is not particularly well suited to integrate automated 123 transport protocol stack instance selection. Furthermore, it 124 describes the limitations for destination address and path selection 125 within the BSD Socket API. 127 2. Prototype Architecture 129 The Socket Intents prototype consists of the following components, 130 also shown in Figure 1: 132 o The Socket Intents API, a BSD Socket API wrapper for applications 133 to use, including a representation of the actual Socket Intents. 135 o The Socket Intents Library which implements the Socket Intents 136 API. It sends requests to the Multiple Access Manager, e.g. 137 before establishing a connection, and gets back a response 138 regarding what interface to use. 140 o The Multiple Access Manager (MAM), a daemon which gets informed 141 about all application requests and has knowledge of the available 142 network interfaces. 144 o The Policy, a dynamically loaded library hosted by the MAM. It 145 chooses which of the available interfaces to use based on the 146 available knowledge about them and the Socket Intents. 148 o Data collectors that that reside inside the MAM and that provide 149 information like bandwidth usage, smoothed RTT estimate and RSSI 150 for wireless links to the policy. 152 +------------------------+ 153 | Application | 154 | | +-------------------+ 155 +-{ Socket Intents API }-+ (MAM Request) | Multiple Access | 156 | | ----------------> | Manager | 157 | Socket Intents | (MAM Response) | +---------------+ | 158 | Library | <---------------- | | Policy | | 159 +------------------------+ | +---------------+ | 160 | BSD Sockets | | |Data Collectors| | 161 +------------------------+ +-+---------------+-+ 163 Figure 1: Components of the Socket Intents Prototype 165 3. Multiple Access Manager 167 The Multiple Access Manager (MAM) is the central transport option 168 selection instance on a host. It is realized as a daemon that runs 169 in userspace and receives requests from each application that uses 170 the Socket Intents Library. 172 The MAM hosts the Policy, which is the actual decision making 173 component, e.g., deciding which source address and therefore which 174 source interface to use. Upon events, such as an application 175 requesting to resolve a name or to connect a socket (see Section 5 176 for details), the Socket Intents Library issues a MAM request and the 177 MAM invokes a callback to the policy - see Section 3.1 for details - 178 which can either communicate its decision right away or defer its 179 decision, e.g., when it has to wait for the results of name 180 resolution. The results and decisions are communicated back to the 181 Socket Intents Library through the MAM response, where they are 182 applied to the actual socket, see also Figure 1. 184 To support the policy, the MAM maintains a list of IP prefixes that 185 are configured on the local interfaces and available for outgoing 186 communications. As destination address selection and path selection 187 are highly dependent on each other, the MAM integrates DNS resolution 188 and maintains separate resolver configurations per prefix (see 189 [ANRW17-MH] for further discussion on multiple PvDs and DNS 190 resolution). Furthermore, the MAM includes data collectors which 191 periodically gather statistics on the available paths, see 192 Section 3.2 for details. 194 3.1. Policy 196 In the Socket Intents prototype, the Policy to select among the 197 available transport options is hosted by the MAM, see Figure 1. We 198 implement different interchangeable policies as dynamically loaded 199 libraries. In our current implementation, only one policy can be 200 active at a given time. When launching the MAM, the user has to 201 choose a policy and supply a policy configuration, which can contain 202 arbitrary data. 204 Examples of policy configuration include: 206 o A list of IP prefixes configured on local interfaces to consider 207 as source for the communication 209 o Name server(s) to use for each of the IP prefixes 211 o Preferences to instrument the policy 213 The policy is initialized with this configuration and then waits for 214 the callback of an incoming MAM request. 216 Upon a callback, the policy can use information from the MAM request, 217 such as Socket Intents, and information available within the MAM, 218 such as recently measured path characteristics (see Section 3.2), to 219 make decisions. 221 Policy decisions can include: 223 o The source address(es) used for name resolution 225 o How to order the results of name resolution (i.e., preferring 226 certain IP addresses over others) 228 o Picking an IP protocol version 230 o Picking a transport protocol 232 o Setting socket options (e.g., disable TCP Nagle) 234 o Choosing a source address for the outgoing communication 236 o Reusing a socket from a given socket set (only for the API variant 237 described in Section 5.3) 239 Note that in our current implementation, the policy is a piece of 240 code which can in principle execute arbitrary instructions. We 241 assume this is acceptable for an experimental platform but would 242 prefer an abstract description like a domain-specific language for a 243 production system. 245 3.2. Path characteristics data collectors 247 The data collectors are implemented as a component of the MAM, within 248 a callback that is executed periodically, e.g., every 100 ms. When 249 this callback is invoked, the MAM passively gathers statistics about 250 the current usage and properties of the available local interfaces 251 and stores them in per-interface or per-network prefix data 252 structures. 254 Measured properties include: 256 o Minimum Smoothed Round Trip Time (SRTT) of current TCP connections 257 using a network prefix, as an estimate for last-mile latency 259 o Transmitted and received bytes per second over an interface within 260 the last callback period, as an estimate for current utilization 262 o Smoothed transmitted and received bytes per second over an 263 interface, as an estimate for recent utilization 265 o Maximum transmitted and received bytes per second over an 266 interface within the last 5 minutes, as an estimate for maximum 267 available bandwidth 269 o On 802.11 interfaces, the Received Signal Strength Indicator 270 (RSSI) of the last received frame on that interface, as an 271 estimate for reception strength 273 o On 802.11 interfaces, the modulation rate of the last received and 274 the last transmitted unicast data frame on that interface, as an 275 estimate for the available data transmission rate on the first hop 277 When a policy callback is invoked, the policy can use the latest 278 measured properties to guide its decisions, see Section 3.1. 280 Note that we do not perform active measurements from within the MAM 281 to avoid overhead. 283 4. Socket Intents Representation 285 As described in [I-D.tiesel-taps-socketintents], Socket Intents are 286 pieces of information about upcoming traffic. An application can 287 share the information that it has available through the Socket 288 Intents API. 290 In our implementation, Socket Intents are represented as socket 291 options for get/setsockopt on its own socket option level 292 (SOL_INTENTS). 294 For some of the API variants, we had to introduce socket option 295 lists, i.e., data structures that can hold multiple socket options 296 and therefore multiple Socket Intents. 298 Which of these variants is actually used depends on the API variant, 299 see Section 5. 301 5. The Socket Intents API Variants 303 The Socket Intents API is a wrapper around the BSD Socket API. It 304 sends requests to the Multiple Access Manager (MAM) at certain 305 events, e.g., before a connection is established, and applies the 306 suggestions that it gets from the MAM, e.g., to bind to a certain 307 local interface or to set a certain socket option. 309 There exist different variants of this API, see Section 5, that try 310 to fit different concepts: 312 o The Classic API with muacc_context, see Section 5.1, was 313 attempting to stick as close as possible to the call sequence of 314 BSD Sockets. 316 o The second variant of the classic API does all transport option 317 selection in "getaddrinfo", see Section 5.2. This variant tries 318 to simplify the implementation without deriving too much from the 319 usage of BSD Sockets. It minimizes the changes to the BSD Socket 320 API, but adds additional overhead to the application. 322 o The "socketconnect" API, see Section 5.3, tries to automate as 323 much functionality as possible and adds support for automating 324 connection caching. It replaces the usual sequence of BSD Socket 325 API calls with a single call. 327 5.1. Classic API / muacc_context 329 In the first variant, we add a parameter called "muacc_context" to 330 the BSD Socket API calls and to getaddrinfo. This parameter holds 331 properties provided by the socket calls and retains them across 332 function calls to enable automation of the connection properties by 333 our Socket Intents Prototype. The shadow data structures behind the 334 "muacc_context" parameter are initialized by API wrapper at the time 335 of the first call (which we assume to be muacc_getaddrinfo most of 336 the time) with most of its fields empty. Then within each call to 337 our modified Socket API, it is filled with data. 339 Properties include: 341 o Socket file descriptor 343 o API calls that were already performed on this context 345 o domain, type, and protocol of the socket 347 o remote hostname 349 o remote address 351 o hints for resolving the remote address 353 o local address to bind to that the application requested 355 o local address to bind to that the MAM suggested 357 o current socket options that were set 359 o socket options suggested by MAM 361 5.1.1. muacc_getaddrinfo() 363 This function resolves a host name or service to an addrinfo data 364 structure, usually containing an IP address or port. Internally, the 365 Socket Intents prototype sends a "getaddrinfo" request to the MAM, 366 which should do the name resolution. It can, e.g., resolve the name 367 over multiple available interfaces at the same time, and then order 368 the results according to a policy decision, or only return results 369 obtained over a specific interface. 371 SIGNATURE: 373 int muacc_getaddrinfo(muacc_context_t *ctx, const char *hostname, 374 const char *servname, const struct addrinfo *hints, struct addrinfo 375 **res) 377 ARGUMENTS: 379 ctx: Context that can contain properties of this socket/connection 380 and retains them across function calls. This function is mostly 381 called with an empty context, which is then filled within the 382 function. 384 hostname: Remote host name to be resolved 386 servname: Remote service to be resolved 388 hints: Hints for resolving the name 390 res: Data structure for result of name resolution 392 RETURN VALUE: 394 Returns 0 on success, or an error code as provided by getaddrinfo(). 396 5.1.2. muacc_socket() 398 This function creates a socket file descriptor just like the regular 399 socket call. 401 SIGNATURE: 403 int muacc_socket(muacc_context_t *ctx, int domain, int type, int 404 protocol) 406 ARGUMENTS: 408 ctx: Context that can contain properties of this socket/connection 409 and retains them across function calls. This function is mostly 410 called after muacc_getaddrinfo(), since domain, type, and protocol 411 can depend on the type of resolved address. 413 domain: Domain of the socket 415 type: Type of the socket 417 protocol: Protocol of the socket 419 RETURN VALUE: 421 Returns a file descriptor of the new socket on success, or -1 on 422 failure. 424 5.1.3. muacc_setsockopt() 426 This call allows to set socket options (including Socket Intents). 427 For Socket Intents, this function can be called on a valid 428 "muacc_context" and an invalided file descriptor (-1) to provide 429 assertional hints to "muacc_getaddrinfo()". 431 SIGNATURE: 433 int muacc_setsockopt(muacc_context_t *ctx, int socket, int level, int 434 option_name, const void *option_value, socklen_t option_len) 436 ARGUMENTS: 438 ctx: Context that can contain properties of this socket/connection 439 and retains them across function calls. This function is mostly 440 called to set Intents as socket options within the context. 442 socket: Socket file descriptor 444 level: Level of the socket option to set 446 option_name: Name of the socket option to set 448 option_value: Value of the socket option to set 450 option_len: Length of the socket option to set 452 RETURN VALUE: 454 Returns 0 on success, or -1 on failure. 456 5.1.4. muacc_connect() 458 Like the regular connect call, but also binds to the source address 459 selected by the Socket Intents Policy and applies socket options 460 suggested by the Socket Intents Policy. 462 SIGNATURE: 464 int muacc_connect(muacc_context_t *ctx, int socket, const struct 465 sockaddr *address, socklen_t address_len) 467 ARGUMENTS: 469 ctx: Context that can contain properties of this socket/connection 470 and retains them across function calls. This function is mostly 471 called after all Socket Intents for this connection have been set 472 via muacc_setsockopt(). 474 socket: Socket file descriptor 476 address: Remote address to connect to 478 address_len: Length of the remote address 480 RETURN VALUE: 482 Returns 0 on success, or -1 on failure. 484 5.1.5. muacc_close() 486 Like regular close, but also cleans up state held in shadow 487 structures behind "muacc_context" 489 SIGNATURE: 491 int muacc_close(muacc_context_t *ctx, int socket) 493 ARGUMENTS: 495 ctx: Context that can contain properties of this socket/connection 496 and retains them across function calls. This function 497 deinitializes and releases the context. 499 socket: Socket file descriptor 501 RETURN VALUE: 503 Returns 0 on success, or -1 on failure. 505 5.2. Classic API / getaddrinfo 507 In this variant, Socket Intents are passed directly to 508 "getaddrinfo()" as part of the "hints" parameter. The name 509 resolution is done by the MAM, which makes all decisions and stores 510 them in the "result" data structure as list of options ordered by 511 preference. Subsequently, applications can use this information for 512 calls to the unmodified BSD Socket API or other APIs. We provide 513 helpers to apply all socket options from the "result" data structure. 515 All relevant infos are stored in our addrinfo struct (see Figure 2) 516 SIGNATURE: 518 int muacc_ai_getaddrinfo(const char * hostname, const char * service, 519 const struct muacc_addrinfo * hints, struct muacc_addrinfo ** result) 521 ARGUMENTS: 523 hostname: Remote host name to be resolved 525 service: Remote service to be resolved 527 hints: Hints for resolving the name. Contents include family, 528 socket type, protocol, socket options (including Socket Intents 529 for this socket/connection), local address to bind to. 531 result: Data structure for result of name resolution 533 RETURN VALUE: 535 Returns 0 on success, or an error code as provided by getaddrinfo(). 537 /** Extended version of the standard library's struct addrinfo 538 * 539 * This is used both as hint and as result from the 540 * muacc_ai_getaddrinfo * function. This structure 541 * differs from struct addrinfo only in the three members 542 * ai_bindaddrlen, ai_bindaddr and ai_socketopt. 543 */ 544 struct muacc_addrinfo { 545 int ai_flags; 546 int ai_family; 547 int ai_socktype; 548 int ai_protocol; 550 /** Not included in struct addrinfo. Purpose: 551 * 1. If the structure is given to muacc_ai_getaddrinfo 552 * as hints, you set socket intents that influence MAM's 553 * source and destination as well as transport protocol 554 * selection 555 * 2. The recommended socket options MAM will be returned 556 * through this attribute. 557 */ 558 struct socketopt *ai_sockopts; 560 int ai_addrlen; 561 struct sockaddr *ai_addr; 562 char *ai_canonname; 564 /** Not included in struct addrinfo. 565 * Length of ai_bindaddr. 566 */ 567 int ai_bindaddrlen; 568 /** Not included in struct addrinfo. 569 * Contains the address, which the MAM recommends us to bind to. 570 */ 571 struct sockaddr *ai_bindaddr; 573 struct muacc_addrinfo *ai_next; 574 }; 576 Figure 2: Definition of the muacc_addrinfo struct 578 Appendix A.2 shows an example usage of the classic API with most 579 functionality in getaddrinfo. 581 5.3. Socketconnect API 583 In this API variant, we move the functionality of resolving a 584 hostname and connecting to the resulting address into one function 585 called "socketconnect()". This API makes it possible to call 586 socketconnect not only for each connection, but also to multiplex 587 messages across multiple existing sockets. 589 This function returns a file descriptor of a connected socket for the 590 application to use. This socket can either be a newly created one or 591 a socket that existed previously and is now being reused. 592 Furthermore, a socket can belong to a socket set of sockets with 593 common destination and service. These sockets may, e.g., be bound to 594 different local addresses, but are treated as interchangeable by the 595 API implementation. So if the application passes a socket file 596 descriptor to this function, it may get back a different file 597 descriptor to a socket from the same set, e.g., to use the connection 598 over a different local interface for its following communication. 600 SIGNATURE: 602 int socketconnect(int *socket, const char *host, size_t hostlen, 603 const char *serv, size_t servlen, struct socketopt *sockopts, int 604 domain, int type, int proto) 606 ARGUMENTS: 608 socket: Existing socket file descriptor as representant to a socket 609 set, "-1" to create a new socket, or "0" to automatically try to 610 find a suitable socket set 612 host: Remote hostname to be resolved 614 hostlen: Length of remote hostname 616 serv: Remote service or port 618 servlen: Length of remote service 620 socketopts: List of socket options, including Socket Intents 622 domain: Domain of the socket 624 type: Type of the socket 626 proto: Protocol of the socket 628 RETURN VALUE: 630 Returns 0 on success if socket is from an existing socket set, 1 on 631 success if socket was newly created, or -1 on fail. 633 Appendix A.3 shows an example usage of the Socketconnect API. 635 6. API Implementation Experiences & Lessons Learned 637 While designing and implementing the different parts of the system as 638 described in this document, we faced several challenges. In the 639 Multiple Access Manager discovering the currently available paths and 640 statistics about their performance turned out to be quite complex and 641 had to be implemented in a partially platform-dependent way. 642 However, the most challenging parts were the Socket Intents API and 643 Library, on which we focus in the following sections. 645 6.1. The Missing Link to Name Resolution 647 Transport option selection is most useful if crucial information, 648 such as Socket Intents or other socket options, is available as early 649 as possible, i.e., for name resolution. The primary problem here is 650 the order of the function calls that are involved in name resolution, 651 destination selection, protocol, and path selection, and how they are 652 linked. 654 In the classic BSD Socket API, most functions either take a socket 655 file descriptor as argument or return it, and thus link different 656 function calls to the same flow. However, "getaddrinfo()" is not 657 linked to a socket file descriptor, and it is typically called before 658 the socket is created. At this point, it is not yet possible to set 659 a socket option, because the socket does not exist yet. 661 Consequently, across BSD Socket API calls, several choices are being 662 made before it is possible to set a Socket Intent: A call to 663 "getaddrinfo()" returns a linked list of "addrinfo" structs, where 664 each entry contains an "ai_family" (IP version), the pair of 665 "ai_socktype" and "ai_protocol" (transport protocol), and a 666 "sockaddr" struct containing an IP address and port to connect to. 667 Then a socket of the given family, type, and protocol is created. 668 Only after this has been done, socket options can be set on the 669 socket, but at this point destination, IP version, and transport 670 protocol are already fixed. Before calling "connect()", only the 671 path to be used (i.e., the local address to bind to) can still be 672 chosen, but the available paths and which one to prefer may be 673 constrained by the choice of destination. 675 The three variants described in Section 5 work around this problem in 676 different ways: 678 o The approach in Section 5.2 places the whole automation of 679 transport option selection into the "getaddrinfo()" function. The 680 results are returned in an extended "addrinfo" struct and have to 681 be applied manually by the application, including binding to a 682 source address representing the selected path and applying all 683 socket options provided in a list, for each connection attempt. 685 o The approach in Section 5.1 adds a context to all socket- and name 686 resolution-related API calls. 688 o The approach in Section 5.3 puts all functionality into one call. 690 All of these approaches add the missing link between name resolution 691 and the other parts of the API, but add a lot of state keeping either 692 to the API, which the application developer has to manage, or to the 693 Socket Intents library. 695 6.2. File Descriptors Considered Harmful 697 When using BSD sockets, file descriptors are the abstraction for 698 network flows. Depending on the transport protocol used, their 699 semantics changes and these file handles represent streams 700 (SOCK_STREAM), associations (SOCK_DRAM) or network interfaces 701 (SOCK_RAW). This does not provide a unified API, but is merely an 702 artifact of squeezing networking into the "Everything is a file" UNIX 703 philosophy. 705 File descriptors make no good abstraction for automated protocol 706 stack instance selection as applications have to adopt to changed 707 semantics, e.g., whether message boundaries are preserved, depending 708 on the transport protocol chosen. 710 File descriptors make no good abstraction for destination instance 711 selection and path selection either. Once a socket has been created, 712 its protocol stack instance is fixed, so selecting a path by binding 713 to a local address and connecting to a destination instance is now 714 only possible using this protocol stack instance. If such a 715 connection attempt fails, it is possible to retry using another path 716 and destination, but changing the protocol stack instance requires 717 creating a new socket with a different file descriptor. 719 For further discussion of other asynchronous I/O weirdness with file 720 descriptors see end of Section 6.3. 722 6.3. Asynchronous API Anarchy 724 Network I/O is asynchronous, but asynchronous I/O within the POSIX 725 filesystem API is hard to use. There are at least three different 726 asynchronous I/O APIs for each operating system. 728 To implement asynchronous I/O for our Socket Intents prototype, we 729 wrapped one of the asynchronous I/O APIs that is available on most 730 platforms: "select()". To make Socket Intents accessible to more 731 applications and on more platforms, a production-grade system would 732 need to wrap all asynchronous I/O APIs and implement most of the 733 socket creation logic, path selection and connection logic within 734 these wrappers. However, mixing asynchronous I/O and multithreading 735 may lead to unintuitive behavior, e.g., calling our prototype's 736 select() from different threads could lead to anything from deadlocks 737 to busy waiting. 739 Another issue is that we use Unix domain sockets to communicate 740 between our Multiple Access Manager and the Socket Intents API 741 library called by the application, so we need to make sure that the 742 application does not block on communication with the Multiple Access 743 Manager. 745 Also the problems with using file descriptors get even worse. If a 746 Socket API call should return immediately, it needs to provide the 747 application with a reference to a flow that has not yet been fully 748 set up, i.e., a reference to a "future" socket. An implementation of 749 such an asynchronous API has to return an unconnected socket file 750 descriptor, on which the application then calls, e.g., "select()", 751 and starts using it once it becomes readable and writable. If the 752 destination, path and transport protocol have not been chosen yet at 753 this point, the file descriptor returned by the implementation might 754 not yet have the final family and transport protocol. When the 755 implementation later creates the final socket of the right type, it 756 can re-bind it to the file-id of the originally returned file 757 descriptor using "dup2". This procedure can easily lead to time-of- 758 check to time-of-use confusion. To make things even worse, the 759 application can copy the "future" file descriptor using "dup", which 760 is rarely useful for sockets, but in combination with file 761 descriptors used as "future" it leads to unexpected behavior. 763 6.4. Here Be Dragons hiding in Shadow Structures 765 The API variants described in Section 5.3 and Section 5.1 need to 766 keep a lot of state in shadow structures that cannot be passed 767 between the Socket API calls otherwise. This state needs to be 768 cleaned up when the last copy of the file descriptor is closed or the 769 last socket held for reuse has timed out. In addition, access to 770 these shadow structures has to be thread-safe. 772 Implementing both has turned out to be extremely error-prone and 773 there is a high amount of unspecified behavior and platform-dependent 774 extensions in the system library. These issues guarantee that an 775 implementation of transport option selection that nicely integrates 776 with BSD Sockets will come with lots of limitations and will not be 777 portable across POSIX-compliant operating systems. 779 7. Conclusion 781 Adding transport option selection to BSD Sockets is hard, as the API 782 calls are not designed to defer making and applying choices to a 783 moment where all information needed for transport option selection is 784 available. 786 After all, if limiting transport option selection to the granularity 787 BSD Sockets typically provide today (TCP connections and UDP 788 associations), the API variant described in Section 5.2 seems to be a 789 good compromise, even if it forces the application to try all 790 candidates itself (either in a sequential or partial parallel 791 fashion). This option is easily deployable, but does not include 792 automation of techniques like connection caching or HTTP pipelining. 794 The most versatile API variant described in Section 5.3 implements 795 connection caching on the transport layer. This comes at the cost of 796 heavily modifying existing applications. If feasible, given the 797 unnecessary complexity of the file I/O integration of BSD sockets, it 798 seems easier to move to a totally different system like 799 [I-D.trammell-taps-post-sockets]. 801 8. Acknowledgments 803 The API variant described in Section 5.2 was originally drafted and 804 implemented by Tobias Kaiser mail@tb-kaiser.de [2] as part of his BA 805 thesis. 807 This work has been supported by Leibniz Prize project funds of DFG - 808 German Research Foundation: Gottfried Wilhelm Leibniz-Preis 2011 (FKZ 809 FE 570/4-1). 811 9. References 812 9.1. Informative References 814 [ANRW17-MH] 815 Tiesel, P., May, B., and A. Feldmann, "Multi-Homed on a 816 Single Link", Proceedings of the 2016 workshop on Applied 817 Networking Research Workshop - ANRW 16, 818 DOI 10.1145/2959424.2959434, 2016. 820 [I-D.tiesel-taps-communitgrany] 821 Tiesel, P. and T. Enghardt, "Communication Units 822 Granularity Considerations for Multi-Path Aware Transport 823 Selection", draft-tiesel-taps-communitgrany-01 (work in 824 progress), October 2017. 826 [I-D.tiesel-taps-socketintents] 827 Tiesel, P., Enghardt, T., and A. Feldmann, "Socket 828 Intents", draft-tiesel-taps-socketintents-01 (work in 829 progress), October 2017. 831 [I-D.trammell-taps-post-sockets] 832 Trammell, B., Perkins, C., Pauly, T., Kuehlewind, M., and 833 C. Wood, "Post Sockets, An Abstract Programming Interface 834 for the Transport Layer", draft-trammell-taps-post- 835 sockets-03 (work in progress), October 2017. 837 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 838 Requirement Levels", BCP 14, RFC 2119, 839 DOI 10.17487/RFC2119, March 1997, 840 . 842 [RFC6824] Ford, A., Raiciu, C., Handley, M., and O. Bonaventure, 843 "TCP Extensions for Multipath Operation with Multiple 844 Addresses", RFC 6824, DOI 10.17487/RFC6824, January 2013, 845 . 847 [RFC7413] Cheng, Y., Chu, J., Radhakrishnan, S., and A. Jain, "TCP 848 Fast Open", RFC 7413, DOI 10.17487/RFC7413, December 2014, 849 . 851 [RFC7556] Anipko, D., Ed., "Multiple Provisioning Domain 852 Architecture", RFC 7556, DOI 10.17487/RFC7556, June 2015, 853 . 855 9.2. URIs 857 [1] https://github.com/fg-inet/socket-intents/ 859 [2] mailto:mail@tb-kaiser.de 861 Appendix A. API Usage Examples 863 A.1. Usage Example of the Classic / muacc_context API 865 In this example, a client application sets up a connection to a 866 remote host and sends data to it. It specifies two Socket Intents on 867 this connection: The Category of Bulk Transfer and the File Size of 1 868 MB. 870 #define LENGTH_OF_DATA 1048576 872 // Create and initialize a context to retain information across function 873 // calls 874 muacc_context_t ctx; 875 muacc_init_context(&ctx); 877 int socket = -1; 879 struct addrinfo *result = NULL; 881 // Initialize a buffer of data to send later. 882 char buf[LENGTH_OF_DATA]; 883 memset(&buf, 0, LENGTH_OF_DATA); 885 // Set Socket Intents for this connection. Note that the "socket" is 886 // still invalid, but it does not yet need to exist at this time. The 887 // Socket Intents prototype just sets the Intent within the 888 // muacc_context data structure. 890 enum intent_category category = INTENT_BULKTRANSFER; 891 muacc_setsockopt(&ctx, socket, SOL_INTENTS, 892 INTENT_CATEGORY, &category, sizeof(enum intent_category)); 894 int filesize = LENGTH_OF_DATA; 895 muacc_setsockopt(&ctx, socket, SOL_INTENTS, 896 INTENT_FILESIZE, &filesize, sizeof(int)); 898 // Resolve a host name. This involves a request to the MAM, which can 899 // automatically choose a suitable local interface or other parameters 900 // for the DNS request and set other parameters, such as preferred 901 // address family or transport protocol. 902 muacc_getaddrinfo(&ctx, "example.org", NULL, NULL, &result); 904 // Create the socket with the address family, type, and protocol 905 // obtained by getaddrinfo. 906 socket = muacc_socket(&ctx, result->ai_family, result->ai_socktype, 907 result->ai_protocol); 909 // Connect the socket to the remote endpoint as determined by 910 // getaddrinfo. This involves another request to MAM, which may at this 911 // point, e.g., choose to bind the socket to a local IP address before 912 // connecting it. 913 muacc_connect(&ctx, socket, result->ai_addr, result->ai_addrlen); 915 // Send data to the remote host over the socket. 916 write(socket, &buf, LENGTH_OF_DATA); 918 // Close the socket. This de-initializes any data that was stored within 919 // the muacc_context. 920 muacc_close(&ctx, socket); 922 A.2. Usage Example of the Classic / getaddrinfo API 924 As in Appendix A.1, the application sets the Intents "Category" and 925 "File Size". 927 #define LENGTH_OF_DATA 1048576 929 // Define Intents to be set later 930 enum intent_category category = INTENT_BULKTRANSFER; 931 int filesize = LENGTH_OF_DATA; 933 struct socketopt intents = { .level = SOL_INTENTS, 934 .optname = INTENT_CATEGORY, .optval = &category, .next = NULL}; 935 struct socketopt filesize_intent = { .level = SOL_INTENTS, 936 .optname = INTENT_FILESIZE, .optval = &filesize, .next = NULL}; 938 intents.next = &filesize_intent; 940 // Initialize a buffer of data to send later. 941 char buf[LENGTH_OF_DATA]; 942 memset(&buf, 0, LENGTH_OF_DATA); 944 struct muacc_addrinfo intent_hints = { .ai_flags = 0, 945 .ai_family = AF_INET, .ai_socktype = SOCK_STREAM, .ai_protocol = 0, 946 .ai_sockopts = &intents, .ai_addr = NULL, .ai_addrlen = 0, 947 .ai_bindaddr = NULL, .ai_bindaddrlen = 0, .ai_next = NULL }; 949 struct muacc_addrinfo *result = NULL; 951 muacc_ai_getaddrinfo("example.org", NULL, &intent_hints, 952 &result); 954 // Create and connect the socket, using the information obtained through 955 // getaddrinfo 956 int fd; 957 fd = socket(result->ai_family, result->ai_socktype, 958 result->ai_protocol); 959 muacc_ai_simple_connect(fd, result); 961 // Send data to the remote host over the socket, then close it. 962 write(fd, &buf, LENGTH_OF_DATA); 963 close(fd); 965 muacc_ai_freeaddrinfo(result); 967 A.3. Usage Example of the Socketconnect API 969 As in Appendix A.1, the application sets the Intents "Category" and 970 "File Size". As we provide "-1" as socket, no we do not reuse 971 existing connections. 973 #define LENGTH_OF_DATA 1048576 975 // Define Intents to be set later 976 enum intent_category category = INTENT_BULKTRANSFER; 977 int filesize = LENGTH_OF_DATA; 979 struct socketopt intents = { .level = SOL_INTENTS, 980 .optname = INTENT_CATEGORY, .optval = &category, .next = NULL}; 981 struct socketopt filesize_intent = { .level = SOL_INTENTS, 982 .optname = INTENT_FILESIZE, .optval = &filesize, .next = NULL}; 984 intents.next = &filesize_intent; 986 // Initialize a buffer of data to send later. 987 char buf[LENGTH_OF_DATA]; 988 memset(&buf, 0, LENGTH_OF_DATA); 990 int socket = -1; 992 // Get a socket that is connected to the given host and service, 993 // with the given Intents 994 socketconnect(&socket, "example.org", 11, "80", 2, &intents, AF_INET, 995 SOCK_STREAM, 0); 997 // Send data to the remote host over the socket. 998 write(socket, &buf, LENGTH_OF_DATA); 1000 // Close the socket and tear down the data structure kept for it 1001 // in the library 1002 socketclose(socket); 1004 Appendix B. Changes 1006 B.1. Since -00 1008 o Fixed Author's affiliations and funding 1010 o Fixed acknowledgments 1012 Authors' Addresses 1014 Philipp S. Tiesel 1015 TU Berlin 1016 Marchstr. 23 1017 Berlin 1018 Germany 1020 Email: philipp@inet.tu-berlin.de 1021 Theresa Enghardt 1022 TU Berlin 1023 Marchstr. 23 1024 Berlin 1025 Germany 1027 Email: theresa@inet.tu-berlin.de