idnits 2.17.1 draft-iyengar-minion-protocol-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document is more than 15 pages and seems to lack a Table of Contents. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (July 14, 2013) is 3936 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Possible downref: Non-RFC (?) normative reference: ref. 'COBS' ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) ** Obsolete normative reference: RFC 6347 (Obsoleted by RFC 9147) -- Obsolete informational reference (is this intentional?): RFC 5245 (Obsoleted by RFC 8445, RFC 8839) Summary: 3 errors (**), 0 flaws (~~), 1 warning (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group J. Iyengar 3 Internet-Draft Franklin and Marshall College 4 Intended status: Standards Track S. Cheshire 5 Expires: January 15, 2014 J. Graessley 6 Apple 7 July 14, 2013 9 Minion - Wire Protocol 10 draft-iyengar-minion-protocol-01 12 Abstract 14 Minion uses TCP-format packets on-the-wire, for compatibility with 15 existing NATs, Firewalls, and similar middleboxes, but provides a 16 richer set of facilities to the application, as described in the 17 Minion Service Model document. This document specifies the details 18 of the on-the-wire protocol used to provide those services. 20 Status of this Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on January 15, 2014. 37 Copyright Notice 39 Copyright (c) 2013 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 1. Conventions and Terminology Used in this Document 54 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 55 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 56 "OPTIONAL" in this document are to be interpreted as described in 57 "Key words for use in RFCs to Indicate Requirement Levels" [RFC2119]. 59 This document uses terminology like "kernel" and "user-level", as 60 those terms pertain to many of today's Unix-like operating systems. 61 Equivalent concepts apply to software that is built using a different 62 architectural model than may not include such an obvious kernel/user 63 split. 65 2. Introduction 67 Minion uses TCP-format packets on-the-wire, to provide full 68 compatibility with existing NATs, Firewalls, and similar middleboxes, 69 but provides a richer set of facilities to the application, described 70 in the Minion Service Model and Conceptual API document [minserv]. 71 This document specifies the details of the on-the-wire protocol used 72 to provide those services. Before reading this protocol 73 specification document, familiarity with the Minion Service Model 74 [minserv] is strongly recommended. That information is not repeated 75 here. 77 Minion runs over a standard TCP connection. Therefore, IP addresses 78 and TCP ports are used just as they are with TCP [RFC0793]. 80 Minion is also designed to be able to use a modified TCP connection 81 which supports out-of-order delivery, giving better low-latency 82 performance on lossy networks, for use by the kinds of application 83 that today would use UDP [RFC0768] to achieve low-latency delivery. 84 The goal of providing low-latency delivery -- and consequently the 85 need to be able to handle a data stream that may have gaps -- is 86 reflected in various aspects of the Minion protocol design, such as 87 the use of DTLS instead of TLS, and the use of Consistent Overhead 88 Byte Stuffing [COBS] for reliably extracting messages from an 89 incomplete data stream. Minion is able to take advantage of out-of- 90 order delivery where the network stack offers that, but Minion does 91 not require it. Minion still works correctly when the performance 92 benefits of out-of-order delivery are not available. 94 Minion supports messages of arbitrary size. Large messages are 95 broken into chunks a little under 16 kilobytes each (the DTLS maximum 96 record size, minus a few bytes for Minion header). At the receiving 97 end the Minion chunks are reassembled into Minion messages and 98 delivered to the client application. Small messages are sent in a 99 single Minion chunk. 101 Normally messages are sent by the client as a single atomic unit, and 102 delivered to the receiving client as a single atomic unit. For 103 messages too large to fit conveniently in memory, the message may be 104 built incrementally by the sender, and delivered to the receiving 105 client incrementally, a chunk at a time. 107 When a Minion message is complete, or has at least one maximum Minion 108 chunk size of data accumulated, then if it is eligible to be sent 109 according to the message ordering facilities offered by the Minion 110 Service Model [minserv] (Sender Ordering, Receiver Ordering, and 111 Chaining) a Minion chunk is generated. 113 Each Minion chunk contains a Minion chunk header followed by the 114 client's message data, as described in Section 3 "Minion Chunk 115 Format". 117 Each Minion chunk is encrypted using DTLS [RFC6347]. 119 Each encrypted DTLS payload is then framed using RECOBS, as described 120 in Section 4 "Recursively Embeddable COBS", so that it begins with a 121 00 byte and ends with an FF byte. 123 The framed, encrypted chunk is then enqueued for transmission. 125 If the kernel networking code supports multiple priorities, then the 126 framed, encrypted chunk is placed in the transmission queue for the 127 stated priority level. Any time the TCP congestion window and/or 128 receive window rules allow more data to be sent, data is drawn from 129 the highest-priority non-empty transmit buffer, assigned the next 130 block of unused TCP sequence numbers, formed into a TCP segment, and 131 transmitted on the wire. This just-in-time TCP sequencing mechanism 132 has the effect of causing higher-priority data to be inserted right 133 at the front of the conceptual combined transmit buffer, at the 134 earliest possible byte boundary, unconstrained by message or chunk 135 boundaries in the lower-priority messages. This is possible because 136 the RECOBS framing is robust to pre-emption at any arbitrary byte 137 boundary. 139 Note that, when priorities are supported, chunks above the lowest 140 priority MUST be delivered to the kernel in such a way that they are 141 sent completely before the kernel resumes sending the lower-priority 142 traffic. The RECOBS framing supports interrupting a lower priority 143 stream with a higher-priority chunk, but not alternating back and 144 forth between two priority levels. Once a higher-priority chunk 145 interrupts lower-priority traffic, the higher-priority chunk must be 146 completed before the lower-priority traffic resumes. Typically this 147 is easily achieved by delivering the chunk to the kernel atomically 148 in a single write call. 150 2.1. Comparison of TCP and UDP NAT Traversal 152 When connecting to a server with a globally routable address, TCP is 153 generally preferable to UDP. TCP includes the SYN and FIN bits which 154 tell a NAT gateway when a connection starts and ends. In particular, 155 the FIN bit tells the NAT gateway when it can discard state related 156 to that mapping. UDP has no defined connection start/end indicators, 157 which means that unused UDP mappings are much more likely to 158 accumulate, which means that NAT gateways tend to be more aggressive 159 about timing out UDP mappings [Study], which means that clients using 160 UDP need to be more aggressive about sending keepalive traffic, which 161 is bad both for network efficiency and for battery life. Port 162 Control Protocol (PCP) [RFC6887] offers some future hope of 163 alleviating this problem by allowing clients to explicitly negotiate 164 for longer mapping lifetimes, but PCP is not yet widely deployed. In 165 the meantime, if use of UDP increases, NAT gateways are likely to be 166 accumulating mappings even more rapidly, with no way to differentiate 167 which are still required and which may be safely discarded, with the 168 result that UDP mappings may have to be discarded even more 169 aggressively. While a discarded UDP mapping can be recreated by 170 another outgoing UDP packet, in the time between when the UDP mapping 171 is discarded and then recreated, the client is cut off an unable to 172 receive inbound communication from server or peer at the other end. 173 Therefore, we believe that it is preferable to use TCP where 174 possible. 176 However, when connecting to a peer which is itself also behind a NAT 177 gateway, in the absence of PCP support [RFC6887], techniques like 178 Interactive Connectivity Establishment (ICE) [RFC5245] are used, and 179 research has shown that there are cases where ICE works for UDP but 180 not for TCP [RFC5128]. 182 To accomodate both usage scenarios, Minion is generally used with 183 standard TCP format packets, but for peer-to-peer scenarios where TCP 184 ICE is found not to work, Minion can be used encapsulated inside UDP 185 [TCPoUDP] instead. 187 3. Minion Chunk Format 189 A Minion Chunk begins with an eight-byte header, followed by the 190 client's message data: 192 0 1 2 3 193 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 194 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 195 |C| Code |Pri| This Minion Chunk ID | 196 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 197 | Reserved |RCP| Referenced Minion Chunk ID | 198 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 199 : : 200 : Minion Chunk Data : 201 : : 202 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 204 Figure 1: Minion Chunk Format 206 If the Complete ('C') bit is zero, this message is incomplete; the 207 receiver should expect to receive additional continuation chunks for 208 this message. If the Complete bit is one, this message is complete; 209 there will be no subsequent continuation chunks for this message. 211 The seven-bit chunk code identifies what type of chunk this is, as 212 described below. 214 The two-bit priority field indicates the priority level for this 215 message, with 0 being the highest priority and 3 being the default 216 (lowest-level) priority. 218 Every Minion chunk has a Chunk ID. This is a 22-bit value assigned 219 from a monotonically increasing 22-bit cyclic counter. This means 220 that Chunk IDs are reused every 2^22 chunks. At any given moment in 221 time though, only a small portion of the 22-bit ID space is actively 222 in use, so Chunk IDs are not ambiguous. Each of the four priority 223 levels has its own 22-bit Chunk ID space, i.e., Priority 1 Chunk 7 224 and Priority 2 Chunk 7 are different chunks. Also, the Chunk ID 225 spaces in opposite directions on a connection are separate. Each 226 sender is responsible for selecting the Chunk IDs for the chunks it 227 sends. 229 In some cases it is useful to refer to messages by ID, and the terms 230 "Message ID" and "Chunk ID" are sometimes used interchangeably. For 231 a message that is sent using a single chunk, the Message ID is the 232 same as the Chunk ID. For a message that is sent using multiple 233 chunks, the Message ID is the Chunk ID of the *final* chunk of the 234 message. One implication of this is that a message's ID is undefined 235 until the message is complete. 237 Because Chunk IDs are eventually reused, issues of ID lifetime must 238 be carefully considered in the Minion protocol design. For example, 239 since a remote peer could, in principle, wait an arbitrary long 240 length of time before replying to a message, the Message ID of a 241 request that is awaiting a response MUST NOT be reused until the 242 response has been received, and the client has disposed of the 243 request message. Otherwise, a reply could be ambiguous, if there 244 were two outstanding request messages both using the same Message ID 245 at the same time. Likewise, the last Chunk ID of an incomplete 246 message MUST NOT be reused until some subsequent chunk has been added 247 to that message, referencing the previous Chunk ID. 249 The Reserved field MUST be set to zero on transmission, and MUST be 250 ignored on reception. 252 For chunk types that need to refer to some other chunk, the 253 Referenced Minion Chunk Priority (RCP) and Referenced Minion Chunk ID 254 fields identify the referenced chunk. Note that some chunk types 255 refer to chunks going in the same direction (e.g., a continuation 256 chunk) and some chunk types refer to chunks going in the reverse 257 direction (e.g., a reply chunk). For chunk types that do not to 258 refer to any other chunk, these two fields MUST be set to zero on 259 transmission, and MUST be ignored on reception. 261 The Minion Chunk payload data follows the Minion Chunk Header. 263 There is no explicit length field in the Minion Chunk Header, because 264 the chunk length is determined implicitly in the RECOBS decoding 265 step. 267 3.1. Minion Chunk Codes 269 The seven-bit chunk code identifies what kind of chunk this is. 270 There are 128 chunk codes available. The following eight chunk codes 271 are currently defined: 273 00 Continuation. This is a continuation of a previously incomplete 274 message. The Referenced Minion Chunk ID identifies what 275 previous chunk this is adding to. (If the Complete bit is one 276 then this chunk is the final chunk and completes the message; no 277 further chunks for this message will be arriving.) 279 01 Cancellation. This is a cancellation of a previously incomplete 280 message. The Referenced Minion Chunk ID identifies what 281 previous chunk this is cancelling. In this case Complete bit is 282 unused; the Complete bit MUST be set to zero on transmission, 283 and MUST be ignored on reception. 285 02 Unordered Message. This chunk begins a new unordered message. 286 The Referenced Minion Chunk ID is unused, and MUST be set to 287 zero on transmission, and MUST be ignored on reception. 289 03 Sender Ordered Message. This chunk begins a new Sender Ordered 290 message. If received out-of-order, it should nonetheless be 291 delivered immediately to the receiving client. The Referenced 292 Minion Chunk ID is used to deduce the Sender Ordering that 293 should be applied if the receiving client generates a reply to 294 this message. If the received message identifed by the 295 Referenced Minion Chunk ID generated a reply A', then a reply to 296 this message should have an automatic Sender Ordering dependency 297 that it follow message A'. 299 04 Receiver Ordered Message. This chunk begins a new Receiver 300 Ordered message. This message is subject to Receiver Ordering; 301 it MUST NOT be delivered to the receiving client until the 302 message indicated by the Referenced Minion Chunk ID field has 303 been delivered. If the receiving client generates a reply to 304 this message, then the reply should have an automatic Receiver 305 Ordering dependency that it follow the reply to the message 306 indicated by the Referenced Minion Chunk ID field. 308 05 Chained Message. This chunk begins a new message that chains on 309 after a preceding message. The Referenced Minion Chunk ID 310 identifies the preceding message. This message MUST NOT be 311 delivered to the receiving client until the previous message of 312 the chain as been delivered to the receiving client, and this 313 message MUST be delivered to the receiving client in a manner 314 that indicates to the client that it is related to the previous 315 message. 317 06 Reply/Acknowledge. This chunk begins a new message which is an 318 explicit reply to a previously received message. The Referenced 319 Minion Chunk ID identifies the received message to which this is 320 a reply. A reply may be empty, in which case it serves as a 321 simple acknowledgement that the request was received and 322 accepted, or it may contain data. It is anticipated that future 323 Minion protocol development will create additional Minion chunk 324 codes to negotiate future protocol features. For these 325 capability negotiation messages, an empty reply referencing the 326 request serves as an acknowledgement that the requested protocol 327 feature is supported. 329 07 Reject. A Minion Reject code indicates that the referenced 330 received message had an error or was not accepted for some other 331 reason. A Reject Message may be empty, or may contain data 332 giving information concerning the reason for the rejection. It 333 is possible to reject an incomplete message that is still 334 arriving, by sending a Reject referencing the most recent Chunk 335 ID for that partial message. The sender will respond by sending 336 a Cancellation for that message, confirming that no further 337 chunks will be sent. When used for Minion protocol capability 338 negotiation, a Reject message referencing the request indicates 339 that the requested protocol feature is not supported. 341 08 End Minion. It is anticipated that there will be existing 342 application protocols that initially add Minion as an optional 343 feature, which they use only when the remote peer indicates it 344 also has Minion support, and otherwise they will communicate 345 using the existing protocol without the Minion features. Such 346 application protocols typically will first connect using their 347 existing protocol, and then negotiate an "upgrade" to Minion 348 framing. For symmetry, it would be good if such an "upgrade" 349 were not an irreversible one-way path. We would like to offer 350 the ability for applications to connect over raw TCP, switch to 351 Minion for some message exchanges, and then drop back to raw TCP 352 for some subsequent communication. This Minion chunk code 353 exists to signal, "This is the final Minion-format message you 354 will receive in this particular Minion session; after this 355 you're on your own." 357 4. Recursively Embeddable COBS 359 Consistent Overhead Byte Stuffing [COBS] allows complete messages to 360 be reliably located within an incomplete data stream that may contain 361 gaps. 363 COBS works by transforming the payload data to eliminate all 364 occurrences of zero bytes. This is like PPP byte stuffing, but more 365 efficient; COBS has a worst-case data size overhead below 0.5%. 366 Having created a zero-free payload, the payloads can then be 367 concatenated into a single byte stream, separated by single zero 368 bytes, and the zero bytes unambiguously mark the boundaries between 369 payloads, because we know the payloads themselves no longer contain 370 any zero bytes. At the receiving end the transformation is reversed 371 to recreate the original payload data. 373 The transformation process [COBS] is, in effect, a simple run length 374 encoding. An extremely simplified summary of the original 1997 COBS 375 encoding is as follows: 377 o If the payload begins with three nonzero bytes followed by a zero, 378 then the output is the byte value 4 (the run length) followed by 379 the three nonzero bytes, and the subsequent zero is skipped. 381 o If that is followed by fifty nonzero bytes followed by a zero, 382 then the output is the byte value 51 (the run length) followed by 383 the fifty nonzero bytes, and the subsequent zero is skipped. 385 o This process is repeated until the entire payload has been 386 replaced by its zero-free equivalent. 388 Recursively Embeddable COBS (RECOBS) is a derivative of the original 389 1997 COBS encoding. RECOBS code bytes have the following meanings: 391 00 New payload begins 392 01 Represents a single zero byte 393 02 Two bytes: a single nonzero byte, followed by a single zero byte 394 03 Three bytes: two nonzero bytes, followed by a single zero byte 395 n Represents n bytes: n-1 nonzero bytes, followed by a zero byte 396 FD 253 bytes: 252 nonzero bytes, followed by a single zero byte 397 FE 253 bytes: 253 nonzero bytes, with *no* following zero byte 398 FF Payload ends 400 This has the effect that, after encoding, every payload has 401 unambiguous bookends; every payload begins with a single 00, and ends 402 with a single FF. Using this encoding, recursive embedding becomes 403 possible. At *any* point in the encoded byte stream it is now 404 possible to interrupt the byte stream, insert a new RECOBS-encoded 405 payload, and then resume the previous byte stream. 407 At the receiving end, the decoder is part-way through decoding a 408 payload when the interruption occurs. The decoder sees a 00, which 409 is not legal in RECOBS-encoded data, so the decoder knows a new 410 payload is beginning. Because the decoder has not yet seen the FF 411 end-marker for the previous payload, it knows that payload is 412 incomplete, so it saves its decoding state for later resumption. The 413 decoder then proceeds to decode the embedded payload. When the 414 decoder sees the FF end-marker for the embedded payload, it delivers 415 that fully decoded payload to the waiting client, and then resumes 416 its decoding of the previously interrupted payload. 418 In principle this recursive embedding could be nested arbitrarily 419 deeply, limited only by the amount of storage the decoder has 420 available for partially-received payloads and their associated 421 decoding state. 423 In practice, Minion limits RECOBS embedding to four levels (the base 424 level plus three levels of nested interruption) to establish a 425 defined upper bound on the amount of storage required by a decoder. 427 5. Flow Control 429 TCP [RFC0793] implements flow control in the form of the advertised 430 receive window. This is to prevent a faster sender from overwhelming 431 a slower receiver. Minion requires similar protection to prevent a 432 slower receiver running out of memory trying to buffer messages 433 arriving faster than it can handle them. 435 For a pure user-level library implementation of Minion, this is 436 achieved by having the library set an upper bound on the amount of 437 memory it will use for storing received messages that have not yet 438 been handled by the client. Once this limit is met, the library 439 ceases reading TCP data from the kernel, which causes the TCP receive 440 window to fill up, which causes the sender to stop sending. Once the 441 client consumes some messages, the library then reads more data from 442 the kernel, the TCP receive window opens up, and the sender is 443 permitted to send more data. 445 However, this means that there is some duplication of buffering -- 446 the TCP receive window in the kernel and additional buffering in the 447 user-level library. For this reason a kernel extension is proposed 448 where a client (the Minion library in this case) can read data from 449 the connection *without* raising the TCP receive window. In a sense 450 it is reading the data "secretly", without admitting to the sender at 451 the other end that it has been read. Those bytes, even though read 452 into user space, are still counted against the TCP receive window. 453 Later, after the client application has actually consumed the 454 message, another kernel call is made to acknowledge consumption of 455 those bytes, and the TCP receive window is raised. 457 This mechanism integrates message-level flow control with TCP's byte- 458 level flow control, rather than having two independent flow control 459 mechanisms happening concurrently at different levels, in ways that 460 might interact badly with each other. 462 Note that the Minion protocol design will have to consider possible 463 deadlock situations. For example, suppose one Minion host is 464 refusing to consume any more Minion Chunks because it wishes to send 465 a Reject message for them, but it cannot, because the peer's receive 466 window is closed. Suppose also that the reason the peer's receive 467 window is closed is because the peer also is sitting on a pile of 468 unwanted Minion Chunks that it refuses to consume until it can send a 469 Reject message for them. Possible deadlocks such as these need to be 470 considered, and mechanisms to avoid them created. 472 6. Retransmission Policy 474 One of the main arguments that is often presented to justify why a 475 particular application protocol is built on UDP instead of TCP is 476 that, "UDP is better for 'real time' applications." The supporting 477 reasoning for this is often that, "TCP insists on continuing to 478 retransmit data long after the client doesn't need any more." In 479 truth the real problem is not retransmission; it is that the 480 conventional TCP APIs don't allow received data to be delivered out 481 of order. Suppose a TCP sender has 50 packets in flight at any given 482 time (e.g., the bandwidth x delay product is 75 kB) then the loss of 483 a single packet causes all 49 following packets to stall at the 484 receiver because the API doesn't allow for them to be delivered to 485 the client until the missing packet has been received. 487 Minion solves this problem by allowing data to be delivered as it 488 arrives, even if there are gaps. But the argument still remains that 489 even after removing the ordering requirement at the receiver, it may 490 still be a waste of bandwidth to retransmit data that will arrive too 491 late to be useful. And indeed, it is possible with TCP to 492 fraudulently acknowledge segments that were in fact not received, and 493 this will cause the sender to not retransmit those segments. 495 However, we chose not to use fraudulent acknowledgements to suppress 496 retransmissions, because certain NATs, Firewalls and other 497 middleboxes may block traffic if they observe implausible protocol 498 actions which they find suspicious. One of the important goals of 499 Minion is 100% compatibility with today's existing Internet devices, 500 not 99% compatibility. 502 We expect packet loss to be about 1% (at most a few percent) in a 503 functioning network, and the cost of retransmitting those lost 504 packets, even in the extreme case where *all* the retransmissions 505 turn out to be unnecessary, is an overhead of about 1%. We argue 506 that an overhead of about 1% is an acceptable price to pay in 507 exchange for 100% compatibility with existing NATs, Firewalls and 508 other middleboxes. 510 7. Optional Kernel Extensions 512 While Minion can be implemented entirely as a user-level library 513 built on top of existing standard networking APIs like BSD sockets, 514 it can also benefit from some optional kernel extensions: 516 Send Priorities 517 Normal TCP APIs transmit data strictly in the order is is given to 518 the kernel. The addition of priority support allows a sendmsg() 519 call to be used in conjunction with cmsg ancillary data to 520 indicate the priority level of the data. For normal applications 521 this capability would be of little use because it would most 522 likely result in corruption of the data stream, but it is useful 523 with Minion because the RECOBS encoding is robust against message 524 insertion at arbitrary byte boundaries. An alternative way to 525 achieve a similar effect is, instead of buffering data in the 526 kernel, to keep the data in the user-space library for as long as 527 possible. When the TCP congestion window and/or receive window 528 rules allow more data to be sent, the kernel generates some kind 529 of upcall (e.g., a kevent notification) to the user-space library 530 informing it of the ability to transmit, and the user-space 531 library responds by selecting which particular block of data to 532 hand to the kernel next. 534 Just-In-Time Data Generation 535 Through operational experience, we have learned (not that this was 536 any great surprise) that excessive buffering in the kernel leads 537 to poor behaviors. For example, two messages at the same priority 538 level are not interleaved effectively if the first message is 539 swallowed whole by the kernel, and held in kernel buffers, before 540 the second message is even created. When that happens, the result 541 is that the first message is sent in its entirety, followed by the 542 second message in its entirety, with no interleaving. 544 To prevent this unintended serialization, we need to avoid 545 irrevocably handing off data to the kernel prematurely. We want 546 to give the kernel enough data to keep the pipeline full (an 547 amount equal to the connection's Bandwidth Delay Product) but no 548 more. 550 To this end, rather than having the kernel indicate that a socket 551 is writable any time the kernel has space available to buffer more 552 data, we'd like the kernel to indicate that a socket is writable 553 only when TCP (according its protocol rules, such as receive 554 window, congestion window, and Nagle's Algorithm) would be willing 555 to send data, but has no data available to send. When this 556 situation occurs, the socket becomes writable, and the client (the 557 user-level Minion library) is able to perform a just-in-time 558 determination of what data ought to be sent next. 560 This just-in-time data generation could be achieved in the BSD 561 sockets API by adding a new socket option. When using this new 562 socket option, a socket will only be writable when TCP is actively 563 waiting for new data. If the context-switching latency or 564 software overhead is such that it takes the user-level code a 565 little too long to generate data strictly on demand, then a middle 566 ground can be achieved by modifying the new socket option such 567 that a socket will only be writable when the socket has less data 568 buffered than it expects to need imminently. For example, a TCP 569 connection in slow start expects it will need four TCP segments 570 when the next ack arrives. When used this way, if an incoming ACK 571 allows TCP to send out four segments then those four segments are 572 already buffered and ready in the kernel, and the socket then 573 becomes writable again to allow the user-level code to generate 574 the next four segments, so that they will be ready and waiting the 575 next time TCP is able to transmit additional segments. 577 We are currently experimenting with just-in-time data generation. 578 If it proves to be as effective as we hope, it might even work 579 well enough to provide effective priority support too, eliminating 580 the need for the "Send Priorities" kernel extension. 582 Immediate Receive 583 Normal TCP APIs deliver data only in TCP sequence number order. 584 The addition of support for new cmsg ancillary data in the 585 recvmsg() call allows the user-space library to request *any* 586 available data, not only in-order data. The cmsg ancillary data 587 returned from the recvmsg() call indicates to the user-space 588 library where in the TCP sequence space this particular block of 589 data lies. A setsockopt() option (or equivalent) is also required 590 to put the socket into this "Immediate Receive" mode, to inform 591 the kernel that the client will accept out-of-order data on this 592 socket, and therefore the client should be notified (via select(), 593 kevent(), etc.), not only when there is in-order data available to 594 be read, but also when there is out-of-order data available to be 595 read. 597 Integrated Receive Window 598 Normal TCP APIs raise the receive window any time data is read out 599 of the kernel into user space. The addition of new cmsg ancillary 600 data in the recvmsg() call allows the user-space library to 601 request that the kernel return received data *without* reflecting 602 this in its receive window calculation. After the client 603 application has consumed the message data from the user-space 604 Minion library, the Minion library makes a subsequent recvmsg() 605 call with appropriate cmsg ancillary data to inform the kernel how 606 many bytes to add back into its receive window. In essence, the 607 receive window boundary is stretched outside the kernel to account 608 for data held by *both* the kernel *and* the user-space Minion 609 library. 611 These optional kernel extensions are a key part of what makes Minion 612 compelling. Minion can be adopted today by any application, using 613 Minion as a purely user-space library. Such an application performs 614 as well as any application can when it is built on top of standard 615 TCP. However, unlike an application built on top of standard TCP, 616 Minion offers the promise of future kernel support for even better 617 performance. Any given application with its own application-specific 618 protocol is unlikely to receive special kernel support to make just 619 that one application work better. But when many applications all use 620 the Minion protocol, it then becomes reasonable to add kernel support 621 to improve all of those applications. 623 8. TCP Deviations 625 When implemented entirely as a user-level library, Minion naturally 626 adheres to the TCP specifications (insofar as the underlying 627 operating system adheres to the TCP specifications) because Minion is 628 merely using the operating system's networking APIs. 630 When optional kernel extensions are in use, they may allow Minion to 631 deviate from classical TCP protocol rules. One such instance of this 632 deviation has already been identified. The TCP protocol rules allow 633 a sender to send a FIN to end a connection, and then follow it with 634 additional data bytes (with higher TCP sequence numbers, so that they 635 fall later in the data stream) which the receiver is expected to 636 discard because it recognizes that they fall after the FIN in the 637 data stream. When out-of-order delivery is enabled, it's possible 638 that if the TCP segment containing the FIN is lost or delayed, then 639 subsequent TCP segments containing data bytes could be incorrectly 640 delivered to the client application, when the TCP protocol rules 641 dictate that they should have been discarded. The ability to send 642 data following the FIN that the receiver is expected to discard is 643 incompatible with out-of-order delivery. Note that this is referring 644 to data that follows the FIN in TCP sequence number space, not data 645 that follows the FIN in transmission order. If, after the FIN has 646 been sent, previously transmitted data is lost and needs to be 647 retransmitted, then this does not cause any problems; the bytes in 648 such retransmitted TCP segments fall *before* the FIN in TCP sequence 649 number space, not after. As a result of this observation, TCP's 650 protocol rules, when used with Minion traffic, are effectively 651 modified as follows: 653 o A client using Minion MUST NOT send new data on a connection after 654 that connection has been closed (i.e. a FIN indication has been 655 sequenced and sent). 657 In reality we do not expect this to be a major burden to TCP 658 implementations. We are not aware of TCP implementations that send 659 data after a connection is closed and then rely on the receiver to 660 discard that data. 662 9. IANA Considerations 664 No IANA actions are required by this document. 666 10. Security Considerations 668 We take security seriously. As this work develops, this section will 669 contain details of any known security issues and possible 670 mitigations. 672 11. Acknowledgements 674 Many thanks to Bryan Ford, Padma Bhooma and Anumita Biswas for their 675 contributions to the development of Minion. 677 Thanks to Joe Touch for pointing out that Minion restricts TCP's 678 ability to send data, after a connection is closed, that will then be 679 ignored by the receiver. 681 12. References 683 12.1. Normative References 685 [COBS] Cheshire, S. and M. Baker, "Consistent Overhead Byte 686 Stuffing", September 1997, 687 . 689 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 690 RFC 793, September 1981. 692 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 693 Requirement Levels", BCP 14, RFC 2119, March 1997. 695 [RFC6347] Rescorla, E. and N. Modadugu, "Datagram Transport Layer 696 Security Version 1.2", RFC 6347, January 2012. 698 [minserv] Iyengar, J., "Minion - Service Model and Conceptual API", 699 draft-iyengar-minion-concept-00 (work in progress), 700 June 2013. 702 12.2. Informative References 704 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, 705 August 1980. 707 [RFC5128] Srisuresh, P., Ford, B., and D. Kegel, "State of Peer-to- 708 Peer (P2P) Communication across Network Address 709 Translators (NATs)", RFC 5128, March 2008. 711 [RFC5245] Rosenberg, J., "Interactive Connectivity Establishment 712 (ICE): A Protocol for Network Address Translator (NAT) 713 Traversal for Offer/Answer Protocols", RFC 5245, 714 April 2010. 716 [RFC6887] Wing, D., Cheshire, S., Boucadair, M., Penno, R., and P. 717 Selkirk, "Port Control Protocol (PCP)", RFC 6887, 718 April 2013. 720 [Study] Hatonen, S., Nyrhinen, A., Eggert, L., Strowes, S., 721 Sarolahti, P., and M. Kojo, "An Experimental Study of Home 722 Gateway Characteristics", September 1997, 723 . 725 [TCPoUDP] Cheshire, S., Graessley, J., and S. Cheshire, 726 "Encapsulation of TCP and other Transport Protocols over 727 UDP", draft-cheshire-tcp-over-udp-00 (work in progress), 728 June 2013. 730 Authors' Addresses 732 Janardhan Iyengar 733 Franklin and Marshall College 734 Mathematics and Computer Science 735 PO Box 3003 736 Lancaster, Pennsylvania 17604-3003 737 USA 739 Phone: +1 717 358 4774 740 Email: janardhan.iyengar@fandm.edu 742 Stuart Cheshire 743 Apple Inc. 744 1 Infinite Loop 745 Cupertino, California 95014 746 USA 748 Phone: +1 408 974 3207 749 Email: cheshire@apple.com 751 Josh Graessley 752 Apple Inc. 753 1 Infinite Loop 754 Cupertino, California 95014 755 USA 757 Phone: +1 408 974 5710 758 Email: jgraessley@apple.com