idnits 2.17.1 

draft-iyengar-minion-protocol-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  ** The document is more than 15 pages and seems to lack a Table of Contents.


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (July 14, 2013) is 3936 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Possible downref: Non-RFC (?) normative reference: ref. 'COBS'

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  ** Obsolete normative reference: RFC 6347 (Obsoleted by RFC 9147)

  -- Obsolete informational reference (is this intentional?): RFC 5245
     (Obsoleted by RFC 8445, RFC 8839)


     Summary: 3 errors (**), 0 flaws (~~), 1 warning (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                         J. Iyengar
3	Internet-Draft                             Franklin and Marshall College
4	Intended status: Standards Track                             S. Cheshire
5	Expires: January 15, 2014                                   J. Graessley
6	                                                                   Apple
7	                                                           July 14, 2013

9	                         Minion - Wire Protocol
10	                    draft-iyengar-minion-protocol-01

12	Abstract

14	   Minion uses TCP-format packets on-the-wire, for compatibility with
15	   existing NATs, Firewalls, and similar middleboxes, but provides a
16	   richer set of facilities to the application, as described in the
17	   Minion Service Model document.  This document specifies the details
18	   of the on-the-wire protocol used to provide those services.

20	Status of this Memo

22	   This Internet-Draft is submitted in full conformance with the
23	   provisions of BCP 78 and BCP 79.

25	   Internet-Drafts are working documents of the Internet Engineering
26	   Task Force (IETF).  Note that other groups may also distribute
27	   working documents as Internet-Drafts.  The list of current Internet-
28	   Drafts is at http://datatracker.ietf.org/drafts/current/.

30	   Internet-Drafts are draft documents valid for a maximum of six months
31	   and may be updated, replaced, or obsoleted by other documents at any
32	   time.  It is inappropriate to use Internet-Drafts as reference
33	   material or to cite them other than as "work in progress."

35	   This Internet-Draft will expire on January 15, 2014.

37	Copyright Notice

39	   Copyright (c) 2013 IETF Trust and the persons identified as the
40	   document authors.  All rights reserved.

42	   This document is subject to BCP 78 and the IETF Trust's Legal
43	   Provisions Relating to IETF Documents
44	   (http://trustee.ietf.org/license-info) in effect on the date of
45	   publication of this document.  Please review these documents
46	   carefully, as they describe your rights and restrictions with respect
47	   to this document.  Code Components extracted from this document must
48	   include Simplified BSD License text as described in Section 4.e of
49	   the Trust Legal Provisions and are provided without warranty as
50	   described in the Simplified BSD License.

52	1.  Conventions and Terminology Used in this Document

54	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
55	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
56	   "OPTIONAL" in this document are to be interpreted as described in
57	   "Key words for use in RFCs to Indicate Requirement Levels" [RFC2119].

59	   This document uses terminology like "kernel" and "user-level", as
60	   those terms pertain to many of today's Unix-like operating systems.
61	   Equivalent concepts apply to software that is built using a different
62	   architectural model than may not include such an obvious kernel/user
63	   split.

65	2.  Introduction

67	   Minion uses TCP-format packets on-the-wire, to provide full
68	   compatibility with existing NATs, Firewalls, and similar middleboxes,
69	   but provides a richer set of facilities to the application, described
70	   in the Minion Service Model and Conceptual API document [minserv].
71	   This document specifies the details of the on-the-wire protocol used
72	   to provide those services.  Before reading this protocol
73	   specification document, familiarity with the Minion Service Model
74	   [minserv] is strongly recommended.  That information is not repeated
75	   here.

77	   Minion runs over a standard TCP connection.  Therefore, IP addresses
78	   and TCP ports are used just as they are with TCP [RFC0793].

80	   Minion is also designed to be able to use a modified TCP connection
81	   which supports out-of-order delivery, giving better low-latency
82	   performance on lossy networks, for use by the kinds of application
83	   that today would use UDP [RFC0768] to achieve low-latency delivery.
84	   The goal of providing low-latency delivery -- and consequently the
85	   need to be able to handle a data stream that may have gaps -- is
86	   reflected in various aspects of the Minion protocol design, such as
87	   the use of DTLS instead of TLS, and the use of Consistent Overhead
88	   Byte Stuffing [COBS] for reliably extracting messages from an
89	   incomplete data stream.  Minion is able to take advantage of out-of-
90	   order delivery where the network stack offers that, but Minion does
91	   not require it.  Minion still works correctly when the performance
92	   benefits of out-of-order delivery are not available.

94	   Minion supports messages of arbitrary size.  Large messages are
95	   broken into chunks a little under 16 kilobytes each (the DTLS maximum
96	   record size, minus a few bytes for Minion header).  At the receiving
97	   end the Minion chunks are reassembled into Minion messages and
98	   delivered to the client application.  Small messages are sent in a
99	   single Minion chunk.

101	   Normally messages are sent by the client as a single atomic unit, and
102	   delivered to the receiving client as a single atomic unit.  For
103	   messages too large to fit conveniently in memory, the message may be
104	   built incrementally by the sender, and delivered to the receiving
105	   client incrementally, a chunk at a time.

107	   When a Minion message is complete, or has at least one maximum Minion
108	   chunk size of data accumulated, then if it is eligible to be sent
109	   according to the message ordering facilities offered by the Minion
110	   Service Model [minserv] (Sender Ordering, Receiver Ordering, and
111	   Chaining) a Minion chunk is generated.

113	   Each Minion chunk contains a Minion chunk header followed by the
114	   client's message data, as described in Section 3 "Minion Chunk
115	   Format".

117	   Each Minion chunk is encrypted using DTLS [RFC6347].

119	   Each encrypted DTLS payload is then framed using RECOBS, as described
120	   in Section 4 "Recursively Embeddable COBS", so that it begins with a
121	   00 byte and ends with an FF byte.

123	   The framed, encrypted chunk is then enqueued for transmission.

125	   If the kernel networking code supports multiple priorities, then the
126	   framed, encrypted chunk is placed in the transmission queue for the
127	   stated priority level.  Any time the TCP congestion window and/or
128	   receive window rules allow more data to be sent, data is drawn from
129	   the highest-priority non-empty transmit buffer, assigned the next
130	   block of unused TCP sequence numbers, formed into a TCP segment, and
131	   transmitted on the wire.  This just-in-time TCP sequencing mechanism
132	   has the effect of causing higher-priority data to be inserted right
133	   at the front of the conceptual combined transmit buffer, at the
134	   earliest possible byte boundary, unconstrained by message or chunk
135	   boundaries in the lower-priority messages.  This is possible because
136	   the RECOBS framing is robust to pre-emption at any arbitrary byte
137	   boundary.

139	   Note that, when priorities are supported, chunks above the lowest
140	   priority MUST be delivered to the kernel in such a way that they are
141	   sent completely before the kernel resumes sending the lower-priority
142	   traffic.  The RECOBS framing supports interrupting a lower priority
143	   stream with a higher-priority chunk, but not alternating back and
144	   forth between two priority levels.  Once a higher-priority chunk
145	   interrupts lower-priority traffic, the higher-priority chunk must be
146	   completed before the lower-priority traffic resumes.  Typically this
147	   is easily achieved by delivering the chunk to the kernel atomically
148	   in a single write call.

150	2.1.  Comparison of TCP and UDP NAT Traversal

152	   When connecting to a server with a globally routable address, TCP is
153	   generally preferable to UDP.  TCP includes the SYN and FIN bits which
154	   tell a NAT gateway when a connection starts and ends.  In particular,
155	   the FIN bit tells the NAT gateway when it can discard state related
156	   to that mapping.  UDP has no defined connection start/end indicators,
157	   which means that unused UDP mappings are much more likely to
158	   accumulate, which means that NAT gateways tend to be more aggressive
159	   about timing out UDP mappings [Study], which means that clients using
160	   UDP need to be more aggressive about sending keepalive traffic, which
161	   is bad both for network efficiency and for battery life.  Port
162	   Control Protocol (PCP) [RFC6887] offers some future hope of
163	   alleviating this problem by allowing clients to explicitly negotiate
164	   for longer mapping lifetimes, but PCP is not yet widely deployed.  In
165	   the meantime, if use of UDP increases, NAT gateways are likely to be
166	   accumulating mappings even more rapidly, with no way to differentiate
167	   which are still required and which may be safely discarded, with the
168	   result that UDP mappings may have to be discarded even more
169	   aggressively.  While a discarded UDP mapping can be recreated by
170	   another outgoing UDP packet, in the time between when the UDP mapping
171	   is discarded and then recreated, the client is cut off an unable to
172	   receive inbound communication from server or peer at the other end.
173	   Therefore, we believe that it is preferable to use TCP where
174	   possible.

176	   However, when connecting to a peer which is itself also behind a NAT
177	   gateway, in the absence of PCP support [RFC6887], techniques like
178	   Interactive Connectivity Establishment (ICE) [RFC5245] are used, and
179	   research has shown that there are cases where ICE works for UDP but
180	   not for TCP [RFC5128].

182	   To accomodate both usage scenarios, Minion is generally used with
183	   standard TCP format packets, but for peer-to-peer scenarios where TCP
184	   ICE is found not to work, Minion can be used encapsulated inside UDP
185	   [TCPoUDP] instead.

187	3.  Minion Chunk Format

189	   A Minion Chunk begins with an eight-byte header, followed by the
190	   client's message data:

192	      0                   1                   2                   3
193	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
194	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
195	     |C|    Code     |Pri|     This Minion Chunk ID                  |
196	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
197	     | Reserved      |RCP|     Referenced Minion Chunk ID            |
198	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
199	     :                                                               :
200	     :                     Minion Chunk Data                         :
201	     :                                                               :
202	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

204	                       Figure 1: Minion Chunk Format

206	   If the Complete ('C') bit is zero, this message is incomplete; the
207	   receiver should expect to receive additional continuation chunks for
208	   this message.  If the Complete bit is one, this message is complete;
209	   there will be no subsequent continuation chunks for this message.

211	   The seven-bit chunk code identifies what type of chunk this is, as
212	   described below.

214	   The two-bit priority field indicates the priority level for this
215	   message, with 0 being the highest priority and 3 being the default
216	   (lowest-level) priority.

218	   Every Minion chunk has a Chunk ID.  This is a 22-bit value assigned
219	   from a monotonically increasing 22-bit cyclic counter.  This means
220	   that Chunk IDs are reused every 2^22 chunks.  At any given moment in
221	   time though, only a small portion of the 22-bit ID space is actively
222	   in use, so Chunk IDs are not ambiguous.  Each of the four priority
223	   levels has its own 22-bit Chunk ID space, i.e., Priority 1 Chunk 7
224	   and Priority 2 Chunk 7 are different chunks.  Also, the Chunk ID
225	   spaces in opposite directions on a connection are separate.  Each
226	   sender is responsible for selecting the Chunk IDs for the chunks it
227	   sends.

229	   In some cases it is useful to refer to messages by ID, and the terms
230	   "Message ID" and "Chunk ID" are sometimes used interchangeably.  For
231	   a message that is sent using a single chunk, the Message ID is the
232	   same as the Chunk ID.  For a message that is sent using multiple
233	   chunks, the Message ID is the Chunk ID of the *final* chunk of the
234	   message.  One implication of this is that a message's ID is undefined
235	   until the message is complete.

237	   Because Chunk IDs are eventually reused, issues of ID lifetime must
238	   be carefully considered in the Minion protocol design.  For example,
239	   since a remote peer could, in principle, wait an arbitrary long
240	   length of time before replying to a message, the Message ID of a
241	   request that is awaiting a response MUST NOT be reused until the
242	   response has been received, and the client has disposed of the
243	   request message.  Otherwise, a reply could be ambiguous, if there
244	   were two outstanding request messages both using the same Message ID
245	   at the same time.  Likewise, the last Chunk ID of an incomplete
246	   message MUST NOT be reused until some subsequent chunk has been added
247	   to that message, referencing the previous Chunk ID.

249	   The Reserved field MUST be set to zero on transmission, and MUST be
250	   ignored on reception.

252	   For chunk types that need to refer to some other chunk, the
253	   Referenced Minion Chunk Priority (RCP) and Referenced Minion Chunk ID
254	   fields identify the referenced chunk.  Note that some chunk types
255	   refer to chunks going in the same direction (e.g., a continuation
256	   chunk) and some chunk types refer to chunks going in the reverse
257	   direction (e.g., a reply chunk).  For chunk types that do not to
258	   refer to any other chunk, these two fields MUST be set to zero on
259	   transmission, and MUST be ignored on reception.

261	   The Minion Chunk payload data follows the Minion Chunk Header.

263	   There is no explicit length field in the Minion Chunk Header, because
264	   the chunk length is determined implicitly in the RECOBS decoding
265	   step.

267	3.1.  Minion Chunk Codes

269	   The seven-bit chunk code identifies what kind of chunk this is.
270	   There are 128 chunk codes available.  The following eight chunk codes
271	   are currently defined:

273	     00 Continuation.  This is a continuation of a previously incomplete
274	        message.  The Referenced Minion Chunk ID identifies what
275	        previous chunk this is adding to.  (If the Complete bit is one
276	        then this chunk is the final chunk and completes the message; no
277	        further chunks for this message will be arriving.)

279	     01 Cancellation.  This is a cancellation of a previously incomplete
280	        message.  The Referenced Minion Chunk ID identifies what
281	        previous chunk this is cancelling.  In this case Complete bit is
282	        unused; the Complete bit MUST be set to zero on transmission,
283	        and MUST be ignored on reception.

285	     02 Unordered Message.  This chunk begins a new unordered message.
286	        The Referenced Minion Chunk ID is unused, and MUST be set to
287	        zero on transmission, and MUST be ignored on reception.

289	     03 Sender Ordered Message.  This chunk begins a new Sender Ordered
290	        message.  If received out-of-order, it should nonetheless be
291	        delivered immediately to the receiving client.  The Referenced
292	        Minion Chunk ID is used to deduce the Sender Ordering that
293	        should be applied if the receiving client generates a reply to
294	        this message.  If the received message identifed by the
295	        Referenced Minion Chunk ID generated a reply A', then a reply to
296	        this message should have an automatic Sender Ordering dependency
297	        that it follow message A'.

299	     04 Receiver Ordered Message.  This chunk begins a new Receiver
300	        Ordered message.  This message is subject to Receiver Ordering;
301	        it MUST NOT be delivered to the receiving client until the
302	        message indicated by the Referenced Minion Chunk ID field has
303	        been delivered.  If the receiving client generates a reply to
304	        this message, then the reply should have an automatic Receiver
305	        Ordering dependency that it follow the reply to the message
306	        indicated by the Referenced Minion Chunk ID field.

308	     05 Chained Message.  This chunk begins a new message that chains on
309	        after a preceding message.  The Referenced Minion Chunk ID
310	        identifies the preceding message.  This message MUST NOT be
311	        delivered to the receiving client until the previous message of
312	        the chain as been delivered to the receiving client, and this
313	        message MUST be delivered to the receiving client in a manner
314	        that indicates to the client that it is related to the previous
315	        message.

317	     06 Reply/Acknowledge.  This chunk begins a new message which is an
318	        explicit reply to a previously received message.  The Referenced
319	        Minion Chunk ID identifies the received message to which this is
320	        a reply.  A reply may be empty, in which case it serves as a
321	        simple acknowledgement that the request was received and
322	        accepted, or it may contain data.  It is anticipated that future
323	        Minion protocol development will create additional Minion chunk
324	        codes to negotiate future protocol features.  For these
325	        capability negotiation messages, an empty reply referencing the
326	        request serves as an acknowledgement that the requested protocol
327	        feature is supported.

329	     07 Reject.  A Minion Reject code indicates that the referenced
330	        received message had an error or was not accepted for some other
331	        reason.  A Reject Message may be empty, or may contain data
332	        giving information concerning the reason for the rejection.  It
333	        is possible to reject an incomplete message that is still
334	        arriving, by sending a Reject referencing the most recent Chunk
335	        ID for that partial message.  The sender will respond by sending
336	        a Cancellation for that message, confirming that no further
337	        chunks will be sent.  When used for Minion protocol capability
338	        negotiation, a Reject message referencing the request indicates
339	        that the requested protocol feature is not supported.

341	     08 End Minion.  It is anticipated that there will be existing
342	        application protocols that initially add Minion as an optional
343	        feature, which they use only when the remote peer indicates it
344	        also has Minion support, and otherwise they will communicate
345	        using the existing protocol without the Minion features.  Such
346	        application protocols typically will first connect using their
347	        existing protocol, and then negotiate an "upgrade" to Minion
348	        framing.  For symmetry, it would be good if such an "upgrade"
349	        were not an irreversible one-way path.  We would like to offer
350	        the ability for applications to connect over raw TCP, switch to
351	        Minion for some message exchanges, and then drop back to raw TCP
352	        for some subsequent communication.  This Minion chunk code
353	        exists to signal, "This is the final Minion-format message you
354	        will receive in this particular Minion session; after this
355	        you're on your own."

357	4.  Recursively Embeddable COBS

359	   Consistent Overhead Byte Stuffing [COBS] allows complete messages to
360	   be reliably located within an incomplete data stream that may contain
361	   gaps.

363	   COBS works by transforming the payload data to eliminate all
364	   occurrences of zero bytes.  This is like PPP byte stuffing, but more
365	   efficient; COBS has a worst-case data size overhead below 0.5%.
366	   Having created a zero-free payload, the payloads can then be
367	   concatenated into a single byte stream, separated by single zero
368	   bytes, and the zero bytes unambiguously mark the boundaries between
369	   payloads, because we know the payloads themselves no longer contain
370	   any zero bytes.  At the receiving end the transformation is reversed
371	   to recreate the original payload data.

373	   The transformation process [COBS] is, in effect, a simple run length
374	   encoding.  An extremely simplified summary of the original 1997 COBS
375	   encoding is as follows:

377	   o  If the payload begins with three nonzero bytes followed by a zero,
378	      then the output is the byte value 4 (the run length) followed by
379	      the three nonzero bytes, and the subsequent zero is skipped.

381	   o  If that is followed by fifty nonzero bytes followed by a zero,
382	      then the output is the byte value 51 (the run length) followed by
383	      the fifty nonzero bytes, and the subsequent zero is skipped.

385	   o  This process is repeated until the entire payload has been
386	      replaced by its zero-free equivalent.

388	   Recursively Embeddable COBS (RECOBS) is a derivative of the original
389	   1997 COBS encoding.  RECOBS code bytes have the following meanings:

391	     00 New payload begins
392	     01 Represents a single zero byte
393	     02 Two bytes: a single nonzero byte, followed by a single zero byte
394	     03 Three bytes: two nonzero bytes, followed by a single zero byte
395	      n Represents n bytes: n-1 nonzero bytes, followed by a zero byte
396	     FD 253 bytes: 252 nonzero bytes, followed by a single zero byte
397	     FE 253 bytes: 253 nonzero bytes, with *no* following zero byte
398	     FF Payload ends

400	   This has the effect that, after encoding, every payload has
401	   unambiguous bookends; every payload begins with a single 00, and ends
402	   with a single FF.  Using this encoding, recursive embedding becomes
403	   possible.  At *any* point in the encoded byte stream it is now
404	   possible to interrupt the byte stream, insert a new RECOBS-encoded
405	   payload, and then resume the previous byte stream.

407	   At the receiving end, the decoder is part-way through decoding a
408	   payload when the interruption occurs.  The decoder sees a 00, which
409	   is not legal in RECOBS-encoded data, so the decoder knows a new
410	   payload is beginning.  Because the decoder has not yet seen the FF
411	   end-marker for the previous payload, it knows that payload is
412	   incomplete, so it saves its decoding state for later resumption.  The
413	   decoder then proceeds to decode the embedded payload.  When the
414	   decoder sees the FF end-marker for the embedded payload, it delivers
415	   that fully decoded payload to the waiting client, and then resumes
416	   its decoding of the previously interrupted payload.

418	   In principle this recursive embedding could be nested arbitrarily
419	   deeply, limited only by the amount of storage the decoder has
420	   available for partially-received payloads and their associated
421	   decoding state.

423	   In practice, Minion limits RECOBS embedding to four levels (the base
424	   level plus three levels of nested interruption) to establish a
425	   defined upper bound on the amount of storage required by a decoder.

427	5.  Flow Control

429	   TCP [RFC0793] implements flow control in the form of the advertised
430	   receive window.  This is to prevent a faster sender from overwhelming
431	   a slower receiver.  Minion requires similar protection to prevent a
432	   slower receiver running out of memory trying to buffer messages
433	   arriving faster than it can handle them.

435	   For a pure user-level library implementation of Minion, this is
436	   achieved by having the library set an upper bound on the amount of
437	   memory it will use for storing received messages that have not yet
438	   been handled by the client.  Once this limit is met, the library
439	   ceases reading TCP data from the kernel, which causes the TCP receive
440	   window to fill up, which causes the sender to stop sending.  Once the
441	   client consumes some messages, the library then reads more data from
442	   the kernel, the TCP receive window opens up, and the sender is
443	   permitted to send more data.

445	   However, this means that there is some duplication of buffering --
446	   the TCP receive window in the kernel and additional buffering in the
447	   user-level library.  For this reason a kernel extension is proposed
448	   where a client (the Minion library in this case) can read data from
449	   the connection *without* raising the TCP receive window.  In a sense
450	   it is reading the data "secretly", without admitting to the sender at
451	   the other end that it has been read.  Those bytes, even though read
452	   into user space, are still counted against the TCP receive window.
453	   Later, after the client application has actually consumed the
454	   message, another kernel call is made to acknowledge consumption of
455	   those bytes, and the TCP receive window is raised.

457	   This mechanism integrates message-level flow control with TCP's byte-
458	   level flow control, rather than having two independent flow control
459	   mechanisms happening concurrently at different levels, in ways that
460	   might interact badly with each other.

462	   Note that the Minion protocol design will have to consider possible
463	   deadlock situations.  For example, suppose one Minion host is
464	   refusing to consume any more Minion Chunks because it wishes to send
465	   a Reject message for them, but it cannot, because the peer's receive
466	   window is closed.  Suppose also that the reason the peer's receive
467	   window is closed is because the peer also is sitting on a pile of
468	   unwanted Minion Chunks that it refuses to consume until it can send a
469	   Reject message for them.  Possible deadlocks such as these need to be
470	   considered, and mechanisms to avoid them created.

472	6.  Retransmission Policy

474	   One of the main arguments that is often presented to justify why a
475	   particular application protocol is built on UDP instead of TCP is
476	   that, "UDP is better for 'real time' applications."  The supporting
477	   reasoning for this is often that, "TCP insists on continuing to
478	   retransmit data long after the client doesn't need any more."  In
479	   truth the real problem is not retransmission; it is that the
480	   conventional TCP APIs don't allow received data to be delivered out
481	   of order.  Suppose a TCP sender has 50 packets in flight at any given
482	   time (e.g., the bandwidth x delay product is 75 kB) then the loss of
483	   a single packet causes all 49 following packets to stall at the
484	   receiver because the API doesn't allow for them to be delivered to
485	   the client until the missing packet has been received.

487	   Minion solves this problem by allowing data to be delivered as it
488	   arrives, even if there are gaps.  But the argument still remains that
489	   even after removing the ordering requirement at the receiver, it may
490	   still be a waste of bandwidth to retransmit data that will arrive too
491	   late to be useful.  And indeed, it is possible with TCP to
492	   fraudulently acknowledge segments that were in fact not received, and
493	   this will cause the sender to not retransmit those segments.

495	   However, we chose not to use fraudulent acknowledgements to suppress
496	   retransmissions, because certain NATs, Firewalls and other
497	   middleboxes may block traffic if they observe implausible protocol
498	   actions which they find suspicious.  One of the important goals of
499	   Minion is 100% compatibility with today's existing Internet devices,
500	   not 99% compatibility.

502	   We expect packet loss to be about 1% (at most a few percent) in a
503	   functioning network, and the cost of retransmitting those lost
504	   packets, even in the extreme case where *all* the retransmissions
505	   turn out to be unnecessary, is an overhead of about 1%.  We argue
506	   that an overhead of about 1% is an acceptable price to pay in
507	   exchange for 100% compatibility with existing NATs, Firewalls and
508	   other middleboxes.

510	7.  Optional Kernel Extensions

512	   While Minion can be implemented entirely as a user-level library
513	   built on top of existing standard networking APIs like BSD sockets,
514	   it can also benefit from some optional kernel extensions:

516	   Send Priorities
517	      Normal TCP APIs transmit data strictly in the order is is given to
518	      the kernel.  The addition of priority support allows a sendmsg()
519	      call to be used in conjunction with cmsg ancillary data to
520	      indicate the priority level of the data.  For normal applications
521	      this capability would be of little use because it would most
522	      likely result in corruption of the data stream, but it is useful
523	      with Minion because the RECOBS encoding is robust against message
524	      insertion at arbitrary byte boundaries.  An alternative way to
525	      achieve a similar effect is, instead of buffering data in the
526	      kernel, to keep the data in the user-space library for as long as
527	      possible.  When the TCP congestion window and/or receive window
528	      rules allow more data to be sent, the kernel generates some kind
529	      of upcall (e.g., a kevent notification) to the user-space library
530	      informing it of the ability to transmit, and the user-space
531	      library responds by selecting which particular block of data to
532	      hand to the kernel next.

534	   Just-In-Time Data Generation
535	      Through operational experience, we have learned (not that this was
536	      any great surprise) that excessive buffering in the kernel leads
537	      to poor behaviors.  For example, two messages at the same priority
538	      level are not interleaved effectively if the first message is
539	      swallowed whole by the kernel, and held in kernel buffers, before
540	      the second message is even created.  When that happens, the result
541	      is that the first message is sent in its entirety, followed by the
542	      second message in its entirety, with no interleaving.

544	      To prevent this unintended serialization, we need to avoid
545	      irrevocably handing off data to the kernel prematurely.  We want
546	      to give the kernel enough data to keep the pipeline full (an
547	      amount equal to the connection's Bandwidth Delay Product) but no
548	      more.

550	      To this end, rather than having the kernel indicate that a socket
551	      is writable any time the kernel has space available to buffer more
552	      data, we'd like the kernel to indicate that a socket is writable
553	      only when TCP (according its protocol rules, such as receive
554	      window, congestion window, and Nagle's Algorithm) would be willing
555	      to send data, but has no data available to send.  When this
556	      situation occurs, the socket becomes writable, and the client (the
557	      user-level Minion library) is able to perform a just-in-time
558	      determination of what data ought to be sent next.

560	      This just-in-time data generation could be achieved in the BSD
561	      sockets API by adding a new socket option.  When using this new
562	      socket option, a socket will only be writable when TCP is actively
563	      waiting for new data.  If the context-switching latency or
564	      software overhead is such that it takes the user-level code a
565	      little too long to generate data strictly on demand, then a middle
566	      ground can be achieved by modifying the new socket option such
567	      that a socket will only be writable when the socket has less data
568	      buffered than it expects to need imminently.  For example, a TCP
569	      connection in slow start expects it will need four TCP segments
570	      when the next ack arrives.  When used this way, if an incoming ACK
571	      allows TCP to send out four segments then those four segments are
572	      already buffered and ready in the kernel, and the socket then
573	      becomes writable again to allow the user-level code to generate
574	      the next four segments, so that they will be ready and waiting the
575	      next time TCP is able to transmit additional segments.

577	      We are currently experimenting with just-in-time data generation.
578	      If it proves to be as effective as we hope, it might even work
579	      well enough to provide effective priority support too, eliminating
580	      the need for the "Send Priorities" kernel extension.

582	   Immediate Receive
583	      Normal TCP APIs deliver data only in TCP sequence number order.
584	      The addition of support for new cmsg ancillary data in the
585	      recvmsg() call allows the user-space library to request *any*
586	      available data, not only in-order data.  The cmsg ancillary data
587	      returned from the recvmsg() call indicates to the user-space
588	      library where in the TCP sequence space this particular block of
589	      data lies.  A setsockopt() option (or equivalent) is also required
590	      to put the socket into this "Immediate Receive" mode, to inform
591	      the kernel that the client will accept out-of-order data on this
592	      socket, and therefore the client should be notified (via select(),
593	      kevent(), etc.), not only when there is in-order data available to
594	      be read, but also when there is out-of-order data available to be
595	      read.

597	   Integrated Receive Window
598	      Normal TCP APIs raise the receive window any time data is read out
599	      of the kernel into user space.  The addition of new cmsg ancillary
600	      data in the recvmsg() call allows the user-space library to
601	      request that the kernel return received data *without* reflecting
602	      this in its receive window calculation.  After the client
603	      application has consumed the message data from the user-space
604	      Minion library, the Minion library makes a subsequent recvmsg()
605	      call with appropriate cmsg ancillary data to inform the kernel how
606	      many bytes to add back into its receive window.  In essence, the
607	      receive window boundary is stretched outside the kernel to account
608	      for data held by *both* the kernel *and* the user-space Minion
609	      library.

611	   These optional kernel extensions are a key part of what makes Minion
612	   compelling.  Minion can be adopted today by any application, using
613	   Minion as a purely user-space library.  Such an application performs
614	   as well as any application can when it is built on top of standard
615	   TCP.  However, unlike an application built on top of standard TCP,
616	   Minion offers the promise of future kernel support for even better
617	   performance.  Any given application with its own application-specific
618	   protocol is unlikely to receive special kernel support to make just
619	   that one application work better.  But when many applications all use
620	   the Minion protocol, it then becomes reasonable to add kernel support
621	   to improve all of those applications.

623	8.  TCP Deviations

625	   When implemented entirely as a user-level library, Minion naturally
626	   adheres to the TCP specifications (insofar as the underlying
627	   operating system adheres to the TCP specifications) because Minion is
628	   merely using the operating system's networking APIs.

630	   When optional kernel extensions are in use, they may allow Minion to
631	   deviate from classical TCP protocol rules.  One such instance of this
632	   deviation has already been identified.  The TCP protocol rules allow
633	   a sender to send a FIN to end a connection, and then follow it with
634	   additional data bytes (with higher TCP sequence numbers, so that they
635	   fall later in the data stream) which the receiver is expected to
636	   discard because it recognizes that they fall after the FIN in the
637	   data stream.  When out-of-order delivery is enabled, it's possible
638	   that if the TCP segment containing the FIN is lost or delayed, then
639	   subsequent TCP segments containing data bytes could be incorrectly
640	   delivered to the client application, when the TCP protocol rules
641	   dictate that they should have been discarded.  The ability to send
642	   data following the FIN that the receiver is expected to discard is
643	   incompatible with out-of-order delivery.  Note that this is referring
644	   to data that follows the FIN in TCP sequence number space, not data
645	   that follows the FIN in transmission order.  If, after the FIN has
646	   been sent, previously transmitted data is lost and needs to be
647	   retransmitted, then this does not cause any problems; the bytes in
648	   such retransmitted TCP segments fall *before* the FIN in TCP sequence
649	   number space, not after.  As a result of this observation, TCP's
650	   protocol rules, when used with Minion traffic, are effectively
651	   modified as follows:

653	   o  A client using Minion MUST NOT send new data on a connection after
654	      that connection has been closed (i.e. a FIN indication has been
655	      sequenced and sent).

657	   In reality we do not expect this to be a major burden to TCP
658	   implementations.  We are not aware of TCP implementations that send
659	   data after a connection is closed and then rely on the receiver to
660	   discard that data.

662	9.  IANA Considerations

664	   No IANA actions are required by this document.

666	10.  Security Considerations

668	   We take security seriously.  As this work develops, this section will
669	   contain details of any known security issues and possible
670	   mitigations.

672	11.  Acknowledgements

674	   Many thanks to Bryan Ford, Padma Bhooma and Anumita Biswas for their
675	   contributions to the development of Minion.

677	   Thanks to Joe Touch for pointing out that Minion restricts TCP's
678	   ability to send data, after a connection is closed, that will then be
679	   ignored by the receiver.

681	12.  References

683	12.1.  Normative References

685	   [COBS]     Cheshire, S. and M. Baker, "Consistent Overhead Byte
686	              Stuffing", September 1997,
687	              <http://stuartcheshire.org/papers/COBSforToN.pdf>.

689	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
690	              RFC 793, September 1981.

692	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
693	              Requirement Levels", BCP 14, RFC 2119, March 1997.

695	   [RFC6347]  Rescorla, E. and N. Modadugu, "Datagram Transport Layer
696	              Security Version 1.2", RFC 6347, January 2012.

698	   [minserv]  Iyengar, J., "Minion - Service Model and Conceptual API",
699	              draft-iyengar-minion-concept-00 (work in progress),
700	              June 2013.

702	12.2.  Informative References

704	   [RFC0768]  Postel, J., "User Datagram Protocol", STD 6, RFC 768,
705	              August 1980.

707	   [RFC5128]  Srisuresh, P., Ford, B., and D. Kegel, "State of Peer-to-
708	              Peer (P2P) Communication across Network Address
709	              Translators (NATs)", RFC 5128, March 2008.

711	   [RFC5245]  Rosenberg, J., "Interactive Connectivity Establishment
712	              (ICE): A Protocol for Network Address Translator (NAT)
713	              Traversal for Offer/Answer Protocols", RFC 5245,
714	              April 2010.

716	   [RFC6887]  Wing, D., Cheshire, S., Boucadair, M., Penno, R., and P.
717	              Selkirk, "Port Control Protocol (PCP)", RFC 6887,
718	              April 2013.

720	   [Study]    Hatonen, S., Nyrhinen, A., Eggert, L., Strowes, S.,
721	              Sarolahti, P., and M. Kojo, "An Experimental Study of Home
722	              Gateway Characteristics", September 1997,
723	              <http://conferences.sigcomm.org/imc/2010/papers/p260.pdf>.

725	   [TCPoUDP]  Cheshire, S., Graessley, J., and S. Cheshire,
726	              "Encapsulation of TCP and other Transport Protocols over
727	              UDP", draft-cheshire-tcp-over-udp-00 (work in progress),
728	              June 2013.

730	Authors' Addresses

732	   Janardhan Iyengar
733	   Franklin and Marshall College
734	   Mathematics and Computer Science
735	   PO Box 3003
736	   Lancaster, Pennsylvania  17604-3003
737	   USA

739	   Phone: +1 717 358 4774
740	   Email: janardhan.iyengar@fandm.edu

742	   Stuart Cheshire
743	   Apple Inc.
744	   1 Infinite Loop
745	   Cupertino, California  95014
746	   USA

748	   Phone: +1 408 974 3207
749	   Email: cheshire@apple.com

751	   Josh Graessley
752	   Apple Inc.
753	   1 Infinite Loop
754	   Cupertino, California  95014
755	   USA

757	   Phone: +1 408 974 5710
758	   Email: jgraessley@apple.com