idnits 2.17.1 

draft-davie-stt-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (January 15, 2016) is 3023 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  == Outdated reference: A later version (-16) exists of
     draft-ietf-nvo3-geneve-01

  == Outdated reference: A later version (-07) exists of
     draft-ietf-nvo3-security-requirements-06


     Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                      B. Davie, Ed.
3	Internet-Draft                                                  J. Gross
4	Intended status: Informational                              VMware, Inc.
5	Expires: July 18, 2016                                  January 15, 2016

7	  A Stateless Transport Tunneling Protocol for Network Virtualization
8	                                 (STT)
9	                           draft-davie-stt-07

11	Abstract

13	   Network Virtualization places unique requirements on tunneling
14	   protocols.  This draft describes STT (Stateless Transport Tunneling),
15	   a tunnel encapsulation that enables overlay networks to be built in
16	   virtualized networks.  STT is particularly useful when some tunnel
17	   endpoints are in end-systems, as it utilizes the capabilities of the
18	   network interface card to improve performance.  This draft documents
19	   the protocol and the rationale for its design, and highlights issues
20	   that may arise in deployments.

22	Status of This Memo

24	   This Internet-Draft is submitted in full conformance with the
25	   provisions of BCP 78 and BCP 79.

27	   Internet-Drafts are working documents of the Internet Engineering
28	   Task Force (IETF).  Note that other groups may also distribute
29	   working documents as Internet-Drafts.  The list of current Internet-
30	   Drafts is at http://datatracker.ietf.org/drafts/current/.

32	   Internet-Drafts are draft documents valid for a maximum of six months
33	   and may be updated, replaced, or obsoleted by other documents at any
34	   time.  It is inappropriate to use Internet-Drafts as reference
35	   material or to cite them other than as "work in progress."

37	   This Internet-Draft will expire on July 18, 2016.

39	Copyright Notice

41	   Copyright (c) 2016 IETF Trust and the persons identified as the
42	   document authors.  All rights reserved.

44	   This document is subject to BCP 78 and the IETF Trust's Legal
45	   Provisions Relating to IETF Documents
46	   (http://trustee.ietf.org/license-info) in effect on the date of
47	   publication of this document.  Please review these documents
48	   carefully, as they describe your rights and restrictions with respect
49	   to this document.  Code Components extracted from this document must
50	   include Simplified BSD License text as described in Section 4.e of
51	   the Trust Legal Provisions and are provided without warranty as
52	   described in the Simplified BSD License.

54	Table of Contents

56	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
57	     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   4
58	     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   4
59	     1.3.  Reference Model . . . . . . . . . . . . . . . . . . . . .   5
60	   2.  Design Rationale  . . . . . . . . . . . . . . . . . . . . . .   5
61	     2.1.  Segmentation Offload  . . . . . . . . . . . . . . . . . .   6
62	     2.2.  Metadata  . . . . . . . . . . . . . . . . . . . . . . . .   7
63	     2.3.  Context Information . . . . . . . . . . . . . . . . . . .   7
64	     2.4.  Alignment . . . . . . . . . . . . . . . . . . . . . . . .   7
65	     2.5.  Equal Cost Multipath  . . . . . . . . . . . . . . . . . .   8
66	     2.6.  Efficient Software Processing . . . . . . . . . . . . . .   8
67	   3.  Frame Formats . . . . . . . . . . . . . . . . . . . . . . . .   8
68	     3.1.  STT Frame Format  . . . . . . . . . . . . . . . . . . . .   9
69	       3.1.1.  Handling non-TCP/IP and non-UDP/IP payloads . . . . .  12
70	     3.2.  Usage of TCP Header by STT  . . . . . . . . . . . . . . .  13
71	     3.3.  Encapsulation of STT Segments in IP . . . . . . . . . . .  15
72	       3.3.1.  Diffserv and ECN-Marking  . . . . . . . . . . . . . .  15
73	       3.3.2.  Packet Loss . . . . . . . . . . . . . . . . . . . . .  15
74	     3.4.  Broadcast and Multicast . . . . . . . . . . . . . . . . .  16
75	   4.  Interoperability Issues . . . . . . . . . . . . . . . . . . .  16
76	   5.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  18
77	   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  18
78	   7.  Contributors  . . . . . . . . . . . . . . . . . . . . . . . .  18
79	   8.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  19
80	   9.  References  . . . . . . . . . . . . . . . . . . . . . . . . .  19
81	     9.1.  Normative References  . . . . . . . . . . . . . . . . . .  19
82	     9.2.  Informative References  . . . . . . . . . . . . . . . . .  20
83	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  21

85	1.  Introduction

87	   Network Virtualization places unique requirements on tunneling
88	   protocols.  The utility of tunneling in virtualized data centers has
89	   been described elsewhere; see, for example [RFC7364], [VL2],
90	   [RFC7348], [RFC7637], [I-D.ietf-nvo3-geneve].  Tunneling allows a
91	   virtual overlay topology to be constructed on top of the physical
92	   data center network, and provides benefits such as:

94	   o  Ability to manage overlapping addresses between multiple tenants
95	   o  Decoupling of the virtual topology provided by the tunnels from
96	      the physical topology of the network

98	   o  Support for virtual machine mobility independent of the physical
99	      network

101	   o  Support for essentially unlimited numbers of virtual networks (in
102	      contrast to VLANs, for example)

104	   o  Decoupling of the network service provided to servers from the
105	      technology used in the physical network (e.g. providing an L2
106	      service over an L3 fabric)

108	   o  Isolating the physical network from the addressing of the virtual
109	      networks, thus avoiding issues such as MAC table size in physical
110	      switches.

112	   This draft describes STT (Stateless Transport Tunneling), a tunnel
113	   encapsulation that enables overlay networks to be built in
114	   virtualized data center networks, providing the benefits outlined
115	   above.  STT is particularly useful when some tunnel endpoints are in
116	   end-systems, as it utilizes the capabilities of standard network
117	   interface cards to improve performance.  Multiple independent
118	   implementations of STT exist and are in production use.

120	   STT is an IP-based encapsulation and utilizes a TCP-like header
121	   inside the IP header.  It is, however, stateless, i.e., there is no
122	   TCP connection state of any kind associated with the tunnel.  The
123	   TCP-like header is used for pragmatic reasons, to leverage the
124	   capabilities of existing network interface cards, but should not be
125	   interpreted as implying any sort of connection state between
126	   endpoints.

128	   STT is typically used to carry Ethernet frames between tunnel
129	   endpoints.  These frames may be considerably larger than the MTU of
130	   the physical network - up to 64KB.  Fields in the tunnel header are
131	   used to allow these large frames to be segmented at the entrance to
132	   the tunnel according to the MTU of the physical network and
133	   subsequently reassembled at the far end of the tunnel.

135	   Because STT uses TCP's header format and protocol number (6), some
136	   care needs to be taken in the deployment of STT.  Section 4 describes
137	   these deployment considerations.

139	1.1.  Requirements Language

141	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
142	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
143	   document are to be interpreted as described in RFC 2119 [RFC2119].

145	1.2.  Terminology

147	   The following terms are used in this document:

149	   Stateless Transport Tunneling (STT).  The tunneling mechanism defined
150	   in this document.  The name derives from the fact that the tunnel
151	   header resembles the TCP/IP headers (hence "transport" tunneling)
152	   while "stateless" refers to the fact that none of the normal TCP
153	   state (connection state, send and receive windows, congestion state
154	   etc.) is associated with the tunnel (as would be required if an
155	   actual TCP connection were used for tunneling).

157	   STT Frame.  The unit of data that is passed into the tunnel prior to
158	   segmentation and encapsulation.  This frame typically consists of an
159	   Ethernet frame and an STT Frame header.  These frames may be up to
160	   64KB in size.

162	   STT Segment.  The unit of data that is transmitted on the underlay
163	   network over which the tunnel operates.  An STT segment has headers
164	   that are syntactically the same as the TCP/IP headers, and typically
165	   contains part of an STT frame as the payload.  These segments must
166	   fit within the MTU of the physical network.

168	   Context ID.  A 64-bit field in the STT frame header that conveys
169	   information about the disposition of the STT frame between the tunnel
170	   endpoints.  One example use of the Context ID is to direct delivery
171	   of the STT frame payload to the appropriate virtual network or
172	   virtual machine.

174	   MSS.  Maximum Segment Size.  The maximum number of bytes that can be
175	   sent in one TCP segment [RFC0793].

177	   NIC.  Network Interface Card.

179	   TSO.  TCP Segmentation Offload.  A function provided by many
180	   commercial NICs that allows large data units to be passed to the NIC,
181	   the NIC being responsible for creating MSS-sized segments with
182	   correct TCP/IP headers.

184	   LRO.  Large Receive Offload.  The receive-side equivalent function of
185	   TSO, in which multiple TCP segments are coalesced into larger data
186	   units.

188	   VM.  Virtual Machine.

190	1.3.  Reference Model

192	   Our conceptual model for a virtualized network is shown in Figure 1.
193	   STT tunnels extend in this figure from one virtual switch to another,
194	   providing a virtual link between the switches over some arbitrary
195	   underlay.  More generally, STT tunnels operate between a pair of
196	   tunnel endpoints; these endpoints may be virtual switches, physical
197	   switches, or some other device (e.g. an appliance).  The STT tunnel
198	   provides a virtual point-to-point Ethernet link between the
199	   endpoints.  Frames are handed to the tunnel by some entity (e.g. a VM
200	   that is connected to a virtual switch in this picture) and first
201	   encapsulated with an STT Frame header.  STT Frames may then be
202	   fragmented in the NIC, and are encapsulated with a tunnel header (the
203	   STT segment header) for transmission over the underlay.  Note that
204	   other models are possible, e.g., where one or both tunnel endpoints
205	   are implemented in a physical switch.  In such cases the tunnel
206	   endpoint may forward packets to and from another link (physical or
207	   virtual) rather than to a VM.

209	      +----------------------+             +----------------------+
210	      | +--+   +-------+---+ |             | +---+-------+   +--+ |
211	      | |VM|---|       |   | |             | |   |       |---|VM| |
212	      | +--+   |Virtual|NIC|--- Underlay --- |NIC|Virtual|   +--+ |
213	      | +--+   |Switch |   | |   Network   | |   |Switch |   +--+ |
214	      | |VM|---|       |   | |             | |   |       |---|VM| |
215	      | +--+   +-------+---+ |             | +---+-------+   +--+ |
216	      +----------------------+             +----------------------+

218	                   ()===============================()
219	                        Switch-Switch tunnel

221	                       Figure 1: STT Reference Model

223	2.  Design Rationale

225	   We take as given the need for some form of tunneling to support the
226	   virtualization of the network as described in Section 1.  One might
227	   reasonably ask whether some existing tunneling protocol such as
228	   GRE[RFC2784] or L2TPv3[RFC3931] might suffice.  In fact, [RFC7637]
229	   does just that, using GRE.  The primary motivation for STT as opposed
230	   to one of the existing tunneling methods is to improve the
231	   performance of data transfers from hosts that implement tunnel
232	   endpoints.  We expand on this rationale below.

234	2.1.  Segmentation Offload

236	   A large percentage of network interface cards (NICs) in use today are
237	   able to perform TCP segmentation offload (TSO).  When a NIC supports
238	   TSO, the host hands a large (greater than 1 TCP MSS) frame of data to
239	   the NIC along with a set of metadata which includes, among other
240	   things, the desired MSS, and various fields needed to complete the
241	   TCP header.  The NIC fragments the frame into MSS-sized segments,
242	   performs the TCP Checksum operation, and applies the appropriate
243	   headers (TCP, IP and MAC) to each segment.

245	   On the receive side, some NICs support the reassembly of TCP
246	   segments, a function referred to as large receive offload (LRO).  In
247	   this case, NICs attempt to reassemble TCP segments and pass larger
248	   aggregates of data to the host.  (Since TCP's service model is a byte
249	   stream, there is no higher level frame for the NIC to reassemble, but
250	   it can pass chunks of the stream larger than one MSS to the host.)
251	   The benefits to the host include fewer per-packet operations and
252	   larger data transfers between host and NIC, which amortizes the per-
253	   transfer cost (such as interrupt processing) more efficiently.  These
254	   gains can translate into significant performance gains for data
255	   transfer from the host to the network.

257	   STT is explicitly designed to leverage the TSO capabilities of
258	   currently available NICs.  While one might think of segmentation as a
259	   generic function, the majority of NICs are designed specifically to
260	   support TCP segmentation offload, as the details of the segmentation
261	   function are highly dependent on the specifics of TCP.  In order to
262	   leverage such capability, therefore, the STT segment header is
263	   syntactically identical to a valid TCP header.  However, we use some
264	   of the fields in the TCP header (specifically, sequence number and
265	   ACK number) to support the objectives of STT.  The details are
266	   described in Section 3.2.  In essence, we need the same set of
267	   information that IP datagrams carry when IP fragmentation takes
268	   place: a unique identifier for the frame that has been fragmented, an
269	   offset into that frame for the current fragment, and the length of
270	   the frame to be reassembled.  We fit these fields into the TCP header
271	   fields traditionally used for the SEQ and ACK numbers.  STT segments
272	   are transmitted as IP datagrams using the TCP protocol number (6).
273	   The primary means to recognize STT segments is the destination port
274	   number.  We discuss the interoperability impact of these design
275	   choices in Section 4.

277	   The net effect of using TSO is that the frame size that is sent by
278	   endpoints in the virtualized network can be much larger than the MTU
279	   of the underlying physical network.  The primary benefit of this is a
280	   significant performance gain when large amounts of data are being
281	   transferred between nodes in the virtual network.  A secondary effect
282	   is that the header of the STT frame is amortized across a larger
283	   amount of data, reducing the need to shrink the STT frame header to
284	   minimum size.

286	   Note that, while segmentation offload is the primary NIC function
287	   that STT takes advantage of, other NIC offload functions such as
288	   checksum calculation can also be leveraged.

290	2.2.  Metadata

292	   When a frame is delivered to the NIC that supports TSO for
293	   segmentation and transmission, a certain amount of metadata is
294	   typically passed along with it.  This includes the MSS and
295	   potentially a VLAN tag to be applied to the transmitted packets.

297	   In some virtualized network deployments, an STT frame may traverse a
298	   tunnel, be received and reassembled at an STT endpoint, and then be
299	   sent on another physical interface.  In such cases, the tunnel
300	   terminating endpoint may need to pass metadata to a NIC to enable
301	   transmission of frames on the physical link.  For this reason,
302	   appropriate metadata is carried in the STT frame header.

304	2.3.  Context Information

306	   When an STT Frame is received by a tunnel endpoint, it needs to be
307	   directed to the appropriate entity in the virtualized network to
308	   which it belongs.  For this reason, a Context ID is required in the
309	   STT frame header.  Some other encapsulations (e.g.  [RFC7348],
310	   [RFC7637]) use an explicit tenant network identifier or virtual
311	   network identifier.  The Context Identifier can be thought of as a
312	   generalized form of virtual network identifier.  Using a larger and
313	   more general identifier allows for a broader range of service models
314	   and allows ample room for future expansion.  There is little downside
315	   to using a larger field here because it is amortized across the
316	   entire STT Frame rather than being present in each packet.

318	2.4.  Alignment

320	   Software implementations of tunnel endpoints benefit from 32-bit
321	   alignment of the data to be manipulated.  Because the Ethernet header
322	   is not a multiple of 32-bits (it is 14 bytes), 2 bytes of padding are
323	   added to the STT header, causing the payload beyond the encapsulated
324	   Ethernet header, which typically includes the IP header of the
325	   encapsulated frame, to be 32-bit aligned.

327	2.5.  Equal Cost Multipath

329	   It is essential that traffic passing through the physical network can
330	   be efficiently distributed across multiple paths.  Standard equal
331	   cost multipath (ECMP) techniques involve hashing on address and port
332	   numbers in the outer protocol headers.  There are two main issues to
333	   address with ECMP.  First, it is important that, when a set of
334	   packets belong to a single flow (e.g. a TCP connection in the virtual
335	   network), all those packets should follow the same path.  Second, all
336	   paths should be used efficiently, i.e. there needs to be sufficient
337	   entropy among the different flows to ensure they get distributed
338	   evenly across multiple paths.

340	   STT achieves the first goal by ensuring that the source and
341	   destination ports and addresses in the outer header are all the same
342	   for a single flow.  The second goal is achieved by generating the
343	   source port using a random hash of fields in the headers of the inner
344	   packets, e.g. the ports and addresses of the virtual flow's packets.
345	   We provide more details on the usage of port numbers in Section 3.2.

347	2.6.  Efficient Software Processing

349	   The design of STT is largely motivated by the desire to tunnel
350	   packets efficiently between virtual switches running in software.  In
351	   addition to the points noted above, this leads to some design
352	   optimizations to simplify processing of packets, such as the use of
353	   an "L4 offset" field in the STT header to enable the payload to be
354	   located quickly without extensive header parsing.

356	3.  Frame Formats

358	   STT encapsulates data payloads of up to 64KB (limited by the length
359	   field in the STT segment header, described in Section 3.2).  Those
360	   frames are then handed to the NIC, which segments them to an
361	   appropriate size given the MTU of the underlying physical network,
362	   and encapsulates the resulting segments in a standard TCP header,
363	   which in turn is encapsulated by an IP header and finally a MAC
364	   header.  This is illustrated in Figure 2.

366	                      +-----------+    +----------+     +----------+
367	                      | IP Header |    |IP Header |     |IP header |
368	   +-----------+      +-----------+    +----------+     +----------+
369	   |STT Frame  |      |TCP-like   |    |TCP-like  |     |TCP-like  |
370	   | Header    |      | header    |    | header   |     | header   |
371	   +-----------+      +-----------+    +----------+     +----------+
372	   |           | ---> | STT Frame |    |Next part | ... |Last part |
373	   |Payload    |      |  Header   |    |of Payload|     |of Payload|
374	   .           .      +-----------+    |          |     |          |
375	   .           .      |           |    |          |     |          |
376	   .           .      |  Start of |    |          |     |          |
377	   +-----------+      |  Payload  |    |          |     +----------+
378	                      +-----------+    +----------+

380	   Original data            STT Frame is segmented by the NIC and
381	   frame is encapped       transmitted as a set of TCP segments (MAC
382	   with STT Header                  headers not shown)

384	              Figure 2: STT Frame Fragments and Encapsulation

386	   The details of the STT Frame header and the usage of the TCP-like
387	   header are described in detail below.  The TCP segments shown in
388	   Figure 2 are of course further encapsulated as IP datagrams, and may
389	   be sent as either IPv4 or IPv6.  The resulting IP datagrams are then
390	   transmitted in the appropriate MAC level frame (e.g.  Ethernet, not
391	   shown in the figure) for the underlying physical network over which
392	   the tunnels are established.

394	3.1.  STT Frame Format

396	   Figure 3 illustrates the header of an STT frame before it is
397	   segmented.

399	    0                   1                   2                   3
400	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
401	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
402	   |  Version      | Flags         |  L4 Offset    |  Reserved     |
403	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
404	   |    Max. Segment Size          | PCP |V|     VLAN ID           |
405	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
406	   |                                                               |
407	   +                     Context ID (64 bits)                      +
408	   |                                                               |
409	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
410	   |     Padding                   |    data                       |
411	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
412	   |                                                               |

414	                        Figure 3: STT Frame Format

416	   The STT frame header contains the following fields:

418	   o  Version - currently 0.  If a non-zero version field is received by
419	      an implementation that supports only version zero, the frame MUST
420	      be discarded.

422	   o  Flags - describes encapsulated packet, see below.

424	   o  L4 offset - offset in bytes from the end of the STT Frame header
425	      to the start of the encapsulated layer 4 (TCP/UDP) header.  If the
426	      encapsulated packet is not IPv4 or IPv6, this field SHOULD be set
427	      to zero.

429	   o  Reserved field - MUST be zero on transmission and ignored on
430	      receipt.

432	   o  Max Segment Size - the segment size (the negotiated MSS in the
433	      case of TCP) that should be used by a tunnel endpoint that is
434	      transmitting this frame onto another network.  MUST be zero if
435	      segmentation offload is not in use.

437	   o  PCP - the 3-bit Priority Code Point field that should be applied
438	      to this packet by an STT tunnel endpoint on transmission to
439	      another network (see Section 2.2).  Meaningful only if the V bit
440	      is set.

442	   o  V - a one bit flag that, if set, indicates the presence of a valid
443	      VLAN ID in the following field and valid PCP in the preceding
444	      field.  When this flag is set, an 802.1Q header will be applied to
445	      the packet by the STT tunnel endpoint on transmission.  The TPID
446	      will be 0x8100.

448	   o  VLAN ID - 12-bit VLAN tag that should be applied to this packet by
449	      an STT tunnel endpoint on transmission to another network (see
450	      Section 2.2).  Any valid VLAN ID (including zero) may be used.
451	      Meaningful only if the V bit is set.

453	   o  Context ID - 64 bits of context information, described in detail
454	      in Section 2.3.

456	   o  Padding - 16 bits as described in Section 2.4.  MUST be set to
457	      zero on transmission and ignored on receipt.

459	   The flags field is an 8-bit field organized as follows:

461	            0 1 2 3 4 5 6 7
462	           +-+-+-+-+-+-+-+-+
463	           |C|P|V|T| Res.  |
464	           +-+-+-+-+-+-+-+-+

466	                            Figure 4: STT Flags

468	   The meanings of the flags is as follows:

470	   o  C: Checksum verified.  Set if the checksum of the encapsulated
471	      packet has been verified by the sender.

473	   o  P: Checksum partial.  Set if the checksum in the encapsulated
474	      packet has been computed only over the TCP/IP pseudoheader (or
475	      UDP/IP pseudoheader, if the encapsulated packet is UDP).  This bit
476	      MUST be set if segmentation offload is used by the sender.  Note
477	      that bit 0 and bit 1 cannot both be set in the same header.

479	   o  V: IP version.  Set if the encapsulated packet is IPv4, not set if
480	      the packet is IPv6.  See below for discussion of non-IP payloads.

482	   o  T: TCP payload.  Set if the encapsulated packet is TCP.

484	   o  Bits 4 through 7 are reserved and MUST be zero on transmission and
485	      ignored on receipt.

487	   As noted above, several of these fields are present primarily to
488	   enable efficient processing of the packet when it is received at a
489	   tunnel endpoint.  (For example, it's entirely possible to determine
490	   if the packet is IPv4 or IPv6 by looking at the Ethernet header -
491	   it's just more efficient not to have to do so.)

493	   The payload of the STT frame is an untagged Ethernet frame.

495	   In the case where the Ethernet frame contains TCP/IP or UDP/IP as its
496	   payload, this encapsulated packet should be correctly formatted as if
497	   it were about to undergo unfragmented transmission (even though it
498	   will ultimately be segmented as part of the transmission process).
499	   This means it should have a correct TCP or UDP checksum (possibly
500	   "partial", as noted above), correct length fields for its
501	   unfragmented state, and correct IP header checksum (if IPv4).

503	   If the length of the payload to be encapsulated exceeds 64KB, or if
504	   the offset to the L4 header exceeds 255 bytes, then it will not be
505	   possible to offload the packet to the NIC for segmentation.  In this
506	   case, the payload needs to be segmented and checksummed before being
507	   encapsulated in STT frames.

509	   Because there is no negotiation between end-points of an STT tunnel,
510	   only basic TSO capabilities should be assumed.  For example, ECN
511	   (explicit congestion notification) support should not be assumed, so
512	   TSO should not be requested for packets requiring such support.
513	   Instead, such payloads should be segmented before being encapsulated
514	   in STT frames.

516	3.1.1.  Handling non-TCP/IP and non-UDP/IP payloads

518	   Note that the STT header does not have a general "protocol" field to
519	   allow the efficient processing of arbitrary payloads.  The current
520	   version is designed to provide a virtual Ethernet link, and hence
521	   efficiently supports only Ethernet frames as the payload.  The
522	   Ethernet header itself contains a protocol field, which then
523	   identifies the higher layer protocol, so it is straightforward to
524	   accommodate non-IP traffic.  Note however that offloading support
525	   will not typically be available for traffic other than the following:
526	   TCP and UDP over IPv4 or IPv6, with a maximum of a single VLAN tag
527	   stored in the STT header.  Other protocols will need to be
528	   appropriately formatted for direct transmission prior to
529	   encapsulation.

531	   It will be noted that the STT Frame header does contain fields that
532	   are intended to assist in efficient processing of IPv4 and IPv6
533	   packets.  These fields MUST be set to zero and ignored on receipt for
534	   packets not being offloaded.

536	   The use of STT to carry payloads other than Ethernet is theoretically
537	   possible but is beyond the scope of this document.

539	3.2.  Usage of TCP Header by STT

541	   Figure 5 illustrates the usage of the TCP header by STT.  This figure
542	   is essentially identical to that in [RFC0793] with the exception that
543	   we denote with an asterisk (*) two fields that are used by STT to
544	   convey something other than the information that is conveyed by TCP.
545	   Syntactically, STT segments look identical to TCP segments.  However,
546	   STT tunnel endpoints treat the Sequence number and Acknowledgment
547	   number differently than TCP endpoints treat those fields.
548	   Furthermore, as noted above, there is no TCP state machine associated
549	   with an STT tunnel.

551	    0                   1                   2                   3
552	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
553	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
554	   |          Source Port          |       Destination Port        |
555	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
556	   |                        Sequence Number(*)                     |
557	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
558	   |                    Acknowledgment Number(*)                   |
559	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
560	   |  Data |           |U|A|P|R|S|F|                               |
561	   | Offset| Reserved  |R|C|S|S|Y|I|            Window             |
562	   |       |           |G|K|H|T|N|N|                               |
563	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
564	   |           Checksum            |         Urgent Pointer        |
565	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
566	   |                    Options                    |    Padding    |
567	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
568	   |                             data                              |
569	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

571	                       Figure 5: STT Segment Format

573	   The Destination port, assigned by IANA, is 7471.

575	   In order to allow correct reassembly of the STT frame, the source
576	   port MUST be constant for all segments of a single STT frame.

578	   As noted above (Section 2.5) the source port SHOULD be the same for
579	   all frames that belong to a single flow in the virtual network, e.g.
580	   a single TCP connection.

582	   Also, to encourage efficient distribution of traffic among multiple
583	   paths when ECMP is used, the method to calculate the source port
584	   should provide a random distribution of source port numbers.  An
585	   example mechanism would be a random hash on ports and addresses of
586	   the TCP headers of the flow in the virtual network.

588	   The Sequence number and Acknowledgment number fields are re-purposed
589	   in a way that does not confuse NICs that expect them to be used in
590	   the conventional manner.  The ACK field is used as a packet
591	   identifier for the purposes of fragmentation, equivalent in function
592	   to the Identification field of IPv4 or the IPv6 Fragment header: it
593	   MUST be constant for all STT segments of a given frame, and different
594	   from any value used recently for other STT frames sent over this
595	   tunnel.  ("Recent" in this context means a long enough interval that
596	   packets from the frame that last used this value of the ACK field
597	   should have all been delivered.  Similar considerations apply to the
598	   reuse of the IP Fragment Identifier, discussed in [RFC6864], but note
599	   that packet lifetimes in a data center are likely to be relatively
600	   short.)

602	   The upper 16 bits of the the SEQ field are used to convey the length
603	   of the STT frame in bytes.  The lower 16 bits of the SEQ field are
604	   used to convey the offset (in bytes) of the current fragment within
605	   the larger STT frame.  The task of updating the SEQ field on each
606	   transmitted segment is the responsibility of the NIC.

608	   Reassembly of the fragments may be done partially by NICs that
609	   perform LRO, since the sequence numbers of frames will increment
610	   appropriately.  That is, the upper 16 bits don't change, and the
611	   lower 16 bits increment by N for every N byte segment that is
612	   transmitted, just as would be the case if an actual sequence number
613	   were being sent.  Note that the size limit of an STT frame ensures
614	   that sequence numbers cannot wrap while sending the segments of a
615	   single STT frame.

617	   In the event that a NIC does not consider the ACK field when merging
618	   received packets, LRO MUST be disabled for this NIC when using STT.

620	   All the fields after ACK have their conventional meaning, although
621	   nothing will be done with the Window or Urgent pointer values.  Those
622	   fields SHOULD be zero on transmit and ignored on receipt.  It is
623	   RECOMMENDED that the PSH (Push) flag be set when transmitting the
624	   last segment of a frame in order to cause data to be delivered by the
625	   NIC without waiting for other fragments.  The ACK flag SHOULD be set
626	   to ensure that a receiving NIC examines the ACK field.  All other
627	   flags SHOULD be zero on transmit and ignored on receipt.

629	3.3.  Encapsulation of STT Segments in IP

631	   From the perspective of IP, an STT segment is just like any other TCP
632	   segment.  The protocol number (IPv4) or Next Header (IPv6) has the
633	   value 6, as for regular TCP.  The resulting IP datagram is then
634	   encapsulated in the appropriate L2 header (e.g.  Ethernet) for
635	   transmission on the physical medium.

637	3.3.1.  Diffserv and ECN-Marking

639	   When traffic is encapsulated in a tunnel header, there are numerous
640	   options as to how the Diffserv Code-Point (DSCP) and ECN markings are
641	   set in the outer header and propagated to the inner header on
642	   decapsulation.

644	   [RFC2983] defines two modes for mapping the DSCP markings from inner
645	   to outer headers and vice versa.  The Uniform model copies the inner
646	   DSCP marking to the outer header on tunnel ingress, and copies that
647	   outer header value back to the inner header at tunnel egress.  The
648	   Pipe model sets the DSCP value to some value based on local policy at
649	   ingress and does not modify the inner header on egress.  Both models
650	   SHOULD be supported by STT endpoints.  However, there is an
651	   additional complexity with the uniform model for STT, because a
652	   single IP datagram that is transmitted over the tunnel appears as
653	   multiple IP datagrams on the wire.  Thus it is not guaranteed that
654	   all segments of the STT frame will have the same DSCP at egress.  If
655	   uniform model behavior is configured, it is RECOMMENDED that the DSCP
656	   of the first segment of the STT frame be used to set the DSCP value
657	   of the IP header in the decapsulated STT frame.

659	   [RFC6040] describes the correct ECN behavior for any type of IP in IP
660	   tunnel, and this behavior SHOULD be followed for STT tunnels.  As
661	   with the Uniform Diffserv tunnel model, the fact that one inner IP
662	   datagram is segmented into multiple outer datagrams makes the
663	   situation slightly more complex.  It is RECOMMENDED that if any
664	   segment of the received STT frame has the CE (congestion experienced)
665	   bit set in its IP header, then the CE bit SHOULD be set in the IP
666	   header of the decapsulated STT frame.

668	3.3.2.  Packet Loss

670	   Individual IP datagrams may be dropped (most often due to congestion)
671	   and, since there is no acknowledgment or reliable delivery of these
672	   datagrams, there is the potential to corrupt an entire STT Frame due
673	   to the loss of a single IP datagram.  The negative consequences of
674	   such partial losses have been known for many years (see, for example,
675	   [KM87]).  Fortunately, there are solutions to this problem in the
676	   case where the higher layer protocol running over STT is TCP.  An STT
677	   receiving endpoint running in an end-system, as shown in Figure 1 for
678	   example, is not required to deliver complete STT frames to the TCP
679	   stack in the receiving VM.  A partial frame payload can be delivered
680	   and the receiving TCP stack can deal with the missing bytes just as
681	   it would if running directly over a physical network.  That is, TCP
682	   in the VM can send ACKs for the contiguous bytes received to trigger
683	   retransmission of the missing bytes by the sender.  This is similar
684	   to the operation of LRO in current NICs.  There are some subtleties
685	   to making this work correctly in the STT context, and it does depend
686	   on the STT endpoint being aware of the higher layer protocols
687	   consuming data in the VM to which it is connected.  The main point of
688	   this discussion is that, in the common deployments of STT running in
689	   a virtual switch, the potential harm of losing individual packets is
690	   not as serious as it might first appear.

692	3.4.  Broadcast and Multicast

694	   It is possible to establish point-to-multipoint STT tunnels by using
695	   an IP multicast address as the destination address of the tunnel.
696	   These may be used for broadcast or multicast traffic if the
697	   underlying physical network supports IP multicast.  Control
698	   mechanisms for setting up such multicast groups are beyond the scope
699	   of this document.  It is worth repeating that, despite the syntactic
700	   resemblance between the STT segment header and the TCP header, there
701	   is no TCP state machine associated with an STT tunnel, so the
702	   traditional issues of combining multicast with TCP (or reliable
703	   transports more generally) do not arise.

705	4.  Interoperability Issues

707	   It will be noted that an STT packet on the wire appears exactly the
708	   same as a TCP packet, but that processing of an STT packet on
709	   reception is entirely different from TCP - no three-way handshake to
710	   establish a connection, no ACKs, retransmission, etc.  Hence, an STT
711	   tunnel endpoint clearly needs to be configured to behave in the
712	   correct manner rather than to perform standard TCP processing on the
713	   packet.  The primary way to recognize an STT segment is the
714	   destination port number in the TCP header.  In the event that an STT
715	   packet is inadvertently delivered to a device that is not configured
716	   to behave as an STT tunnel endpoint, no TCP connection will be
717	   established and STT packets will be dropped.

719	   Being stateless, STT does not provide any sort of congestion control.
720	   In this sense it is equivalent to other tunneling protocols such as
721	   GRE.  The assumption is that congestion control, if required, is
722	   provided by higher layers (e.g. a real TCP connection generating the
723	   payloads of STT frames), just as in any other tunneling protocol.

725	   STT deployments are almost entirely limited at present to intra-data
726	   center environments.  In these environments, STT tunnels between
727	   pairs of endpoints are typically created by some sort of network
728	   virtualization controller.  STT packets should therefore remain
729	   within the perimeter of the overlay that is managed by that
730	   controller.  In the event of some misconfiguration or erroneous
731	   controller behavior, STT packets could be sent outside of this
732	   controlled domain into the broader Internet.  As noted above, any
733	   endpoint that is not expecting STT packets will drop them, as they
734	   will appear to belong to an unestablished TCP session.  Many
735	   firewalls are also likely to drop erroneously sent STT packets for
736	   the same reason.

738	   Within a network virtualization overlay, there may be middle boxes
739	   (e.g. firewalls) that process TCP.  It is likely that, in the near
740	   term at least, such devices will drop STT packets, as there will be
741	   no TCP connection state established.  This could prevent the correct
742	   operation of the overlay.  This is clearly undesirable, but it is a
743	   general issue with any form of tunneling - the nature of many middle
744	   boxes is that they will not permit tunnels to pass through them.
745	   Hence the best solution is simply to avoid deploying middle boxes at
746	   locations where STT tunnels (or other forms of tunnels for network
747	   virtualization) will need to pass through them.  This will not,
748	   however, always be feasible, especially when virtualized networks
749	   extend among multiple data centers.  Other solutions include
750	   configuring the middle boxes to permit TCP packets to pass through
751	   when the port number matches the port assigned for STT.  In this case
752	   the middle boxes would have to permit the packets to pass in spite of
753	   the lack of an established TCP connection and the repurposing of the
754	   SEQ and ACK fields.

756	   In the longer term, we might reasonably expect that middle boxes
757	   would be able to recognize STT traffic, and to terminate and
758	   originate STT tunnels if necessary (e.g. to perform functions that
759	   require the STT payload to be inspected such as stateful
760	   firewalling).

762	   It is also possible to provide all the functionality of STT using a
763	   different IP protocol number (or next header value in IPv6).  This
764	   approach could make sense in the long run but will typically not
765	   enable current NIC hardware to be leveraged for TSO and LRO
766	   functions.  An alternative approach is to move to a UDP-based
767	   encapsulation such as Geneve [I-D.ietf-nvo3-geneve].  This, too,
768	   requires NICs to evolve to support TSO and LRO on tunneled traffic.

770	   It is also possible to run STT traffic over other forms of tunnel
771	   (GRE, IPSEC, etc.) in which case the STT traffic can pass through
772	   appropriately configured middle boxes.

774	5.  IANA Considerations

776	   IANA has allocated TCP port 7471 for STT.  This document makes no
777	   further request of IANA.

779	6.  Security Considerations

781	   In the physical network, STT packets are simply IP datagrams, and do
782	   not introduce new security issues.  Most standard IP security
783	   mechanisms (such as IPSEC encryption or authentication) can be
784	   implemented on STT packets if desired.  As noted above, however,
785	   tunneling generally interacts poorly with middle boxes, and STT is no
786	   exception.  Devices such as firewalls are likely to drop STT traffic
787	   unless the capability to recognize STT packets is implemented, or
788	   unless the STT traffic is itself run over some sort of tunnel that
789	   the firewall is configured to permit.  Intrusion detection systems
790	   would similarly need to be enhanced to be able to look inside STT
791	   packets.

793	   It should also be noted that while STT packets resemble TCP segments,
794	   the lack of a TCP state machine means that TCP-related security
795	   issues (e.g.  SYN-flooding) do not apply.  Similarly, some of the
796	   benefits of the TCP state machine (e.g. the ability to discard
797	   packets with unexpected sequence numbers) are also absent for STT
798	   traffic.

800	   More general issues of security related to network virtualization
801	   overlays are described in [I-D.ietf-nvo3-security-requirements].

803	7.  Contributors

805	   The following individuals contributed to this document:

807	   Brad McConnell
808	   Rackspace
809	   5000 Walzem Road
810	   San Antonio, TX  78218
811	   Email: bmcconne@rackspace.com

813	   JC Martin
814	   eBay
815	   2145 Hamilton Ave.
816	   San Jose, CA 95125
817	   Email: jcmartin@ebaysf.com

819	   Iben Rodriguez
820	   eBay
821	   2477 Woodland Ave
822	   San Jose, CA 95128
823	   Email: Iben.rodriguez@gmail.com

825	   Ilango Ganga
826	   Intel Corporation
827	   2200 Mission College Blvd.
828	   Santa Clara, CA - 95054
829	   Email: ilango.s.ganga@intel.com

831	   Igor Gashinsky
832	   Yahoo!
833	   111 West 40th Street
834	   New York, NY 10018
835	   Email: igor@yahoo-inc.com

837	8.  Acknowledgements

839	   We thank Martin Casado for inspiring this work and making all the
840	   introductions, and to Ben Pfaff for his explanations of the
841	   implementation.  Thanks also to Pierre Ettori, Yukio Ogawa and
842	   Koichiro Seto for their helpful comments.

844	9.  References

846	9.1.  Normative References

848	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
849	              RFC 793, DOI 10.17487/RFC0793, September 1981,
850	              <http://www.rfc-editor.org/info/rfc793>.

852	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
853	              Requirement Levels", BCP 14, RFC 2119,
854	              DOI 10.17487/RFC2119, March 1997,
855	              <http://www.rfc-editor.org/info/rfc2119>.

857	9.2.  Informative References

859	   [I-D.ietf-nvo3-geneve]
860	              Gross, J. and I. Ganga, "Geneve: Generic Network
861	              Virtualization Encapsulation", draft-ietf-nvo3-geneve-01
862	              (work in progress), January 2016.

864	   [I-D.ietf-nvo3-security-requirements]
865	              Hartman, S., Zhang, D., Wasserman, M., Qiang, Z., and M.
866	              Zhang, "Security Requirements of NVO3", draft-ietf-nvo3-
867	              security-requirements-06 (work in progress), December
868	              2015.

870	   [KM87]     Kent, C. and J. Mogul, "Fragmentation Considered Harmful",
871	              Proc. ACM SIGCOMM 1987, August 1987.

873	   [RFC2784]  Farinacci, D., Li, T., Hanks, S., Meyer, D., and P.
874	              Traina, "Generic Routing Encapsulation (GRE)", RFC 2784,
875	              DOI 10.17487/RFC2784, March 2000,
876	              <http://www.rfc-editor.org/info/rfc2784>.

878	   [RFC2983]  Black, D., "Differentiated Services and Tunnels",
879	              RFC 2983, DOI 10.17487/RFC2983, October 2000,
880	              <http://www.rfc-editor.org/info/rfc2983>.

882	   [RFC3931]  Lau, J., Ed., Townsley, M., Ed., and I. Goyret, Ed.,
883	              "Layer Two Tunneling Protocol - Version 3 (L2TPv3)",
884	              RFC 3931, DOI 10.17487/RFC3931, March 2005,
885	              <http://www.rfc-editor.org/info/rfc3931>.

887	   [RFC6040]  Briscoe, B., "Tunnelling of Explicit Congestion
888	              Notification", RFC 6040, DOI 10.17487/RFC6040, November
889	              2010, <http://www.rfc-editor.org/info/rfc6040>.

891	   [RFC6864]  Touch, J., "Updated Specification of the IPv4 ID Field",
892	              RFC 6864, DOI 10.17487/RFC6864, February 2013,
893	              <http://www.rfc-editor.org/info/rfc6864>.

895	   [RFC7348]  Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,
896	              L., Sridhar, T., Bursell, M., and C. Wright, "Virtual
897	              eXtensible Local Area Network (VXLAN): A Framework for
898	              Overlaying Virtualized Layer 2 Networks over Layer 3
899	              Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014,
900	              <http://www.rfc-editor.org/info/rfc7348>.

902	   [RFC7364]  Narten, T., Ed., Gray, E., Ed., Black, D., Fang, L.,
903	              Kreeger, L., and M. Napierala, "Problem Statement:
904	              Overlays for Network Virtualization", RFC 7364,
905	              DOI 10.17487/RFC7364, October 2014,
906	              <http://www.rfc-editor.org/info/rfc7364>.

908	   [RFC7637]  Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network
909	              Virtualization Using Generic Routing Encapsulation",
910	              RFC 7637, DOI 10.17487/RFC7637, September 2015,
911	              <http://www.rfc-editor.org/info/rfc7637>.

913	   [VL2]      Greenberg, A., "VL2: A Scalable and Flexible Data Center
914	              Network", Proc. ACM SIGCOMM 2009, August 2009.

916	Authors' Addresses

918	   Bruce Davie (editor)
919	   VMware, Inc.
920	   3401 Hillview Ave.
921	   Palo Alto, CA  94304
922	   USA

924	   Email: bdavie@vmware.com

926	   Jesse Gross
927	   VMware, Inc.
928	   3401 Hillview Ave.
929	   Palo Alto, CA  94304
930	   USA

932	   Email: jgross@vmware.com