idnits 2.17.1 

draft-davie-stt-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (March 5, 2012) is 4428 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  ** Obsolete normative reference: RFC  793 (Obsoleted by RFC 9293)

  == Outdated reference: A later version (-09) exists of
     draft-mahalingam-dutt-dcops-vxlan-01

  == Outdated reference: A later version (-04) exists of
     draft-narten-nvo3-overlay-problem-statement-01

  == Outdated reference: A later version (-08) exists of
     draft-sridharan-virtualization-nvgre-00


     Summary: 1 error (**), 0 flaws (~~), 4 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                      B. Davie, Ed.
3	Internet-Draft                                                  J. Gross
4	Intended status: Informational                     Nicira Networks, Inc.
5	Expires: September 6, 2012                                 March 5, 2012

7	  A Stateless Transport Tunneling Protocol for Network Virtualization
8	                                 (STT)
9	                           draft-davie-stt-01

11	Abstract

13	   Network Virtualization places unique requirements on tunneling
14	   protocols.  This draft describes STT (Stateless Transport Tunneling),
15	   a tunnel encapsulation that enables overlay networks to be built in
16	   virtualized networks.  STT is particularly useful when some tunnel
17	   endpoints are in end-systems, as it utilizes the capabilities of the
18	   network interface card to improve performance.

20	Status of this Memo

22	   This Internet-Draft is submitted in full conformance with the
23	   provisions of BCP 78 and BCP 79.

25	   Internet-Drafts are working documents of the Internet Engineering
26	   Task Force (IETF).  Note that other groups may also distribute
27	   working documents as Internet-Drafts.  The list of current Internet-
28	   Drafts is at http://datatracker.ietf.org/drafts/current/.

30	   Internet-Drafts are draft documents valid for a maximum of six months
31	   and may be updated, replaced, or obsoleted by other documents at any
32	   time.  It is inappropriate to use Internet-Drafts as reference
33	   material or to cite them other than as "work in progress."

35	   This Internet-Draft will expire on September 6, 2012.

37	Copyright Notice

39	   Copyright (c) 2012 IETF Trust and the persons identified as the
40	   document authors.  All rights reserved.

42	   This document is subject to BCP 78 and the IETF Trust's Legal
43	   Provisions Relating to IETF Documents
44	   (http://trustee.ietf.org/license-info) in effect on the date of
45	   publication of this document.  Please review these documents
46	   carefully, as they describe your rights and restrictions with respect
47	   to this document.  Code Components extracted from this document must
48	   include Simplified BSD License text as described in Section 4.e of
49	   the Trust Legal Provisions and are provided without warranty as
50	   described in the Simplified BSD License.

52	Table of Contents

54	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
55	     1.1.  Requirements Language  . . . . . . . . . . . . . . . . . .  4
56	     1.2.  Terminology  . . . . . . . . . . . . . . . . . . . . . . .  4
57	     1.3.  Reference Model  . . . . . . . . . . . . . . . . . . . . .  5
58	   2.  Design Rationale . . . . . . . . . . . . . . . . . . . . . . .  5
59	     2.1.  Segmentation Offload . . . . . . . . . . . . . . . . . . .  6
60	     2.2.  Metadata . . . . . . . . . . . . . . . . . . . . . . . . .  7
61	     2.3.  Context Information  . . . . . . . . . . . . . . . . . . .  7
62	     2.4.  Alignment  . . . . . . . . . . . . . . . . . . . . . . . .  8
63	     2.5.  Equal Cost Multipath . . . . . . . . . . . . . . . . . . .  8
64	     2.6.  Efficient Software Processing  . . . . . . . . . . . . . .  8
65	   3.  Frame Formats  . . . . . . . . . . . . . . . . . . . . . . . .  8
66	     3.1.  STT Frame Format . . . . . . . . . . . . . . . . . . . . .  9
67	       3.1.1.  Handling non-IP payloads . . . . . . . . . . . . . . . 11
68	     3.2.  Usage of TCP Header by STT . . . . . . . . . . . . . . . . 12
69	     3.3.  Encapsulation of STT Segments in IP  . . . . . . . . . . . 13
70	       3.3.1.  Diffserv and ECN-Marking . . . . . . . . . . . . . . . 14
71	       3.3.2.  Packet Loss  . . . . . . . . . . . . . . . . . . . . . 14
72	     3.4.  Broadcast and Multicast  . . . . . . . . . . . . . . . . . 15
73	   4.  Interoperability Issues  . . . . . . . . . . . . . . . . . . . 15
74	   5.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 16
75	   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 16
76	   7.  Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 16
77	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 17
78	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 17
79	     9.1.  Normative References . . . . . . . . . . . . . . . . . . . 17
80	     9.2.  Informative References . . . . . . . . . . . . . . . . . . 18
81	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18

83	1.  Introduction

85	   Network Virtualization places unique requirements on tunneling
86	   protocols.  The utility of tunneling in virtualized data centers has
87	   been described elsewhere; see, for example
88	   [I-D.narten-nvo3-overlay-problem-statement], [VL2],
89	   [I-D.mahalingam-dutt-dcops-vxlan],
90	   [I-D.sridharan-virtualization-nvgre].  Tunneling allows a virtual
91	   overlay topology to be constructed on top of the physical data center
92	   network, and provides benefits such as:

94	   o  Ability to manage overlapping addresses between multiple tenants

96	   o  Decoupling of the virtual topology provided by the tunnels from
97	      the physical topology of the network

99	   o  Support for virtual machine mobility independent of the physical
100	      network

102	   o  Support for essentially unlimited numbers of virtual networks (in
103	      contrast to VLANs, for example)

105	   o  Decoupling of the network service provided to servers from the
106	      technology used in the physical network (e.g. providing an L2
107	      service over an L3 fabric)

109	   o  Isolating the physical network from the addressing of the virtual
110	      networks, thus avoiding issues such as MAC table size in physical
111	      switches.

113	   This draft describes STT (Stateless Transport Tunneling), a tunnel
114	   encapsulation that enables overlay networks to be built in
115	   virtualized data center networks, providing the benefits outlined
116	   above.  STT is particularly useful when some tunnel endpoints are in
117	   end-systems, as it utilizes the capabilities of standard network
118	   interface cards to improve performance.  STT is an IP-based
119	   encapsulation and utilizes a TCP-like header inside the IP header.
120	   It is, however, stateless, i.e., there is no TCP connection state of
121	   any kind associated with the tunnel.  The TCP-like header is used for
122	   pragmatic reasons, to leverage the capabilities of existing network
123	   interface cards, but should not be interpreted as implying any sort
124	   of connection state between endpoints.

126	   STT is typically used to carry Ethernet frames between tunnel
127	   endpoints.  These frames may be considerably larger than the MTU of
128	   the physical network - up to 64KB.  Fields in the tunnel header are
129	   used to allow these large frames to be segmented at the entrance to
130	   the tunnel according to the MTU of the physical network and
131	   subsequently reassembled at the far end of the tunnel.

133	1.1.  Requirements Language

135	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
136	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
137	   document are to be interpreted as described in RFC 2119 [RFC2119].

139	1.2.  Terminology

141	   The following terms are used in this document:

143	   Stateless Transport Tunneling (STT).  The tunneling mechanism defined
144	   in this document.  The name derives from the fact that the tunnel
145	   header resembles the TCP/IP headers (hence "transport" tunneling)
146	   while "stateless" refers to the fact that none of the normal TCP
147	   state (connection state, send and receive windows, congestion state
148	   etc.) is associated with the tunnel (as would be required if an
149	   actual TCP connection were used for tunneling).

151	   STT Frame.  The unit of data that is passed into the tunnel prior to
152	   segmentation and encapsulation.  This frame typically consists of an
153	   Ethernet frame and an STT Frame header.  These frames may be up to
154	   64KB in size.

156	   STT Segment.  The unit of data that is transmitted on the underlay
157	   network over which the tunnel operates.  An STT segment has headers
158	   that are syntactically the same as the TCP/IP headers, and typically
159	   contains part of an STT frame as the payload.  These segments must
160	   fit within the MTU of the physical network.

162	   Context ID.  A 64-bit field in the STT frame header that conveys
163	   information about the disposition of the STT frame between the tunnel
164	   endpoints.  One example use of the Context ID is to direct delivery
165	   of the STT frame payload to the appropriate virtual network or
166	   virtual machine.

168	   MSS.  Maximum Segment Size.  The maximum number of bytes that can be
169	   sent in one TCP segment [RFC0793].

171	   NIC.  Network Interface Card.

173	   TSO.  TCP Segmentation Offload.  A function provided by many
174	   commercial NICs that allows large data units to be passed to the NIC,
175	   the NIC being responsible for creating MSS-sized segments with
176	   correct TCP/IP headers.

178	   LRO.  Large Receive Offload.  The receive-side equivalent function of
179	   TSO, in which multiple TCP segments are coalesced into larger data
180	   units.

182	   VM.  Virtual Machine.

184	1.3.  Reference Model

186	   Our conceptual model for a virtualized network is shown in Figure 1.
187	   STT tunnels extend in this figure from one virtual switch to another,
188	   providing a virtual link between the switches over some arbitrary
189	   underlay.  More generally, STT tunnels operate between a pair of
190	   tunnel endpoints; these endpoints may be virtual switches, physical
191	   switches, or some other device (e.g. an appliance).  The STT tunnel
192	   provides a virtual point-to-point Ethernet link between the
193	   endpoints.  Frames are handed to the tunnel by some entity (e.g. a VM
194	   that is connected to a virtual switch in this picture) and first
195	   encapsulated with an STT Frame header.  STT Frames may then be
196	   fragmented in the NIC, and are encapsulated with a tunnel header (the
197	   STT segment header) for transmission over the underlay.  Note that
198	   other models are possible, e.g., where one or both tunnel endpoints
199	   are implemented in a physical switch.  In such cases the tunnel
200	   endpoint may forward packets to and from another link (physical or
201	   virtual) rather than to a VM.

203	      +----------------------+             +----------------------+
204	      | +--+   +-------+---+ |             | +---+-------+   +--+ |
205	      | |VM|---|       |   | |             | |   |       |---|VM| |
206	      | +--+   |Virtual|NIC|--- Underlay --- |NIC|Virtual|   +--+ |
207	      | +--+   |Switch |   | |   Network   | |   |Switch |   +--+ |
208	      | |VM|---|       |   | |             | |   |       |---|VM| |
209	      | +--+   +-------+---+ |             | +---+-------+   +--+ |
210	      +----------------------+             +----------------------+

212	                   ()===============================()
213	                        Switch-Switch tunnel

215	                       Figure 1: STT Reference Model

217	2.  Design Rationale

219	   We take as given the need for some form of tunneling to support the
220	   virtualization of the network as described in Section 1.  One might
221	   reasonably ask whether some existing tunneling protocol such as
222	   GRE[RFC2784] or L2TPv3[RFC3931] might suffice.  In fact,

224	   [I-D.sridharan-virtualization-nvgre] does just that, using GRE.  The
225	   primary motivation for STT as opposed to one of the existing
226	   tunneling methods is to improve the performance of data transfers
227	   from hosts that implement tunnel endpoints.  We expand on this
228	   rationale below.

230	2.1.  Segmentation Offload

232	   A large percentage of network interface cards (NICs) in use today are
233	   able to perform TCP segmentation offload (TSO).  When a NIC supports
234	   TSO, the host hands a large (greater than 1 TCP MSS) frame of data to
235	   the NIC along with a set of metadata which includes, among other
236	   things, the desired MSS, and various fields needed to complete the
237	   TCP header.  The NIC fragments the frame into MSS-sized segments,
238	   performs the TCP Checksum operation, and applies the appropriate
239	   headers (TCP, IP and MAC) to each segment.

241	   On the receive side, some NICs support the reassembly of TCP
242	   segments, a function referred to as large receive offload (LRO).  In
243	   this case, NICs attempt to reassemble TCP segments and pass larger
244	   aggregates of data to the host.  (Since TCP's service model is a byte
245	   stream, there is no higher level frame for the NIC to reassemble, but
246	   it can pass chunks of the stream larger than one MSS to the host.
247	   Full reassembly of STT frames is handled in the host.)  The benefits
248	   to the host include fewer per-packet operations and larger data
249	   transfers between host and NIC, which amortizes the per-transfer cost
250	   (such as interrupt processing) more efficiently.  These gains can
251	   translate into significant performance gains for data transfer from
252	   the host to the network.

254	   STT is explicitly designed to leverage the TSO capabilities of
255	   currently available NICs.  While one might think of segmentation as a
256	   generic function, the majority of NICs are designed specifically to
257	   support TCP segmentation offload, as the details of the segmentation
258	   function are highly dependent on the specifics of TCP.  In order to
259	   leverage such capability, therefore, the STT segment header is
260	   syntactically identical to a valid TCP header.  However, we use some
261	   of the fields in the TCP header (specifically, sequence number and
262	   ACK number) to support the objectives of STT.  The details are
263	   described in Section 3.2.  In essence, we need the same set of
264	   information that IP datagrams carry when IP fragmentation takes
265	   place: a unique identifier for the frame that has been fragmented, an
266	   offset into that frame for the current fragment, and the length of
267	   the frame to be reassembled.  We fit these fields into the TCP header
268	   fields traditionally used for the SEQ and ACK numbers.  STT segments
269	   are transmitted as IP datagrams using the TCP protocol number (6).
270	   The primary means to recognize STT segments is the destination port
271	   number.  We discuss the interoperability impact of these design
272	   choices in Section 4.

274	   The net effect of using TSO is that the frame size that is sent by
275	   endpoints in the virtualized network can be much larger than the MTU
276	   of the underlying physical network.  The primary benefit of this is a
277	   significant performance gain when large amounts of data are being
278	   transferred between nodes in the virtual network.  A secondary effect
279	   is that the header of the STT frame is amortized across a larger
280	   amount of data, reducing the need to shrink the STT frame header to
281	   minimum size.

283	   Note that, while segmentation offload is the primary NIC function
284	   that STT takes advantage of, other NIC offload functions such as
285	   checksum calculation can also be leveraged.

287	2.2.  Metadata

289	   When a frame is delivered to the NIC that supports TSO for
290	   segmentation and transmission, a certain amount of metadata is
291	   typically passed along with it.  This includes the MSS and
292	   potentially a VLAN tag to be applied to the transmitted packets.

294	   In some virtualized network deployments, an STT frame may traverse a
295	   tunnel, be received and reassembled at an STT endpoint, and then be
296	   sent on another physical interface.  In such cases, the tunnel
297	   terminating endpoint may need to pass metadata to a NIC to enable
298	   transmission of frames on the physical link.  For this reason,
299	   appropriate metadata is carried in the STT frame header.

301	2.3.  Context Information

303	   When an STT Frame is received by a tunnel endpoint, it needs to be
304	   directed to the appropriate entity in the virtualized network to
305	   which it belongs.  For this reason, a Context ID is required in the
306	   STT frame header.  Some other encapsulations (e.g.
307	   [I-D.mahalingam-dutt-dcops-vxlan],
308	   [I-D.sridharan-virtualization-nvgre]) use an explicit tenant network
309	   identifier or virtual network identifier.  The Context Identifier can
310	   be thought of as a generalized form of virtual network identifier.
311	   Using a larger and more general identifier allows for a broader range
312	   of service models and allows ample room for future expansion.  There
313	   is little downside to using a larger field here because it is
314	   amortized across the entire STT Frame rather than being present in
315	   each packet.

317	2.4.  Alignment

319	   Software implementations of tunnel endpoints benefit from 32-bit
320	   alignment of the data to be manipulated.  Because the Ethernet header
321	   is not a multiple of 32-bits (it is 14 bytes), 2 bytes of padding are
322	   added to the STT header, causing the payload beyond the encapsulated
323	   Ethernet header, which typically includes the IP header of the
324	   encapsulated frame, to be 32-bit aligned.

326	2.5.  Equal Cost Multipath

328	   It is essential that traffic passing through the physical network can
329	   be efficiently distributed across multiple paths.  Standard equal
330	   cost multipath (ECMP) techniques involve hashing on address and port
331	   numbers in the outer protocol headers.  There are two main issues to
332	   address with ECMP.  First, it is important that, when a set of
333	   packets belong to a single flow (e.g. a TCP connection in the virtual
334	   network), all those packets should follow the same path.  Second, all
335	   paths should be used efficiently, i.e. there needs to be sufficient
336	   entropy among the different flows to ensure they get distributed
337	   evenly across multiple paths.

339	   STT achieves the first goal by ensuring that the source and
340	   destination ports and addresses in the outer header are all the same
341	   for a single flow.  The second goal is achieved by generating the
342	   source port using a random hash of fields in the headers of the inner
343	   packets, e.g. the ports and addresses of the virtual flow's packets.
344	   We provide more details on the usage of port numbers in Section 3.2.

346	2.6.  Efficient Software Processing

348	   The design of STT is largely motivated by the desire to tunnel
349	   packets efficiently between virtual switches running in software.  In
350	   addition to the points noted above, this leads to some design
351	   optimizations to simplify processing of packets, such as the use of
352	   an "L4 offset" field in the STT header to enable the payload to be
353	   located quickly without extensive header parsing.

355	3.  Frame Formats

357	   STT encapsulates data payloads of up to 64KB (limited by the length
358	   field in the STT header, described below).  Those frames are then
359	   segmented (depending on the MTU of the underlying physical network)
360	   and the resulting segments are encapsulated in a standard TCP header,
361	   which in turn is encapsulated by an IP header and finally a MAC
362	   header.  This is illustrated in Figure 2.

364	                      +-----------+    +----------+     +----------+
365	                      | IP Header |    |IP Header |     |IP header |
366	   +-----------+      +-----------+    +----------+     +----------+
367	   |STT Frame  |      |TCP-like   |    |TCP-like  |     |TCP-like  |
368	   | Header    |      | header    |    | header   |     | header   |
369	   +-----------+      +-----------+    +----------+     +----------+
370	   |           | ---> | STT Frame |    |Next part | ... |Last part |
371	   |Payload    |      |  Header   |    |of Payload|     |of Payload|
372	   .           .      +-----------+    |          |     |          |
373	   .           .      |           |    |          |     |          |
374	   .           .      |  Start of |    |          |     |          |
375	   +-----------+      |  Payload  |    |          |     +----------+
376	                      +-----------+    +----------+

378	   Original data           STT Frame is segmented and transmitted as
379	   frame is encapped           a set of TCP segments (MAC
380	   with STT Header                 headers not shown)

382	              Figure 2: STT Frame Fragments and Encapsulation

384	   The details of the STT Frame header and the usage of the TCP-like
385	   header are described in detail below.  The TCP segments shown in
386	   Figure 2 are of course further encapsulated as IP datagrams, and may
387	   be sent as either IPv4 or IPv6.  The resulting IP datagrams are then
388	   transmitted in the appropriate MAC level frame (e.g.  Ethernet, not
389	   shown in the figure) for the underlying physical network over which
390	   the tunnels are established.

392	3.1.  STT Frame Format

394	   Figure 3 illustrates the header of an STT frame before it is
395	   segmented.

397	    0                   1                   2                   3
398	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
399	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
400	   |  Version      | Flags         |  L4 Offset    |  Reserved     |
401	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
402	   |    Max. Segment Size          | PCP |V|     VLAN ID           |
403	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
404	   |                                                               |
405	   +                     Context ID (64 bits)                      +
406	   |                                                               |
407	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
408	   |     Padding                   |    data                       |
409	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               +
410	   |                                                               |

412	                        Figure 3: STT Frame Format

414	   The STT frame header contains the following fields:

416	   o  Version - currently 0.

418	   o  Flags - describes encapsulated packet, see below.

420	   o  L4 offset - offset in bytes from the end of the STT Frame header
421	      to the start of the encapsulated layer 4 (TCP/UDP) header.

423	   o  Reserved field - MUST be zero on transmission and ignored on
424	      receipt.

426	   o  Max Segment Size - the TCP MSS that should be used by a tunnel
427	      endpoint that is transmitting this frame onto another network.

429	   o  PCP - the 3-bit Priority Code Point field that should be applied
430	      to this packet by an STT tunnel endpoint on transmission to
431	      another network (see Section 2.2).

433	   o  V - a one bit flag that, if set, indicates the presence of a valid
434	      VLAN ID in the following field and valid PCP in the preceding
435	      field.

437	   o  VLAN ID - 12-bit VLAN tag that should be applied to this packet by
438	      an STT tunnel endpoint on transmission to another network (see
439	      Section 2.2).

441	   o  Context ID - 64 bits of context information, described in detail
442	      in Section 2.3.

444	   o  Padding - 16 bits as described above.

446	   The flags field contains:

448	   o  0: Checksum verified.  Set if the checksum of the encapsulated
449	      packet has been verified by the sender.

451	   o  1: Checksum partial.  Set if the checksum in the encapsulated
452	      packet has been computed only over the TCP/IP header.  This bit
453	      MUST be set if TSO is used by the sender.  Note that bit 0 and bit
454	      1 cannot both be set in the same header.

456	   o  2: IP version.  Set if the encapsulated packet is IPv4, not set if
457	      the packet is IPv6.  See below for discussion of non-IP payloads.

459	   o  3: TCP payload.  Set if the encapsulated packet is TCP.

461	   o  4-7: Unused, MUST be zero on transmission and ignored on receipt.

463	   As noted above, several of these fields are present primarily to
464	   enable efficient processing of the packet when it received at a
465	   tunnel endpoint.  (For example, it's entirely possible to determine
466	   if the packet is IPv4 or IPv6 by looking at the Ethernet header -
467	   it's just more efficient not to have to do so.)

469	   The payload of the STT frame is an untagged Ethernet frame.

471	3.1.1.  Handling non-IP payloads

473	   Note that the STT header does not have a general "protocol" field to
474	   allow the efficient processing of arbitrary payloads.  The current
475	   version is designed to provide a virtual Ethernet link, and hence
476	   efficiently supports only Ethernet frames as the payload.  The
477	   Ethernet header itself contains a protocol field, which then
478	   identifies the higher layer protocol, so it is straightforward to
479	   accommodate non-IP traffic.

481	   It will be noted that the STT Frame header does contain fields that
482	   are intended to assist in efficient processing of IPv4 and IPv6
483	   packets.  These fields MUST be set to zero and ignored on receipt for
484	   non-IP payloads.

486	   The use of STT to carry payloads other than Ethernet is theoretically
487	   possible but is beyond the scope of this document.

489	3.2.  Usage of TCP Header by STT

491	   Figure 4 illustrates the usage of the TCP header STT.  This figure is
492	   essentially identical to that in [RFC0793] with the exception that we
493	   denote with an asterisk (*) two fields that are used by STT to convey
494	   something other than the information that is conveyed by TCP.
495	   Syntactically, STT segments look identical to TCP segments.  However,
496	   STT tunnel endpoints treat the Sequence number and Acknowledgment
497	   number differently than TCP endpoints treat those fields.
498	   Furthermore, as noted above, there is no TCP state machine associated
499	   with an STT tunnel.

501	    0                   1                   2                   3
502	    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
503	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
504	   |          Source Port          |       Destination Port        |
505	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
506	   |                        Sequence Number(*)                     |
507	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
508	   |                    Acknowledgment Number(*)                   |
509	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
510	   |  Data |           |U|A|P|R|S|F|                               |
511	   | Offset| Reserved  |R|C|S|S|Y|I|            Window             |
512	   |       |           |G|K|H|T|N|N|                               |
513	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
514	   |           Checksum            |         Urgent Pointer        |
515	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
516	   |                    Options                    |    Padding    |
517	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
518	   |                             data                              |
519	   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

521	                       Figure 4: STT Segment Format

523	   The Destination port is to be requested from IANA, in the user range
524	   (1024-49151).

526	   In order to allow correct reassembly of the STT frame, the source
527	   port MUST be constant for all segments of a single STT frame.

529	   As noted above (Section 2.5) the source port SHOULD be the same for
530	   all frames that belong to a single flow in the virtual network, e.g.
531	   a single TCP connection.

533	   Also, to encourage efficient distribution of traffic among multiple
534	   paths when ECMP is used, the method to calculate the source port
535	   should provide a random distribution of source port numbers.  An
536	   example mechanism would be a random hash on ports and addresses of
537	   the TCP headers of the flow in the virtual network.

539	   It is RECOMMENDED to use a source port number from the ephemeral
540	   range defined by IANA (49152-65535).

542	   The Sequence number and Acknowledgment number fields are re-purposed
543	   in a way that does not confuse NICs that expect them to be used in
544	   the conventional manner.  The ACK field is used as a packet
545	   identifier for the purposes of fragmentation, equivalent in function
546	   to the Identification field of IPv4 or the IPv6 Fragment header: it
547	   MUST be constant for all STT segments of a given frame, and different
548	   from any value used recently for other STT frames sent over this
549	   tunnel.

551	   The upper 16 bits of the the SEQ field are used to convey the length
552	   of the STT frame in bytes.  The lower 16 bits of the SEQ field are
553	   used to convey the offset (in bytes) of the current fragment within
554	   the larger STT frame.

556	   Reassembly of the fragments may be done partially by NICs that
557	   perform LRO, since the sequence numbers of frames will increment
558	   appropriately.  That is, the upper 16 bits don't change, and the
559	   lower 16 bits increment by N for every N byte segment that is
560	   transmitted, just as would be the case if an actual sequence number
561	   were being sent.  Note that the size limit of an STT frame ensures
562	   that sequence numbers cannot wrap while sending the segments of a
563	   single STT frame.

565	   All the fields after ACK have their conventional meaning, although
566	   nothing will be done with the Window or Urgent pointer values.  Those
567	   fields SHOULD be zero on transmit and ignored on receipt.  It is
568	   RECOMMENDED that the PSH (Push) flag be set when transmitting the
569	   last segment of a frame in order to cause data to be delivered by the
570	   NIC without waiting for other fragments.  The ACK flag SHOULD be set
571	   to ensure that a receiving NIC passes the ACK field to the host to
572	   assist in reassembly.  All other flags SHOULD be zero on transmit and
573	   ignored on receipt.

575	3.3.  Encapsulation of STT Segments in IP

577	   From the perspective of IP, an STT segment is just like any other TCP
578	   segment.  The protocol number (IPv4) or Next Header (IPv6) has the
579	   value 6, as for regular TCP.  The resulting IP datagram is then
580	   encapsulated in the appropriate L2 header (e.g.  Ethernet) for
581	   transmission on the physical medium.

583	3.3.1.  Diffserv and ECN-Marking

585	   When traffic is encapsulated in a tunnel header, there are numerous
586	   options as to how the Diffserv Code-Point (DSCP) and ECN markings are
587	   set in the outer header and propagated to the inner header on
588	   decapsulation.

590	   [RFC2983] defines two modes for mapping the DSCP markings from inner
591	   to outer headers and vice versa.  The Uniform model copies the inner
592	   DSCP marking to the outer header on tunnel ingress, and copies that
593	   outer header value back to the inner header at tunnel egress.  The
594	   Pipe model sets the DSCP value to some value based on local policy at
595	   ingress and does not modify the inner header on egress.  Both models
596	   SHOULD be supported by STT endpoints.  However, there is an
597	   additional complexity with the uniform model for STT, because a
598	   single IP datagram that is transmitted over the tunnel appears as
599	   multiple IP datagrams on the wire.  Thus it is not guaranteed that
600	   all segments of the STT frame will have the same DSCP at egress.  If
601	   uniform model behavior is configured, it is RECOMMENDED that the DSCP
602	   of the first segment of the STT frame be used to set the DSCP value
603	   of the IP header in the decapsulated STT frame.

605	   [RFC6040] describes the correct ECN behavior for any type of IP in IP
606	   tunnel, and this behavior SHOULD be followed for STT tunnels.  As
607	   with the Uniform Diffserv tunnel model, the fact that one inner IP
608	   datagram is segmented into multiple outer datagrams makes the
609	   situation slightly more complex.  It is RECOMMENDED that if any
610	   segment of the received STT frame has the CE (congestion experienced)
611	   bit set in its IP header, then the CE bit SHOULD be set in the IP
612	   header of the decapsulated STT frame.

614	3.3.2.  Packet Loss

616	   Individual IP datagrams may be dropped (most often due to congestion)
617	   and, since there is no acknowledgment or reliable delivery of these
618	   datagrams, there is the potential to corrupt an entire STT Frame due
619	   to the loss of a single IP datagram.  Fortunately, there are
620	   solutions to this problem in the case where the higher layer protocol
621	   running over STT is TCP.  An STT receiving endpoint running in an
622	   end-system, as shown in Figure 1 for example, is not required to
623	   deliver complete STT frames to the TCP stack in the receiving VM.  A
624	   partial frame payload can be delivered and the receiving TCP stack
625	   can deal with the missing bytes just as it would if running directly
626	   over a physical network.  That is, TCP in the VM can send ACKs for
627	   the contiguous bytes received to trigger retransmission of the
628	   missing bytes by the sender.  This is similar to the operation of LRO
629	   in current NICs.  There are some subtleties to making this work
630	   correctly in the STT context, and it does depend on the STT endpoint
631	   being aware of the higher layer protocols consuming data in the VM to
632	   which it is connected.  The main point of this discussion is that, in
633	   the common deployments of STT running in a virtual switch, the
634	   potential harm of losing individual packets is not as serious as it
635	   might first appear.

637	3.4.  Broadcast and Multicast

639	   It is possible to establish point-to-multipoint STT tunnels by using
640	   an IP multicast address as the destination address of the tunnel.
641	   These may be used for broadcast or multicast traffic if the
642	   underlying physical network supports IP multicast.  Control
643	   mechanisms for setting up such multicast groups are beyond the scope
644	   of this document.  It is worth repeating that, despite the syntactic
645	   resemblance between the STT segment header and the TCP header, there
646	   is no TCP state machine associated with an STT tunnel, so the
647	   traditional issues of combining multicast with TCP (or reliable
648	   transports more generally) do not arise.

650	4.  Interoperability Issues

652	   It will be noted that an STT packet on the wire appears exactly the
653	   same as a TCP packet, but that processing of an STT packet on
654	   reception is entirely different from TCP - no three-way handshake to
655	   establish a connection, no ACKs, retransmission, etc.  Hence, an STT
656	   tunnel endpoint clearly needs to be configured to behave in the
657	   correct manner rather than to perform standard TCP processing on the
658	   packet.  The primary way to recognize an STT segment is the
659	   destination port number in the TCP header.  In the event that an STT
660	   packet is inadvertently delivered to a device that is not configured
661	   to behave as an STT tunnel endpoint, no TCP connection will be
662	   established and STT packets will be dropped.

664	   In the event that STT packets pass through middle boxes that process
665	   TCP, it is likely that (in the near term at least) they will be
666	   dropped, as there will be no TCP connection state established.  This
667	   is clearly undesirable, but it is a general issue with any form of
668	   tunneling - the nature of many middle boxes is that they will not
669	   permit tunnels to pass through them.  Hence the best solution is
670	   simply to avoid deploying middle boxes at locations where STT tunnels
671	   (or other forms of tunnels for network virtualization) will need to
672	   pass through them.  This will not, however, always be feasible,
673	   especially when virtualized networks extend among multiple data
674	   centers.  Other solutions include configuring the middle boxes to
675	   permit TCP packets to pass through when the port number matches the
676	   port assigned for STT.

678	   In the longer term, we might reasonably expect that middle boxes
679	   would be able to recognize STT traffic, and to terminate and
680	   originate STT tunnels if necessary (e.g. to perform functions that
681	   require the STT payload to be inspected such as statefull
682	   firewalling).

684	   It is also of course possible to provide all the functionality of STT
685	   using a different IP protocol number (or next header value in IPv6).
686	   This approach makes sense in the long run but will typically not
687	   enable current NIC hardware to be leveraged for TSO and LRO
688	   functions.

690	   It is also possible to run STT traffic over other forms of tunnel
691	   (GRE, IPSEC, etc.) in which case they the STT traffic can pass
692	   through appropriately configured middle boxes.

694	5.  IANA Considerations

696	   A TCP port in the user range (1024- 49151) will be requested from
697	   IANA.

699	6.  Security Considerations

701	   In the physical network, STT packets are simply IP datagrams, and do
702	   not introduce new security issues.  Most standard IP security
703	   mechanisms (such as IPSEC encryption or authentication) can be
704	   implemented on STT packets if desired.  As noted above, however,
705	   tunneling generally interacts poorly with middle boxes, and STT is no
706	   exception.  Devices such as firewalls are likely to drop STT traffic
707	   unless the capability to recognize STT packets is implemented, or
708	   unless the STT traffic is itself run over some sort of tunnel that
709	   the firewall is configured to permit.  Intrusion detection systems
710	   would similarly need to be enhanced to be able to look inside STT
711	   packets.

713	   It should also be noted that while STT packets resemble TCP segments,
714	   the lack of a TCP state machine means that TCP-related security
715	   issues (e.g.  SYN-flooding) do not apply.  Similarly, some of the
716	   benefits of the TCP state machine (e.g. the ability to discard
717	   packets with unexpected sequence numbers) are also absent for STT
718	   traffic.

720	7.  Contributors

722	   The following individuals contributed to this document:

724	   Brad McConnell
725	   Rackspace
726	   5000 Walzem Road
727	   San Antonio, TX  78218
728	   Email: bmcconne@rackspace.com

730	   JC Martin
731	   eBay
732	   2477 Woodland Ave
733	   San Jose, CA 95128
734	   Email:

736	   Iben Rodriguez
737	   eBay
738	   2477 Woodland Ave
739	   San Jose, CA 95128
740	   Email: Iben.rodriguez@gmail.com

742	   Ilango Ganga
743	   Intel Corporation
744	   2200 Mission College Blvd.
745	   Santa Clara, CA - 95054
746	   Email: ilango.s.ganga@intel.com

748	   Igor Gashinsky
749	   Yahoo!
750	   111 West 40th Street
751	   New York, NY 10018
752	   Email: igor@yahoo-inc.com

754	8.  Acknowledgements

756	   We thank Martin Casado for inspiring this work and making all the
757	   introductions, and to Ben Pfaff for his explanations of the
758	   implementation.  Thanks also to Pierre Ettori, Yukio Ogawa and
759	   Koichiro Seto for their helpful comments.

761	9.  References

763	9.1.  Normative References

765	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
766	              RFC 793, September 1981.

768	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
769	              Requirement Levels", BCP 14, RFC 2119, March 1997.

771	9.2.  Informative References

773	   [I-D.mahalingam-dutt-dcops-vxlan]
774	              Sridhar, T., Bursell, M., Kreeger, L., Dutt, D., Wright,
775	              C., Mahalingam, M., Duda, K., and P. Agarwal, "VXLAN: A
776	              Framework for Overlaying Virtualized Layer 2 Networks over
777	              Layer 3 Networks", draft-mahalingam-dutt-dcops-vxlan-01
778	              (work in progress), February 2012.

780	   [I-D.narten-nvo3-overlay-problem-statement]
781	              Narten, T., Sridharan, M., Dutt, D., Black, D., and L.
782	              Kreeger, "Problem Statement: Overlays for Network
783	              Virtualization",
784	              draft-narten-nvo3-overlay-problem-statement-01 (work in
785	              progress), October 2011.

787	   [I-D.sridharan-virtualization-nvgre]
788	              Sridhavan, M., Duda, K., Ganga, I., Greenberg, A., Lin,
789	              G., Pearson, M., Thaler, P., Tumuluri, C., and Y. Wang,
790	              "NVGRE: Network Virtualization using Generic Routing
791	              Encapsulation", draft-sridharan-virtualization-nvgre-00
792	              (work in progress), September 2011.

794	   [RFC2784]  Farinacci, D., Li, T., Hanks, S., Meyer, D., and P.
795	              Traina, "Generic Routing Encapsulation (GRE)", RFC 2784,
796	              March 2000.

798	   [RFC2983]  Black, D., "Differentiated Services and Tunnels",
799	              RFC 2983, October 2000.

801	   [RFC3931]  Lau, J., Townsley, M., and I. Goyret, "Layer Two Tunneling
802	              Protocol - Version 3 (L2TPv3)", RFC 3931, March 2005.

804	   [RFC6040]  Briscoe, B., "Tunnelling of Explicit Congestion
805	              Notification", RFC 6040, November 2010.

807	   [VL2]      Greenberg et al, "VL2: A Scalable and Flexible Data Center
808	              Network", 2009.

810	              Proc.  ACM SIGCOMM 2009

812	Authors' Addresses

814	   Bruce Davie (editor)
815	   Nicira Networks, Inc.
816	   3460 W. Bayshore Rd.
817	   Palo Alto, CA  94303
818	   USA

820	   Email: bsd@nicira.com

822	   Jesse Gross
823	   Nicira Networks, Inc.
824	   3460 W. Bayshore Rd.
825	   Palo Alto, CA  94303
826	   USA

828	   Email: jesse@nicira.com