idnits 2.17.1 draft-davie-stt-06.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There is 1 instance of too long lines in the document, the longest one being 4 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 15, 2014) is 3635 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-02) exists of draft-gross-geneve-00 == Outdated reference: A later version (-07) exists of draft-ietf-nvo3-security-requirements-02 == Outdated reference: A later version (-08) exists of draft-sridharan-virtualization-nvgre-04 Summary: 2 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group B. Davie, Ed. 3 Internet-Draft J. Gross 4 Intended status: Informational VMware, Inc. 5 Expires: October 17, 2014 April 15, 2014 7 A Stateless Transport Tunneling Protocol for Network Virtualization 8 (STT) 9 draft-davie-stt-06 11 Abstract 13 Network Virtualization places unique requirements on tunneling 14 protocols. This draft describes STT (Stateless Transport Tunneling), 15 a tunnel encapsulation that enables overlay networks to be built in 16 virtualized networks. STT is particularly useful when some tunnel 17 endpoints are in end-systems, as it utilizes the capabilities of the 18 network interface card to improve performance. 20 Status of this Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on October 17, 2014. 37 Copyright Notice 39 Copyright (c) 2014 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 56 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 57 1.3. Reference Model . . . . . . . . . . . . . . . . . . . . . 5 58 2. Design Rationale . . . . . . . . . . . . . . . . . . . . . . . 5 59 2.1. Segmentation Offload . . . . . . . . . . . . . . . . . . . 6 60 2.2. Metadata . . . . . . . . . . . . . . . . . . . . . . . . . 7 61 2.3. Context Information . . . . . . . . . . . . . . . . . . . 7 62 2.4. Alignment . . . . . . . . . . . . . . . . . . . . . . . . 8 63 2.5. Equal Cost Multipath . . . . . . . . . . . . . . . . . . . 8 64 2.6. Efficient Software Processing . . . . . . . . . . . . . . 8 65 3. Frame Formats . . . . . . . . . . . . . . . . . . . . . . . . 8 66 3.1. STT Frame Format . . . . . . . . . . . . . . . . . . . . . 9 67 3.1.1. Handling non-TCP/IP and non-UDP/IP payloads . . . . . 12 68 3.2. Usage of TCP Header by STT . . . . . . . . . . . . . . . . 12 69 3.3. Encapsulation of STT Segments in IP . . . . . . . . . . . 14 70 3.3.1. Diffserv and ECN-Marking . . . . . . . . . . . . . . . 14 71 3.3.2. Packet Loss . . . . . . . . . . . . . . . . . . . . . 15 72 3.4. Broadcast and Multicast . . . . . . . . . . . . . . . . . 15 73 4. Interoperability Issues . . . . . . . . . . . . . . . . . . . 16 74 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 17 75 6. Security Considerations . . . . . . . . . . . . . . . . . . . 17 76 7. Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 18 77 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 19 78 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 19 79 9.1. Normative References . . . . . . . . . . . . . . . . . . . 19 80 9.2. Informative References . . . . . . . . . . . . . . . . . . 20 81 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 21 83 1. Introduction 85 Network Virtualization places unique requirements on tunneling 86 protocols. The utility of tunneling in virtualized data centers has 87 been described elsewhere; see, for example 88 [I-D.ietf-nvo3-overlay-problem-statement], [VL2], 89 [I-D.mahalingam-dutt-dcops-vxlan], 90 [I-D.sridharan-virtualization-nvgre], [I-D.gross-geneve]. Tunneling 91 allows a virtual overlay topology to be constructed on top of the 92 physical data center network, and provides benefits such as: 94 o Ability to manage overlapping addresses between multiple tenants 96 o Decoupling of the virtual topology provided by the tunnels from 97 the physical topology of the network 99 o Support for virtual machine mobility independent of the physical 100 network 102 o Support for essentially unlimited numbers of virtual networks (in 103 contrast to VLANs, for example) 105 o Decoupling of the network service provided to servers from the 106 technology used in the physical network (e.g. providing an L2 107 service over an L3 fabric) 109 o Isolating the physical network from the addressing of the virtual 110 networks, thus avoiding issues such as MAC table size in physical 111 switches. 113 This draft describes STT (Stateless Transport Tunneling), a tunnel 114 encapsulation that enables overlay networks to be built in 115 virtualized data center networks, providing the benefits outlined 116 above. STT is particularly useful when some tunnel endpoints are in 117 end-systems, as it utilizes the capabilities of standard network 118 interface cards to improve performance. STT is an IP-based 119 encapsulation and utilizes a TCP-like header inside the IP header. 120 It is, however, stateless, i.e., there is no TCP connection state of 121 any kind associated with the tunnel. The TCP-like header is used for 122 pragmatic reasons, to leverage the capabilities of existing network 123 interface cards, but should not be interpreted as implying any sort 124 of connection state between endpoints. 126 STT is typically used to carry Ethernet frames between tunnel 127 endpoints. These frames may be considerably larger than the MTU of 128 the physical network - up to 64KB. Fields in the tunnel header are 129 used to allow these large frames to be segmented at the entrance to 130 the tunnel according to the MTU of the physical network and 131 subsequently reassembled at the far end of the tunnel. 133 1.1. Requirements Language 135 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 136 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 137 document are to be interpreted as described in RFC 2119 [RFC2119]. 139 1.2. Terminology 141 The following terms are used in this document: 143 Stateless Transport Tunneling (STT). The tunneling mechanism defined 144 in this document. The name derives from the fact that the tunnel 145 header resembles the TCP/IP headers (hence "transport" tunneling) 146 while "stateless" refers to the fact that none of the normal TCP 147 state (connection state, send and receive windows, congestion state 148 etc.) is associated with the tunnel (as would be required if an 149 actual TCP connection were used for tunneling). 151 STT Frame. The unit of data that is passed into the tunnel prior to 152 segmentation and encapsulation. This frame typically consists of an 153 Ethernet frame and an STT Frame header. These frames may be up to 154 64KB in size. 156 STT Segment. The unit of data that is transmitted on the underlay 157 network over which the tunnel operates. An STT segment has headers 158 that are syntactically the same as the TCP/IP headers, and typically 159 contains part of an STT frame as the payload. These segments must 160 fit within the MTU of the physical network. 162 Context ID. A 64-bit field in the STT frame header that conveys 163 information about the disposition of the STT frame between the tunnel 164 endpoints. One example use of the Context ID is to direct delivery 165 of the STT frame payload to the appropriate virtual network or 166 virtual machine. 168 MSS. Maximum Segment Size. The maximum number of bytes that can be 169 sent in one TCP segment [RFC0793]. 171 NIC. Network Interface Card. 173 TSO. TCP Segmentation Offload. A function provided by many 174 commercial NICs that allows large data units to be passed to the NIC, 175 the NIC being responsible for creating MSS-sized segments with 176 correct TCP/IP headers. 178 LRO. Large Receive Offload. The receive-side equivalent function of 179 TSO, in which multiple TCP segments are coalesced into larger data 180 units. 182 VM. Virtual Machine. 184 1.3. Reference Model 186 Our conceptual model for a virtualized network is shown in Figure 1. 187 STT tunnels extend in this figure from one virtual switch to another, 188 providing a virtual link between the switches over some arbitrary 189 underlay. More generally, STT tunnels operate between a pair of 190 tunnel endpoints; these endpoints may be virtual switches, physical 191 switches, or some other device (e.g. an appliance). The STT tunnel 192 provides a virtual point-to-point Ethernet link between the 193 endpoints. Frames are handed to the tunnel by some entity (e.g. a VM 194 that is connected to a virtual switch in this picture) and first 195 encapsulated with an STT Frame header. STT Frames may then be 196 fragmented in the NIC, and are encapsulated with a tunnel header (the 197 STT segment header) for transmission over the underlay. Note that 198 other models are possible, e.g., where one or both tunnel endpoints 199 are implemented in a physical switch. In such cases the tunnel 200 endpoint may forward packets to and from another link (physical or 201 virtual) rather than to a VM. 203 +----------------------+ +----------------------+ 204 | +--+ +-------+---+ | | +---+-------+ +--+ | 205 | |VM|---| | | | | | | |---|VM| | 206 | +--+ |Virtual|NIC|--- Underlay --- |NIC|Virtual| +--+ | 207 | +--+ |Switch | | | Network | | |Switch | +--+ | 208 | |VM|---| | | | | | | |---|VM| | 209 | +--+ +-------+---+ | | +---+-------+ +--+ | 210 +----------------------+ +----------------------+ 212 ()===============================() 213 Switch-Switch tunnel 215 Figure 1: STT Reference Model 217 2. Design Rationale 219 We take as given the need for some form of tunneling to support the 220 virtualization of the network as described in Section 1. One might 221 reasonably ask whether some existing tunneling protocol such as 222 GRE[RFC2784] or L2TPv3[RFC3931] might suffice. In fact, 224 [I-D.sridharan-virtualization-nvgre] does just that, using GRE. The 225 primary motivation for STT as opposed to one of the existing 226 tunneling methods is to improve the performance of data transfers 227 from hosts that implement tunnel endpoints. We expand on this 228 rationale below. 230 2.1. Segmentation Offload 232 A large percentage of network interface cards (NICs) in use today are 233 able to perform TCP segmentation offload (TSO). When a NIC supports 234 TSO, the host hands a large (greater than 1 TCP MSS) frame of data to 235 the NIC along with a set of metadata which includes, among other 236 things, the desired MSS, and various fields needed to complete the 237 TCP header. The NIC fragments the frame into MSS-sized segments, 238 performs the TCP Checksum operation, and applies the appropriate 239 headers (TCP, IP and MAC) to each segment. 241 On the receive side, some NICs support the reassembly of TCP 242 segments, a function referred to as large receive offload (LRO). In 243 this case, NICs attempt to reassemble TCP segments and pass larger 244 aggregates of data to the host. (Since TCP's service model is a byte 245 stream, there is no higher level frame for the NIC to reassemble, but 246 it can pass chunks of the stream larger than one MSS to the host. 247 Full reassembly of STT frames is handled in the host.) The benefits 248 to the host include fewer per-packet operations and larger data 249 transfers between host and NIC, which amortizes the per-transfer cost 250 (such as interrupt processing) more efficiently. These gains can 251 translate into significant performance gains for data transfer from 252 the host to the network. 254 STT is explicitly designed to leverage the TSO capabilities of 255 currently available NICs. While one might think of segmentation as a 256 generic function, the majority of NICs are designed specifically to 257 support TCP segmentation offload, as the details of the segmentation 258 function are highly dependent on the specifics of TCP. In order to 259 leverage such capability, therefore, the STT segment header is 260 syntactically identical to a valid TCP header. However, we use some 261 of the fields in the TCP header (specifically, sequence number and 262 ACK number) to support the objectives of STT. The details are 263 described in Section 3.2. In essence, we need the same set of 264 information that IP datagrams carry when IP fragmentation takes 265 place: a unique identifier for the frame that has been fragmented, an 266 offset into that frame for the current fragment, and the length of 267 the frame to be reassembled. We fit these fields into the TCP header 268 fields traditionally used for the SEQ and ACK numbers. STT segments 269 are transmitted as IP datagrams using the TCP protocol number (6). 270 The primary means to recognize STT segments is the destination port 271 number. We discuss the interoperability impact of these design 272 choices in Section 4. 274 The net effect of using TSO is that the frame size that is sent by 275 endpoints in the virtualized network can be much larger than the MTU 276 of the underlying physical network. The primary benefit of this is a 277 significant performance gain when large amounts of data are being 278 transferred between nodes in the virtual network. A secondary effect 279 is that the header of the STT frame is amortized across a larger 280 amount of data, reducing the need to shrink the STT frame header to 281 minimum size. 283 Note that, while segmentation offload is the primary NIC function 284 that STT takes advantage of, other NIC offload functions such as 285 checksum calculation can also be leveraged. 287 2.2. Metadata 289 When a frame is delivered to the NIC that supports TSO for 290 segmentation and transmission, a certain amount of metadata is 291 typically passed along with it. This includes the MSS and 292 potentially a VLAN tag to be applied to the transmitted packets. 294 In some virtualized network deployments, an STT frame may traverse a 295 tunnel, be received and reassembled at an STT endpoint, and then be 296 sent on another physical interface. In such cases, the tunnel 297 terminating endpoint may need to pass metadata to a NIC to enable 298 transmission of frames on the physical link. For this reason, 299 appropriate metadata is carried in the STT frame header. 301 2.3. Context Information 303 When an STT Frame is received by a tunnel endpoint, it needs to be 304 directed to the appropriate entity in the virtualized network to 305 which it belongs. For this reason, a Context ID is required in the 306 STT frame header. Some other encapsulations (e.g. 307 [I-D.mahalingam-dutt-dcops-vxlan], 308 [I-D.sridharan-virtualization-nvgre]) use an explicit tenant network 309 identifier or virtual network identifier. The Context Identifier can 310 be thought of as a generalized form of virtual network identifier. 311 Using a larger and more general identifier allows for a broader range 312 of service models and allows ample room for future expansion. There 313 is little downside to using a larger field here because it is 314 amortized across the entire STT Frame rather than being present in 315 each packet. 317 2.4. Alignment 319 Software implementations of tunnel endpoints benefit from 32-bit 320 alignment of the data to be manipulated. Because the Ethernet header 321 is not a multiple of 32-bits (it is 14 bytes), 2 bytes of padding are 322 added to the STT header, causing the payload beyond the encapsulated 323 Ethernet header, which typically includes the IP header of the 324 encapsulated frame, to be 32-bit aligned. 326 2.5. Equal Cost Multipath 328 It is essential that traffic passing through the physical network can 329 be efficiently distributed across multiple paths. Standard equal 330 cost multipath (ECMP) techniques involve hashing on address and port 331 numbers in the outer protocol headers. There are two main issues to 332 address with ECMP. First, it is important that, when a set of 333 packets belong to a single flow (e.g. a TCP connection in the virtual 334 network), all those packets should follow the same path. Second, all 335 paths should be used efficiently, i.e. there needs to be sufficient 336 entropy among the different flows to ensure they get distributed 337 evenly across multiple paths. 339 STT achieves the first goal by ensuring that the source and 340 destination ports and addresses in the outer header are all the same 341 for a single flow. The second goal is achieved by generating the 342 source port using a random hash of fields in the headers of the inner 343 packets, e.g. the ports and addresses of the virtual flow's packets. 344 We provide more details on the usage of port numbers in Section 3.2. 346 2.6. Efficient Software Processing 348 The design of STT is largely motivated by the desire to tunnel 349 packets efficiently between virtual switches running in software. In 350 addition to the points noted above, this leads to some design 351 optimizations to simplify processing of packets, such as the use of 352 an "L4 offset" field in the STT header to enable the payload to be 353 located quickly without extensive header parsing. 355 3. Frame Formats 357 STT encapsulates data payloads of up to 64KB (limited by the length 358 field in the STT segment header, described in Section 3.2). Those 359 frames are then segmented (depending on the MTU of the underlying 360 physical network) and the resulting segments are encapsulated in a 361 standard TCP header, which in turn is encapsulated by an IP header 362 and finally a MAC header. This is illustrated in Figure 2. 364 +-----------+ +----------+ +----------+ 365 | IP Header | |IP Header | |IP header | 366 +-----------+ +-----------+ +----------+ +----------+ 367 |STT Frame | |TCP-like | |TCP-like | |TCP-like | 368 | Header | | header | | header | | header | 369 +-----------+ +-----------+ +----------+ +----------+ 370 | | ---> | STT Frame | |Next part | ... |Last part | 371 |Payload | | Header | |of Payload| |of Payload| 372 . . +-----------+ | | | | 373 . . | | | | | | 374 . . | Start of | | | | | 375 +-----------+ | Payload | | | +----------+ 376 +-----------+ +----------+ 378 Original data STT Frame is segmented and transmitted as 379 frame is encapped a set of TCP segments (MAC 380 with STT Header headers not shown) 382 Figure 2: STT Frame Fragments and Encapsulation 384 The details of the STT Frame header and the usage of the TCP-like 385 header are described in detail below. The TCP segments shown in 386 Figure 2 are of course further encapsulated as IP datagrams, and may 387 be sent as either IPv4 or IPv6. The resulting IP datagrams are then 388 transmitted in the appropriate MAC level frame (e.g. Ethernet, not 389 shown in the figure) for the underlying physical network over which 390 the tunnels are established. 392 3.1. STT Frame Format 394 Figure 3 illustrates the header of an STT frame before it is 395 segmented. 397 0 1 2 3 398 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 399 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 400 | Version | Flags | L4 Offset | Reserved | 401 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 402 | Max. Segment Size | PCP |V| VLAN ID | 403 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 404 | | 405 + Context ID (64 bits) + 406 | | 407 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 408 | Padding | data | 409 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 410 | | 412 Figure 3: STT Frame Format 414 The STT frame header contains the following fields: 416 o Version - currently 0. 418 o Flags - describes encapsulated packet, see below. 420 o L4 offset - offset in bytes from the end of the STT Frame header 421 to the start of the encapsulated layer 4 (TCP/UDP) header. 423 o Reserved field - MUST be zero on transmission and ignored on 424 receipt. 426 o Max Segment Size - the segment size (the negotiated MSS in the 427 case of TCP) that should be used by a tunnel endpoint that is 428 transmitting this frame onto another network. MUST be zero if 429 segmentation offload is not in use. 431 o PCP - the 3-bit Priority Code Point field that should be applied 432 to this packet by an STT tunnel endpoint on transmission to 433 another network (see Section 2.2). 435 o V - a one bit flag that, if set, indicates the presence of a valid 436 VLAN ID in the following field and valid PCP in the preceding 437 field. 439 o VLAN ID - 12-bit VLAN tag that should be applied to this packet by 440 an STT tunnel endpoint on transmission to another network (see 441 Section 2.2). 443 o Context ID - 64 bits of context information, described in detail 444 in Section 2.3. 446 o Padding - 16 bits as described above. 448 The flags field contains: 450 o 0: Checksum verified. Set if the checksum of the encapsulated 451 packet has been verified by the sender. 453 o 1: Checksum partial. Set if the checksum in the encapsulated 454 packet has been computed only over the TCP/IP pseudoheader (or 455 UDP/IP pseudoheader, if the encapsulated packet is UDP). This bit 456 MUST be set if segmentation offload is used by the sender. Note 457 that bit 0 and bit 1 cannot both be set in the same header. 459 o 2: IP version. Set if the encapsulated packet is IPv4, not set if 460 the packet is IPv6. See below for discussion of non-IP payloads. 462 o 3: TCP payload. Set if the encapsulated packet is TCP. 464 o 4-7: Unused, MUST be zero on transmission and ignored on receipt. 466 As noted above, several of these fields are present primarily to 467 enable efficient processing of the packet when it is received at a 468 tunnel endpoint. (For example, it's entirely possible to determine 469 if the packet is IPv4 or IPv6 by looking at the Ethernet header - 470 it's just more efficient not to have to do so.) 472 The payload of the STT frame is an untagged Ethernet frame. 474 In the case where the Ethernet frame contains TCP/IP or UDP/IP as its 475 payload, this encapsulated packet should be correctly formatted as if 476 it were about to undergo unfragmented transmission (even though it 477 will ultimately be segmented as part of the transmission process). 478 This means it should have a correct TCP or UDP checksum (possibly 479 "partial", as noted above), correct length fields for its 480 unfragmented state, and correct IP header checksum (if IPv4). 482 If the length of the payload to be encapsulated exceeds 64KB, or if 483 the offset to the L4 header exceeds 255 bytes, then it will not be 484 possible to offload the packet to the NIC for segmentation. In this 485 case, the payload needs to be checksummed and segmented before 486 handing to the NIC. 488 Because there is no negotiation between end-points of an STT tunnel, 489 only basic TSO capabilities should be assumed. For example, ECN 490 (explicit congestion notification) support should not be assumed, so 491 TSO should not be requested for packets requiring such support. 492 Instead, such payloads should be segmented before handing to the NIC. 494 3.1.1. Handling non-TCP/IP and non-UDP/IP payloads 496 Note that the STT header does not have a general "protocol" field to 497 allow the efficient processing of arbitrary payloads. The current 498 version is designed to provide a virtual Ethernet link, and hence 499 efficiently supports only Ethernet frames as the payload. The 500 Ethernet header itself contains a protocol field, which then 501 identifies the higher layer protocol, so it is straightforward to 502 accommodate non-IP traffic. Note however that offloading support 503 will not typically be available for traffic other than the following: 504 TCP and UDP over IPv4 or IPv6, with a maximum of a single VLAN tag 505 stored in the STT header. Other protocols will need to be 506 appropriately formatted for direct transmission prior to 507 encapsulation. 509 It will be noted that the STT Frame header does contain fields that 510 are intended to assist in efficient processing of IPv4 and IPv6 511 packets. These fields MUST be set to zero and ignored on receipt for 512 packets not being offloaded. 514 The use of STT to carry payloads other than Ethernet is theoretically 515 possible but is beyond the scope of this document. 517 3.2. Usage of TCP Header by STT 519 Figure 4 illustrates the usage of the TCP header by STT. This figure 520 is essentially identical to that in [RFC0793] with the exception that 521 we denote with an asterisk (*) two fields that are used by STT to 522 convey something other than the information that is conveyed by TCP. 523 Syntactically, STT segments look identical to TCP segments. However, 524 STT tunnel endpoints treat the Sequence number and Acknowledgment 525 number differently than TCP endpoints treat those fields. 526 Furthermore, as noted above, there is no TCP state machine associated 527 with an STT tunnel. 529 0 1 2 3 530 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 531 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 532 | Source Port | Destination Port | 533 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 534 | Sequence Number(*) | 535 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 536 | Acknowledgment Number(*) | 537 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 538 | Data | |U|A|P|R|S|F| | 539 | Offset| Reserved |R|C|S|S|Y|I| Window | 540 | | |G|K|H|T|N|N| | 541 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 542 | Checksum | Urgent Pointer | 543 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 544 | Options | Padding | 545 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 546 | data | 547 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 549 Figure 4: STT Segment Format 551 The Destination port is to be requested from IANA, in the user range 552 (1024-49151). 554 In order to allow correct reassembly of the STT frame, the source 555 port MUST be constant for all segments of a single STT frame. 557 As noted above (Section 2.5) the source port SHOULD be the same for 558 all frames that belong to a single flow in the virtual network, e.g. 559 a single TCP connection. 561 Also, to encourage efficient distribution of traffic among multiple 562 paths when ECMP is used, the method to calculate the source port 563 should provide a random distribution of source port numbers. An 564 example mechanism would be a random hash on ports and addresses of 565 the TCP headers of the flow in the virtual network. 567 It is RECOMMENDED to use a source port number from the ephemeral 568 range defined by IANA (49152-65535). 570 The Sequence number and Acknowledgment number fields are re-purposed 571 in a way that does not confuse NICs that expect them to be used in 572 the conventional manner. The ACK field is used as a packet 573 identifier for the purposes of fragmentation, equivalent in function 574 to the Identification field of IPv4 or the IPv6 Fragment header: it 575 MUST be constant for all STT segments of a given frame, and different 576 from any value used recently for other STT frames sent over this 577 tunnel. 579 The upper 16 bits of the the SEQ field are used to convey the length 580 of the STT frame in bytes. The lower 16 bits of the SEQ field are 581 used to convey the offset (in bytes) of the current fragment within 582 the larger STT frame. 584 Reassembly of the fragments may be done partially by NICs that 585 perform LRO, since the sequence numbers of frames will increment 586 appropriately. That is, the upper 16 bits don't change, and the 587 lower 16 bits increment by N for every N byte segment that is 588 transmitted, just as would be the case if an actual sequence number 589 were being sent. Note that the size limit of an STT frame ensures 590 that sequence numbers cannot wrap while sending the segments of a 591 single STT frame. 593 In the event that a NIC does not consider the ACK field when merging 594 received packets, LRO MUST be disabled for this NIC when using STT. 596 All the fields after ACK have their conventional meaning, although 597 nothing will be done with the Window or Urgent pointer values. Those 598 fields SHOULD be zero on transmit and ignored on receipt. It is 599 RECOMMENDED that the PSH (Push) flag be set when transmitting the 600 last segment of a frame in order to cause data to be delivered by the 601 NIC without waiting for other fragments. The ACK flag SHOULD be set 602 to ensure that a receiving NIC passes the ACK field to the host to 603 assist in reassembly. All other flags SHOULD be zero on transmit and 604 ignored on receipt. 606 3.3. Encapsulation of STT Segments in IP 608 From the perspective of IP, an STT segment is just like any other TCP 609 segment. The protocol number (IPv4) or Next Header (IPv6) has the 610 value 6, as for regular TCP. The resulting IP datagram is then 611 encapsulated in the appropriate L2 header (e.g. Ethernet) for 612 transmission on the physical medium. 614 3.3.1. Diffserv and ECN-Marking 616 When traffic is encapsulated in a tunnel header, there are numerous 617 options as to how the Diffserv Code-Point (DSCP) and ECN markings are 618 set in the outer header and propagated to the inner header on 619 decapsulation. 621 [RFC2983] defines two modes for mapping the DSCP markings from inner 622 to outer headers and vice versa. The Uniform model copies the inner 623 DSCP marking to the outer header on tunnel ingress, and copies that 624 outer header value back to the inner header at tunnel egress. The 625 Pipe model sets the DSCP value to some value based on local policy at 626 ingress and does not modify the inner header on egress. Both models 627 SHOULD be supported by STT endpoints. However, there is an 628 additional complexity with the uniform model for STT, because a 629 single IP datagram that is transmitted over the tunnel appears as 630 multiple IP datagrams on the wire. Thus it is not guaranteed that 631 all segments of the STT frame will have the same DSCP at egress. If 632 uniform model behavior is configured, it is RECOMMENDED that the DSCP 633 of the first segment of the STT frame be used to set the DSCP value 634 of the IP header in the decapsulated STT frame. 636 [RFC6040] describes the correct ECN behavior for any type of IP in IP 637 tunnel, and this behavior SHOULD be followed for STT tunnels. As 638 with the Uniform Diffserv tunnel model, the fact that one inner IP 639 datagram is segmented into multiple outer datagrams makes the 640 situation slightly more complex. It is RECOMMENDED that if any 641 segment of the received STT frame has the CE (congestion experienced) 642 bit set in its IP header, then the CE bit SHOULD be set in the IP 643 header of the decapsulated STT frame. 645 3.3.2. Packet Loss 647 Individual IP datagrams may be dropped (most often due to congestion) 648 and, since there is no acknowledgment or reliable delivery of these 649 datagrams, there is the potential to corrupt an entire STT Frame due 650 to the loss of a single IP datagram. Fortunately, there are 651 solutions to this problem in the case where the higher layer protocol 652 running over STT is TCP. An STT receiving endpoint running in an 653 end-system, as shown in Figure 1 for example, is not required to 654 deliver complete STT frames to the TCP stack in the receiving VM. A 655 partial frame payload can be delivered and the receiving TCP stack 656 can deal with the missing bytes just as it would if running directly 657 over a physical network. That is, TCP in the VM can send ACKs for 658 the contiguous bytes received to trigger retransmission of the 659 missing bytes by the sender. This is similar to the operation of LRO 660 in current NICs. There are some subtleties to making this work 661 correctly in the STT context, and it does depend on the STT endpoint 662 being aware of the higher layer protocols consuming data in the VM to 663 which it is connected. The main point of this discussion is that, in 664 the common deployments of STT running in a virtual switch, the 665 potential harm of losing individual packets is not as serious as it 666 might first appear. 668 3.4. Broadcast and Multicast 670 It is possible to establish point-to-multipoint STT tunnels by using 671 an IP multicast address as the destination address of the tunnel. 673 These may be used for broadcast or multicast traffic if the 674 underlying physical network supports IP multicast. Control 675 mechanisms for setting up such multicast groups are beyond the scope 676 of this document. It is worth repeating that, despite the syntactic 677 resemblance between the STT segment header and the TCP header, there 678 is no TCP state machine associated with an STT tunnel, so the 679 traditional issues of combining multicast with TCP (or reliable 680 transports more generally) do not arise. 682 4. Interoperability Issues 684 It will be noted that an STT packet on the wire appears exactly the 685 same as a TCP packet, but that processing of an STT packet on 686 reception is entirely different from TCP - no three-way handshake to 687 establish a connection, no ACKs, retransmission, etc. Hence, an STT 688 tunnel endpoint clearly needs to be configured to behave in the 689 correct manner rather than to perform standard TCP processing on the 690 packet. The primary way to recognize an STT segment is the 691 destination port number in the TCP header. In the event that an STT 692 packet is inadvertently delivered to a device that is not configured 693 to behave as an STT tunnel endpoint, no TCP connection will be 694 established and STT packets will be dropped. 696 Being stateless, STT does not provide any sort of congestion control. 697 In this sense it is equivalent to other tunneling protocols such as 698 GRE. The assumption is that congestion control, if required, is 699 provided by higher layers (e.g. a real TCP connection generating the 700 payloads of STT frames), just as in any other tunneling protocol. 702 STT deployments are almost entirely limited at present to intra-data 703 center environments. In these environments, STT tunnels between 704 pairs of endpoints are typically created by some sort of network 705 virtualization controller. STT packets should therefore remain 706 within the perimeter of the overlay that is managed by that 707 controller. In the event of some misconfiguration or erroneous 708 controller behavior, STT packets could be sent outside of this 709 controlled domain into the broader Internet. As noted above, any 710 endpoint that is not expecting STT packets will drop them, as they 711 will appear to belong to an unestablished TCP session. Many 712 firewalls are also likely to drop erroneously sent STT packets for 713 the same reason. 715 Within a network virtualization overlay, there may be middle boxes 716 (e.g. firewalls) that process TCP. It is likely that, in the near 717 term at least, such devices will drop STT packets, as there will be 718 no TCP connection state established. This could prevent the correct 719 operation of the overlay. This is clearly undesirable, but it is a 720 general issue with any form of tunneling - the nature of many middle 721 boxes is that they will not permit tunnels to pass through them. 722 Hence the best solution is simply to avoid deploying middle boxes at 723 locations where STT tunnels (or other forms of tunnels for network 724 virtualization) will need to pass through them. This will not, 725 however, always be feasible, especially when virtualized networks 726 extend among multiple data centers. Other solutions include 727 configuring the middle boxes to permit TCP packets to pass through 728 when the port number matches the port assigned for STT. 730 In the longer term, we might reasonably expect that middle boxes 731 would be able to recognize STT traffic, and to terminate and 732 originate STT tunnels if necessary (e.g. to perform functions that 733 require the STT payload to be inspected such as stateful 734 firewalling). 736 It is also possible to provide all the functionality of STT using a 737 different IP protocol number (or next header value in IPv6). This 738 approach could make sense in the long run but will typically not 739 enable current NIC hardware to be leveraged for TSO and LRO 740 functions. An alternative approach is to move to a UDP-based 741 encapsulation such as Geneve [I-D.gross-geneve]. This, too, requires 742 NICs to evolve to support TSO and LRO on tunneled traffic. 744 It is also possible to run STT traffic over other forms of tunnel 745 (GRE, IPSEC, etc.) in which case the STT traffic can pass through 746 appropriately configured middle boxes. 748 5. IANA Considerations 750 IANA is requested to allocate a TCP port in the user range (1024- 751 49151). The specific request is as follows: 753 Service Name: STT 754 Transport Protocol(s): TCP 755 Assignee: Bruce Davie 756 Contact: Bruce Davie 757 Description: Stateless Transport Tunneling 758 Reference: This document (intended for publication as Informational RFC) 759 Port Number: TBA (7471 requested) 761 6. Security Considerations 763 In the physical network, STT packets are simply IP datagrams, and do 764 not introduce new security issues. Most standard IP security 765 mechanisms (such as IPSEC encryption or authentication) can be 766 implemented on STT packets if desired. As noted above, however, 767 tunneling generally interacts poorly with middle boxes, and STT is no 768 exception. Devices such as firewalls are likely to drop STT traffic 769 unless the capability to recognize STT packets is implemented, or 770 unless the STT traffic is itself run over some sort of tunnel that 771 the firewall is configured to permit. Intrusion detection systems 772 would similarly need to be enhanced to be able to look inside STT 773 packets. 775 It should also be noted that while STT packets resemble TCP segments, 776 the lack of a TCP state machine means that TCP-related security 777 issues (e.g. SYN-flooding) do not apply. Similarly, some of the 778 benefits of the TCP state machine (e.g. the ability to discard 779 packets with unexpected sequence numbers) are also absent for STT 780 traffic. 782 More general issues of security related to network virtualization 783 overlays are described in [I-D.ietf-nvo3-security-requirements]. 785 7. Contributors 787 The following individuals contributed to this document: 789 Brad McConnell 790 Rackspace 791 5000 Walzem Road 792 San Antonio, TX 78218 793 Email: bmcconne@rackspace.com 795 JC Martin 796 eBay 797 2145 Hamilton Ave. 798 San Jose, CA 95125 799 Email: jcmartin@ebaysf.com 801 Iben Rodriguez 802 eBay 803 2477 Woodland Ave 804 San Jose, CA 95128 805 Email: Iben.rodriguez@gmail.com 807 Ilango Ganga 808 Intel Corporation 809 2200 Mission College Blvd. 810 Santa Clara, CA - 95054 811 Email: ilango.s.ganga@intel.com 813 Igor Gashinsky 814 Yahoo! 815 111 West 40th Street 816 New York, NY 10018 817 Email: igor@yahoo-inc.com 819 8. Acknowledgements 821 We thank Martin Casado for inspiring this work and making all the 822 introductions, and to Ben Pfaff for his explanations of the 823 implementation. Thanks also to Pierre Ettori, Yukio Ogawa and 824 Koichiro Seto for their helpful comments. 826 9. References 828 9.1. Normative References 830 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 831 RFC 793, September 1981. 833 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 834 Requirement Levels", BCP 14, RFC 2119, March 1997. 836 9.2. Informative References 838 [I-D.gross-geneve] 839 Gross, J., Sridhar, T., Garg, P., Wright, C., and I. 840 Ganga, "Geneve: Generic Network Virtualization 841 Encapsulation", draft-gross-geneve-00 (work in progress), 842 February 2014. 844 [I-D.ietf-nvo3-overlay-problem-statement] 845 Narten, T., Gray, E., Black, D., Fang, L., Kreeger, L., 846 and M. Napierala, "Problem Statement: Overlays for Network 847 Virtualization", 848 draft-ietf-nvo3-overlay-problem-statement-04 (work in 849 progress), July 2013. 851 [I-D.ietf-nvo3-security-requirements] 852 Hartman, S., Zhang, D., and M. Wasserman, "Security 853 Requirements of NVO3", 854 draft-ietf-nvo3-security-requirements-02 (work in 855 progress), January 2014. 857 [I-D.mahalingam-dutt-dcops-vxlan] 858 Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 859 L., Sridhar, T., Bursell, M., and C. Wright, "VXLAN: A 860 Framework for Overlaying Virtualized Layer 2 Networks over 861 Layer 3 Networks", draft-mahalingam-dutt-dcops-vxlan-09 862 (work in progress), April 2014. 864 [I-D.sridharan-virtualization-nvgre] 865 Sridharan, M., Greenberg, A., Wang, Y., Garg, P., 866 Venkataramiah, N., Duda, K., Ganga, I., Lin, G., Pearson, 867 M., Thaler, P., and C. Tumuluri, "NVGRE: Network 868 Virtualization using Generic Routing Encapsulation", 869 draft-sridharan-virtualization-nvgre-04 (work in 870 progress), February 2014. 872 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 873 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 874 March 2000. 876 [RFC2983] Black, D., "Differentiated Services and Tunnels", 877 RFC 2983, October 2000. 879 [RFC3931] Lau, J., Townsley, M., and I. Goyret, "Layer Two Tunneling 880 Protocol - Version 3 (L2TPv3)", RFC 3931, March 2005. 882 [RFC6040] Briscoe, B., "Tunnelling of Explicit Congestion 883 Notification", RFC 6040, November 2010. 885 [VL2] Greenberg et al, "VL2: A Scalable and Flexible Data Center 886 Network", 2009. 888 Proc. ACM SIGCOMM 2009 890 Authors' Addresses 892 Bruce Davie (editor) 893 VMware, Inc. 894 3401 Hillview Ave. 895 Palo Alto, CA 94304 896 USA 898 Email: bdavie@vmware.com 900 Jesse Gross 901 VMware, Inc. 902 3401 Hillview Ave. 903 Palo Alto, CA 94304 904 USA 906 Email: jgross@vmware.com