idnits 2.17.1 draft-davie-stt-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (April 28, 2016) is 2917 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-16) exists of draft-ietf-nvo3-geneve-01 == Outdated reference: A later version (-07) exists of draft-ietf-nvo3-security-requirements-06 Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group B. Davie, Ed. 3 Internet-Draft J. Gross 4 Intended status: Informational VMware, Inc. 5 Expires: October 30, 2016 April 28, 2016 7 A Stateless Transport Tunneling Protocol for Network Virtualization 8 (STT) 9 draft-davie-stt-08 11 Abstract 13 Network Virtualization places unique requirements on tunneling 14 protocols. This draft describes STT (Stateless Transport Tunneling), 15 a tunnel encapsulation that enables overlay networks to be built in 16 virtualized networks. STT is particularly useful when some tunnel 17 endpoints are in end-systems, as it utilizes the capabilities of the 18 network interface card to improve performance. This draft documents 19 the protocol and the rationale for its design, and highlights issues 20 that may arise in deployments. 22 Status of This Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on October 30, 2016. 39 Copyright Notice 41 Copyright (c) 2016 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 57 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 4 58 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 59 1.3. Reference Model . . . . . . . . . . . . . . . . . . . . . 5 60 2. Design Rationale . . . . . . . . . . . . . . . . . . . . . . 5 61 2.1. Segmentation Offload . . . . . . . . . . . . . . . . . . 6 62 2.2. Metadata . . . . . . . . . . . . . . . . . . . . . . . . 7 63 2.3. Context Information . . . . . . . . . . . . . . . . . . . 7 64 2.4. Alignment . . . . . . . . . . . . . . . . . . . . . . . . 7 65 2.5. Equal Cost Multipath . . . . . . . . . . . . . . . . . . 8 66 2.6. Efficient Software Processing . . . . . . . . . . . . . . 8 67 3. Frame Formats . . . . . . . . . . . . . . . . . . . . . . . . 8 68 3.1. STT Frame Format . . . . . . . . . . . . . . . . . . . . 9 69 3.1.1. Handling non-TCP/IP and non-UDP/IP payloads . . . . . 12 70 3.2. Usage of TCP Header by STT . . . . . . . . . . . . . . . 13 71 3.3. Encapsulation of STT Segments in IP . . . . . . . . . . . 15 72 3.3.1. Diffserv and ECN-Marking . . . . . . . . . . . . . . 15 73 3.3.2. Packet Loss . . . . . . . . . . . . . . . . . . . . . 15 74 3.4. Broadcast and Multicast . . . . . . . . . . . . . . . . . 16 75 4. Interoperability Issues . . . . . . . . . . . . . . . . . . . 16 76 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 77 6. Security Considerations . . . . . . . . . . . . . . . . . . . 18 78 7. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 18 79 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 19 80 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 81 9.1. Normative References . . . . . . . . . . . . . . . . . . 19 82 9.2. Informative References . . . . . . . . . . . . . . . . . 20 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 21 85 1. Introduction 87 Network Virtualization places unique requirements on tunneling 88 protocols. The utility of tunneling in virtualized data centers has 89 been described elsewhere; see, for example [RFC7364], [VL2], 90 [RFC7348], [RFC7637], [I-D.ietf-nvo3-geneve]. Tunneling allows a 91 virtual overlay topology to be constructed on top of the physical 92 data center network, and provides benefits such as: 94 o Ability to manage overlapping addresses between multiple tenants 95 o Decoupling of the virtual topology provided by the tunnels from 96 the physical topology of the network 98 o Support for virtual machine mobility independent of the physical 99 network 101 o Support for essentially unlimited numbers of virtual networks (in 102 contrast to VLANs, for example) 104 o Decoupling of the network service provided to servers from the 105 technology used in the physical network (e.g. providing an L2 106 service over an L3 fabric) 108 o Isolating the physical network from the addressing of the virtual 109 networks, thus avoiding issues such as MAC table size in physical 110 switches. 112 This draft describes STT (Stateless Transport Tunneling), a tunnel 113 encapsulation that enables overlay networks to be built in 114 virtualized data center networks, providing the benefits outlined 115 above. STT is particularly useful when some tunnel endpoints are in 116 end-systems, as it utilizes the capabilities of standard network 117 interface cards to improve performance. Multiple independent 118 implementations of STT exist and are in production use. 120 STT is an IP-based encapsulation and utilizes a TCP-like header 121 inside the IP header. It is, however, stateless, i.e., there is no 122 TCP connection state of any kind associated with the tunnel. The 123 TCP-like header is used for pragmatic reasons, to leverage the 124 capabilities of existing network interface cards, but should not be 125 interpreted as implying any sort of connection state between 126 endpoints. 128 STT is typically used to carry Ethernet frames between tunnel 129 endpoints. These frames may be considerably larger than the MTU of 130 the physical network - up to 64KB. Fields in the tunnel header are 131 used to allow these large frames to be segmented at the entrance to 132 the tunnel according to the MTU of the physical network and 133 subsequently reassembled at the far end of the tunnel. 135 Because STT uses TCP's header format and protocol number (6), some 136 care needs to be taken in the deployment of STT. Section 4 describes 137 these deployment considerations. 139 1.1. Requirements Language 141 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 142 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 143 document are to be interpreted as described in RFC 2119 [RFC2119]. 145 1.2. Terminology 147 The following terms are used in this document: 149 Stateless Transport Tunneling (STT). The tunneling mechanism defined 150 in this document. The name derives from the fact that the tunnel 151 header resembles the TCP/IP headers (hence "transport" tunneling) 152 while "stateless" refers to the fact that none of the normal TCP 153 state (connection state, send and receive windows, congestion state 154 etc.) is associated with the tunnel (as would be required if an 155 actual TCP connection were used for tunneling). 157 STT Frame. The unit of data that is passed into the tunnel prior to 158 segmentation and encapsulation. This frame typically consists of an 159 Ethernet frame and an STT Frame header. These frames may be up to 160 64KB in size. 162 STT Segment. The unit of data that is transmitted on the underlay 163 network over which the tunnel operates. An STT segment has headers 164 that are syntactically the same as the TCP/IP headers, and typically 165 contains part of an STT frame as the payload. These segments must 166 fit within the MTU of the physical network. 168 Context ID. A 64-bit field in the STT frame header that conveys 169 information about the disposition of the STT frame between the tunnel 170 endpoints. One example use of the Context ID is to direct delivery 171 of the STT frame payload to the appropriate virtual network or 172 virtual machine. 174 MSS. Maximum Segment Size. The maximum number of bytes that can be 175 sent in one TCP segment [RFC0793]. 177 NIC. Network Interface Card. 179 TSO. TCP Segmentation Offload. A function provided by many 180 commercial NICs that allows large data units to be passed to the NIC, 181 the NIC being responsible for creating MSS-sized segments with 182 correct TCP/IP headers. 184 LRO. Large Receive Offload. The receive-side equivalent function of 185 TSO, in which multiple TCP segments are coalesced into larger data 186 units. 188 VM. Virtual Machine. 190 1.3. Reference Model 192 Our conceptual model for a virtualized network is shown in Figure 1. 193 STT tunnels extend in this figure from one virtual switch to another, 194 providing a virtual link between the switches over some arbitrary 195 underlay. More generally, STT tunnels operate between a pair of 196 tunnel endpoints; these endpoints may be virtual switches, physical 197 switches, or some other device (e.g. an appliance). The STT tunnel 198 provides a virtual point-to-point Ethernet link between the 199 endpoints. Frames are handed to the tunnel by some entity (e.g. a VM 200 that is connected to a virtual switch in this picture) and first 201 encapsulated with an STT Frame header. STT Frames may then be 202 fragmented in the NIC, and are encapsulated with a tunnel header (the 203 STT segment header) for transmission over the underlay. Note that 204 other models are possible, e.g., where one or both tunnel endpoints 205 are implemented in a physical switch. In such cases the tunnel 206 endpoint may forward packets to and from another link (physical or 207 virtual) rather than to a VM. 209 +----------------------+ +----------------------+ 210 | +--+ +-------+---+ | | +---+-------+ +--+ | 211 | |VM|---| | | | | | | |---|VM| | 212 | +--+ |Virtual|NIC|--- Underlay --- |NIC|Virtual| +--+ | 213 | +--+ |Switch | | | Network | | |Switch | +--+ | 214 | |VM|---| | | | | | | |---|VM| | 215 | +--+ +-------+---+ | | +---+-------+ +--+ | 216 +----------------------+ +----------------------+ 218 ()===============================() 219 Switch-Switch tunnel 221 Figure 1: STT Reference Model 223 2. Design Rationale 225 We take as given the need for some form of tunneling to support the 226 virtualization of the network as described in Section 1. One might 227 reasonably ask whether some existing tunneling protocol such as 228 GRE[RFC2784] or L2TPv3[RFC3931] might suffice. In fact, [RFC7637] 229 does just that, using GRE. The primary motivation for STT as opposed 230 to one of the existing tunneling methods is to improve the 231 performance of data transfers from hosts that implement tunnel 232 endpoints. We expand on this rationale below. 234 2.1. Segmentation Offload 236 A large percentage of network interface cards (NICs) in use today are 237 able to perform TCP segmentation offload (TSO). When a NIC supports 238 TSO, the host hands a large (greater than 1 TCP MSS) frame of data to 239 the NIC along with a set of metadata which includes, among other 240 things, the desired MSS, and various fields needed to complete the 241 TCP header. The NIC fragments the frame into MSS-sized segments, 242 performs the TCP Checksum operation, and applies the appropriate 243 headers (TCP, IP and MAC) to each segment. 245 On the receive side, some NICs support the reassembly of TCP 246 segments, a function referred to as large receive offload (LRO). In 247 this case, NICs attempt to reassemble TCP segments and pass larger 248 aggregates of data to the host. (Since TCP's service model is a byte 249 stream, there is no higher level frame for the NIC to reassemble, but 250 it can pass chunks of the stream larger than one MSS to the host.) 251 The benefits to the host include fewer per-packet operations and 252 larger data transfers between host and NIC, which amortizes the per- 253 transfer cost (such as interrupt processing) more efficiently. These 254 gains can translate into significant performance gains for data 255 transfer from the host to the network. 257 STT is explicitly designed to leverage the TSO capabilities of 258 currently available NICs. While one might think of segmentation as a 259 generic function, the majority of NICs are designed specifically to 260 support TCP segmentation offload, as the details of the segmentation 261 function are highly dependent on the specifics of TCP. In order to 262 leverage such capability, therefore, the STT segment header is 263 syntactically identical to a valid TCP header. However, we use some 264 of the fields in the TCP header (specifically, sequence number and 265 ACK number) to support the objectives of STT. The details are 266 described in Section 3.2. In essence, we need the same set of 267 information that IP datagrams carry when IP fragmentation takes 268 place: a unique identifier for the frame that has been fragmented, an 269 offset into that frame for the current fragment, and the length of 270 the frame to be reassembled. We fit these fields into the TCP header 271 fields traditionally used for the SEQ and ACK numbers. STT segments 272 are transmitted as IP datagrams using the TCP protocol number (6). 273 The primary means to recognize STT segments is the destination port 274 number. We discuss the interoperability impact of these design 275 choices in Section 4. 277 The net effect of using TSO is that the frame size that is sent by 278 endpoints in the virtualized network can be much larger than the MTU 279 of the underlying physical network. The primary benefit of this is a 280 significant performance gain when large amounts of data are being 281 transferred between nodes in the virtual network. A secondary effect 282 is that the header of the STT frame is amortized across a larger 283 amount of data, reducing the need to shrink the STT frame header to 284 minimum size. 286 Note that, while segmentation offload is the primary NIC function 287 that STT takes advantage of, other NIC offload functions such as 288 checksum calculation can also be leveraged. 290 2.2. Metadata 292 When a frame is delivered to the NIC that supports TSO for 293 segmentation and transmission, a certain amount of metadata is 294 typically passed along with it. This includes the MSS and 295 potentially a VLAN tag to be applied to the transmitted packets. 297 In some virtualized network deployments, an STT frame may traverse a 298 tunnel, be received and reassembled at an STT endpoint, and then be 299 sent on another physical interface. In such cases, the tunnel 300 terminating endpoint may need to pass metadata to a NIC to enable 301 transmission of frames on the physical link. For this reason, 302 appropriate metadata is carried in the STT frame header. 304 2.3. Context Information 306 When an STT Frame is received by a tunnel endpoint, it needs to be 307 directed to the appropriate entity in the virtualized network to 308 which it belongs. For this reason, a Context ID is required in the 309 STT frame header. Some other encapsulations (e.g. [RFC7348], 310 [RFC7637]) use an explicit tenant network identifier or virtual 311 network identifier. The Context Identifier can be thought of as a 312 generalized form of virtual network identifier. Using a larger and 313 more general identifier allows for a broader range of service models 314 and allows ample room for future expansion. There is little downside 315 to using a larger field here because it is amortized across the 316 entire STT Frame rather than being present in each packet. 318 2.4. Alignment 320 Software implementations of tunnel endpoints benefit from 32-bit 321 alignment of the data to be manipulated. Because the Ethernet header 322 is not a multiple of 32-bits (it is 14 bytes), 2 bytes of padding are 323 added to the STT header, causing the payload beyond the encapsulated 324 Ethernet header, which typically includes the IP header of the 325 encapsulated frame, to be 32-bit aligned. 327 2.5. Equal Cost Multipath 329 It is essential that traffic passing through the physical network can 330 be efficiently distributed across multiple paths. Standard equal 331 cost multipath (ECMP) techniques involve hashing on address and port 332 numbers in the outer protocol headers. There are two main issues to 333 address with ECMP. First, it is important that, when a set of 334 packets belong to a single flow (e.g. a TCP connection in the virtual 335 network), all those packets should follow the same path. Second, all 336 paths should be used efficiently, i.e. there needs to be sufficient 337 entropy among the different flows to ensure they get distributed 338 evenly across multiple paths. 340 STT achieves the first goal by ensuring that the source and 341 destination ports and addresses in the outer header are all the same 342 for a single flow. The second goal is achieved by generating the 343 source port using a random hash of fields in the headers of the inner 344 packets, e.g. the ports and addresses of the virtual flow's packets. 345 We provide more details on the usage of port numbers in Section 3.2. 347 2.6. Efficient Software Processing 349 The design of STT is largely motivated by the desire to tunnel 350 packets efficiently between virtual switches running in software. In 351 addition to the points noted above, this leads to some design 352 optimizations to simplify processing of packets, such as the use of 353 an "L4 offset" field in the STT header to enable the payload to be 354 located quickly without extensive header parsing. 356 3. Frame Formats 358 STT encapsulates data payloads of up to 64KB (limited by the length 359 field in the STT segment header, described in Section 3.2). Those 360 frames are then handed to the NIC, which segments them to an 361 appropriate size given the MTU of the underlying physical network, 362 and encapsulates the resulting segments in a TCP-like header, which 363 in turn is encapsulated by an IP header and finally a MAC header. 364 (The header is "TCP-like" in the sense that it has all the same 365 fields as a standard TCP header, but some are interpreted differently 366 as described in Section 3.2.) The encapsulation process is 367 illustrated in Figure 2. 369 +-----------+ +----------+ +----------+ 370 | IP Header | |IP Header | |IP header | 371 +-----------+ +-----------+ +----------+ +----------+ 372 |STT Frame | |TCP-like | |TCP-like | |TCP-like | 373 | Header | | header | | header | | header | 374 +-----------+ +-----------+ +----------+ +----------+ 375 | | ---> | STT Frame | |Next part | ... |Last part | 376 |Payload | | Header | |of Payload| |of Payload| 377 . . +-----------+ | | | | 378 . . | | | | | | 379 . . | Start of | | | | | 380 +-----------+ | Payload | | | +----------+ 381 +-----------+ +----------+ 383 Original data STT Frame is segmented by the NIC and 384 frame is encapped transmitted as a set of TCP segments (MAC 385 with STT Header headers not shown) 387 Figure 2: STT Frame Fragments and Encapsulation 389 The details of the STT Frame header and the usage of the TCP-like 390 header are described in detail below. The TCP segments shown in 391 Figure 2 are of course further encapsulated as IP datagrams, and may 392 be sent as either IPv4 or IPv6. The resulting IP datagrams are then 393 transmitted in the appropriate MAC level frame (e.g. Ethernet, not 394 shown in the figure) for the underlying physical network over which 395 the tunnels are established. 397 3.1. STT Frame Format 399 Figure 3 illustrates the header of an STT frame before it is 400 segmented. 402 0 1 2 3 403 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 404 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 405 | Version | Flags | L4 Offset | Reserved | 406 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 407 | Max. Segment Size | PCP |V| VLAN ID | 408 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 409 | | 410 + Context ID (64 bits) + 411 | | 412 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 413 | Padding | data | 414 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 415 | | 417 Figure 3: STT Frame Format 419 The STT frame header contains the following fields: 421 o Version - currently 0. If a non-zero version field is received by 422 an implementation that supports only version zero, the frame MUST 423 be discarded. 425 o Flags - describes encapsulated packet, see below. 427 o L4 offset - offset in bytes from the end of the STT Frame header 428 to the start of the encapsulated layer 4 (TCP/UDP) header. If the 429 encapsulated packet is not IPv4 or IPv6, this field SHOULD be set 430 to zero. 432 o Reserved field - MUST be zero on transmission and ignored on 433 receipt. 435 o Max Segment Size - the segment size (the negotiated MSS in the 436 case of TCP) that should be used by a tunnel endpoint that is 437 transmitting this frame onto another network. MUST be zero if 438 segmentation offload is not in use. 440 o PCP - the 3-bit Priority Code Point field that should be applied 441 to this packet by an STT tunnel endpoint on transmission to 442 another network (see Section 2.2). Meaningful only if the V bit 443 is set. 445 o V - a one bit flag that, if set, indicates the presence of a valid 446 VLAN ID in the following field and valid PCP in the preceding 447 field. When this flag is set, an 802.1Q header will be applied to 448 the packet by the STT tunnel endpoint on transmission. The TPID 449 will be 0x8100. 451 o VLAN ID - 12-bit VLAN tag that should be applied to this packet by 452 an STT tunnel endpoint on transmission to another network (see 453 Section 2.2). Any valid VLAN ID (including zero) may be used. 454 Meaningful only if the V bit is set. 456 o Context ID - 64 bits of context information, described in detail 457 in Section 2.3. 459 o Padding - 16 bits as described in Section 2.4. MUST be set to 460 zero on transmission and ignored on receipt. 462 The flags field is an 8-bit field organized as follows: 464 0 1 2 3 4 5 6 7 465 +-+-+-+-+-+-+-+-+ 466 |C|P|V|T| Res. | 467 +-+-+-+-+-+-+-+-+ 469 Figure 4: STT Flags 471 The meanings of the flags is as follows: 473 o C: Checksum verified. Set if the checksum of the encapsulated 474 packet has been verified by the sender. 476 o P: Checksum partial. Set if the checksum in the encapsulated 477 packet has been computed only over the TCP/IP pseudoheader (or 478 UDP/IP pseudoheader, if the encapsulated packet is UDP). This bit 479 MUST be set if segmentation offload is used by the sender. Note 480 that bit 0 and bit 1 cannot both be set in the same header. 482 o V: IP version. Set if the encapsulated packet is IPv4, not set if 483 the packet is IPv6. See below for discussion of non-IP payloads. 485 o T: TCP payload. Set if the encapsulated packet is TCP. 487 o Bits 4 through 7 are reserved and MUST be zero on transmission and 488 ignored on receipt. 490 As noted above, several of these fields are present primarily to 491 enable efficient processing of the packet when it is received at a 492 tunnel endpoint. (For example, it's entirely possible to determine 493 if the packet is IPv4 or IPv6 by looking at the Ethernet header - 494 it's just more efficient not to have to do so.) 496 The payload of the STT frame is an untagged Ethernet frame. 498 In the case where the Ethernet frame contains TCP/IP or UDP/IP as its 499 payload, this encapsulated packet should be correctly formatted as if 500 it were about to undergo unfragmented transmission (even though it 501 will ultimately be segmented as part of the transmission process). 502 This means it should have a correct TCP or UDP checksum (possibly 503 "partial", as noted above), correct length fields for its 504 unfragmented state, and correct IP header checksum (if IPv4). 506 If the length of the payload to be encapsulated exceeds 64KB, or if 507 the offset to the L4 header exceeds 255 bytes, then it will not be 508 possible to offload the packet to the NIC for segmentation. In this 509 case, the payload needs to be segmented and checksummed before being 510 encapsulated in STT frames. 512 Because there is no negotiation between end-points of an STT tunnel, 513 only basic TSO capabilities should be assumed. For example, ECN 514 (explicit congestion notification) support should not be assumed, so 515 TSO should not be requested for packets requiring such support. 516 Instead, such payloads should be segmented before being encapsulated 517 in STT frames. 519 3.1.1. Handling non-TCP/IP and non-UDP/IP payloads 521 Note that the STT header does not have a general "protocol" field to 522 allow the efficient processing of arbitrary payloads. The current 523 version is designed to provide a virtual Ethernet link, and hence 524 efficiently supports only Ethernet frames as the payload. The 525 Ethernet header itself contains a protocol field, which then 526 identifies the higher layer protocol, so it is straightforward to 527 accommodate non-IP traffic. Note however that offloading support 528 will not typically be available for traffic other than the following: 529 TCP and UDP over IPv4 or IPv6, with a maximum of a single VLAN tag 530 stored in the STT header. Other protocols will need to be 531 appropriately formatted for direct transmission prior to 532 encapsulation. 534 It will be noted that the STT Frame header does contain fields that 535 are intended to assist in efficient processing of IPv4 and IPv6 536 packets. These fields MUST be set to zero and ignored on receipt for 537 packets not being offloaded. 539 The use of STT to carry payloads other than Ethernet is theoretically 540 possible but is beyond the scope of this document. 542 3.2. Usage of TCP Header by STT 544 Figure 5 illustrates the usage of the TCP header by STT. This figure 545 is essentially identical to that in [RFC0793] with the exception that 546 we denote with an asterisk (*) two fields that are used by STT to 547 convey something other than the information that is conveyed by TCP. 548 Syntactically, STT segments look identical to TCP segments. However, 549 STT tunnel endpoints treat the Sequence number and Acknowledgment 550 number differently than TCP endpoints treat those fields. 551 Furthermore, as noted above, there is no TCP state machine associated 552 with an STT tunnel. 554 0 1 2 3 555 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 556 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 557 | Source Port | Destination Port | 558 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 559 | Sequence Number(*) | 560 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 561 | Acknowledgment Number(*) | 562 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 563 | Data | |U|A|P|R|S|F| | 564 | Offset| Reserved |R|C|S|S|Y|I| Window | 565 | | |G|K|H|T|N|N| | 566 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 567 | Checksum | Urgent Pointer | 568 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 569 | Options | Padding | 570 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 571 | data | 572 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 574 Figure 5: STT Segment Format 576 The Destination port, assigned by IANA, is 7471. 578 In order to allow correct reassembly of the STT frame, the source 579 port MUST be constant for all segments of a single STT frame. 581 As noted above (Section 2.5) the source port SHOULD be the same for 582 all frames that belong to a single flow in the virtual network, e.g. 583 a single TCP connection. 585 Also, to encourage efficient distribution of traffic among multiple 586 paths when ECMP is used, the method to calculate the source port 587 should provide a random distribution of source port numbers. An 588 example mechanism would be a random hash on ports and addresses of 589 the TCP headers of the flow in the virtual network. 591 The Sequence number and Acknowledgment number fields are re-purposed 592 in a way that does not confuse NICs that expect them to be used in 593 the conventional manner. The ACK field is used as a packet 594 identifier for the purposes of fragmentation, equivalent in function 595 to the Identification field of IPv4 or the IPv6 Fragment header: it 596 MUST be constant for all STT segments of a given frame, and different 597 from any value used recently for other STT frames sent over this 598 tunnel. ("Recent" in this context means a long enough interval that 599 packets from the frame that last used this value of the ACK field 600 should have all been delivered. Similar considerations apply to the 601 reuse of the IP Fragment Identifier, discussed in [RFC6864], but note 602 that packet lifetimes in a data center are likely to be relatively 603 short.) 605 The upper 16 bits of the the SEQ field are used to convey the length 606 of the STT frame in bytes. The lower 16 bits of the SEQ field are 607 used to convey the offset (in bytes) of the current fragment within 608 the larger STT frame. The task of updating the SEQ field on each 609 transmitted segment is the responsibility of the NIC. 611 Reassembly of the fragments may be done partially by NICs that 612 perform LRO, since the sequence numbers of frames will increment 613 appropriately. That is, the upper 16 bits don't change, and the 614 lower 16 bits increment by N for every N byte segment that is 615 transmitted, just as would be the case if an actual sequence number 616 were being sent. Note that the size limit of an STT frame ensures 617 that sequence numbers cannot wrap while sending the segments of a 618 single STT frame. 620 Many NICs, when performing LRO, will only merge packets with the same 621 ACK value. In the event that a NIC does not require the ACK field to 622 be constant when merging received packets, LRO MUST be disabled for 623 this NIC when using STT. In this case, STT frame reassembly will be 624 the responsibility of the software on the receiving host. 626 All the fields after ACK have their conventional meaning, although 627 nothing will be done with the Window or Urgent pointer values. Those 628 fields SHOULD be zero on transmit and ignored on receipt. It is 629 RECOMMENDED that the PSH (Push) flag be set when transmitting the 630 last segment of a frame in order to cause data to be delivered by the 631 NIC without waiting for other fragments. The ACK flag SHOULD be set 632 to ensure that a receiving NIC examines the ACK field. All other 633 flags SHOULD be zero on transmit and ignored on receipt. 635 3.3. Encapsulation of STT Segments in IP 637 From the perspective of IP, an STT segment is just like any other TCP 638 segment. The protocol number (IPv4) or Next Header (IPv6) has the 639 value 6, as for regular TCP. The resulting IP datagram is then 640 encapsulated in the appropriate L2 header (e.g. Ethernet) for 641 transmission on the physical medium. 643 3.3.1. Diffserv and ECN-Marking 645 When traffic is encapsulated in a tunnel header, there are numerous 646 options as to how the Diffserv Code-Point (DSCP) and ECN markings are 647 set in the outer header and propagated to the inner header on 648 decapsulation. 650 [RFC2983] defines two modes for mapping the DSCP markings from inner 651 to outer headers and vice versa. The Uniform model copies the inner 652 DSCP marking to the outer header on tunnel ingress, and copies that 653 outer header value back to the inner header at tunnel egress. The 654 Pipe model sets the DSCP value to some value based on local policy at 655 ingress and does not modify the inner header on egress. Both models 656 SHOULD be supported by STT endpoints. However, there is an 657 additional complexity with the uniform model for STT, because a 658 single IP datagram that is transmitted over the tunnel appears as 659 multiple IP datagrams on the wire. Thus it is not guaranteed that 660 all segments of the STT frame will have the same DSCP at egress. If 661 uniform model behavior is configured, it is RECOMMENDED that the DSCP 662 of the first segment of the STT frame be used to set the DSCP value 663 of the IP header in the decapsulated STT frame. 665 [RFC6040] describes the correct ECN behavior for any type of IP in IP 666 tunnel, and this behavior SHOULD be followed for STT tunnels. As 667 with the Uniform Diffserv tunnel model, the fact that one inner IP 668 datagram is segmented into multiple outer datagrams makes the 669 situation slightly more complex. It is RECOMMENDED that if any 670 segment of the received STT frame has the CE (congestion experienced) 671 bit set in its IP header, then the CE bit SHOULD be set in the IP 672 header of the decapsulated STT frame. 674 3.3.2. Packet Loss 676 Individual IP datagrams may be dropped (most often due to congestion) 677 and, since there is no acknowledgment or reliable delivery of these 678 datagrams, there is the potential to corrupt an entire STT Frame due 679 to the loss of a single IP datagram. The negative consequences of 680 such partial losses have been known for many years (see, for example, 681 [KM87]). Fortunately, there are solutions to this problem in the 682 case where the higher layer protocol running over STT is TCP. An STT 683 receiving endpoint running in an end-system, as shown in Figure 1 for 684 example, is not required to deliver complete STT frames to the TCP 685 stack in the receiving VM. A partial frame payload can be delivered 686 and the receiving TCP stack can deal with the missing bytes just as 687 it would if running directly over a physical network. That is, TCP 688 in the VM can send ACKs for the contiguous bytes received to trigger 689 retransmission of the missing bytes by the sender. This is similar 690 to the operation of LRO in current NICs. There are some subtleties 691 to making this work correctly in the STT context, and it does depend 692 on the STT endpoint being aware of the higher layer protocols 693 consuming data in the VM to which it is connected. The main point of 694 this discussion is that, in the common deployments of STT running in 695 a virtual switch, the potential harm of losing individual packets is 696 not as serious as it might first appear. 698 3.4. Broadcast and Multicast 700 It is possible to establish point-to-multipoint STT tunnels by using 701 an IP multicast address as the destination address of the tunnel. 702 These may be used for broadcast or multicast traffic if the 703 underlying physical network supports IP multicast. Control 704 mechanisms for setting up such multicast groups are beyond the scope 705 of this document. It is worth repeating that, despite the syntactic 706 resemblance between the STT segment header and the TCP header, there 707 is no TCP state machine associated with an STT tunnel, so the 708 traditional issues of combining multicast with TCP (or reliable 709 transports more generally) do not arise. 711 4. Interoperability Issues 713 It will be noted that an STT packet on the wire appears exactly the 714 same as a TCP packet, but that processing of an STT packet on 715 reception is entirely different from TCP - no three-way handshake to 716 establish a connection, no ACKs, retransmission, etc. Hence, an STT 717 tunnel endpoint clearly needs to be configured to behave in the 718 correct manner rather than to perform standard TCP processing on the 719 packet. The primary way to recognize an STT segment is the 720 destination port number in the TCP header. In the event that an STT 721 packet is inadvertently delivered to a device that is not configured 722 to behave as an STT tunnel endpoint, no TCP connection will be 723 established and STT packets will be dropped. 725 Being stateless, STT does not provide any sort of congestion control. 726 In this sense it is equivalent to other tunneling protocols such as 727 GRE. The assumption is that congestion control, if required, is 728 provided by higher layers (e.g. a real TCP connection generating the 729 payloads of STT frames), just as in any other tunneling protocol. 731 STT deployments are almost entirely limited at present to intra-data 732 center environments. In these environments, STT tunnels between 733 pairs of endpoints are typically created by some sort of network 734 virtualization controller. STT packets should therefore remain 735 within the perimeter of the overlay that is managed by that 736 controller. In the event of some misconfiguration or erroneous 737 controller behavior, STT packets could be sent outside of this 738 controlled domain into the broader Internet. As noted above, any 739 endpoint that is not expecting STT packets will drop them, as they 740 will appear to belong to an unestablished TCP session. Many 741 firewalls are also likely to drop erroneously sent STT packets for 742 the same reason. 744 Within a network virtualization overlay, there may be middle boxes 745 (e.g. firewalls) that process TCP. It is likely that, in the near 746 term at least, such devices will drop STT packets, as there will be 747 no TCP connection state established. This could prevent the correct 748 operation of the overlay. This is clearly undesirable, but it is a 749 general issue with any form of tunneling - the nature of many middle 750 boxes is that they will not permit tunnels to pass through them. 751 Hence the best solution is simply to avoid deploying middle boxes at 752 locations where STT tunnels (or other forms of tunnels for network 753 virtualization) will need to pass through them. This will not, 754 however, always be feasible, especially when virtualized networks 755 extend among multiple data centers. Other solutions include 756 configuring the middle boxes to permit TCP packets to pass through 757 when the port number matches the port assigned for STT. In this case 758 the middle boxes would have to permit the packets to pass in spite of 759 the lack of an established TCP connection and the repurposing of the 760 SEQ and ACK fields. 762 In the longer term, we might reasonably expect that middle boxes 763 would be able to recognize STT traffic, and to terminate and 764 originate STT tunnels if necessary (e.g. to perform functions that 765 require the STT payload to be inspected such as stateful 766 firewalling). 768 It is also possible to provide all the functionality of STT using a 769 different IP protocol number (or next header value in IPv6). This 770 approach could make sense in the long run but will typically not 771 enable current NIC hardware to be leveraged for TSO and LRO 772 functions. An alternative approach is to move to a UDP-based 773 encapsulation such as Geneve [I-D.ietf-nvo3-geneve]. This, too, 774 requires NICs to evolve to support TSO and LRO on tunneled traffic. 776 It is also possible to run STT traffic over other forms of tunnel 777 (GRE, IPSEC, etc.) in which case the STT traffic can pass through 778 appropriately configured middle boxes. 780 5. IANA Considerations 782 IANA has allocated TCP port 7471 for STT. This document makes no 783 further request of IANA. 785 6. Security Considerations 787 In the physical network, STT packets are simply IP datagrams, and do 788 not introduce new security issues. Most standard IP security 789 mechanisms (such as IPSEC encryption or authentication) can be 790 implemented on STT packets if desired. As noted above, however, 791 tunneling generally interacts poorly with middle boxes, and STT is no 792 exception. Devices such as firewalls are likely to drop STT traffic 793 unless the capability to recognize STT packets is implemented, or 794 unless the STT traffic is itself run over some sort of tunnel that 795 the firewall is configured to permit. Intrusion detection systems 796 would similarly need to be enhanced to be able to look inside STT 797 packets. 799 It should also be noted that while STT packets resemble TCP segments, 800 the lack of a TCP state machine means that TCP-related security 801 issues (e.g. SYN-flooding) do not apply. Similarly, some of the 802 benefits of the TCP state machine (e.g. the ability to discard 803 packets with unexpected sequence numbers) are also absent for STT 804 traffic. 806 More general issues of security related to network virtualization 807 overlays are described in [I-D.ietf-nvo3-security-requirements]. 809 7. Contributors 811 The following individuals contributed to this document: 813 Brad McConnell 814 Rackspace 815 5000 Walzem Road 816 San Antonio, TX 78218 817 Email: bmcconne@rackspace.com 819 JC Martin 820 eBay 821 2145 Hamilton Ave. 822 San Jose, CA 95125 823 Email: jcmartin@ebaysf.com 825 Iben Rodriguez 826 eBay 827 2477 Woodland Ave 828 San Jose, CA 95128 829 Email: Iben.rodriguez@gmail.com 831 Ilango Ganga 832 Intel Corporation 833 2200 Mission College Blvd. 834 Santa Clara, CA - 95054 835 Email: ilango.s.ganga@intel.com 837 Igor Gashinsky 838 Yahoo! 839 111 West 40th Street 840 New York, NY 10018 841 Email: igor@yahoo-inc.com 843 8. Acknowledgements 845 We thank Martin Casado for inspiring this work and making all the 846 introductions, and to Ben Pfaff for his explanations of the 847 implementation. Thanks also to Pierre Ettori, Yukio Ogawa, Koichiro 848 Seto, Erik Nordmark, Michael Orr and Aibing Zhou for their helpful 849 comments. 851 9. References 853 9.1. Normative References 855 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 856 RFC 793, DOI 10.17487/RFC0793, September 1981, 857 . 859 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 860 Requirement Levels", BCP 14, RFC 2119, 861 DOI 10.17487/RFC2119, March 1997, 862 . 864 9.2. Informative References 866 [I-D.ietf-nvo3-geneve] 867 Gross, J. and I. Ganga, "Geneve: Generic Network 868 Virtualization Encapsulation", draft-ietf-nvo3-geneve-01 869 (work in progress), January 2016. 871 [I-D.ietf-nvo3-security-requirements] 872 Hartman, S., Zhang, D., Wasserman, M., Qiang, Z., and M. 873 Zhang, "Security Requirements of NVO3", draft-ietf-nvo3- 874 security-requirements-06 (work in progress), December 875 2015. 877 [KM87] Kent, C. and J. Mogul, "Fragmentation Considered Harmful", 878 Proc. ACM SIGCOMM 1987, August 1987. 880 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 881 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 882 DOI 10.17487/RFC2784, March 2000, 883 . 885 [RFC2983] Black, D., "Differentiated Services and Tunnels", 886 RFC 2983, DOI 10.17487/RFC2983, October 2000, 887 . 889 [RFC3931] Lau, J., Ed., Townsley, M., Ed., and I. Goyret, Ed., 890 "Layer Two Tunneling Protocol - Version 3 (L2TPv3)", 891 RFC 3931, DOI 10.17487/RFC3931, March 2005, 892 . 894 [RFC6040] Briscoe, B., "Tunnelling of Explicit Congestion 895 Notification", RFC 6040, DOI 10.17487/RFC6040, November 896 2010, . 898 [RFC6864] Touch, J., "Updated Specification of the IPv4 ID Field", 899 RFC 6864, DOI 10.17487/RFC6864, February 2013, 900 . 902 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 903 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 904 eXtensible Local Area Network (VXLAN): A Framework for 905 Overlaying Virtualized Layer 2 Networks over Layer 3 906 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 907 . 909 [RFC7364] Narten, T., Ed., Gray, E., Ed., Black, D., Fang, L., 910 Kreeger, L., and M. Napierala, "Problem Statement: 911 Overlays for Network Virtualization", RFC 7364, 912 DOI 10.17487/RFC7364, October 2014, 913 . 915 [RFC7637] Garg, P., Ed. and Y. Wang, Ed., "NVGRE: Network 916 Virtualization Using Generic Routing Encapsulation", 917 RFC 7637, DOI 10.17487/RFC7637, September 2015, 918 . 920 [VL2] Greenberg, A., "VL2: A Scalable and Flexible Data Center 921 Network", Proc. ACM SIGCOMM 2009, August 2009. 923 Authors' Addresses 925 Bruce Davie (editor) 926 VMware, Inc. 927 3401 Hillview Ave. 928 Palo Alto, CA 94304 929 USA 931 Email: bdavie@vmware.com 933 Jesse Gross 934 VMware, Inc. 935 3401 Hillview Ave. 936 Palo Alto, CA 94304 937 USA 939 Email: jgross@vmware.com