idnits 2.17.1 draft-davie-stt-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (September 10, 2013) is 3878 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- ** Obsolete normative reference: RFC 793 (Obsoleted by RFC 9293) == Outdated reference: A later version (-09) exists of draft-mahalingam-dutt-dcops-vxlan-04 == Outdated reference: A later version (-08) exists of draft-sridharan-virtualization-nvgre-03 Summary: 1 error (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group B. Davie, Ed. 3 Internet-Draft J. Gross 4 Intended status: Informational VMware, Inc. 5 Expires: March 14, 2014 September 10, 2013 7 A Stateless Transport Tunneling Protocol for Network Virtualization 8 (STT) 9 draft-davie-stt-04 11 Abstract 13 Network Virtualization places unique requirements on tunneling 14 protocols. This draft describes STT (Stateless Transport Tunneling), 15 a tunnel encapsulation that enables overlay networks to be built in 16 virtualized networks. STT is particularly useful when some tunnel 17 endpoints are in end-systems, as it utilizes the capabilities of the 18 network interface card to improve performance. 20 Status of This Memo 22 This Internet-Draft is submitted in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF). Note that other groups may also distribute 27 working documents as Internet-Drafts. The list of current Internet- 28 Drafts is at http://datatracker.ietf.org/drafts/current/. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 This Internet-Draft will expire on March 14, 2014. 37 Copyright Notice 39 Copyright (c) 2013 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with respect 47 to this document. Code Components extracted from this document must 48 include Simplified BSD License text as described in Section 4.e of 49 the Trust Legal Provisions and are provided without warranty as 50 described in the Simplified BSD License. 52 Table of Contents 54 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 55 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 56 1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 57 1.3. Reference Model . . . . . . . . . . . . . . . . . . . . . 4 58 2. Design Rationale . . . . . . . . . . . . . . . . . . . . . . 5 59 2.1. Segmentation Offload . . . . . . . . . . . . . . . . . . 5 60 2.2. Metadata . . . . . . . . . . . . . . . . . . . . . . . . 7 61 2.3. Context Information . . . . . . . . . . . . . . . . . . . 7 62 2.4. Alignment . . . . . . . . . . . . . . . . . . . . . . . . 7 63 2.5. Equal Cost Multipath . . . . . . . . . . . . . . . . . . 7 64 2.6. Efficient Software Processing . . . . . . . . . . . . . . 8 65 3. Frame Formats . . . . . . . . . . . . . . . . . . . . . . . . 8 66 3.1. STT Frame Format . . . . . . . . . . . . . . . . . . . . 9 67 3.1.1. Handling non-IP payloads . . . . . . . . . . . . . . 11 68 3.2. Usage of TCP Header by STT . . . . . . . . . . . . . . . 11 69 3.3. Encapsulation of STT Segments in IP . . . . . . . . . . . 13 70 3.3.1. Diffserv and ECN-Marking . . . . . . . . . . . . . . 13 71 3.3.2. Packet Loss . . . . . . . . . . . . . . . . . . . . . 14 72 3.4. Broadcast and Multicast . . . . . . . . . . . . . . . . . 14 73 4. Interoperability Issues . . . . . . . . . . . . . . . . . . . 14 74 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 75 6. Security Considerations . . . . . . . . . . . . . . . . . . . 16 76 7. Contributors . . . . . . . . . . . . . . . . . . . . . . . . 16 77 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 17 78 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 17 79 9.1. Normative References . . . . . . . . . . . . . . . . . . 17 80 9.2. Informative References . . . . . . . . . . . . . . . . . 17 81 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 18 83 1. Introduction 85 Network Virtualization places unique requirements on tunneling 86 protocols. The utility of tunneling in virtualized data centers has 87 been described elsewhere; see, for example 88 [I-D.narten-nvo3-overlay-problem-statement], [VL2], 89 [I-D.mahalingam-dutt-dcops-vxlan], 90 [I-D.sridharan-virtualization-nvgre]. Tunneling allows a virtual 91 overlay topology to be constructed on top of the physical data center 92 network, and provides benefits such as: 94 o Ability to manage overlapping addresses between multiple tenants 95 o Decoupling of the virtual topology provided by the tunnels from 96 the physical topology of the network 98 o Support for virtual machine mobility independent of the physical 99 network 101 o Support for essentially unlimited numbers of virtual networks (in 102 contrast to VLANs, for example) 104 o Decoupling of the network service provided to servers from the 105 technology used in the physical network (e.g. providing an L2 106 service over an L3 fabric) 108 o Isolating the physical network from the addressing of the virtual 109 networks, thus avoiding issues such as MAC table size in physical 110 switches. 112 This draft describes STT (Stateless Transport Tunneling), a tunnel 113 encapsulation that enables overlay networks to be built in 114 virtualized data center networks, providing the benefits outlined 115 above. STT is particularly useful when some tunnel endpoints are in 116 end-systems, as it utilizes the capabilities of standard network 117 interface cards to improve performance. STT is an IP-based 118 encapsulation and utilizes a TCP-like header inside the IP header. 119 It is, however, stateless, i.e., there is no TCP connection state of 120 any kind associated with the tunnel. The TCP-like header is used for 121 pragmatic reasons, to leverage the capabilities of existing network 122 interface cards, but should not be interpreted as implying any sort 123 of connection state between endpoints. 125 STT is typically used to carry Ethernet frames between tunnel 126 endpoints. These frames may be considerably larger than the MTU of 127 the physical network - up to 64KB. Fields in the tunnel header are 128 used to allow these large frames to be segmented at the entrance to 129 the tunnel according to the MTU of the physical network and 130 subsequently reassembled at the far end of the tunnel. 132 1.1. Requirements Language 134 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 135 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 136 document are to be interpreted as described in RFC 2119 [RFC2119]. 138 1.2. Terminology 140 The following terms are used in this document: 142 Stateless Transport Tunneling (STT). The tunneling mechanism defined 143 in this document. The name derives from the fact that the tunnel 144 header resembles the TCP/IP headers (hence "transport" tunneling) 145 while "stateless" refers to the fact that none of the normal TCP 146 state (connection state, send and receive windows, congestion state 147 etc.) is associated with the tunnel (as would be required if an 148 actual TCP connection were used for tunneling). 150 STT Frame. The unit of data that is passed into the tunnel prior to 151 segmentation and encapsulation. This frame typically consists of an 152 Ethernet frame and an STT Frame header. These frames may be up to 153 64KB in size. 155 STT Segment. The unit of data that is transmitted on the underlay 156 network over which the tunnel operates. An STT segment has headers 157 that are syntactically the same as the TCP/IP headers, and typically 158 contains part of an STT frame as the payload. These segments must 159 fit within the MTU of the physical network. 161 Context ID. A 64-bit field in the STT frame header that conveys 162 information about the disposition of the STT frame between the tunnel 163 endpoints. One example use of the Context ID is to direct delivery 164 of the STT frame payload to the appropriate virtual network or 165 virtual machine. 167 MSS. Maximum Segment Size. The maximum number of bytes that can be 168 sent in one TCP segment [RFC0793]. 170 NIC. Network Interface Card. 172 TSO. TCP Segmentation Offload. A function provided by many 173 commercial NICs that allows large data units to be passed to the NIC, 174 the NIC being responsible for creating MSS-sized segments with 175 correct TCP/IP headers. 177 LRO. Large Receive Offload. The receive-side equivalent function of 178 TSO, in which multiple TCP segments are coalesced into larger data 179 units. 181 VM. Virtual Machine. 183 1.3. Reference Model 184 Our conceptual model for a virtualized network is shown in Figure 1. 185 STT tunnels extend in this figure from one virtual switch to another, 186 providing a virtual link between the switches over some arbitrary 187 underlay. More generally, STT tunnels operate between a pair of 188 tunnel endpoints; these endpoints may be virtual switches, physical 189 switches, or some other device (e.g. an appliance). The STT tunnel 190 provides a virtual point-to-point Ethernet link between the 191 endpoints. Frames are handed to the tunnel by some entity (e.g. a VM 192 that is connected to a virtual switch in this picture) and first 193 encapsulated with an STT Frame header. STT Frames may then be 194 fragmented in the NIC, and are encapsulated with a tunnel header (the 195 STT segment header) for transmission over the underlay. Note that 196 other models are possible, e.g., where one or both tunnel endpoints 197 are implemented in a physical switch. In such cases the tunnel 198 endpoint may forward packets to and from another link (physical or 199 virtual) rather than to a VM. 201 +----------------------+ +----------------------+ 202 | +--+ +-------+---+ | | +---+-------+ +--+ | 203 | |VM|---| | | | | | | |---|VM| | 204 | +--+ |Virtual|NIC|--- Underlay --- |NIC|Virtual| +--+ | 205 | +--+ |Switch | | | Network | | |Switch | +--+ | 206 | |VM|---| | | | | | | |---|VM| | 207 | +--+ +-------+---+ | | +---+-------+ +--+ | 208 +----------------------+ +----------------------+ 210 ()===============================() 211 Switch-Switch tunnel 213 Figure 1: STT Reference Model 215 2. Design Rationale 217 We take as given the need for some form of tunneling to support the 218 virtualization of the network as described in Section 1. One might 219 reasonably ask whether some existing tunneling protocol such as 220 GRE[RFC2784] or L2TPv3[RFC3931] might suffice. In fact, 221 [I-D.sridharan-virtualization-nvgre] does just that, using GRE. The 222 primary motivation for STT as opposed to one of the existing 223 tunneling methods is to improve the performance of data transfers 224 from hosts that implement tunnel endpoints. We expand on this 225 rationale below. 227 2.1. Segmentation Offload 228 A large percentage of network interface cards (NICs) in use today are 229 able to perform TCP segmentation offload (TSO). When a NIC supports 230 TSO, the host hands a large (greater than 1 TCP MSS) frame of data to 231 the NIC along with a set of metadata which includes, among other 232 things, the desired MSS, and various fields needed to complete the 233 TCP header. The NIC fragments the frame into MSS-sized segments, 234 performs the TCP Checksum operation, and applies the appropriate 235 headers (TCP, IP and MAC) to each segment. 237 On the receive side, some NICs support the reassembly of TCP 238 segments, a function referred to as large receive offload (LRO). In 239 this case, NICs attempt to reassemble TCP segments and pass larger 240 aggregates of data to the host. (Since TCP's service model is a byte 241 stream, there is no higher level frame for the NIC to reassemble, but 242 it can pass chunks of the stream larger than one MSS to the host. 243 Full reassembly of STT frames is handled in the host.) The benefits 244 to the host include fewer per-packet operations and larger data 245 transfers between host and NIC, which amortizes the per-transfer cost 246 (such as interrupt processing) more efficiently. These gains can 247 translate into significant performance gains for data transfer from 248 the host to the network. 250 STT is explicitly designed to leverage the TSO capabilities of 251 currently available NICs. While one might think of segmentation as a 252 generic function, the majority of NICs are designed specifically to 253 support TCP segmentation offload, as the details of the segmentation 254 function are highly dependent on the specifics of TCP. In order to 255 leverage such capability, therefore, the STT segment header is 256 syntactically identical to a valid TCP header. However, we use some 257 of the fields in the TCP header (specifically, sequence number and 258 ACK number) to support the objectives of STT. The details are 259 described in Section 3.2. In essence, we need the same set of 260 information that IP datagrams carry when IP fragmentation takes 261 place: a unique identifier for the frame that has been fragmented, an 262 offset into that frame for the current fragment, and the length of 263 the frame to be reassembled. We fit these fields into the TCP header 264 fields traditionally used for the SEQ and ACK numbers. STT segments 265 are transmitted as IP datagrams using the TCP protocol number (6). 266 The primary means to recognize STT segments is the destination port 267 number. We discuss the interoperability impact of these design 268 choices in Section 4. 270 The net effect of using TSO is that the frame size that is sent by 271 endpoints in the virtualized network can be much larger than the MTU 272 of the underlying physical network. The primary benefit of this is a 273 significant performance gain when large amounts of data are being 274 transferred between nodes in the virtual network. A secondary effect 275 is that the header of the STT frame is amortized across a larger 276 amount of data, reducing the need to shrink the STT frame header to 277 minimum size. 279 Note that, while segmentation offload is the primary NIC function 280 that STT takes advantage of, other NIC offload functions such as 281 checksum calculation can also be leveraged. 283 2.2. Metadata 285 When a frame is delivered to the NIC that supports TSO for 286 segmentation and transmission, a certain amount of metadata is 287 typically passed along with it. This includes the MSS and 288 potentially a VLAN tag to be applied to the transmitted packets. 290 In some virtualized network deployments, an STT frame may traverse a 291 tunnel, be received and reassembled at an STT endpoint, and then be 292 sent on another physical interface. In such cases, the tunnel 293 terminating endpoint may need to pass metadata to a NIC to enable 294 transmission of frames on the physical link. For this reason, 295 appropriate metadata is carried in the STT frame header. 297 2.3. Context Information 299 When an STT Frame is received by a tunnel endpoint, it needs to be 300 directed to the appropriate entity in the virtualized network to 301 which it belongs. For this reason, a Context ID is required in the 302 STT frame header. Some other encapsulations (e.g. 303 [I-D.mahalingam-dutt-dcops-vxlan], 304 [I-D.sridharan-virtualization-nvgre]) use an explicit tenant network 305 identifier or virtual network identifier. The Context Identifier can 306 be thought of as a generalized form of virtual network identifier. 307 Using a larger and more general identifier allows for a broader range 308 of service models and allows ample room for future expansion. There 309 is little downside to using a larger field here because it is 310 amortized across the entire STT Frame rather than being present in 311 each packet. 313 2.4. Alignment 315 Software implementations of tunnel endpoints benefit from 32-bit 316 alignment of the data to be manipulated. Because the Ethernet header 317 is not a multiple of 32-bits (it is 14 bytes), 2 bytes of padding are 318 added to the STT header, causing the payload beyond the encapsulated 319 Ethernet header, which typically includes the IP header of the 320 encapsulated frame, to be 32-bit aligned. 322 2.5. Equal Cost Multipath 323 It is essential that traffic passing through the physical network can 324 be efficiently distributed across multiple paths. Standard equal 325 cost multipath (ECMP) techniques involve hashing on address and port 326 numbers in the outer protocol headers. There are two main issues to 327 address with ECMP. First, it is important that, when a set of 328 packets belong to a single flow (e.g. a TCP connection in the virtual 329 network), all those packets should follow the same path. Second, all 330 paths should be used efficiently, i.e. there needs to be sufficient 331 entropy among the different flows to ensure they get distributed 332 evenly across multiple paths. 334 STT achieves the first goal by ensuring that the source and 335 destination ports and addresses in the outer header are all the same 336 for a single flow. The second goal is achieved by generating the 337 source port using a random hash of fields in the headers of the inner 338 packets, e.g. the ports and addresses of the virtual flow's packets. 339 We provide more details on the usage of port numbers in Section 3.2. 341 2.6. Efficient Software Processing 343 The design of STT is largely motivated by the desire to tunnel 344 packets efficiently between virtual switches running in software. In 345 addition to the points noted above, this leads to some design 346 optimizations to simplify processing of packets, such as the use of 347 an "L4 offset" field in the STT header to enable the payload to be 348 located quickly without extensive header parsing. 350 3. Frame Formats 352 STT encapsulates data payloads of up to 64KB (limited by the length 353 field in the STT header, described below). Those frames are then 354 segmented (depending on the MTU of the underlying physical network) 355 and the resulting segments are encapsulated in a standard TCP header, 356 which in turn is encapsulated by an IP header and finally a MAC 357 header. This is illustrated in Figure 2. 359 +-----------+ +----------+ +----------+ 360 | IP Header | |IP Header | |IP header | 361 +-----------+ +-----------+ +----------+ +----------+ 362 |STT Frame | |TCP-like | |TCP-like | |TCP-like | 363 | Header | | header | | header | | header | 364 +-----------+ +-----------+ +----------+ +----------+ 365 | | ---> | STT Frame | |Next part | ... |Last part | 366 |Payload | | Header | |of Payload| |of Payload| 367 . . +-----------+ | | | | 368 . . | | | | | | 369 . . | Start of | | | | | 370 +-----------+ | Payload | | | +----------+ 371 +-----------+ +----------+ 373 Original data STT Frame is segmented and transmitted as 374 frame is encapped a set of TCP segments (MAC 375 with STT Header headers not shown) 377 Figure 2: STT Frame Fragments and Encapsulation 379 The details of the STT Frame header and the usage of the TCP-like 380 header are described in detail below. The TCP segments shown in 381 Figure 2 are of course further encapsulated as IP datagrams, and may 382 be sent as either IPv4 or IPv6. The resulting IP datagrams are then 383 transmitted in the appropriate MAC level frame (e.g. Ethernet, not 384 shown in the figure) for the underlying physical network over which 385 the tunnels are established. 387 3.1. STT Frame Format 389 Figure 3 illustrates the header of an STT frame before it is 390 segmented. 392 0 1 2 3 393 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 394 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 395 | Version | Flags | L4 Offset | Reserved | 396 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 397 | Max. Segment Size | PCP |V| VLAN ID | 398 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 399 | | 400 + Context ID (64 bits) + 401 | | 402 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 403 | Padding | data | 404 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + 405 | | 407 Figure 3: STT Frame Format 409 The STT frame header contains the following fields: 411 o Version - currently 0. 413 o Flags - describes encapsulated packet, see below. 415 o L4 offset - offset in bytes from the end of the STT Frame header 416 to the start of the encapsulated layer 4 (TCP/UDP) header. 418 o Reserved field - MUST be zero on transmission and ignored on 419 receipt. 421 o Max Segment Size - the TCP MSS that should be used by a tunnel 422 endpoint that is transmitting this frame onto another network. 424 o PCP - the 3-bit Priority Code Point field that should be applied 425 to this packet by an STT tunnel endpoint on transmission to 426 another network (see Section 2.2). 428 o V - a one bit flag that, if set, indicates the presence of a valid 429 VLAN ID in the following field and valid PCP in the preceding 430 field. 432 o VLAN ID - 12-bit VLAN tag that should be applied to this packet by 433 an STT tunnel endpoint on transmission to another network (see 434 Section 2.2). 436 o Context ID - 64 bits of context information, described in detail 437 in Section 2.3. 439 o Padding - 16 bits as described above. 441 The flags field contains: 443 o 0: Checksum verified. Set if the checksum of the encapsulated 444 packet has been verified by the sender. 446 o 1: Checksum partial. Set if the checksum in the encapsulated 447 packet has been computed only over the TCP/IP header. This bit 448 MUST be set if TSO is used by the sender. Note that bit 0 and bit 449 1 cannot both be set in the same header. 451 o 2: IP version. Set if the encapsulated packet is IPv4, not set if 452 the packet is IPv6. See below for discussion of non-IP payloads. 454 o 3: TCP payload. Set if the encapsulated packet is TCP. 456 o 4-7: Unused, MUST be zero on transmission and ignored on receipt. 458 As noted above, several of these fields are present primarily to 459 enable efficient processing of the packet when it received at a 460 tunnel endpoint. (For example, it's entirely possible to determine 461 if the packet is IPv4 or IPv6 by looking at the Ethernet header - 462 it's just more efficient not to have to do so.) 463 The payload of the STT frame is an untagged Ethernet frame. 465 3.1.1. Handling non-IP payloads 467 Note that the STT header does not have a general "protocol" field to 468 allow the efficient processing of arbitrary payloads. The current 469 version is designed to provide a virtual Ethernet link, and hence 470 efficiently supports only Ethernet frames as the payload. The 471 Ethernet header itself contains a protocol field, which then 472 identifies the higher layer protocol, so it is straightforward to 473 accommodate non-IP traffic. 475 It will be noted that the STT Frame header does contain fields that 476 are intended to assist in efficient processing of IPv4 and IPv6 477 packets. These fields MUST be set to zero and ignored on receipt for 478 non-IP payloads. 480 The use of STT to carry payloads other than Ethernet is theoretically 481 possible but is beyond the scope of this document. 483 3.2. Usage of TCP Header by STT 485 Figure 4 illustrates the usage of the TCP header STT. This figure is 486 essentially identical to that in [RFC0793] with the exception that we 487 denote with an asterisk (*) two fields that are used by STT to convey 488 something other than the information that is conveyed by TCP. 489 Syntactically, STT segments look identical to TCP segments. However, 490 STT tunnel endpoints treat the Sequence number and Acknowledgment 491 number differently than TCP endpoints treat those fields. 492 Furthermore, as noted above, there is no TCP state machine associated 493 with an STT tunnel. 495 0 1 2 3 496 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 497 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 498 | Source Port | Destination Port | 499 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 500 | Sequence Number(*) | 501 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 502 | Acknowledgment Number(*) | 503 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 504 | Data | |U|A|P|R|S|F| | 505 | Offset| Reserved |R|C|S|S|Y|I| Window | 506 | | |G|K|H|T|N|N| | 507 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 508 | Checksum | Urgent Pointer | 509 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 510 | Options | Padding | 511 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 512 | data | 513 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 515 Figure 4: STT Segment Format 517 The Destination port is to be requested from IANA, in the user range 518 (1024-49151). 520 In order to allow correct reassembly of the STT frame, the source 521 port MUST be constant for all segments of a single STT frame. 523 As noted above (Section 2.5) the source port SHOULD be the same for 524 all frames that belong to a single flow in the virtual network, e.g. 525 a single TCP connection. 527 Also, to encourage efficient distribution of traffic among multiple 528 paths when ECMP is used, the method to calculate the source port 529 should provide a random distribution of source port numbers. An 530 example mechanism would be a random hash on ports and addresses of 531 the TCP headers of the flow in the virtual network. 533 It is RECOMMENDED to use a source port number from the ephemeral 534 range defined by IANA (49152-65535). 536 The Sequence number and Acknowledgment number fields are re-purposed 537 in a way that does not confuse NICs that expect them to be used in 538 the conventional manner. The ACK field is used as a packet 539 identifier for the purposes of fragmentation, equivalent in function 540 to the Identification field of IPv4 or the IPv6 Fragment header: it 541 MUST be constant for all STT segments of a given frame, and different 542 from any value used recently for other STT frames sent over this 543 tunnel. 545 The upper 16 bits of the the SEQ field are used to convey the length 546 of the STT frame in bytes. The lower 16 bits of the SEQ field are 547 used to convey the offset (in bytes) of the current fragment within 548 the larger STT frame. 550 Reassembly of the fragments may be done partially by NICs that 551 perform LRO, since the sequence numbers of frames will increment 552 appropriately. That is, the upper 16 bits don't change, and the 553 lower 16 bits increment by N for every N byte segment that is 554 transmitted, just as would be the case if an actual sequence number 555 were being sent. Note that the size limit of an STT frame ensures 556 that sequence numbers cannot wrap while sending the segments of a 557 single STT frame. 559 All the fields after ACK have their conventional meaning, although 560 nothing will be done with the Window or Urgent pointer values. Those 561 fields SHOULD be zero on transmit and ignored on receipt. It is 562 RECOMMENDED that the PSH (Push) flag be set when transmitting the 563 last segment of a frame in order to cause data to be delivered by the 564 NIC without waiting for other fragments. The ACK flag SHOULD be set 565 to ensure that a receiving NIC passes the ACK field to the host to 566 assist in reassembly. All other flags SHOULD be zero on transmit and 567 ignored on receipt. 569 3.3. Encapsulation of STT Segments in IP 571 From the perspective of IP, an STT segment is just like any other TCP 572 segment. The protocol number (IPv4) or Next Header (IPv6) has the 573 value 6, as for regular TCP. The resulting IP datagram is then 574 encapsulated in the appropriate L2 header (e.g. Ethernet) for 575 transmission on the physical medium. 577 3.3.1. Diffserv and ECN-Marking 579 When traffic is encapsulated in a tunnel header, there are numerous 580 options as to how the Diffserv Code-Point (DSCP) and ECN markings are 581 set in the outer header and propagated to the inner header on 582 decapsulation. 584 [RFC2983] defines two modes for mapping the DSCP markings from inner 585 to outer headers and vice versa. The Uniform model copies the inner 586 DSCP marking to the outer header on tunnel ingress, and copies that 587 outer header value back to the inner header at tunnel egress. The 588 Pipe model sets the DSCP value to some value based on local policy at 589 ingress and does not modify the inner header on egress. Both models 590 SHOULD be supported by STT endpoints. However, there is an 591 additional complexity with the uniform model for STT, because a 592 single IP datagram that is transmitted over the tunnel appears as 593 multiple IP datagrams on the wire. Thus it is not guaranteed that 594 all segments of the STT frame will have the same DSCP at egress. If 595 uniform model behavior is configured, it is RECOMMENDED that the DSCP 596 of the first segment of the STT frame be used to set the DSCP value 597 of the IP header in the decapsulated STT frame. 599 [RFC6040] describes the correct ECN behavior for any type of IP in IP 600 tunnel, and this behavior SHOULD be followed for STT tunnels. As 601 with the Uniform Diffserv tunnel model, the fact that one inner IP 602 datagram is segmented into multiple outer datagrams makes the 603 situation slightly more complex. It is RECOMMENDED that if any 604 segment of the received STT frame has the CE (congestion experienced) 605 bit set in its IP header, then the CE bit SHOULD be set in the IP 606 header of the decapsulated STT frame. 608 3.3.2. Packet Loss 610 Individual IP datagrams may be dropped (most often due to congestion) 611 and, since there is no acknowledgment or reliable delivery of these 612 datagrams, there is the potential to corrupt an entire STT Frame due 613 to the loss of a single IP datagram. Fortunately, there are 614 solutions to this problem in the case where the higher layer protocol 615 running over STT is TCP. An STT receiving endpoint running in an 616 end-system, as shown in Figure 1 for example, is not required to 617 deliver complete STT frames to the TCP stack in the receiving VM. A 618 partial frame payload can be delivered and the receiving TCP stack 619 can deal with the missing bytes just as it would if running directly 620 over a physical network. That is, TCP in the VM can send ACKs for 621 the contiguous bytes received to trigger retransmission of the 622 missing bytes by the sender. This is similar to the operation of LRO 623 in current NICs. There are some subtleties to making this work 624 correctly in the STT context, and it does depend on the STT endpoint 625 being aware of the higher layer protocols consuming data in the VM to 626 which it is connected. The main point of this discussion is that, in 627 the common deployments of STT running in a virtual switch, the 628 potential harm of losing individual packets is not as serious as it 629 might first appear. 631 3.4. Broadcast and Multicast 633 It is possible to establish point-to-multipoint STT tunnels by using 634 an IP multicast address as the destination address of the tunnel. 635 These may be used for broadcast or multicast traffic if the 636 underlying physical network supports IP multicast. Control 637 mechanisms for setting up such multicast groups are beyond the scope 638 of this document. It is worth repeating that, despite the syntactic 639 resemblance between the STT segment header and the TCP header, there 640 is no TCP state machine associated with an STT tunnel, so the 641 traditional issues of combining multicast with TCP (or reliable 642 transports more generally) do not arise. 644 4. Interoperability Issues 645 It will be noted that an STT packet on the wire appears exactly the 646 same as a TCP packet, but that processing of an STT packet on 647 reception is entirely different from TCP - no three-way handshake to 648 establish a connection, no ACKs, retransmission, etc. Hence, an STT 649 tunnel endpoint clearly needs to be configured to behave in the 650 correct manner rather than to perform standard TCP processing on the 651 packet. The primary way to recognize an STT segment is the 652 destination port number in the TCP header. In the event that an STT 653 packet is inadvertently delivered to a device that is not configured 654 to behave as an STT tunnel endpoint, no TCP connection will be 655 established and STT packets will be dropped. 657 In the event that STT packets pass through middle boxes that process 658 TCP, it is likely that (in the near term at least) they will be 659 dropped, as there will be no TCP connection state established. This 660 is clearly undesirable, but it is a general issue with any form of 661 tunneling - the nature of many middle boxes is that they will not 662 permit tunnels to pass through them. Hence the best solution is 663 simply to avoid deploying middle boxes at locations where STT tunnels 664 (or other forms of tunnels for network virtualization) will need to 665 pass through them. This will not, however, always be feasible, 666 especially when virtualized networks extend among multiple data 667 centers. Other solutions include configuring the middle boxes to 668 permit TCP packets to pass through when the port number matches the 669 port assigned for STT. 671 In the longer term, we might reasonably expect that middle boxes 672 would be able to recognize STT traffic, and to terminate and 673 originate STT tunnels if necessary (e.g. to perform functions that 674 require the STT payload to be inspected such as statefull 675 firewalling). 677 It is also of course possible to provide all the functionality of STT 678 using a different IP protocol number (or next header value in IPv6). 679 This approach makes sense in the long run but will typically not 680 enable current NIC hardware to be leveraged for TSO and LRO 681 functions. 683 It is also possible to run STT traffic over other forms of tunnel 684 (GRE, IPSEC, etc.) in which case they the STT traffic can pass 685 through appropriately configured middle boxes. 687 5. IANA Considerations 689 A TCP port in the user range (1024- 49151) will be requested from 690 IANA. 692 6. Security Considerations 694 In the physical network, STT packets are simply IP datagrams, and do 695 not introduce new security issues. Most standard IP security 696 mechanisms (such as IPSEC encryption or authentication) can be 697 implemented on STT packets if desired. As noted above, however, 698 tunneling generally interacts poorly with middle boxes, and STT is no 699 exception. Devices such as firewalls are likely to drop STT traffic 700 unless the capability to recognize STT packets is implemented, or 701 unless the STT traffic is itself run over some sort of tunnel that 702 the firewall is configured to permit. Intrusion detection systems 703 would similarly need to be enhanced to be able to look inside STT 704 packets. 706 It should also be noted that while STT packets resemble TCP segments, 707 the lack of a TCP state machine means that TCP-related security 708 issues (e.g. SYN-flooding) do not apply. Similarly, some of the 709 benefits of the TCP state machine (e.g. the ability to discard 710 packets with unexpected sequence numbers) are also absent for STT 711 traffic. 713 7. Contributors 715 The following individuals contributed to this document: 717 Brad McConnell 718 Rackspace 719 5000 Walzem Road 720 San Antonio, TX 78218 721 Email: bmcconne@rackspace.com 723 JC Martin 724 eBay 725 2145 Hamilton Ave. 726 San Jose, CA 95125 727 Email: jcmartin@ebaysf.com 729 Iben Rodriguez 730 eBay 731 2477 Woodland Ave 732 San Jose, CA 95128 733 Email: Iben.rodriguez@gmail.com 735 Ilango Ganga 736 Intel Corporation 737 2200 Mission College Blvd. 738 Santa Clara, CA - 95054 739 Email: ilango.s.ganga@intel.com 740 Igor Gashinsky 741 Yahoo! 742 111 West 40th Street 743 New York, NY 10018 744 Email: igor@yahoo-inc.com 746 8. Acknowledgements 748 We thank Martin Casado for inspiring this work and making all the 749 introductions, and to Ben Pfaff for his explanations of the 750 implementation. Thanks also to Pierre Ettori, Yukio Ogawa and 751 Koichiro Seto for their helpful comments. 753 9. References 755 9.1. Normative References 757 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, RFC 758 793, September 1981. 760 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 761 Requirement Levels", BCP 14, RFC 2119, March 1997. 763 9.2. Informative References 765 [I-D.mahalingam-dutt-dcops-vxlan] 766 Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 767 L., Sridhar, T., Bursell, M., and C. Wright, "VXLAN: A 768 Framework for Overlaying Virtualized Layer 2 Networks over 769 Layer 3 Networks", draft-mahalingam-dutt-dcops-vxlan-04 770 (work in progress), May 2013. 772 [I-D.narten-nvo3-overlay-problem-statement] 773 Narten, T., Black, D., Dutt, D., Fang, L., Gray, E., 774 Kreeger, L., Napierala, M., and M. Sridhavan, "Problem 775 Statement: Overlays for Network Virtualization", draft- 776 narten-nvo3-overlay-problem-statement-04 (work in 777 progress), August 2012. 779 [I-D.sridharan-virtualization-nvgre] 780 Sridharan, M., Greenberg, A., Wang, Y., Garg, P., 781 Venkataramiah, N., Duda, K., Ganga, I., Lin, G., Pearson, 782 M., Thaler, P., and C. Tumuluri, "NVGRE: Network 783 Virtualization using Generic Routing Encapsulation", 784 draft-sridharan-virtualization-nvgre-03 (work in 785 progress), August 2013. 787 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 788 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 789 March 2000. 791 [RFC2983] Black, D., "Differentiated Services and Tunnels", RFC 792 2983, October 2000. 794 [RFC3931] Lau, J., Townsley, M., and I. Goyret, "Layer Two Tunneling 795 Protocol - Version 3 (L2TPv3)", RFC 3931, March 2005. 797 [RFC6040] Briscoe, B., "Tunnelling of Explicit Congestion 798 Notification", RFC 6040, November 2010. 800 [VL2] Greenberg et al, ., "VL2: A Scalable and Flexible Data 801 Center Network", 2009. 803 Proc. ACM SIGCOMM 2009 805 Authors' Addresses 807 Bruce Davie (editor) 808 VMware, Inc. 809 3401 Hillview Ave. 810 Palo Alto, CA 94304 811 USA 813 Email: bdavie@vmware.com 815 Jesse Gross 816 VMware, Inc. 817 3401 Hillview Ave. 818 Palo Alto, CA 94304 819 USA 821 Email: jgross@vmware.com