idnits 2.17.1 draft-zhou-li-vxlan-soe-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (May 2, 2014) is 3619 days in the past. Is this intentional? Checking references for intended status: Experimental ---------------------------------------------------------------------------- == Unused Reference: 'I-D.mahalingam-dutt-dcops-vxlan' is defined on line 421, but no explicit reference was found in the text == Unused Reference: 'I-D.davie-stt' is defined on line 428, but no explicit reference was found in the text ** Obsolete normative reference: RFC 5226 (Obsoleted by RFC 8126) == Outdated reference: A later version (-09) exists of draft-mahalingam-dutt-dcops-vxlan-08 == Outdated reference: A later version (-08) exists of draft-davie-stt-05 == Outdated reference: A later version (-04) exists of draft-quinn-vxlan-gpe-02 Summary: 1 error (**), 0 flaws (~~), 6 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group H. Zhou 3 Internet-Draft C. Li 4 Intended Status: Experimental eBay Inc. 5 Expires: November 3, 2014 May 2, 2014 7 Segmentation Offloading Extension for VXLAN 8 draft-zhou-li-vxlan-soe-01 10 Abstract 12 Segmentation offloading is nowadays common in network stack 13 implementation and well supported by para-virtualized network device 14 drivers for virtual machine (VM)s. This draft describes an extension 15 to Virtual eXtensible Local Area Network (VXLAN) so that segmentation 16 can be decoupled from physical/underlay networks and offloaded 17 further to the remote end-point thus improving data-plane performance 18 for VMs running on top of overlay networks. 20 Status of this Memo 22 This Internet-Draft is submitted to IETF in full conformance with the 23 provisions of BCP 78 and BCP 79. 25 Internet-Drafts are working documents of the Internet Engineering 26 Task Force (IETF), its areas, and its working groups. Note that 27 other groups may also distribute working documents as 28 Internet-Drafts. 30 Internet-Drafts are draft documents valid for a maximum of six months 31 and may be updated, replaced, or obsoleted by other documents at any 32 time. It is inappropriate to use Internet-Drafts as reference 33 material or to cite them other than as "work in progress." 35 The list of current Internet-Drafts can be accessed at 36 http://www.ietf.org/1id-abstracts.html 38 The list of Internet-Draft Shadow Directories can be accessed at 39 http://www.ietf.org/shadow.html 41 Copyright and License Notice 43 Copyright (c) 2014 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (http://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 59 1.1 Requirements Notation . . . . . . . . . . . . . . . . . . . 4 60 1.2 Definition of Terms . . . . . . . . . . . . . . . . . . . . 4 61 2. Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 62 2.1 VXLAN Header Extension . . . . . . . . . . . . . . . . . . 6 63 2.2 TX VTEP . . . . . . . . . . . . . . . . . . . . . . . . . . 7 64 2.3 RX VTEP - Hypervisors . . . . . . . . . . . . . . . . . . . 7 65 2.4 RX VTEP - Gateways . . . . . . . . . . . . . . . . . . . . . 7 66 3 IP Fragmentation . . . . . . . . . . . . . . . . . . . . . . . 7 67 4 Interoperability . . . . . . . . . . . . . . . . . . . . . . . 8 68 5 Deployment Examples . . . . . . . . . . . . . . . . . . . . . . 9 69 5.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . 9 70 5.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . 11 71 6 Security Considerations . . . . . . . . . . . . . . . . . . . . 12 72 7 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 12 73 8 References . . . . . . . . . . . . . . . . . . . . . . . . . . 13 74 8.1 Normative References . . . . . . . . . . . . . . . . . . . 13 75 8.2 Informative References . . . . . . . . . . . . . . . . . . 13 76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 78 1 Introduction 80 Network virtualization over L3 transport is evolved along with server 81 virtualization in data-centers, and data plane performance is one of 82 the keys to the success of this combination. One of the most critical 83 improvements in OS kernel TCP/IP stack in recent years is 84 segmentation offloading, and now hypervisor providers support same 85 mechanism in para-virtualized Ethernet drivers so that virtual 86 servers can benefit from the same mechanism in virtualized world by 87 offloading segmentation tasks to the lowest layer on hypervisors or 88 NICs (if TSO/UFO is supported by the NICs equipped in the 89 hypervisor). 91 While the general idea of segmentation offloading is to postpone 92 segmentation to the latest point of packet transmission, this draft 93 introduces a mechanism to avoid overlay segmentation completely in 94 some situation. 96 Essentially, overlay networks has its own advantage comparing with 97 physical underlay networks in that it does not have a hard MTU 98 limitation. Therefore, segmentation offloading can be pushed to the 99 remote end-point of the transport tunnel, where segmentation can be 100 completely omitted (e.g. the remote end-point is a hypervisor), 101 unless it is going to be forwarded to physical networks (e.g. the 102 remote end-point is a gateway). 104 However, this advantage is not utilized when the transport of the 105 overlay is based on the Virtual eXtensible Local Area Network [I- 106 D.mahalingam-dutt-dcops-vxlan], which provides a transport mechanism 107 for logically isolated L2 overlay networks between hypervisors. 108 Lacking segmentation information in the VXLAN header, hypervisor 109 implementations have to make pessimistic decisions to always segment 110 the packet in the size specified by VMs before delivering to 111 hypervisors' IP stack, because it does not know whether the remote 112 end-point is bridged to a physical network with hard MTU limitations. 113 It is worth noting that the segmentation here is not the IP 114 fragmentation in terms of the physical network MTU, which may still 115 follow if the segment size resulting from the process above plus the 116 tunnel outer header is greater than the physical network MTU. 118 To fulfill the potential of segmentation offloading on overlay, this 119 draft introduces segmentation metadata in VXLAN header. With the 120 capability of carrying segmentation metadata in packets, hypervisors 121 can offload the segmentation decision further to the remote tunnel 122 end-point, where decision can be made whether segmentation is 123 omitted, performed, or offloaded further to NIC hardware or next hop 124 tunnel end-point. 126 This mechanism decouples segmentation for overlay from physical 127 limitations of underlay, providing higher flexibility to hyerpervisor 128 implementations to achieve significant performance gains in a major 129 part of VXLAN deployment scenarios. 131 Although the performance gains can be achieved is affected by the 132 physical network MTU, there is inherently no mandatory requirement to 133 physical layer: 135 1) When physical network MTU is far bigger than overlay MTU, the 136 offloading reduces the number of packets being transmitted by TX 137 hypervisors and received in RX hypervisors and RX VMs. 139 2) When physical network MTU is close to overlay MTU, the number of 140 packets being transmitted in physical network (resulted in IP 141 fragmentation) may not be reduced significantly, but on RX side after 142 IP reassembling, the number of packets being delivered from the 143 hypervisor to the receiving VM is largely reduced, thus saving the 144 cost of hypervisor <-> VM interaction and protocol stack traversing 145 of the receiving VM. Furthermore, a minor cost saving is that the 146 bytes being transmitted over physical network is slightly reduced 147 because only one copy of headers (inner L2-L4 header, VXLAN header 148 and outer UDP header) is transmitted for a large overlay packet. 150 In addition, offloading features support from NIC hardware is NOT 151 required to the performance gains discussed above. 153 1.1 Requirements Notation 155 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 156 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 157 document are to be interpreted as described in RFC 2119 [RFC2119]. 159 1.2 Definition of Terms 161 GSO: Generic Segmentation Offload. 163 TSO: TCP Segmentation Offload. 165 UFO: UDP Segmentation Offload. 167 LRO: Large Receive Offload. 169 GRO: Generic Receive Offload. 171 NIC: Network Interface Card. 173 VM: Virtual Machine. 175 TX: Sending side. 177 RX: Receiving side. 179 VTEP: Virtual Tunnel End Point. 181 2. Approach 183 2.1 VXLAN Header Extension 185 The new VXLAN Segmentation Offloading Extension (VXLAN-soe) header is 186 defined as: 188 0 1 2 3 189 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 190 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 191 |S|R|R|R|I|R|R|R|Overlay MSS Hi | Reserved | 192 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 193 | VXLAN Network Identifier (VNI) |Overlay MSS Lo | 194 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 196 The changes to VXLAN are: 198 S Bit: Flag bit 0 is defined as the S (Segmentation Offloading 199 Extension) bit. 201 S = 1 indicates that VXLAN-soe is applied to the encapsulated 202 overlay packet, and the Overlay MSS fields (see below) are valid. 204 S = 0 indicates that VXLAN-soe is NOT applied, and the Overlay MSS 205 fields MUST be set to 0 in accordance with VXLAN. 207 Overlay MSS: bit 8 - 15 and bit 56 - 63 together is defined as the 208 Overlay Max Segment Size (16 bit unsigned integer) specified by TX VM 209 for the segmentation being offloaded. 211 Overlay MSS Hi: bit 8 - 15 carries the higher 8 bits of the 16 212 bit value. 214 Overlay MSS Lo: bit 56 - 63 carries the lower 8 bits of the 16 215 bit value. 217 Definition of the 16 bit value depends on the inner packet type. 218 For TCP packets, it is defined as the max size of TCP payload; for 219 UDP packets, it is defined as the max size of IP payload. This 220 definition follows the convention of Linux kernel implementation, 221 thus GSO size passed from VM to hypervisor can be directly 222 utilized. Definition for other inner packet types can be added in 223 the future. 225 This field is valid only if the S bit is set. 227 2.2 TX VTEP 229 VTEP at TX side MUST set the S bit to 1 if the packet to be 230 encapsulated is NOT segmented and it decides to offload the 231 segmentation to the remote end-point. In such case the Overlay MSS 232 field MUST be set accordingly. This is the typical use case when the 233 TX VTEP is a hypervisor transmitting TCP stream of VMs with large 234 sliding windows. 236 VTEP at TX side MUST clear the S bit if the packet to be encapsulated 237 is segmented already or does NOT need to be segmented in terms of the 238 overlay MTU. In such case, the encapsulation is in the same format as 239 specified in VXLAN. This is the typical use case when the TX VTEP is 240 a hypervisor transmitting small size overlay packets, or a gateway 241 forwarding overlay packets to physical networks. 243 2.3 RX VTEP - Hypervisors 245 When a VTEP at RX side is on a hypervisor, where the packet is 246 delivered to a receiving VM, the hypervisor checks the S bit. If the 247 S bit is NOT set, the packet is handled as a normal VXLAN packet. In 248 this case a packet with size larger than the MTU setting of the 249 receiving VM's virtual interface is usually dropped by the 250 hypervisor. If the S bit is set, the hypervisor SHALL NOT perform MTU 251 check against the virtual interface of the receiving VM. 253 2.4 RX VTEP - Gateways 255 When a VTEP at RX side is on a gateway node that connects overlay 256 networks and physical networks, the S bit MUST be checked and the 257 VTEP MUST ensure the segmentation specified by the Overlay MSS field 258 is performed by the VTEP itself or offloaded further - it MAY offload 259 the segmentation again to the subsequent transmission mechanisms: 260 such as TSO/UFO/GSO, or, if the link to the next hop is also an 261 overlay based on VXLAN-soe (or other tunneling protocols that 262 supports segmentation offloading), pass the segmentation metadata to 263 the next hop. 265 3 IP Fragmentation 267 Skipping overlay segmentation results in big size packets being 268 encapsulated in VXLAN and outer UDP/IP header. When the encapsulated 269 packet size is bigger than physical network MTU, IP fragmentation has 270 to be enforced. This can leads to two problems. 272 The first problem is that a single IP fragment loss will result in a 273 drop of the whole IP packet, which will result in waste of band-width 274 and pose negative impact to the throughput. Because of this, it is 275 recommended to implement VXLAN-soe as a configurable feature, which 276 should be enabled only if physical network is highly reliable. Data 277 center is the typical environment to enable this feature. 279 Another problem is that inner packet size plus the outer headers can 280 exceed 65535 bytes, which is the upper limit of IP packet size. In 281 this situation special handling can be implemented to avoid oversized 282 IP packets, such as falling back to overlay segmentation. Other 283 optimal solutions are possible but out of the scope of this document. 285 4 Interoperability 287 In addition to offload segmentation requests from VMs, VXLAN-soe 288 enabled VTEP is able to offload segmentation requests from STT [I- 289 D.davie-stt] overlay. The metadata required in VXLAN-soe header is a 290 subset of STT metadata, and the additional segmentation offloading 291 information carried in STT metadata such as L4 offset can be obtained 292 by examine inner headers of the packets. 294 VXLAN-soe is compatible with VXLAN-gpe [I-D.quinn-vxlan-gpe], another 295 extension of VXLAN. For example, if the packet being encapsulated is 296 a TCP/IP packet without L2 header, TCP segmentation can also be 297 skipped at TX side and offloaded to the RX side. See the example in 298 section 5.2. 300 5 Deployment Examples 302 5.1 Example 1 304 .--. .--. 305 ( ' '.--._ 306 (''' Physical ) 307 ( Network .'-' 308 '--'._.'. ) 309 / '--' 310 Gateway /VLAN 311 +-----'----+ 312 | | 313 | VTEP | 314 +----+-----+ 315 |VXLAN-soe 316 .--.|.--. 317 ( ' '.--. 318 .-.' Intra-DC ' 319 ( network ) 320 / .'-\ 321 VXLAN-soe/ '--'._.'. ) \VXLAN-soe 322 / '--' \ 323 +--------+-+ +--+-------+ 324 | VTEP | | VTEP | 325 |+-----+ | | +-----+| 326 ||VM1 | | | | VM2 || 327 ++-----+---+ +---+-----++ 328 Hypervisor1 Hypervisor2 330 Figure 1 332 Figure 1 shows basic scenarios of VXLAN-soe usage. Take TCP stream as 333 an example, when VM1 on Hypervisor1 send a big data buffer to VM2 on 334 Hypervisor2, TCP segmentation is offloaded from VM1 to Hypervisor1, 335 and because of VXLAN-soe, it is offloaded from Hypervisor1 to 336 Hypervisor2: the VXLAN-soe encapsulated packet is fragmented in IP 337 fragments according to physical network MTU and transmitted to 338 Hypervisor2. On Hypervisor2, after IP reassembling, the big TCP data 339 buffer is delivered directly to VM2. 341 When VM1 send a big data buffer to some host behind the Gateway, same 342 process happens on Hypervisor1, but after the IP fragments are 343 reassembled on the Gateway, TCP segmentation must be performed 344 according to the overlay MSS in VXLAN-soe header. The Gateway can be 345 deployed as a ToR switch or a generic server. If the Gateway is a 346 generic server with TSO supported NIC, it can offload the 347 segmentation task to NIC hardware. In both cases, packets transmitted 348 to the physical VLAN are already segmented according to the overlay 349 MSS. 351 When TCP segments destined to VM1 are received from the physical VLAN 352 on the Gateway, and if the Gateway is a generic server, NIC hardware 353 with LRO/GRO support can accumulate small TCP segments to bigger TCP 354 packets, which can be delivered to VM1 efficiently with the help of 355 VXLAN-soe. 357 5.2 Example 2 359 .--. .--. 360 ( ' '.--._ 361 (''' Inter-DC ) 362 ( network .i-'., 363 / '--'._.'. ) ` \ 364 / '--' \ 365 / VXLAN-soe + VXLAN-gpe\ 366 Gateway1 / \ Gateway2 367 +----------+ +----------+ 368 | | | | 369 | VTEP | | VTEP | 370 +----+-----+ +----+-----+ 371 |VXLAN-soe |VXLAN-soe 372 .--.|.--. .--.|.--. 373 ( ' '.--. ( ' '.--. 374 .-.' Intra-DC ' .-.' Intra-DC ' 375 ( network __) ( network __) 376 ( .' ( \ 377 '--/._.'. ) '--'._.'. )\VXLAN-soe 378 VXLAN-soe/ '--' '--' \ 379 +---+------+ +-----++-+-+ 380 | VTEP | | VTEP | 381 |+-----+ | |+-----+ | 382 ||VM1 | | ||VM2 | | 383 ++-----+---+ ++-----+---+ 384 Hypervisor1 Hypervisor2 386 Figure 2 388 Figure 2 shows how VXLAN-soe and VXLAN-gpe works together. In this 389 example, traffic from VM1 to VM2 needs to traverse inter-DC network 390 connected by Gateway1 and Gateway2. In this case VXLAN-gpe is used 391 between Gateway1 and Gateway2 to encapsulate L3 packets directly. 392 When a big TCP buffer is sent from VM1, TCP segmentation is firstly 393 offloaded to Hyervisor1 and then to Gateway1. With the help of VXLAN- 394 soe between Gateway1 and Gateway2, TCP segmentation is offloaded 395 further to Gateway2 and then to Hypervisor2, where the big TCP buffer 396 is delivered directly to VM2. 398 6 Security Considerations 400 There is no special security issues introduced by this extension to 401 VXLAN. 403 7 IANA Considerations 405 This document creates no new requirements on IANA namespaces 406 [RFC5226]. 408 8 References 410 8.1 Normative References 412 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 413 Requirement Levels", BCP 14, RFC 2119, March 1997. 415 [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an 416 IANA Considerations Section in RFCs", BCP 26, RFC 5226, 417 May 2008. 419 8.2 Informative References 421 [I-D.mahalingam-dutt-dcops-vxlan] 422 Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 423 L., Sridhar, T., Bursell, M., and C. Wright, "VXLAN: A 424 Framework for Overlaying Virtualized Layer 2 Networks over 425 Layer 3 Networks", draft-mahalingam-dutt-dcops-vxlan-08 426 (work in progress), February 2014. 428 [I-D.davie-stt] 429 Davie, B. and J. Gross, "A Stateless Transport Tunneling 430 Protocol for Network Virtualization (STT)", draft-davie- 431 stt-05(work in progress), March 2014. 433 [I-D.quinn-vxlan-gpe] 434 Agarwal, P., Fernando, R., Kreeger, L., Lewis, D., Maino, 435 F., Quinn, P., Yong, L., Xu, X., Smith, M., Yadav, N., and 436 U. Elzur, "Generic Protocol Extension for VXLAN", draft- 437 quinn-vxlan-gpe-02 (work in progress), December 2013. 439 Authors' Addresses 441 Han Zhou 442 eBay, Inc. 444 EMail: hzhou8@ebay.com 446 Chengyuan Li 447 eBay, Inc. 449 Email: chengyli@ebay.com