idnits 2.17.1 draft-ietf-nvo3-gue-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 337: '...and decapsulator MUST agree on the mea...' RFC 2119 keyword, line 342: '... GUE packet with private data, it MUST...' RFC 2119 keyword, line 344: '...capsulator the packet MUST be dropped....' RFC 2119 keyword, line 346: '...ntics the packet MUST also be dropped....' RFC 2119 keyword, line 525: '...o a decapsulator MUST NOT be ignored. ...' (18 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 10, 2016) is 2877 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC0768' is mentioned on line 131, but not defined == Missing Reference: 'RFC2983' is mentioned on line 508, but not defined == Missing Reference: 'RFC6040' is mentioned on line 509, but not defined == Missing Reference: 'RFC478' is mentioned on line 606, but not defined == Missing Reference: 'RFC6935' is mentioned on line 641, but not defined == Missing Reference: 'RFC6936' is mentioned on line 641, but not defined == Missing Reference: 'GUECSUM' is mentioned on line 654, but not defined == Missing Reference: 'RFC768' is mentioned on line 689, but not defined == Missing Reference: 'RFC1122' is mentioned on line 677, but not defined == Missing Reference: 'RFC2460' is mentioned on line 689, but not defined ** Obsolete undefined reference: RFC 2460 (Obsoleted by RFC 8200) == Missing Reference: 'RFC4459' is mentioned on line 716, but not defined == Missing Reference: 'RFC5405' is mentioned on line 725, but not defined ** Obsolete undefined reference: RFC 5405 (Obsoleted by RFC 8085) == Missing Reference: 'RFC3378' is mentioned on line 857, but not defined == Missing Reference: 'RFC2784' is mentioned on line 858, but not defined == Missing Reference: 'RFC4023' is mentioned on line 859, but not defined == Missing Reference: 'RFC2661' is mentioned on line 859, but not defined == Missing Reference: 'RFC2003' is mentioned on line 862, but not defined == Missing Reference: 'RFC2473' is mentioned on line 863, but not defined == Missing Reference: 'RFC3948' is mentioned on line 866, but not defined == Missing Reference: 'RFC6830' is mentioned on line 867, but not defined ** Obsolete undefined reference: RFC 6830 (Obsoleted by RFC 9300, RFC 9301) == Outdated reference: A later version (-19) exists of draft-ietf-tsvwg-gre-in-udp-encap-06 == Outdated reference: A later version (-03) exists of draft-hy-gue-4-secure-transport-02 == Outdated reference: A later version (-02) exists of draft-herbert-remotecsumoffload-00 Summary: 4 errors (**), 0 flaws (~~), 24 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Virtualization Overlays (nvo3) T. Herbert 3 Internet-Draft Facebook 4 Intended status: Standard track L. Yong 5 Expires December 12, 2016 Huawei USA 6 O. Zia 7 Microsoft 8 June 10, 2016 10 Generic UDP Encapsulation 11 draft-ietf-nvo3-gue-03 13 Status of this Memo 15 This Internet-Draft is submitted in full conformance with the 16 provisions of BCP 78 and BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html 34 This Internet-Draft will expire on December 12, 2015. 36 Copyright Notice 38 Copyright (c) 2016 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Abstract 60 This specification describes Generic UDP Encapsulation (GUE), which 61 is a scheme for using UDP to encapsulate packets of arbitrary IP 62 protocols for transport across layer 3 networks. By encapsulating 63 packets in UDP, specialized capabilities in networking hardware for 64 efficient handling of UDP packets can be leveraged. GUE specifies 65 basic encapsulation methods upon which higher level constructs, such 66 tunnels and overlay networks for network virtualization, can be 67 constructed. GUE is extensible by allowing optional data fields as 68 part of the encapsulation, and is generic in that it can encapsulate 69 packets of various IP protocols. 71 Table of Contents 73 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 74 2. Base packet format . . . . . . . . . . . . . . . . . . . . . . 5 75 2.1. GUE version . . . . . . . . . . . . . . . . . . . . . . . . 5 76 3. Version 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 77 3.1. Header format . . . . . . . . . . . . . . . . . . . . . . . 6 78 3.2. Proto/ctype field . . . . . . . . . . . . . . . . . . . . . 7 79 3.3. Flags and optional fields . . . . . . . . . . . . . . . . . 8 80 3.4. Private data . . . . . . . . . . . . . . . . . . . . . . . 8 81 3.5 Message types . . . . . . . . . . . . . . . . . . . . . . . 9 82 3.5.1. Control messages . . . . . . . . . . . . . . . . . . . 9 83 3.5.2. Data messages . . . . . . . . . . . . . . . . . . . . . 9 84 4. Version 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 85 4.1. Direct encapsulation of IPv4 . . . . . . . . . . . . . . . 10 86 4.2 Direct encapsulation of IPv6 . . . . . . . . . . . . . . . . 11 87 5. Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 88 5.1. Network tunnel encapsulation . . . . . . . . . . . . . . . 12 89 5.2. Transport layer encapsulation . . . . . . . . . . . . . . . 12 90 5.3. Encapsulator operation . . . . . . . . . . . . . . . . . . 12 91 5.4. Decapsulator operation . . . . . . . . . . . . . . . . . . 13 92 5.5. Router and switch operation . . . . . . . . . . . . . . . . 13 93 5.6. Middlebox interactions . . . . . . . . . . . . . . . . . . 13 94 5.6.1. Inferring connection semantics . . . . . . . . . . . . 13 95 5.6.2. Identification of GUE packets . . . . . . . . . . . . . 14 96 5.7. NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 97 5.8. Checksum Handling . . . . . . . . . . . . . . . . . . . . . 15 98 5.8.1. Checksum requirements . . . . . . . . . . . . . . . . . 15 99 5.8.2. GUE header checksum . . . . . . . . . . . . . . . . . . 15 100 5.8.3. UDP Checksum with IPv4 . . . . . . . . . . . . . . . . 16 101 5.8.4. UDP Checksum with IPv6 . . . . . . . . . . . . . . . . 16 102 5.9. MTU and fragmentation . . . . . . . . . . . . . . . . . . . 17 103 5.10. Congestion control . . . . . . . . . . . . . . . . . . . . 17 104 5.11. Multicast . . . . . . . . . . . . . . . . . . . . . . . . 18 105 6. Inner flow identifier properties . . . . . . . . . . . . . . . 18 106 6.1. Flow classification . . . . . . . . . . . . . . . . . . . . 18 107 6.2. Inner flow identifier properties . . . . . . . . . . . . . 19 108 7. Motivation for GUE . . . . . . . . . . . . . . . . . . . . . . 20 109 8. Security Considerations . . . . . . . . . . . . . . . . . . . . 21 110 9. IANA Consideration . . . . . . . . . . . . . . . . . . . . . . 21 111 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 21 112 11. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 113 11.1. Normative References . . . . . . . . . . . . . . . . . . . 22 114 11.2. Informative References . . . . . . . . . . . . . . . . . . 22 115 Appendix A: NIC processing for GUE . . . . . . . . . . . . . . . . 23 116 A.1. Receive multi-queue . . . . . . . . . . . . . . . . . . . . 23 117 A.2. Checksum offload . . . . . . . . . . . . . . . . . . . . . 23 118 A.2.1. Transmit checksum offload . . . . . . . . . . . . . . . 23 119 A.2.2. Receive checksum offload . . . . . . . . . . . . . . . 24 120 A.3. Transmit Segmentation Offload . . . . . . . . . . . . . . . 25 121 A.4. Large Receive Offload . . . . . . . . . . . . . . . . . . . 26 122 Appendix B: Privileged ports . . . . . . . . . . . . . . . . . . . 26 123 Appendix C: Inner flow identifier as a route selector . . . . . . 27 124 Appendix D: Hardware protocol implementation considerations . . . 27 125 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27 127 1. Introduction 129 This specification describes Generic UDP Encapsulation (GUE) which is 130 a general method for encapsulating packets of arbitrary IP protocols 131 within User Datagram Protocol (UDP) [RFC0768] packets. Encapsulating 132 packets in UDP facilitates efficient transport across networks. 133 Networking devices widely provide protocol specific processing and 134 optimizations for UDP (as well as TCP) packets. Packets for atypical 135 IP protocols (those not usually parsed by networking hardware) can be 136 encapsulated in UDP packets to maximize deliverability and to 137 leverage flow specific mechanisms for routing and packet steering. 139 GUE provides an extensible header format for including optional data 140 in the encapsulation header. This data potentially covers items such 141 as virtual networking identifier, security data for validating or 142 authenticating the GUE header, congestion control data, etc. GUE also 143 allows private optional data in the encapsulation header. This 144 feature can be used by a site or implementation to define local 145 custom optional data, and allows experimentation of options that may 146 eventually become standard. 148 2. Base packet format 150 A GUE packet is comprised of a UDP packet whose payload is a GUE 151 header followed by a payload which is either an encapsulated packet 152 of some IP protocol or a control message (like an OAM message). A GUE 153 packet has the general format: 155 +-------------------------------+ 156 | | 157 | UDP/IP header | 158 | | 159 |-------------------------------| 160 | | 161 | GUE Header | 162 | | 163 |-------------------------------| 164 | | 165 | Encapsulated packet | 166 | or control message | 167 | | 168 +-------------------------------+ 170 The GUE header is variable length as determined by the presence of 171 optional fields. 173 2.1. GUE version 175 The first two bits of the GUE header contain the GUE protocol version 176 number. The rest of the fields after the GUE version number are 177 defined based on the version number. Versions 0x0 and 0x1 are 178 described in this specification; versions 0x2 and 0x3 are reserved, 180 3. Version 0 182 Version 0 of GUE is defines a generic extensible format to 183 encapsulate packets by IP protocol number. 185 3.1. Header format 187 The header format for version 0x0 of GUE in UDP is: 189 0 1 2 3 190 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 191 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 192 | Source port | Destination port | 193 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 194 | Length | Checksum | 195 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 196 |0x0|C| Hlen | Proto/ctype | Flags |E| 197 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 198 | | 199 ~ Fields (optional) ~ 200 | | 201 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 202 | Extension flags (optional) | 203 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 204 | | 205 ~ Extension fields (optional) ~ 206 | | 207 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 208 | | 209 ~ Private data (optional) ~ 210 | | 211 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 213 The contents of the UDP header are: 215 o Source port (inner flow identifier): This should be set to a 216 value that represents the encapsulated flow. The properties of 217 the inner flow identifier are described below. 219 o Destination port: The GUE assigned port number, 6080. 221 o Length: Canonical length of the UDP packet (length of UDP header 222 and payload). 224 o Checksum: Standard UDP checksum (see section 5.8). 226 The GUE header consists of: 228 o Ver: GUE protocol version (0x0). 230 o C: Control flag. When set indicates a control message, not set 231 indicates a data message. 233 o Hlen: Length in 32-bit words of the GUE header, including 234 optional fields but not the first four bytes of the header. 235 Computed as (header_len - 4) / 4. All GUE headers are a multiple 236 of four bytes in length. Maximum header length is 128 bytes. 238 o Proto/ctype: When the C bit is set this field contains a control 239 message type for the payload. When C bit is not set, the field 240 holds the IP protocol number for the encapsulated packet in the 241 payload. The control message or encapsulated packet begins at 242 the offset provided by Hlen. 244 o Flags. Header flags that may be allocated for various purposes 245 and may indicate presence of optional fields. Undefined header 246 flag bits must be set to zero on transmission. 248 o 'E' Extension flag. Indicates presence of extension flags option 249 in the optional fields. 251 o Fields: Optional fields whose presence is indicated by 252 corresponding flags. 254 o Extension flags: An optional field indicated by the E bit. This 255 field provides an additional set of thirty-two bits for flags. 257 o Extension fields: Optional fields whose presence is indicated by 258 corresponding extension flags. 260 o Private data: Optional private data. If private data is present 261 it immediately follows that last field present in the header. 262 The length of this data is determined by subtracting the 263 starting offset of the private data from the header length. 265 3.2. Proto/ctype field 267 When the C bit is not set, the proto/ctype field must be set to a 268 valid IP protocol number. The IP protocol number serves as an 269 indication of the type of next protocol header which is contained in 270 the GUE payload at the offset indicated in Hlen. Intermediate devices 271 may parse the GUE payload per the IP protocol number in the 272 proto/ctype field, and header flags cannot affect the interpretation 273 of the proto/ctype field. 275 IP protocol number 59 ("No next header") may be set to indicate that 276 the GUE payload does not begin with the header of an IP protocol. 277 This would be the case, for instance, if the GUE payload were a 278 fragment when performing GUE level fragmentation. The interpretation 279 of the payload is performed though other means (such as flags and 280 optional fields), and intermediate devices must not parse packets the 281 packet based on the IP protocol number in this case. 283 When the C bit is set, the proto/ctype field must be set to a valid 284 control message type. A value of zero indicates that the GUE payload 285 requires further interpretation to deduce the control type. This 286 might be the case when the payload is a fragment of a control 287 message, where only the reassembled packet can be interpreted as a 288 control message. 290 3.3. Flags and optional fields 292 Flags and associated optional fields are the primary mechanism of 293 extensibility in GUE. There are sixteen flag bits in the primary GUE 294 header with one being reserved to indicate that an optional extension 295 flags field is present. The extension flags field contains an 296 additional thirty-two flag bits. 298 A flag may indicate presence of optional fields. The size of an 299 optional field indicated by a flag must be fixed. 301 Flags may be paired together to allow different lengths for an 302 optional field. For example, if two flag bits are paired, a field may 303 possibly be three different lengths. Regardless of how flag bits may 304 be paired, the lengths and offsets of optional fields corresponding 305 to a set of flags must be well defined. 307 Optional fields are placed in order of the flags. New flags should be 308 allocated from high to low order bit contiguously. Flags allow random 309 access, for instance to inspect the field corresponding to the Nth 310 flag bit, an implementation only considers the previous N-1 flags to 311 determine the offset. Flags after the Nth flag are not pertinent in 312 calculating the offset of the Nth flag. 314 Flags (or paired flags) are idempotent such that new flags should not 315 cause reinterpretation of old flags. Also, new flags should not alter 316 interpretation of other elements in the GUE header nor how the 317 message is parsed (for instance, in a data message the proto/ctype 318 field always holds an IP protocol number as an invariant). 320 3.4. Private data 322 An implementation may use private data for its own use. The private 323 data immediately follows the last field in the GUE header and is not 324 a fixed length. This data is considered part of the GUE header and 325 must be accounted for in header length (Hlen). The length of the 326 private data must be a multiple of four and is determined by 327 subtracting the offset of private data in the GUE header from the 328 header length. Specifically: 330 Private_length = (Hlen * 4) - Length(flags) 332 Where "Length(flags)" returns the sum of lengths of all the optional 333 fields present in the GUE header. When there is no private data 334 present, length of the private data is zero. 336 The semantics and interpretation of private data are implementation 337 specific. An encapsulator and decapsulator MUST agree on the meaning 338 of private data before using it. The private data may be structured 339 as necessary, for instance it might contain its own set of flags and 340 optional fields. 342 If a decapsulator receives a GUE packet with private data, it MUST 343 validate the private data appropriately. If a decapsulator does not 344 expect private data from an encapsulator the packet MUST be dropped. 345 If a decapsulator cannot validate the contents of private data per 346 the provided semantics the packet MUST also be dropped. An 347 implementation may place security data in GUE private data which must 348 be verified for packet acceptance. 350 3.5 Message types 352 3.5.1. Control messages 354 Control messages are indicated in the GUE header when the C bit is 355 set. The payload is interpreted as a control message with type 356 specified in the proto/ctype field. The format and contents of the 357 control message are indicated by the type and can be variable length. 359 Other than interpreting the proto/ctype field as a control message 360 type, the meaning and semantics of the rest of the elements in the 361 GUE header are the same as that of data messages. Forwarding and 362 routing of control messages should be the same as that of a data 363 message with the same outer IP and UDP header and GUE flags-- this 364 ensures that control messages can be created that follow the same 365 path as data messages. 367 Control messages can be defined for OAM type messages. For instance, 368 an echo request and corresponding echo reply message may be defined 369 to test for liveness. 371 3.5.2. Data messages 373 Data messages are indicated in GUE header with C bit not set. The 374 payload of a data message is interpreted as an encapsulated packet of 375 an IP protocol indicated in the proto/ctype field. The packet 376 immediately follows the GUE header. 378 Data messages are a primary means of encapsulation and can be used to 379 create tunnels for overlay networks. 381 4. Version 1 383 Version 1 of GUE allows direct encapsulation of IPv4 and IPv6 in UDP. 384 In this version there is no GUE header, a UDP packet carries an IP 385 packet. The first two bits of the UDP payload indicate the GUE 386 version and coincide with the first two bits of the version number in 387 the IP header. So, the first two version bits of IPv4 and IPv6 are 01 388 which corresponds with GUE version 1. 390 4.1. Direct encapsulation of IPv4 392 The format for encapsulating IPv4 directly in UDP is demonstrated 393 below. 394 0 1 2 3 395 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 396 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 397 | Source port | Destination port | 398 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 399 | Length | Checksum | 400 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 401 |0|1|0|0| IHL |Type of Service| Total Length | 402 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 403 | Identification |Flags| Fragment Offset | 404 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 405 | Time to Live | Protocol | Header Checksum | 406 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 407 | Source IPv4 Address | 408 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 409 | Destination IPv4 Address | 410 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 412 Note that 0100 value IP version field express the GUE version as 1 413 (bits 01) and IP version as 4 (bits 0100). 415 4.2 Direct encapsulation of IPv6 417 The format for encapsulating IPv6 directly in UDP is demonstrated 418 below. 419 0 1 2 3 420 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 421 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 422 | Source port | Destination port | 423 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 424 | Length | Checksum | 425 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 426 |0|1|1|0| Traffic Class | Flow Label | 427 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 428 | Payload Length | NextHdr | Hop Limit | 429 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 430 | | 431 + + 432 | | 433 + Outer Source IPv6 Address + 434 | | 435 + + 436 | | 437 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 438 | | 439 + + 440 | | 441 + Outer Destination IPv6 Address + 442 | | 443 + + 444 | | 445 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 447 Note that 0110 value IP version field express the GUE version as 1 448 (bits 01) and IP version as 6 (bits 0110). 450 5. Operation 452 The figure below illustrates the use of GUE encapsulation between two 453 servers. Sever 1 is sending packets to server 2. An encapsulator 454 performs encapsulation of packets from server 1. These encapsulated 455 packets traverse the network as UDP packets. At the decapsulator, 456 packets are decapsulated and sent on to server 2. Packet flow in the 457 reverse direction need not be symmetric; GUE encapsulation is not 458 required in the reverse path. 460 +---------------+ +---------------+ 461 | | | | 462 | Server 1 | | Server 2 | 463 | | | | 464 +---------------+ +---------------+ 465 | ^ 466 V | 467 +---------------+ +---------------+ +---------------+ 468 | | | | | | 469 | Encapsulator |-->| Layer 3 |-->| Decapsulator | 470 | | | Network | | | 471 +---------------+ +---------------+ +---------------+ 473 The encapsulator and decapsulator may be co-resident with the 474 corresponding servers, or may be on separate nodes in the network. 476 5.1. Network tunnel encapsulation 478 Network tunneling can be achieved by encapsulating layer 2 or layer 3 479 packets. In this case the encapsulator and decapsulator nodes are the 480 tunnel endpoints. These could be routers that provide network tunnels 481 on behalf of communicating servers. 483 5.2. Transport layer encapsulation 485 When encapsulating layer 4 packets, the encapsulator and decapsulator 486 should be co-resident with the servers. In this case, the 487 encapsulation headers are inserted between the IP header and the 488 transport packet. The addresses in the IP header refer to both the 489 endpoints of the encapsulation and the endpoints for terminating the 490 the transport protocol. 492 5.3. Encapsulator operation 494 Encapsulators create GUE data messages, set the source port to the 495 inner flow identifier, set flags and optional fields in the GUE 496 header, and forward packets to a decapsulator. 498 An encapsulator may be an end host originating the packets of a flow, 499 or may be a network device performing encapsulation on behalf of 500 servers (routers implementing tunnels for instance). In either case, 501 the intended target (decapsulator) is indicated by the outer 502 destination IP address. 504 If an encapsulator is tunneling packets, that is encapsulating 505 packets of layer 2 or layer 3 protocols (e.g. EtherIP, IPIP, ESP 506 tunnel mode), it should follow standard conventions for tunneling of 507 one IP protocol over another. Diffserv interaction with tunnels is 508 described in [RFC2983], ECN propagation for tunnels is described in 509 [RFC6040]. 511 5.4. Decapsulator operation 513 A decapsulator performs decapsulation of GUE packets. A decapsulator 514 is addressed by the outer destination IP address of a GUE packet. 515 The decapsulator validates packets, including fields of the GUE 516 header. If a packet is acceptable, the UDP and GUE headers are 517 removed and the packet is resubmitted for IP protocol processing or 518 control message processing if it is a control message. 520 If a decapsulator receives a GUE packet with an unsupported version, 521 unknown flag, bad header length (too small for included optional 522 fields), unknown control message type, or an otherwise malformed 523 header, it must drop the packet and may log the event. No error 524 message is returned back to the encapsulator. Note that set flags in 525 GUE that are unknown to a decapsulator MUST NOT be ignored. If a GUE 526 packet is received by a decapsulator with unknown flags, the packet 527 MUST be dropped. 529 5.5. Router and switch operation 531 Routers and switches should forward GUE packets as standard UDP/IP 532 packets. The outer five-tuple should contain sufficient information 533 to perform flow classification corresponding to the flow of the inner 534 packet. A router should not normally need to parse a GUE header, and 535 none of the flags or optional fields in the GUE header should affect 536 routing. 538 A router should not modify a GUE header when forwarding a packet. It 539 may encapsulate a GUE packet in another GUE packet, for instance to 540 implement a network tunnel. In this case the router takes the role of 541 an encapsulator, and the corresponding decapsulator is the logical 542 endpoint of the tunnel. 544 5.6. Middlebox interactions 546 A middle box may interpret some flags and optional fields of the GUE 547 header for classification purposes, but is not required to understand 548 all flags and fields in GUE packets. A middle box should not drop a 549 GUE packet because there are flags unknown to it. The header length 550 in the GUE header allows a middlebox to inspect the payload packet 551 without needing to parse the flags or optional fields. 553 5.6.1. Inferring connection semantics 555 A middlebox may infer bidirectional connection semantics to a UDP 556 flow. For instance a stateful firewall may create a five-tuple rule 557 to match flows on egress, and a corresponding five-tuple rule for 558 matching ingress packets where the roles of source and destination 559 are reversed for the IP addresses and UDP port numbers. To operate in 560 this environment, a GUE tunnel must assume connected semantics 561 defined by the UDP five tuple and the use of GUE encapsulation must 562 be symmetric between both endpoints. The source port set in the UDP 563 header must be the destination port the peer would set for replies. 565 5.6.2. Identification of GUE packets 567 For a middlebox to be able to parse GUE packets it must be able to 568 positively identify GUE packets. The obvious way to do this would be 569 by destination port number, that is if the destination UDP port 570 matches the GUE port number it would be considered GUE. This method 571 is problematic since ports numbers do not have global meaning 572 ([RFC7605]) and a packet which is not GUE but destined to the same 573 port number could be misinterpreted. 575 To allow middleboxes to accurately identify GUE packet, UDP magic 576 numbers may be used ([UDPMAG]). A UDP magic number is a sixy-four bit 577 value inserted between the UDP header and the encapsulated protocol 578 headers. The value is a constant for each defined protocol and can be 579 matched by middlebox to identify (with very high probability) the 580 encapsulated protocol. 582 The sixty-four bit magic number for GUE is 0xffd871a2:0x217431f9, so 583 a GUE packet with a UDP magic number would have the format: 585 0 1 2 3 586 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 587 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 588 | Source port | Destination port | 589 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 590 | Length | Checksum | 591 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 592 | 0xffd871a2 | 593 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 594 | 0x217431f9 | 595 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 596 |0x0|C| Hlen | Proto/ctype | Flags |E| 597 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 598 | | 599 ~ Fields (optional) ~ 600 | | 601 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 603 5.7. NAT 605 IP address and port translation can be performed on the UDP/IP 606 headers adhering to the requirements for NAT with UDP [RFC478]. In 607 the case of stateful NAT, connection semantics must be applied to a 608 GUE tunnel as described above. 610 When using transport mode encapsulation and traversing a NAT, the IP 611 addresses may be changed such that the pseudo header checksum used 612 for checksum calculation is modified and the checksum will be found 613 invalid at the receiver. To compensate for this, A GUE option can be 614 added which contains the checksum over the source and destination 615 addresses when the packet is transmitted. Upon receiving this option, 616 the delta of the pseudo header checksum is computed by subtracting 617 the checksum over the source and and destination addresses from the 618 checksum value in the option. The resultant value is then added into 619 checksum calculation when validating the inner transport checksum. 621 5.8. Checksum Handling 623 This section describes the requirements around the UDP checksum and 624 GUE header checksum. Checksums are an important consideration in that 625 that they can provide end to end validation and protect against 626 packet mis-delivery. The latter is allowed by the inclusion of a 627 pseudo header that covers the IP addresses and UDP ports of the 628 encapsulating headers. 630 5.8.1. Checksum requirements 632 The potential for mis-delivery of packets due to corruption of IP, 633 UDP, or GUE headers must be considered. One of the following 634 requirements must be met: 636 o UDP checksums are enabled (for IPv4 or IPv6). 638 o The GUE header checksum is used. 640 o Zero UDP checksums are used in accordance with applicable 641 requirements in [GREUDP], [RFC6935], and [RFC6936]. 643 5.8.2. GUE header checksum 645 The GUE header checksum (in version 0x0) provides a UDP-lite 646 [RFC3828] type of checksum capability as an optional field of the 647 GUE header. The GUE header checksum minimally covers the GUE header 648 and a GUE pseudo header. The GUE pseudo header includes the 649 corresponding IP addresses as well as the UDP ports of the 650 encapsulating headers. This checksum should provide adequate 651 protection against address corruption in IPv6 when the UDP checksum 652 is zero. Additionally, the GUE checksum provides protection of the 653 GUE header when the UDP checksum is set to zero with either IPv4 or 654 IPv6. The GUE header checksum is defined in [GUECSUM]. 656 5.8.3. UDP Checksum with IPv4 658 For UDP in IPv4, the UDP checksum MUST be processed as specified in 659 [RFC768] and [RFC1122] for both transmit and receive. An 660 encapsulator MAY set the UDP checksum to zero for performance or 661 implementation considerations. The IPv4 header includes a checksum 662 that protects against mis-delivery of the packet due to corruption 663 of IP addresses. The UDP checksum potentially provides protection 664 against corruption of the UDP header, GUE header, and GUE payload. 665 Enabling or disabling the use of checksums is a deployment 666 consideration that should take into account the risk and effects of 667 packet corruption, and whether the packets in the network are 668 already adequately protected by other, possibly stronger mechanisms 669 such as the Ethernet CRC. If an encapsulator sets a zero UDP 670 checksum for IPv4 it SHOULD use the GUE header checksum as described 671 in section 4.8.2. 673 When a decapsulator receives a packet, the UDP checksum field MUST 674 be processed. If the UDP checksum is non-zero, the decapsulator MUST 675 verify the checksum before accepting the packet. By default a 676 decapsulator SHOULD accept UDP packets with a zero checksum. A node 677 MAY be configured to disallow zero checksums per [RFC1122]; this may 678 be done selectively, for instance disallowing zero checksums from 679 certain hosts that are known to be sending over paths subject to 680 packet corruption. If verification of a non-zero checksum fails, a 681 decapsulator lacks the capability to verify a non-zero checksum, or 682 a packet with a zero-checksum was received and the decapsulator is 683 configured to disallow, the packet MUST be dropped and an event MAY 684 be logged. 686 5.8.4. UDP Checksum with IPv6 688 For UDP in IPv6, the UDP checksum MUST be processed as specified in 689 [RFC768] and [RFC2460] for both transmit and receive. Unlike IPv4, 690 there is no header checksum in IPv6 that protects against mis- 691 delivery due to address corruption. Therefore, when GUE is used over 692 IPv6, either the UDP checksum must be enabled or the GUE header 693 checksum must be used. An encapsulator MAY set a zero UDP checksum 694 for performance or implementation reasons, in which case the GUE 695 header checksum MUST be used or applicable requirements for using 696 zero UDP checksums in [GREUDP] MUST be met. If the UDP checksum is 697 enabled, then the GUE header checksum should not be used since it is 698 mostly redundant. 700 When a decapsulator receives a packet, the UDP checksum field MUST 701 be processed. If the UDP checksum is non-zero, the decapsulator MUST 702 verify the checksum before accepting the packet. By default a 703 decapsulator MUST only accept UDP packets with a zero checksum if 704 the GUE header checksum is used and is verified. If verification of 705 a non-zero checksum fails, a decapsulator lacks the capability to 706 verify a non-zero checksum, or a packet with a zero-checksum and no 707 GUE header checksum was received, the packet MUST be dropped and an 708 event MAY be logged. 710 5.9. MTU and fragmentation 712 Standard conventions for handling of MTU (Maximum Transmission Unit) 713 and fragmentation in conjunction with networking tunnels 714 (encapsulation of layer 2 or layer 3 packets) should be followed. 715 Details are described in MTU and Fragmentation Issues with In-the- 716 Network Tunneling [RFC4459] 718 If a packet is fragmented before encapsulation in GUE, all the 719 related fragments must be encapsulated using the same source port 720 (inner flow identifier). An operator may set MTU to account for 721 encapsulation overhead and reduce the likelihood of fragmentation. 723 5.10. Congestion control 725 Per requirements of [RFC5405], if the IP traffic encapsulated with 726 GUE implements proper congestion control no additional mechanisms 727 should be required. 729 In the case that the encapsulated traffic does not implement any or 730 sufficient control, or it is not known rather a transmitter will 731 consistently implement proper congestion control, then congestion 732 control at the encapsulation layer must be provided. Note this case 733 applies to a significant use case in network virtualization in which 734 guests run third party networking stacks that cannot be implicitly 735 trusted to implement conformant congestion control. 737 Out of band mechanisms such as rate limiting, Managed Circuit 738 Breaker, or traffic isolation may used to provide rudimentary 739 congestion control. For finer grained congestion control that allows 740 alternate congestion control algorithms, reaction time within an 741 RTT, and interaction with ECN, in-band mechanisms may warranted. 743 DCCP may be used to provide congestion control for encapsulated 744 flows. In this case, the protocol stack for an IP tunnel may be IP- 745 GUE-DCCP-IP. Alternatively, GUE can be extended to include 746 congestion control (related data carried in GUE optional fields). 747 Congestion control mechanisms for GUE will be elaborated in other 748 specifications. 750 5.11. Multicast 752 GUE packets may be multicast to decapsulators using a multicast 753 destination address in the encapsulating IP headers. Each receiving 754 host will decapsulate the packet independently following normal 755 decapsulator operations. The receiving decapsulators should agree on 756 the same set of GUE parameters and properties. 758 GUE allows encapsulation of unicast, broadcast, or multicast 759 traffic. Entropy for the inner flow identifier (UDP source port) may 760 be generated from the header of encapsulated unicast or 761 broadcast/multicast packets at an encapsulator. The mapping 762 mechanism between the encapsulated multicast traffic and the 763 multicast capability in the IP network is transparent and 764 independent to the encapsulation and is otherwise outside the scope 765 of this document. 767 6. Inner flow identifier properties 769 6.1. Flow classification 771 A major objective of using GUE is that a network device can perform 772 flow classification corresponding to the flow of the inner 773 encapsulated packet based on the contents in the outer headers. 775 Hardware devices commonly perform hash computations on packet 776 headers to classify packets into flows or flow buckets. Flow 777 classification is done to support load balancing (statistical 778 multiplexing) of flows across a set of networking resources. 779 Examples of such load balancing techniques are Equal Cost Multipath 780 routing (ECMP), port selection in Link Aggregation, and NIC device 781 Receive Side Scaling (RSS). Hashes are usually either a three-tuple 782 hash of IP protocol, source address, and destination address; or a 783 five-tuple hash consisting of IP protocol, source address, 784 destination address, source port, and destination port. Typically, 785 networking hardware will compute five-tuple hashes for TCP and UDP, 786 but only three-tuple hashes for other IP protocols. Since the five- 787 tuple hash provides more granularity, load balancing can be finer 788 grained with better distribution. When a packet is encapsulated with 789 GUE, the source port in the outer UDP packet is set to reflect the 790 flow of the inner packet. When a device computes a five-tuple hash 791 on the outer UDP/IP header of a GUE packet, the resultant value 792 classifies the packet per its inner flow. 794 To support flow classification, the source port of the UDP header in 795 GUE is set to a value that maps to the inner flow. This is referred 796 to as the inner flow identifier. The inner flow identifier is set by 797 the encapsulator; it can be computed on the fly based on packet 798 contents or retrieved from a state maintained for the inner flow. 800 Examples of deriving an inner flow identifier are: 802 o If the encapsulated packet is a layer 4 packet, TCP/IPv4 for 803 instance, the inner flow identifier could be based on the 804 canonical five-tuple hash of the inner packet. 806 o If the encapsulated packet is an AH transport mode packet with 807 TCP as next header, the inner flow identifier could be a hash 808 over a three-tuple: TCP protocol and TCP ports of the 809 encapsulated packet. 811 o If a node is encrypting a packet using ESP tunnel mode and GUE 812 encapsulation, the inner flow identifier could be based on the 813 contents of clear-text packet. For instance, a canonical five- 814 tuple hash for a TCP/IP packet could be used. 816 6.2. Inner flow identifier properties 818 The inner flow identifier is the value set in the UDP source 819 port of a GUE packet. The inner flow identifier should adhere to 820 the following properties: 822 o The value set in the source port should be within the ephemeral 823 port range. IANA suggests this range to be 49152 to 65535, where 824 the high order two bits of the port are set to one. This 825 provides fourteen bits of entropy for the inner flow identifier. 827 o The inner flow identifier should have a uniform distribution 828 across encapsulated flows. 830 o An encapsulator may occasionally change the inner flow 831 identifier used for an inner flow per its discretion (for 832 security, route selection, etc). Changing the value should 833 happen no more than once every thirty seconds. 835 o Decapsulators, or any networking devices, should not attempt any 836 interpretation of the inner flow identifier, nor should they 837 attempt to reproduce any hash calculation. They may use the 838 value to match further receive packets for steering decisions, 839 but cannot assume that the hash uniquely or permanently 840 identifies a flow. 842 o Input to the inner flow identifier is not restricted to ports 843 and addresses; input could include flow label from an IPv6 844 packet, SPI from an ESP packet, or other flow related state in 845 the encapsulator that is not necessarily conveyed in the packet. 847 o The assignment function for inner flow identifiers should be 848 randomly seeded to mitigate denial of service attacks. The seed 849 may be changed periodically. 851 7. Motivation for GUE 853 This section presents the motivation for GUE with respect to other 854 encapsulation methods. 856 A number of different encapsulation techniques have been proposed for 857 the encapsulation of one protocol over another. EtherIP [RFC3378] 858 provides layer 2 tunneling of Ethernet frames over IP. GRE [RFC2784], 859 MPLS [RFC4023], and L2TP [RFC2661] provide methods for tunneling 860 layer 2 and layer 3 packets over IP. NVGRE [NVGRE] and VXLAN 861 [RFC7348] are proposals for encapsulation of layer 2 packets for 862 network virtualization. IPIP [RFC2003] and Generic packet tunneling 863 in IPv6 [RFC2473] provide methods for tunneling IP packets over IP. 865 Several proposals exist for encapsulating packets over UDP including 866 ESP over UDP [RFC3948], TCP directly over UDP [TCPUDP], VXLAN, LISP 867 [RFC6830] which encapsulates layer 3 packets, and Generic UDP 868 Encapsulation for IP Tunneling (GRE over UDP)[GREUDP]. Generic UDP 869 tunneling [GUT] is a proposal similar to GUE in that it aims to 870 tunnel packets of IP protocols over UDP. 872 GUE has the following discriminating features: 874 o UDP encapsulation leverages specialized network device 875 processing for efficient transport. The semantics for using the 876 UDP source port as an identifier for an inner flow are defined. 878 o GUE permits encapsulation of arbitrary IP protocols, which 879 includes layer 2 3, and 4 protocols. This potentially allows 880 nearly all traffic within a data center to be normalized to be 881 either TCP or UDP on the wire. 883 o Multiple protocols can be multiplexed over a single UDP port 884 number. This is in contrast to techniques to encapsulate 885 protocols over UDP using a protocol specific port number (such 886 as ESP/UDP, GRE/UDP, SCTP/UDP). GUE provides a uniform and 887 extensible mechanism for encapsulating all IP protocols in UDP 888 with minimal overhead (four bytes of additional header). 890 o GUE is extensible. New flags and optional fields can be defined. 892 o The GUE header includes a header length field. This allows a 893 network node to inspect an encapsulated packet without needing 894 to parse the full encapsulation header. 896 o Private data in the encapsulation header allows local 897 customization and experimentation while being compatible with 898 processing in network nodes (routers and middleboxes). 900 o GUE includes both data messages (encapsulation of packets) and 901 control messages (such as OAM). 903 8. Security Considerations 905 Encapsulation of IP protocols within GUE should not increase security 906 risk, nor provide additional security in itself. As suggested in 907 section 5 the source port for of UDP packets in GUE should be 908 randomly seeded to mitigate some possible denial service attacks. 910 Security for Generic UDP Encapsulation, including security for the 911 GUE header and payload, is described in detail in [GUESEC]. 913 9. IANA Consideration 915 A user UDP port number assignment for GUE has been assigned: 917 Service Name: gue 918 Transport Protocol(s): UDP 919 Assignee: Tom Herbert 920 Contact: Tom Herbert 921 Description: Generic UDP Encapsulation 922 Reference: draft-herbert-gue 923 Port Number: 6080 924 Service Code: N/A 925 Known Unauthorized Uses: N/A 926 Assignment Notes: N/A 928 IANA is requested to create a "GUE flag-fields" registry to allocate 929 flags and optional fields for the primary GUE header flags and 930 extension flags. This shall be a registry of bit assignments for 931 flags, length of optional fields for corresponding flags, and 932 descriptive strings. There are sixteen bits for primary GUE header 933 flags (bit number 0-15) where bit 15 is reserved as the extension 934 flag in this document. There are thirty-two bits for extension flags. 936 10. Acknowledgements 938 The authors would like to thank David Liu, Erik Nordmark, and Fred 939 Templin for valuable input on this draft. 941 11. References 943 11.1. Normative References 945 11.2. Informative References 947 [RFC3828] Larzon, L-A., Degermark, M., Pink, S., Jonsson, L-E., Ed., 948 and G. Fairhurst, Ed., "The Lightweight User Datagram 949 Protocol (UDP-Lite)", RFC 3828, July 2004, 950 . 952 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 953 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 954 eXtensible Local Area Network (VXLAN): A Framework for 955 Overlaying Virtualized Layer 2 Networks over Layer 3 956 Networks", RFC 7348, August 2014, . 959 [RFC7605] Touch, J., "Recommendations on Using Assigned Transport 960 Port Numbers", BCP 165, RFC 7605, DOI 10.17487/RFC7605, 961 August 2015, . 963 [NVGRE] Garg, P., and Wang, Y., "NVGRE: Network Virtualization 964 using Generic Routing Encapsulation" draft-sridharan- 965 virtualization-nvgre-08 967 [TCPUDP] Chesire, S., Graessley, J., and McGuire, R., 968 "Encapsulation of TCP and other Transport Protocols over 969 UDP" draft-cheshire-tcp-over-udp-00 971 [GREUDP] Crabbe, E., Yong, L., Xu, X., and Herbert, T.,Cpsest 972 "Generic UDP Encapsulation for IP Tunneling" draft-ietf- 973 tsvwg-gre-in-udp-encap-06 975 [GUESEC] Yong, L., Herbert, T., "Generic UDP Encapsulation (GUE) 976 for Secure Transport", draft-hy-gue-4-secure-transport-02, 977 work in progress. 979 [UDPMAG] Herbert, T., Yong, L., "UDP Magic Numbers", draft-herbert- 980 udp-magic-numbers-01 982 [GUT] Manner, J., Varia, N., and Briscoe, B., "Generic UDP 983 Tunnelling (GUT) draft-manner-tsvwg-gut-02.txt" 985 [REMCSUM] Herbert, T., "Remote Checksum Offload" draft-herbert- 986 remotecsumoffload-00 988 [LCO] Cree, E., "Checksum Offloads in the Linux Networking 989 Stack", Linx Documentation, 990 Documentation/networking/checksum-offloads.txt 991 Appendix A: NIC processing for GUE 993 This appendix provides some guidelines for Network Interface Cards 994 (NICs) to implement common offloads and accelerations to support GUE. 995 Note that most of this discussion is generally applicable to other 996 methods of UDP based encapsulation. 998 A.1. Receive multi-queue 1000 Contemporary NICs support multiple receive descriptor queues (multi- 1001 queue). Multi-queue enables load balancing of network processing for 1002 a NIC across multiple CPUs. On packet reception, a NIC must select 1003 the appropriate queue for host processing. Receive Side Scaling is a 1004 common method which uses the flow hash for a packet to index an 1005 indirection table where each entry stores a queue number. Flow 1006 Director and Accelerated Receive Flow Steering (aRFS) allow a host to 1007 program the queue that is used for a given flow which is identified 1008 either by an explicit five-tuple or by the flow's hash. 1010 GUE encapsulation should be compatible with multi-queue NICs that 1011 support five-tuple hash calculation for UDP/IP packets as input to 1012 RSS. The inner flow identifier (source port) ensures classification 1013 of the encapsulated flow even in the case that the outer source and 1014 destination addresses are the same for all flows (e.g. all flows are 1015 going over a single tunnel). 1017 By default, UDP RSS support is often disabled in NICs to avoid out of 1018 order reception that can occur when UDP packets are fragmented. As 1019 discussed above, fragmentation of GUE packets should be mitigated by 1020 fragmenting packets before entering a tunnel, path MTU discovery in 1021 higher layer protocols, or operator adjusting MTUs. Other UDP traffic 1022 may not implement such procedures to avoid fragmentation, so enabling 1023 UDP RSS support in the NIC should be a considered tradeoff during 1024 configuration. 1026 A.2. Checksum offload 1028 Many NICs provide capabilities to calculate standard ones complement 1029 payload checksum for packets in transmit or receive. When using GUE 1030 encapsulation there are at least two checksums that may be of 1031 interest: the encapsulated packet's transport checksum, and the UDP 1032 checksum in the outer header. 1034 A.2.1. Transmit checksum offload 1035 NICs may provide a protocol agnostic method to offload transmit 1036 checksum (NETIF_F_HW_CSUM in Linux parlance) that can be used with 1037 GUE. In this method the host provides checksum related parameters in 1038 a transmit descriptor for a packet. These parameters include the 1039 starting offset of data to checksum, the length of data to checksum, 1040 and the offset in the packet where the computed checksum is to be 1041 written. The host initializes the checksum field to pseudo header 1042 checksum. 1044 In the case of GUE, the checksum for an encapsulated transport layer 1045 packet, a TCP packet for instance, can be offloaded by setting the 1046 appropriate checksum parameters. 1048 NICs typically can offload only one transmit checksum per packet. 1049 Multiple checksums may simultaneously be offloaded using Local 1050 Checksum Offload [LCO]. LCO is a technique for efficiently computing 1051 the outer checksum of an encapsulated datagram when the inner 1052 checksum is due to be offloaded. The ones-complement sum of a 1053 correctly checksummed TCP or UDP packet is equal to the complement of 1054 the sum of the pseudo header, because everything else gets 'canceled 1055 out' by the checksum field. This is because the sum was complemented 1056 before being written to the checksum field. More generally, this 1057 holds in any case where the 'IP-style' ones complement checksum is 1058 used, and thus any checksum that TX Checksum Offload supports. That 1059 is, if we have set up TX Checksum Offload with a start/offset pair, 1060 we know that after the device has filled in that checksum, the ones 1061 complement sum from starting offset to the end of the packet will be 1062 equal to the complement of whatever value we put in the checksum 1063 field beforehand. 1065 If an encapsulator is co-resident with a host, then checksum offload 1066 may be performed using remote checksum offload [REMCSUM]. Remote 1067 checksum offload relies on NIC offload of the simple UDP/IP checksum 1068 which is commonly supported even in legacy devices. In remote 1069 checksum offload the outer UDP checksum is set and the GUE header 1070 includes an option indicating the start and offset of the inner 1071 "offloaded" checksum. The inner checksum is initialized to the pseudo 1072 header checksum. When a decapsulator receives a GUE packet with the 1073 remote checksum offload option, it completes the offload operation by 1074 determining the packet checksum from the indicated start point to the 1075 end of the packet, and then adds this into the checksum field at the 1076 offset given in the option. Computing the checksum from the start to 1077 end of packet is efficient if checksum-complete is provided on the 1078 receiver. 1080 A.2.2. Receive checksum offload 1082 GUE is compatible with NICs that perform a protocol agnostic receive 1083 checksum (CHECKSUM_COMPLETE in Linux parlance). In this technique, a 1084 NIC computes a ones complement checksum over all (or some predefined 1085 portion) of a packet. The computed value is provided to the host 1086 stack in the packet's receive descriptor. The host driver can use 1087 this checksum to "patch up" and validate any inner packet transport 1088 checksum, as well as the outer UDP checksum if it is non-zero. 1090 Many legacy NICs don't provide checksum-complete but instead provide 1091 an indication that a checksum has been verified (CHECKSUM_UNNECESSARY 1092 in Linux). Usually, such validation is only done for simple TCP/IP or 1093 UDP/IP packets. If a NIC indicates that a UDP checksum is valid, the 1094 checksum-complete value for the UDP packet is the "not" of the pseudo 1095 header checksum. In this way, checksum-unnecessary can be converted 1096 to checksum-complete. So if the NIC provides checksum-unnecessary for 1097 the outer UDP header in an encapsulation, checksum conversion can be 1098 done so that the checksum-complete value is derived and can be used 1099 by the stack to validate an checksums in the encapsulated packet. 1101 A.3. Transmit Segmentation Offload 1103 Transmit Segmentation Offload (TSO) is a NIC feature where a host 1104 provides a large (>MTU size) TCP packet to the NIC, which in turn 1105 splits the packet into separate segments and transmits each one. This 1106 is useful to reduce CPU load on the host. 1108 The process of TSO can be generalized as: 1110 - Split the TCP payload into segments which allow packets with 1111 size less than or equal to MTU. 1113 - For each created segment: 1115 1. Replicate the TCP header and all preceding headers of the 1116 original packet. 1118 2. Set payload length fields in any headers to reflect the 1119 length of the segment. 1121 3. Set TCP sequence number to correctly reflect the offset of 1122 the TCP data in the stream. 1124 4. Recompute and set any checksums that either cover the payload 1125 of the packet or cover header which was changed by setting a 1126 payload length. 1128 Following this general process, TSO can be extended to support TCP 1129 encapsulation in GUE. For each segment the Ethernet, outer IP, UDP 1130 header, GUE header, inner IP header if tunneling, and TCP headers are 1131 replicated. Any packet length header fields need to be set properly 1132 (including the length in the outer UDP header), and checksums need to 1133 be set correctly (including the outer UDP checksum if being used). 1135 To facilitate TSO with GUE it is recommended that optional fields 1136 should not contain values that must be updated on a per segment 1137 basis-- for example the GUE fields should not include checksums, 1138 lengths, or sequence numbers that refer to the payload. If the GUE 1139 header does not contain such fields then the TSO engine only needs to 1140 copy the bits in the GUE header when creating each segment and does 1141 not need to parse the GUE header. 1143 A.4. Large Receive Offload 1145 Large Receive Offload (LRO) is a NIC feature where packets of a TCP 1146 connection are reassembled, or coalesced, in the NIC and delivered to 1147 the host as one large packet. This feature can reduce CPU utilization 1148 in the host. 1150 LRO requires significant protocol awareness to be implemented 1151 correctly and is difficult to generalize. Packets in the same flow 1152 need to be unambiguously identified. In the presence of tunnels or 1153 network virtualization, this may require more than a five-tuple match 1154 (for instance packets for flows in two different virtual networks may 1155 have identical five-tuples). Additionally, a NIC needs to perform 1156 validation over packets that are being coalesced, and needs to 1157 fabricate a single meaningful header from all the coalesced packets. 1159 The conservative approach to supporting LRO for GUE would be to 1160 assign packets to the same flow only if they have identical five- 1161 tuple and were encapsulated the same way. That is the outer IP 1162 addresses, the outer UDP ports, GUE protocol, GUE flags and fields, 1163 and inner five tuple are all identical. 1165 Appendix B: Privileged ports 1167 Using the source port to contain an inner flow identifier value 1168 disallows the security method of a receiver enforcing that the source 1169 port be a privileged port. Privileged ports are defined by some 1170 operating systems to restrict source port binding. Unix, for 1171 instance, considered port number less than 1024 to be privileged. 1173 Enforcing that packets are sent from a privileged port is widely 1174 considered an inadequate security mechanism and has been mostly 1175 deprecated. To approximate this behavior, an implementation could 1176 restrict a user from sending a packet destined to the GUE port 1177 without proper credentials. 1179 Appendix C: Inner flow identifier as a route selector 1181 An encapsulator generating an inner flow identifier may modulate the 1182 value to perform a type of multipath source routing. Assuming that 1183 networking switches perform ECMP based on the flow hash, a sender can 1184 affect the path by altering the inner flow identifier. For instance, 1185 a host may store a flow hash in its PCB for an inner flow, and may 1186 alter the value upon detecting that packets are traversing a lossy 1187 path. Changing the inner flow identifier for a flow should be subject 1188 to hysteresis (at most once every thirty seconds) to limit the number 1189 of out of order packets. 1191 Appendix D: Hardware protocol implementation considerations 1193 A low level protocol, such is GUE, is likely interesting to being 1194 supported by high speed network devices. Variable length header (VLH) 1195 protocols like GUE are often considered difficult to efficiently 1196 implement in hardware. In order to retain the important 1197 characteristics of an extensible and robust protocol, hardware 1198 vendors may practice "constrained flexibility". In this model, only 1199 certain combinations or protocol header parameterizations are 1200 implemented in hardware fast path. Each such parameterization is 1201 fixed length so that the particular instance can be optimized as a 1202 fixed length protocol. In the case of GUE this constitutes specific 1203 combinations of GUE flags, fields, and next protocol. The selected 1204 combinations would naturally be the most common cases which form the 1205 "fast path", and other combinations are assumed to take the "slow 1206 path". 1208 In time, needs and requirements of the protocol may change which may 1209 manifest themselves as new parameterizations to be supported in the 1210 fast path. To allow allow this extensibility, a device practicing 1211 constrained flexibility should allow the fast path parameterizations 1212 to be programmable. 1214 Authors' Addresses 1216 Tom Herbert 1217 Facebook 1218 1 Hacker Way 1219 Menlo Park, CA 94052 1220 US 1222 Email: tom@herbertland.com 1224 Lucy Yong 1225 Huawei USA 1226 5340 Legacy Dr. 1228 Plano, TX 75024 1229 US 1231 Email: lucy.yong@huawei.com 1233 Osama Zia 1234 Microsoft 1235 1 Microsoft Way 1236 Redmond, WA 98029 1237 US 1239 Email: osamaz@microsoft.com