idnits 2.17.1 draft-ietf-nvo3-gue-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 328: '...and decapsulator MUST agree on the mea...' RFC 2119 keyword, line 333: '... GUE packet with private data, it MUST...' RFC 2119 keyword, line 335: '...capsulator the packet MUST be dropped....' RFC 2119 keyword, line 337: '...ntics the packet MUST also be dropped....' RFC 2119 keyword, line 447: '...o a decapsulator MUST NOT be ignored. ...' (18 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (December 22, 2015) is 3041 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: 'RFC478' is mentioned on line 528, but not defined == Missing Reference: 'RFC6935' is mentioned on line 563, but not defined == Missing Reference: 'GUECSUM' is mentioned on line 576, but not defined == Missing Reference: 'RFC768' is mentioned on line 611, but not defined == Missing Reference: 'RFC1122' is mentioned on line 599, but not defined == Missing Reference: 'RFC2460' is mentioned on line 611, but not defined ** Obsolete undefined reference: RFC 2460 (Obsoleted by RFC 8200) == Missing Reference: 'RFC5405' is mentioned on line 647, but not defined ** Obsolete undefined reference: RFC 5405 (Obsoleted by RFC 8085) == Missing Reference: 'RFC2473' is mentioned on line 785, but not defined == Unused Reference: 'RFC2434' is defined on line 871, but no explicit reference was found in the text == Unused Reference: 'RFC5925' is defined on line 929, but no explicit reference was found in the text ** Obsolete normative reference: RFC 2434 (Obsoleted by RFC 5226) ** Downref: Normative reference to an Informational RFC: RFC 2983 ** Downref: Normative reference to an Informational RFC: RFC 4459 -- Obsolete informational reference (is this intentional?): RFC 6830 (Obsoleted by RFC 9300, RFC 9301) == Outdated reference: A later version (-19) exists of draft-ietf-tsvwg-gre-in-udp-encap-06 == Outdated reference: A later version (-03) exists of draft-hy-gue-4-secure-transport-02 == Outdated reference: A later version (-02) exists of draft-herbert-remotecsumoffload-00 Summary: 6 errors (**), 0 flaws (~~), 14 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Virtualization Overlays (nvo3) T. Herbert 3 Internet-Draft Facebook 4 Intended status: Standard track L. Yong 5 Expires June 24, 2016 Huawei USA 6 O. Zia 7 Microsoft 8 December 22, 2015 10 Generic UDP Encapsulation 11 draft-ietf-nvo3-gue-02 13 Status of this Memo 15 This Internet-Draft is submitted in full conformance with the 16 provisions of BCP 78 and BCP 79. 18 Internet-Drafts are working documents of the Internet Engineering 19 Task Force (IETF), its areas, and its working groups. Note that 20 other groups may also distribute working documents as Internet- 21 Drafts. 23 Internet-Drafts are draft documents valid for a maximum of six months 24 and may be updated, replaced, or obsoleted by other documents at any 25 time. It is inappropriate to use Internet-Drafts as reference 26 material or to cite them other than as "work in progress." 28 The list of current Internet-Drafts can be accessed at 29 http://www.ietf.org/ietf/1id-abstracts.txt 31 The list of Internet-Draft Shadow Directories can be accessed at 32 http://www.ietf.org/shadow.html 34 This Internet-Draft will expire on September 7, 2015. 36 Copyright Notice 38 Copyright (c) 2014 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Abstract 60 This specification describes Generic UDP Encapsulation (GUE), which 61 is a scheme for using UDP to encapsulate packets of arbitrary IP 62 protocols for transport across layer 3 networks. By encapsulating 63 packets in UDP, specialized capabilities in networking hardware for 64 efficient handling of UDP packets can be leveraged. GUE specifies 65 basic encapsulation methods upon which higher level constructs, such 66 tunnels and overlay networks for network virtualization, can be 67 constructed. GUE is extensible by allowing optional data fields as 68 part of the encapsulation, and is generic in that it can encapsulate 69 packets of various IP protocols. 71 Table of Contents 73 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 74 2. Packet formats . . . . . . . . . . . . . . . . . . . . . . . . 5 75 2.1. GUE version . . . . . . . . . . . . . . . . . . . . . . . . 5 76 2.2. GUE header . . . . . . . . . . . . . . . . . . . . . . . . 6 77 2.3. Proto/ctype field . . . . . . . . . . . . . . . . . . . . . 7 78 2.4. Flags and optional fields . . . . . . . . . . . . . . . . . 8 79 2.5. Private data . . . . . . . . . . . . . . . . . . . . . . . 8 80 3. Message types . . . . . . . . . . . . . . . . . . . . . . . . . 9 81 3.1. Control messages . . . . . . . . . . . . . . . . . . . . . 9 82 3.2. Data messages . . . . . . . . . . . . . . . . . . . . . . . 9 83 4. Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 84 4.1. Network tunnel encapsulation . . . . . . . . . . . . . . . 10 85 4.2. Transport layer encapsulation . . . . . . . . . . . . . . . 10 86 4.3. Encapsulator operation . . . . . . . . . . . . . . . . . . 10 87 4.4. Decapsulator operation . . . . . . . . . . . . . . . . . . 11 88 4.5. Router and switch operation . . . . . . . . . . . . . . . . 11 89 4.6. Middlebox interactions . . . . . . . . . . . . . . . . . . 12 90 4.6.1. Inferring connection semantics . . . . . . . . . . . . 12 91 4.6.2. Identification of GUE packets . . . . . . . . . . . . . 12 92 4.7. NAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 93 4.8. Checksum Handling . . . . . . . . . . . . . . . . . . . . . 13 94 4.8.1. Checksum requirements . . . . . . . . . . . . . . . . . 13 95 4.8.2. GUE header checksum . . . . . . . . . . . . . . . . . . 14 96 4.8.3. UDP Checksum with IPv4 . . . . . . . . . . . . . . . . 14 97 4.8.4. UDP Checksum with IPv6 . . . . . . . . . . . . . . . . 15 98 4.9. MTU and fragmentation . . . . . . . . . . . . . . . . . . . 15 99 4.10. Congestion control . . . . . . . . . . . . . . . . . . . . 15 100 4.11. Multicast . . . . . . . . . . . . . . . . . . . . . . . . 16 101 5. Inner flow identifier properties . . . . . . . . . . . . . . . 16 102 5.1. Flow classification . . . . . . . . . . . . . . . . . . . . 16 103 5.2. Inner flow identifier properties . . . . . . . . . . . . . 17 104 6. Motivation for GUE . . . . . . . . . . . . . . . . . . . . . . 18 105 7. Security Considerations . . . . . . . . . . . . . . . . . . . . 19 106 8. IANA Consideration . . . . . . . . . . . . . . . . . . . . . . 19 107 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 20 108 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 20 109 10.1. Normative References . . . . . . . . . . . . . . . . . . . 20 110 10.2. Informative References . . . . . . . . . . . . . . . . . . 21 111 Appendix A: NIC processing for GUE . . . . . . . . . . . . . . . . 22 112 A.1. Receive multi-queue . . . . . . . . . . . . . . . . . . . . 22 113 A.2. Checksum offload . . . . . . . . . . . . . . . . . . . . . 23 114 A.2.1. Transmit checksum offload . . . . . . . . . . . . . . . 23 115 A.2.2. Receive checksum offload . . . . . . . . . . . . . . . 24 116 A.3. Transmit Segmentation Offload . . . . . . . . . . . . . . . 24 117 A.4. Large Receive Offload . . . . . . . . . . . . . . . . . . . 25 118 Appendix B: Privileged ports . . . . . . . . . . . . . . . . . . . 26 119 Appendix C: Inner flow identifier as a route selector . . . . . . 26 120 Appendix D: Hardware protocol implementation considerations . . . 26 121 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 27 123 1. Introduction 125 This specification describes Generic UDP Encapsulation (GUE) which is 126 a general method for encapsulating packets of arbitrary IP protocols 127 within User Datagram Protocol (UDP) [RFC0768] packets. Encapsulating 128 packets in UDP facilitates efficient transport across networks. 129 Networking devices widely provide protocol specific processing and 130 optimizations for UDP (as well as TCP) packets. Packets for atypical 131 IP protocols (those not usually parsed by networking hardware) can be 132 encapsulated in UDP packets to maximize deliverability and to 133 leverage flow specific mechanisms for routing and packet steering. 135 GUE provides an extensible header format for including optional data 136 in the encapsulation header. This data potentially covers items such 137 as virtual networking identifier, security data for validating or 138 authenticating the GUE header, congestion control data, etc. GUE also 139 allows private optional data in the encapsulation header. This 140 feature can be used by a site or implementation to define local 141 custom optional data, and allows experimentation of options that may 142 eventually become standard. 144 2. Packet formats 146 A GUE packet is comprised of a UDP packet whose payload is a GUE 147 header followed by a payload which is either an encapsulated packet 148 of some IP protocol or a control message (like an OAM message). A GUE 149 packet has the general format: 151 +-------------------------------+ 152 | | 153 | UDP/IP header | 154 | | 155 |-------------------------------| 156 | | 157 | GUE Header | 158 | | 159 |-------------------------------| 160 | | 161 | Encapsulated packet | 162 | or control message | 163 | | 164 +-------------------------------+ 166 The GUE header is variable length as determined by the presence of 167 optional fields. 169 2.1. GUE version 171 The first two bits of the GUE header contain the GUE protocol version 172 number. The rest of the fields after the GUE version number are 173 defined based on the version number. The remainder of this 174 specification describes version 0x0 of GUE. 176 2.2. GUE header 178 The header format for version 0x0 of GUE in UDP is: 180 0 1 2 3 181 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 182 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 183 | Source port | Destination port | 184 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 185 | Length | Checksum | 186 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 187 |0x0|C| Hlen | Proto/ctype | Flags |E| 188 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 189 | | 190 ~ Fields (optional) ~ 191 | | 192 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 193 | Extension flags (optional) | 194 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 195 | | 196 ~ Extension fields (optional) ~ 197 | | 198 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 199 | | 200 ~ Private data (optional) ~ 201 | | 202 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 204 The contents of the UDP header are: 206 o Source port (inner flow identifier): This should be set to a 207 value that represents the encapsulated flow. The properties of 208 the inner flow identifier are described below. 210 o Destination port: The GUE assigned port number, 6080. 212 o Length: Canonical length of the UDP packet (length of UDP header 213 and payload). 215 o Checksum: Standard UDP checksum (see section 4). 217 The GUE header consists of: 219 o Ver: GUE protocol version (0x0). 221 o C: Control flag. When set indicates a control message, not set 222 indicates a data message. 224 o Hlen: Length in 32-bit words of the GUE header, including 225 optional fields but not the first four bytes of the header. 226 Computed as (header_len - 4) / 4. All GUE headers are a multiple 227 of four bytes in length. Maximum header length is 128 bytes. 229 o Proto/ctype: When the C bit is set this field contains a control 230 message type for the payload. When C bit is not set, the field 231 holds the IP protocol number for the encapsulated packet in the 232 payload. The control message or encapsulated packet begins at 233 the offset provided by Hlen. 235 o Flags. Header flags that may be allocated for various purposes 236 and may indicate presence of optional fields. Undefined header 237 flag bits must be set to zero on transmission. 239 o 'E' Extension flag. Indicates presence of extension flags option 240 in the optional fields. 242 o Fields: Optional fields whose presence is indicated by 243 corresponding flags. 245 o Extension flags: An optional field indicated by the E bit. This 246 field provides an additional set of thirty-two bits for flags. 248 o Extension fields: Optional fields whose presence is indicated by 249 corresponding extension flags. 251 o Private data: Optional private data. If private data is present 252 it immediately follows that last field present in the header. 253 The length of this data is determined by subtracting the 254 starting offset from the header length. 256 2.3. Proto/ctype field 258 When the C bit is not set, the proto/ctype field must be set to a 259 valid IP protocol number. The IP protocol number serves as an 260 indication of the type of next protocol header which is contained in 261 the GUE payload at the offset indicated in Hlen. Intermediate devices 262 may parse the GUE payload per the IP protocol number in the 263 proto/ctype field, and header flags cannot affect the interpretation 264 of the proto/ctype field. 266 IP protocol number 59 ("No next header") may be set to indicate that 267 the GUE payload does not begin with the header of an IP protocol. 268 This would be the case, for instance, if the GUE payload were a 269 fragment when performing GUE level fragmentation. The interpretation 270 of the payload is performed though other means (such as flags and 271 optional fields), and intermediate devices must not parse packets the 272 packet based on the IP protocol number in this case. 274 When the C bit is set, the proto/ctype field must be set to a valid 275 control message type. A value of zero indicates that the GUE payload 276 requires further interpretation to deduce the control type. This 277 might be the case when the payload is a fragment of a control 278 message, where only the reassembled packet can be interpreted as a 279 control message. 281 2.4. Flags and optional fields 283 Flags and associated optional fields are the primary mechanism of 284 extensibility in GUE. There are sixteen flag bits in the primary GUE 285 header with one being reserved to indicate that an optional extension 286 flags field is present. The extension flags field contains an 287 additional thirty-two flag bits. 289 A flag may indicate presence of optional fields. The size of an 290 optional field indicated by a flag must be fixed. 292 Flags may be paired together to allow different lengths for an 293 optional field. For example, if two flag bits are paired, a field may 294 possibly be three different lengths. Regardless of how flag bits may 295 be paired, the lengths and offsets of optional fields corresponding 296 to a set of flags must be well defined. 298 Optional fields are placed in order of the flags. New flags should be 299 allocated from high to low order bit contiguously without holes. 300 Flags allow random access, for instance to inspect the field 301 corresponding to the Nth flag bit, an implementation only considers 302 the previous N-1 flags to determine the offset. Flags after the Nth 303 flag are not pertinent in calculating the offset of the Nth flag. 305 Flags (or paired flags) are idempotent such that new flags should not 306 cause reinterpretation of old flags. Also, new flags should not alter 307 interpretation of other elements in the GUE header nor how the 308 message is parsed (for instance, in a data message the proto/ctype 309 field always holds an IP protocol number as an invariant). 311 2.5. Private data 313 An implementation may use private data for its own use. The private 314 data immediately follows the last field in the GUE header and is not 315 a fixed length. This data is considered part of the GUE header and 316 must be accounted for in header length (Hlen). The length of the 317 private data must be a multiple of four and is determined by 318 subtracting the offset of private data in the GUE header from the 319 header length. Specifically: 321 Private_length = (Hlen * 4) - Length(flags) 323 Where "Length(flags)" returns the sum of lengths of all the optional 324 fields present in the GUE header. When there is no private data 325 present, length of the private data is zero. 327 The semantics and interpretation of private data are implementation 328 specific. An encapsulator and decapsulator MUST agree on the meaning 329 of private data before using it. The private data may be structured 330 as necessary, for instance it might contain its own set of flags and 331 optional fields. 333 If a decapsulator receives a GUE packet with private data, it MUST 334 validate the private data appropriately. If a decapsulator does not 335 expect private data from an encapsulator the packet MUST be dropped. 336 If a decapsulator cannot validate the contents of private data per 337 the provided semantics the packet MUST also be dropped. An 338 implementation may place security data in GUE private data which must 339 be verified for packet acceptance. 341 3. Message types 343 3.1. Control messages 345 Control messages are indicated in the GUE header when the C bit is 346 set. The payload is interpreted as a control message with type 347 specified in the proto/ctype field. The format and contents of the 348 control message are indicated by the type and can be variable length. 350 Other than interpreting the proto/ctype field as a control message 351 type, the meaning and semantics of the rest of the elements in the 352 GUE header are the same as that of data messages. Forwarding and 353 routing of control messages should be the same as that of a data 354 message with the same outer IP and UDP header and GUE flags-- this 355 ensures that control messages can be created that follow the same 356 path as data messages. 358 Control messages can be defined for OAM type messages. For instance, 359 an echo request and corresponding echo reply message may be defined 360 to test for liveness. 362 3.2. Data messages 364 Data messages are indicated in GUE header with C bit not set. The 365 payload of a data message is interpreted as an encapsulated packet of 366 an IP protocol indicated in the proto/ctype field. The packet 367 immediately follows the GUE header. 369 Data messages are a primary means of encapsulation and can be used to 370 create tunnels for overlay networks. 372 4. Operation 374 The figure below illustrates the use of GUE encapsulation between two 375 servers. Sever 1 is sending packets to server 2. An encapsulator 376 performs encapsulation of packets from server 1. These encapsulated 377 packets traverse the network as UDP packets. At the decapsulator, 378 packets are decapsulated and sent on to server 2. Packet flow in the 379 reverse direction need not be symmetric; GUE encapsulation is not 380 required in the reverse path. 382 +---------------+ +---------------+ 383 | | | | 384 | Server 1 | | Server 2 | 385 | | | | 386 +---------------+ +---------------+ 387 | ^ 388 V | 389 +---------------+ +---------------+ +---------------+ 390 | | | | | | 391 | Encapsulator |-->| Layer 3 |-->| Decapsulator | 392 | | | Network | | | 393 +---------------+ +---------------+ +---------------+ 395 The encapsulator and decapsulator may be co-resident with the 396 corresponding servers, or may be on separate nodes in the network. 398 4.1. Network tunnel encapsulation 400 Network tunneling can be achieved by encapsulating layer 2 or layer 3 401 packets. In this case the encapsulator and decapsulator nodes are the 402 tunnel endpoints. These could be routers that provide network tunnels 403 on behalf of communicating servers. 405 4.2. Transport layer encapsulation 407 When encapsulating layer 4 packets, the encapsulator and decapsulator 408 should be co-resident with the servers. In this case, the 409 encapsulation headers are inserted between the IP header and the 410 transport packet. The addresses in the IP header refer to both the 411 endpoints of the encapsulation and the endpoints for terminating the 412 the transport protocol. 414 4.3. Encapsulator operation 416 Encapsulators create GUE data messages, set the source port to the 417 inner flow identifier, set flags and optional fields in the GUE 418 header, and forward packets to a decapsulator. 420 An encapsulator may be an end host originating the packets of a flow, 421 or may be a network device performing encapsulation on behalf of 422 servers (routers implementing tunnels for instance). In either case, 423 the intended target (decapsulator) is indicated by the outer 424 destination IP address. 426 If an encapsulator is tunneling packets, that is encapsulating 427 packets of layer 2 or layer 3 protocols (e.g. EtherIP, IPIP, ESP 428 tunnel mode), it should follow standard conventions for tunneling of 429 one IP protocol over another. Diffserv interaction with tunnels is 430 described in [RFC2983], ECN propagation for tunnels is described in 431 [RFC6040]. 433 4.4. Decapsulator operation 435 A decapsulator performs decapsulation of GUE packets. A decapsulator 436 is addressed by the outer destination IP address of a GUE packet. 437 The decapsulator validates packets, including fields of the GUE 438 header. If a packet is acceptable, the UDP and GUE headers are 439 removed and the packet is resubmitted for IP protocol processing or 440 control message processing if it is a control message. 442 If a decapsulator receives a GUE packet with an unsupported version, 443 unknown flag, bad header length (too small for included optional 444 fields), unknown control message type, or an otherwise malformed 445 header, it must drop the packet and may log the event. No error 446 message is returned back to the encapsulator. Note that set flags in 447 GUE that are unknown to a decapsulator MUST NOT be ignored. If a GUE 448 packet is received by a decapsulator with unknown flags, the packet 449 MUST be dropped. 451 4.5. Router and switch operation 453 Routers and switches should forward GUE packets as standard UDP/IP 454 packets. The outer five-tuple should contain sufficient information 455 to perform flow classification corresponding to the flow of the inner 456 packet. A switch should not normally need to parse a GUE header, and 457 none of the flags or optional fields in the GUE header should affect 458 routing. 460 A router should not modify a GUE header when forwarding a packet. It 461 may encapsulate a GUE packet in another GUE packet, for instance to 462 implement a network tunnel. In this case the router takes the role of 463 an encapsulator, and the corresponding decapsulator is the logical 464 endpoint of the tunnel. 466 4.6. Middlebox interactions 468 A middle box may interpret some flags and optional fields of the GUE 469 header for classification purposes, but is not required to understand 470 all flags and fields in GUE packets. A middle box should not drop a 471 GUE packet because there are flags unknown to it. The header length 472 in the GUE header allows a middlebox to inspect the payload packet 473 without needing to parse the flags or optional fields. 475 4.6.1. Inferring connection semantics 477 A middlebox may infer bidirectional connection semantics to a UDP 478 flow. For instance a stateful firewall may create a five-tuple rule 479 to match flows on egress, and a corresponding five-tuple rule for 480 matching ingress packets where the roles of source and destination 481 are reversed for the IP addresses and UDP port numbers. To operate in 482 this environment, a GUE tunnel must assume connected semantics 483 defined by the UDP five tuple and the use of GUE encapsulation must 484 be symmetric between both endpoints. The source port set in the UDP 485 header must be the destination port the peer would set for replies. 487 4.6.2. Identification of GUE packets 489 For a middlebox to be able to parse GUE packets it must be able to 490 positively identify GUE packets. The obvious way to do this would be 491 by destination port number, that is if the destination UDP port 492 matches the GUE port number it would be considered GUE. This method 493 is problematic since ports numbers do not have global meaning 494 ([RFC7605]) and a packet which is not GUE but destined to the same 495 port number could be misinterpreted. 497 To allow middleboxes to accurately identify GUE packet, UDP magic 498 numbers may be used ([UDPMAG]). A UDP magic number is a sixy-four bit 499 value inserted between the UDP header and the encapsulated protocol 500 headers. The value is a constant for each defined protocol and can be 501 matched by middlebox to identify (with very high probability) the 502 encapsulated protocol. 504 The sixty-four bit magic number for GUE is 0xffd871a2:0x217431f9, so 505 a GUE packet with a UDP magic number would have the format: 507 0 1 2 3 508 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 509 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 510 | Source port | Destination port | 511 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 512 | Length | Checksum | 513 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 514 | 0xffd871a2 | 515 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 516 | 0x217431f9 | 517 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 518 |0x0|C| Hlen | Proto/ctype | Flags |E| 519 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 520 | | 521 ~ Fields (optional) ~ 522 | | 523 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 525 4.7. NAT 527 IP address and port translation can be performed on the UDP/IP 528 headers adhering to the requirements for NAT with UDP [RFC478]. In 529 the case of stateful NAT, connection semantics must be applied to a 530 GUE tunnel as described above. 532 When using transport mode encapsulation and traversing a NAT, the IP 533 addresses may be changed such that the pseudo header checksum used 534 for checksum calculation is modified and the checksum will be found 535 invalid at the receiver. To compensate for this, A GUE option can be 536 added which contains the checksum over the source and destination 537 addresses when the packet is transmitted. Upon receiving this option, 538 the delta of the pseudo header checksum is computed by subtracting 539 the checksum over the source and and destination addresses from the 540 checksum value in the option. The resultant value is then added into 541 checksum calculation when validating the inner transport checksum. 543 4.8. Checksum Handling 545 This section describes the requirements around the UDP checksum and 546 GUE header checksum. Checksums are an important consideration in that 547 that they can provide end to end validation and protect against 548 packet mis-delivery. The latter is allowed by the inclusion of a 549 pseudo header that covers the IP addresses and UDP ports of the 550 encapsulating headers. 552 4.8.1. Checksum requirements 554 The potential for mis-delivery of packets due to corruption of IP, 555 UDP, or GUE headers must be considered. One of the following 556 requirements must be met: 558 o UDP checksums are enabled (for IPv4 or IPv6). 560 o The GUE header checksum is used. 562 o Zero UDP checksums are used in accordance with applicable 563 requirements in [GREUDP], [RFC6935], and [RFC6936]. 565 4.8.2. GUE header checksum 567 The GUE header checksum provides a UDP-lite [RFC3828] type of 568 checksum capability as an optional field of the GUE header. The GUE 569 header checksum minimally covers the GUE header and a GUE pseudo 570 header. The GUE pseudo header includes the corresponding IP 571 addresses as well as the UDP ports of the encapsulating headers. 572 This checksum should provide adequate protection against address 573 corruption in IPv6 when the UDP checksum is zero. Additionally, the 574 GUE checksum provides protection of the GUE header when the UDP 575 checksum is set to zero with either IPv4 or IPv6. The GUE header 576 checksum is defined in [GUECSUM]. 578 4.8.3. UDP Checksum with IPv4 580 For UDP in IPv4, the UDP checksum MUST be processed as specified in 581 [RFC768] and [RFC1122] for both transmit and receive. An 582 encapsulator MAY set the UDP checksum to zero for performance or 583 implementation considerations. The IPv4 header includes a checksum 584 that protects against mis-delivery of the packet due to corruption 585 of IP addresses. The UDP checksum potentially provides protection 586 against corruption of the UDP header, GUE header, and GUE payload. 587 Enabling or disabling the use of checksums is a deployment 588 consideration that should take into account the risk and effects of 589 packet corruption, and whether the packets in the network are 590 already adequately protected by other, possibly stronger mechanisms 591 such as the Ethernet CRC. If an encapsulator sets a zero UDP 592 checksum for IPv4 it SHOULD use the GUE header checksum as described 593 in section 4.8.2. 595 When a decapsulator receives a packet, the UDP checksum field MUST 596 be processed. If the UDP checksum is non-zero, the decapsulator MUST 597 verify the checksum before accepting the packet. By default a 598 decapsulator SHOULD accept UDP packets with a zero checksum. A node 599 MAY be configured to disallow zero checksums per [RFC1122]; this may 600 be done selectively, for instance disallowing zero checksums from 601 certain hosts that are known to be sending over paths subject to 602 packet corruption. If verification of a non-zero checksum fails, a 603 decapsulator lacks the capability to verify a non-zero checksum, or 604 a packet with a zero-checksum was received and the decapsulator is 605 configured to disallow, the packet MUST be dropped and an event MAY 606 be logged. 608 4.8.4. UDP Checksum with IPv6 610 For UDP in IPv6, the UDP checksum MUST be processed as specified in 611 [RFC768] and [RFC2460] for both transmit and receive. Unlike IPv4, 612 there is no header checksum in IPv6 that protects against mis- 613 delivery due to address corruption. Therefore, when GUE is used over 614 IPv6, either the UDP checksum must be enabled or the GUE header 615 checksum must be used. An encapsulator MAY set a zero UDP checksum 616 for performance or implementation reasons, in which case the GUE 617 header checksum MUST be used or applicable requirements for using 618 zero UDP checksums in [GREUDP] MUST be met. If the UDP checksum is 619 enabled, then the GUE header checksum should not be used since it is 620 mostly redundant. 622 When a decapsulator receives a packet, the UDP checksum field MUST 623 be processed. If the UDP checksum is non-zero, the decapsulator MUST 624 verify the checksum before accepting the packet. By default a 625 decapsulator MUST only accept UDP packets with a zero checksum if 626 the GUE header checksum is used and is verified. If verification of 627 a non-zero checksum fails, a decapsulator lacks the capability to 628 verify a non-zero checksum, or a packet with a zero-checksum and no 629 GUE header checksum was received, the packet MUST be dropped and an 630 event MAY be logged. 632 4.9. MTU and fragmentation 634 Standard conventions for handling of MTU (Maximum Transmission Unit) 635 and fragmentation in conjunction with networking tunnels 636 (encapsulation of layer 2 or layer 3 packets) should be followed. 637 Details are described in MTU and Fragmentation Issues with In-the- 638 Network Tunneling [RFC4459] 640 If a packet is fragmented before encapsulation in GUE, all the 641 related fragments must be encapsulated using the same source port 642 (inner flow identifier). An operator may set MTU to account for 643 encapsulation overhead and reduce the likelihood of fragmentation. 645 4.10. Congestion control 647 Per requirements of [RFC5405], if the IP traffic encapsulated with 648 GUE implements proper congestion control no additional mechanisms 649 should be required. 651 In the case that the encapsulated traffic does not implement any or 652 sufficient control, or it is not known rather a transmitter will 653 consistently implement proper congestion control, then congestion 654 control at the encapsulation layer must be provided. Note this case 655 applies to a significant use case in network virtualization in which 656 guests run third party networking stacks that cannot be implicitly 657 trusted to implement conformant congestion control. 659 Out of band mechanisms such as rate limiting, Managed Circuit 660 Breaker, or traffic isolation may used to provide rudimentary 661 congestion control. For finer grained congestion control that allows 662 alternate congestion control algorithms, reaction time within an 663 RTT, and interaction with ECN, in-band mechanisms may warranted. 665 DCCP may be used to provide congestion control for encapsulated 666 flows. In this case, the protocol stack for an IP tunnel may be IP- 667 GUE-DCCP-IP. Alternatively, GUE can be extended to include 668 congestion control (related data carried in GUE optional fields). 669 Congestion control mechanisms for GUE will be elaborated in other 670 specifications. 672 4.11. Multicast 674 GUE packets may be multicast to decapsulators using a multicast 675 destination address in the encapsulating IP headers. Each receiving 676 host will decapsulate the packet independently following normal 677 decapsulator operations. The receiving decapsulators should agree on 678 the same set of GUE parameters and properties. 680 GUE allows encapsulation of unicast, broadcast, or multicast 681 traffic. Entropy for the inner flow identifier (UDP source port) may 682 be generated from the header of encapsulated unicast or 683 broadcast/multicast packets at an encapsulator. The mapping 684 mechanism between the encapsulated multicast traffic and the 685 multicast capability in the IP network is transparent and 686 independent to the encapsulation and is otherwise outside the scope 687 of this document. 689 5. Inner flow identifier properties 691 5.1. Flow classification 693 A major objective of using GUE is that a network device can perform 694 flow classification corresponding to the flow of the inner 695 encapsulated packet based on the contents in the outer headers. 697 Hardware devices commonly perform hash computations on packet 698 headers to classify packets into flows or flow buckets. Flow 699 classification is done to support load balancing (statistical 700 multiplexing) of flows across a set of networking resources. 701 Examples of such load balancing techniques are Equal Cost Multipath 702 routing (ECMP), port selection in Link Aggregation, and NIC device 703 Receive Side Scaling (RSS). Hashes are usually either a three-tuple 704 hash of IP protocol, source address, and destination address; or a 705 five-tuple hash consisting of IP protocol, source address, 706 destination address, source port, and destination port. Typically, 707 networking hardware will compute five-tuple hashes for TCP and UDP, 708 but only three-tuple hashes for other IP protocols. Since the five- 709 tuple hash provides more granularity, load balancing can be finer 710 grained with better distribution. When a packet is encapsulated with 711 GUE, the source port in the outer UDP packet is set to reflect the 712 flow of the inner packet. When a device computes a five-tuple hash 713 on the outer UDP/IP header of a GUE packet, the resultant value 714 classifies the packet per its inner flow. 716 To support flow classification, the source port of the UDP header in 717 GUE is set to a value that maps to the inner flow. This is referred 718 to as the inner flow identifier. The inner flow identifier is set by 719 the encapsulator; it can be computed on the fly based on packet 720 contents or retrieved from a state maintained for the inner flow. 722 Examples of deriving an inner flow identifier are: 724 o If the encapsulated packet is a layer 4 packet, TCP/IPv4 for 725 instance, the inner flow identifier could be based on the 726 canonical five-tuple hash of the inner packet. 728 o If the encapsulated packet is an AH transport mode packet with 729 TCP as next header, the inner flow identifier could be a hash 730 over a three-tuple: TCP protocol and TCP ports of the 731 encapsulated packet. 733 o If a node is encrypting a packet using ESP tunnel mode and GUE 734 encapsulation, the inner flow identifier could be based on the 735 contents of clear-text packet. For instance, a canonical five- 736 tuple hash for a TCP/IP packet could be used. 738 5.2. Inner flow identifier properties 740 The inner flow identifier is the value set in the UDP source 741 port of a GUE packet. The inner flow identifier should adhere to 742 the following properties: 744 o The value set in the source port should be within the ephemeral 745 port range. IANA suggests this range to be 49152 to 65535, where 746 the high order two bits of the port are set to one. This 747 provides fourteen bits of entropy for the inner flow identifier. 749 o The inner flow identifier should have a uniform distribution 750 across encapsulated flows. 752 o An encapsulator may occasionally change the inner flow 753 identifier used for an inner flow per its discretion (for 754 security, route selection, etc). Changing the value should 755 happen no more than once every thirty seconds. 757 o Decapsulators, or any networking devices, should not attempt any 758 interpretation of the inner flow identifier, nor should they 759 attempt to reproduce any hash calculation. They may use the 760 value to match further receive packets for steering decisions, 761 but cannot assume that the hash uniquely or permanently 762 identifies a flow. 764 o Input to the inner flow identifier is not restricted to ports 765 and addresses; input could include flow label from an IPv6 766 packet, SPI from an ESP packet, or other flow related state in 767 the encapsulator that is not necessarily conveyed in the packet. 769 o The assignment function for inner flow identifiers should be 770 randomly seeded to mitigate denial of service attacks. The seed 771 may be changed periodically. 773 6. Motivation for GUE 775 This section presents the motivation for GUE with respect to other 776 encapsulation methods. 778 A number of different encapsulation techniques have been proposed for 779 the encapsulation of one protocol over another. EtherIP [RFC3378] 780 provides layer 2 tunneling of Ethernet frames over IP. GRE [RFC2784], 781 MPLS [RFC4023], and L2TP [RFC2661] provide methods for tunneling 782 layer 2 and layer 3 packets over IP. NVGRE [NVGRE] and VXLAN 783 [RFC7348] are proposals for encapsulation of layer 2 packets for 784 network virtualization. IPIP [RFC2003] and Generic packet tunneling 785 in IPv6 [RFC2473] provide methods for tunneling IP packets over IP. 787 Several proposals exist for encapsulating packets over UDP including 788 ESP over UDP [RFC3948], TCP directly over UDP [TCPUDP], VXLAN, LISP 789 [RFC6830] which encapsulates layer 3 packets, and Generic UDP 790 Encapsulation for IP Tunneling (GRE over UDP)[GREUDP]. Generic UDP 791 tunneling [GUT] is a proposal similar to GUE in that it aims to 792 tunnel packets of IP protocols over UDP. 794 GUE has the following discriminating features: 796 o UDP encapsulation leverages specialized network device 797 processing for efficient transport. The semantics for using the 798 UDP source port as an identifier for an inner flow are defined. 800 o GUE permits encapsulation of arbitrary IP protocols, which 801 includes layer 2 3, and 4 protocols. This potentially allows 802 nearly all traffic within a data center to be normalized to be 803 either TCP or UDP on the wire. 805 o Multiple protocols can be multiplexed over a single UDP port 806 number. This is in contrast to techniques to encapsulate 807 protocols over UDP using a protocol specific port number (such 808 as ESP/UDP, GRE/UDP, SCTP/UDP). GUE provides a uniform and 809 extensible mechanism for encapsulating all IP protocols in UDP 810 with minimal overhead (four bytes of additional header). 812 o GUE is extensible. New flags and optional fields can be defined. 814 o The GUE header includes a header length field. This allows a 815 network node to inspect an encapsulated packet without needing 816 to parse the full encapsulation header. 818 o Private data in the encapsulation header allows local 819 customization and experimentation while being compatible with 820 processing in network nodes (routers and middleboxes). 822 o GUE includes both data messages (encapsulation of packets) and 823 control messages (such as OAM). 825 7. Security Considerations 827 Encapsulation of IP protocols within GUE should not increase security 828 risk, nor provide additional security in itself. As suggested in 829 section 5 the source port for of UDP packets in GUE should be 830 randomly seeded to mitigate some possible denial service attacks. 832 Security for Generic UDP Encapsulation, including security for the 833 GUE header and payload, is described in detail in [GUESEC]. 835 8. IANA Consideration 837 A user UDP port number assignment for GUE has been assigned: 839 Service Name: gue 840 Transport Protocol(s): UDP 841 Assignee: Tom Herbert 842 Contact: Tom Herbert 843 Description: Generic UDP Encapsulation 844 Reference: draft-herbert-gue 845 Port Number: 6080 846 Service Code: N/A 847 Known Unauthorized Uses: N/A 848 Assignment Notes: N/A 850 IANA is requested to create a "GUE flag-fields" registry to allocate 851 flags and optional fields for the primary GUE header flags and 852 extension flags. This shall be a registry of bit assignments for 853 flags, length of optional fields for corresponding flags, and 854 descriptive strings. There are sixteen bits for primary GUE header 855 flags (bit number 0-15) where bit 15 is reserved as the extension 856 flag in this document. There are thirty-two bits for extension flags. 858 9. Acknowledgements 860 The authors would like to thank David Liu, Erik Nordmark, and Fred 861 Templin for valuable input on this draft. 863 10. References 865 10.1. Normative References 867 [RFC0768] Postel, J., "User Datagram Protocol", STD 6, RFC 768, DOI 868 10.17487/RFC0768, August 1980, . 871 [RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an 872 IANA Considerations Section in RFCs", RFC 2434, DOI 873 10.17487/RFC2434, October 1998, . 876 [RFC2983] Black, D., "Differentiated Services and Tunnels", 877 RFC 2983, DOI 10.17487/RFC2983, October 2000, 878 . 880 [RFC6040] Briscoe, B., "Tunnelling of Explicit Congestion 881 Notification", RFC 6040, DOI 10.17487/RFC6040, November 882 2010, . 884 [RFC6936] Fairhurst, G. and M. Westerlund, "Applicability Statement 885 for the Use of IPv6 UDP Datagrams with Zero Checksums", 886 RFC 6936, DOI 10.17487/RFC6936, April 2013, 887 . 889 [RFC4459] Savola, P., "MTU and Fragmentation Issues with In-the- 890 Network Tunneling", RFC 4459, DOI 10.17487/RFC4459, April 891 2006, . 893 10.2. Informative References 895 [RFC2003] Perkins, C., "IP Encapsulation within IP", RFC 2003, DOI 896 10.17487/RFC2003, October 1996, . 899 [RFC3948] Huttunen, A., Swander, B., Volpe, V., DiBurro, L., and M. 900 Stenberg, "UDP Encapsulation of IPsec ESP Packets", 901 RFC 3948, DOI 10.17487/RFC3948, January 2005, 902 . 904 [RFC6830] Farinacci, D., Fuller, V., Meyer, D., and D. Lewis, "The 905 Locator/ID Separation Protocol (LISP)", RFC 6830, DOI 906 10.17487/RFC6830, January 2013, . 909 [RFC3378] Housley, R. and S. Hollenbeck, "EtherIP: Tunneling 910 Ethernet Frames in IP Datagrams", RFC 3378, DOI 911 10.17487/RFC3378, September 2002, . 914 [RFC2784] Farinacci, D., Li, T., Hanks, S., Meyer, D., and P. 915 Traina, "Generic Routing Encapsulation (GRE)", RFC 2784, 916 DOI 10.17487/RFC2784, March 2000, . 919 [RFC4023] Worster, T., Rekhter, Y., and E. Rosen, Ed., 920 "Encapsulating MPLS in IP or Generic Routing Encapsulation 921 (GRE)", RFC 4023, DOI 10.17487/RFC4023, March 2005, 922 . 924 [RFC2661] Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, 925 G., and B. Palter, "Layer Two Tunneling Protocol "L2TP"", 926 RFC 2661, DOI 10.17487/RFC2661, August 1999, 927 . 929 [RFC5925] Touch, J., Mankin, A., and R. Bonica, "The TCP 930 Authentication Option", RFC 5925, DOI 10.17487/RFC5925, 931 June 2010, . 933 [RFC3828] Larzon, L-A., Degermark, M., Pink, S., Jonsson, L-E., Ed., 934 and G. Fairhurst, Ed., "The Lightweight User Datagram 935 Protocol (UDP-Lite)", RFC 3828, July 2004, 936 . 938 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 939 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 940 eXtensible Local Area Network (VXLAN): A Framework for 941 Overlaying Virtualized Layer 2 Networks over Layer 3 942 Networks", RFC 7348, August 2014, . 945 [RFC7605] Touch, J., "Recommendations on Using Assigned Transport 946 Port Numbers", BCP 165, RFC 7605, DOI 10.17487/RFC7605, 947 August 2015, . 949 [NVGRE] Garg, P., and Wang, Y., "NVGRE: Network Virtualization 950 using Generic Routing Encapsulation" draft-sridharan- 951 virtualization-nvgre-08 953 [TCPUDP] Chesire, S., Graessley, J., and McGuire, R., 954 "Encapsulation of TCP and other Transport Protocols over 955 UDP" draft-cheshire-tcp-over-udp-00 957 [GREUDP] Crabbe, E., Yong, L., Xu, X., and Herbert, T.,Cpsest 958 "Generic UDP Encapsulation for IP Tunneling" draft-ietf- 959 tsvwg-gre-in-udp-encap-06 961 [GUESEC] Yong, L., Herbert, T., "Generic UDP Encapsulation (GUE) 962 for Secure Transport", draft-hy-gue-4-secure-transport-02, 963 work in progress. 965 [UDPMAG] Herbert, T., Yong, L., "UDP Magic Numbers", draft-herbert- 966 udp-magic-numbers-01 968 [GUT] Manner, J., Varia, N., and Briscoe, B., "Generic UDP 969 Tunnelling (GUT) draft-manner-tsvwg-gut-02.txt" 971 [REMCSUM] Herbert, T., "Remote Checksum Offload" draft-herbert- 972 remotecsumoffload-00 974 Appendix A: NIC processing for GUE 976 This appendix provides some guidelines for Network Interface Cards 977 (NICs) to implement common offloads and accelerations to support GUE. 978 Note that most of this discussion is generally applicable to other 979 methods of UDP based encapsulation. 981 A.1. Receive multi-queue 983 Contemporary NICs support multiple receive descriptor queues (multi- 984 queue). Multi-queue enables load balancing of network processing for 985 a NIC across multiple CPUs. On packet reception, a NIC must select 986 the appropriate queue for host processing. Receive Side Scaling is a 987 common method which uses the flow hash for a packet to index an 988 indirection table where each entry stores a queue number. Flow 989 Director and Accelerated Receive Flow Steering (aRFS) allow a host to 990 program the queue that is used for a given flow which is identified 991 either by an explicit five-tuple or by the flow's hash. 993 GUE encapsulation should be compatible with multi-queue NICs that 994 support five-tuple hash calculation for UDP/IP packets as input to 995 RSS. The inner flow identifier (source port) ensures classification 996 of the encapsulated flow even in the case that the outer source and 997 destination addresses are the same for all flows (e.g. all flows are 998 going over a single tunnel). 1000 By default, UDP RSS support is often disabled in NICs to avoid out of 1001 order reception that can occur when UDP packets are fragmented. As 1002 discussed above, fragmentation of GUE packets should be mitigated by 1003 fragmenting packets before entering a tunnel, path MTU discovery in 1004 higher layer protocols, or operator adjusting MTUs. Other UDP traffic 1005 may not implement such procedures to avoid fragmentation, so enabling 1006 UDP RSS support in the NIC should be a considered tradeoff during 1007 configuration. 1009 A.2. Checksum offload 1011 Many NICs provide capabilities to calculate standard ones complement 1012 payload checksum for packets in transmit or receive. When using GUE 1013 encapsulation there are at least two checksums that may be of 1014 interest: the encapsulated packet's transport checksum, and the UDP 1015 checksum in the outer header. 1017 A.2.1. Transmit checksum offload 1019 NICs may provide a protocol agnostic method to offload transmit 1020 checksum (NETIF_F_HW_CSUM in Linux parlance) that can be used with 1021 GUE. In this method the host provides checksum related parameters in 1022 a transmit descriptor for a packet. These parameters include the 1023 starting offset of data to checksum, the length of data to checksum, 1024 and the offset in the packet where the computed checksum is to be 1025 written. The host initializes the checksum field to pseudo header 1026 checksum. 1028 In the case of GUE, the checksum for an encapsulated transport layer 1029 packet, a TCP packet for instance, can be offloaded by setting the 1030 appropriate checksum parameters. 1032 NICs typically can offload only one transmit checksum per packet, so 1033 simultaneously offloading both an inner transport packet's checksum 1034 and the outer UDP checksum is likely not possible. In this case 1035 setting UDP checksum to zero (per above discussion) and offloading 1036 the inner transport packet checksum might be acceptable. 1038 If an encapsulator is co-resident with a host, then checksum offload 1039 may be performed using remote checksum offload [REMCSUM]. Remote 1040 checksum offload relies on NIC offload of the simple UDP/IP checksum 1041 which is commonly supported even in legacy devices. In remote 1042 checksum offload the outer UDP checksum is set and the GUE header 1043 includes an option indicating the start and offset of the inner 1044 "offloaded" checksum. The inner checksum is initialized to the pseudo 1045 header checksum. When a decapsulator receives a GUE packet with the 1046 remote checksum offload option, it completes the offload operation by 1047 determining the packet checksum from the indicated start point to the 1048 end of the packet, and then adds this into the checksum field at the 1049 offset given in the option. Computing the checksum from the start to 1050 end of packet is efficient if checksum-complete is provided on the 1051 receiver. 1053 A.2.2. Receive checksum offload 1055 GUE is compatible with NICs that perform a protocol agnostic receive 1056 checksum (CHECKSUM_COMPLETE in Linux parlance). In this technique, a 1057 NIC computes a ones complement checksum over all (or some predefined 1058 portion) of a packet. The computed value is provided to the host 1059 stack in the packet's receive descriptor. The host driver can use 1060 this checksum to "patch up" and validate any inner packet transport 1061 checksum, as well as the outer UDP checksum if it is non-zero. 1063 Many legacy NICs don't provide checksum-complete but instead provide 1064 an indication that a checksum has been verified (CHECKSUM_UNNECESSARY 1065 in Linux). Usually, such validation is only done for simple TCP/IP or 1066 UDP/IP packets. If a NIC indicates that a UDP checksum is valid, the 1067 checksum-complete value for the UDP packet is the "not" of the pseudo 1068 header checksum. In this way, checksum-unnecessary can be converted 1069 to checksum-complete. So if the NIC provides checksum-unnecessary for 1070 the outer UDP header in an encapsulation, checksum conversion can be 1071 done so that the checksum-complete value is derived and can be used 1072 by the stack to validate an checksums in the encapsulated packet. 1074 A.3. Transmit Segmentation Offload 1076 Transmit Segmentation Offload (TSO) is a NIC feature where a host 1077 provides a large (>MTU size) TCP packet to the NIC, which in turn 1078 splits the packet into separate segments and transmits each one. This 1079 is useful to reduce CPU load on the host. 1081 The process of TSO can be generalized as: 1083 - Split the TCP payload into segments which allow packets with 1084 size less than or equal to MTU. 1086 - For each created segment: 1088 1. Replicate the TCP header and all preceding headers of the 1089 original packet. 1091 2. Set payload length fields in any headers to reflect the 1092 length of the segment. 1094 3. Set TCP sequence number to correctly reflect the offset of 1095 the TCP data in the stream. 1097 4. Recompute and set any checksums that either cover the payload 1098 of the packet or cover header which was changed by setting a 1099 payload length. 1101 Following this general process, TSO can be extended to support TCP 1102 encapsulation in GUE. For each segment the Ethernet, outer IP, UDP 1103 header, GUE header, inner IP header if tunneling, and TCP headers are 1104 replicated. Any packet length header fields need to be set properly 1105 (including the length in the outer UDP header), and checksums need to 1106 be set correctly (including the outer UDP checksum if being used). 1108 To facilitate TSO with GUE it is recommended that optional fields 1109 should not contain values that must be updated on a per segment 1110 basis-- for example the GUE fields should not include checksums, 1111 lengths, or sequence numbers that refer to the payload. If the GUE 1112 header does not contain such fields then the TSO engine only needs to 1113 copy the bits in the GUE header when creating each segment and does 1114 not need to parse the GUE header. 1116 A.4. Large Receive Offload 1118 Large Receive Offload (LRO) is a NIC feature where packets of a TCP 1119 connection are reassembled, or coalesced, in the NIC and delivered to 1120 the host as one large packet. This feature can reduce CPU utilization 1121 in the host. 1123 LRO requires significant protocol awareness to be implemented 1124 correctly and is difficult to generalize. Packets in the same flow 1125 need to be unambiguously identified. In the presence of tunnels or 1126 network virtualization, this may require more than a five-tuple match 1127 (for instance packets for flows in two different virtual networks may 1128 have identical five-tuples). Additionally, a NIC needs to perform 1129 validation over packets that are being coalesced, and needs to 1130 fabricate a single meaningful header from all the coalesced packets. 1132 The conservative approach to supporting LRO for GUE would be to 1133 assign packets to the same flow only if they have identical five- 1134 tuple and were encapsulated the same way. That is the outer IP 1135 addresses, the outer UDP ports, GUE protocol, GUE flags and fields, 1136 and inner five tuple are all identical. 1138 Appendix B: Privileged ports 1140 Using the source port to contain an inner flow identifier value 1141 disallows the security method of a receiver enforcing that the source 1142 port be a privileged port. Privileged ports are defined by some 1143 operating systems to restrict source port binding. Unix, for 1144 instance, considered port number less than 1024 to be privileged. 1146 Enforcing that packets are sent from a privileged port is widely 1147 considered an inadequate security mechanism and has been mostly 1148 deprecated. To approximate this behavior, an implementation could 1149 restrict a user from sending a packet destined to the GUE port 1150 without proper credentials. 1152 Appendix C: Inner flow identifier as a route selector 1154 An encapsulator generating an inner flow identifier may modulate the 1155 value to perform a type of multipath source routing. Assuming that 1156 networking switches perform ECMP based on the flow hash, a sender can 1157 affect the path by altering the inner flow identifier. For instance, 1158 a host may store a flow hash in its PCB for an inner flow, and may 1159 alter the value upon detecting that packets are traversing a lossy 1160 path. Changing the inner flow identifier for a flow should be subject 1161 to hysteresis (at most once every thirty seconds) to limit the number 1162 of out of order packets. 1164 Appendix D: Hardware protocol implementation considerations 1166 A low level protocol, such is GUE, is likely interesting to being 1167 supported by high speed network devices. Variable length header (VLH) 1168 protocols like GUE are often considered difficult to efficiently 1169 implement in hardware. In order to retain the important 1170 characteristics of an extensible and robust protocol, hardware 1171 vendors may practice "constrained flexibility". In this model, only 1172 certain combinations or protocol header parameterizations are 1173 implemented in hardware fast path. Each such parameterization is 1174 fixed length so that the particular instance can be optimized as a 1175 fixed length protocol. In the case of GUE this constitutes specific 1176 combinations of GUE flags, fields, and next protocol. The selected 1177 combinations would naturally be the most common cases which form the 1178 "fast path", and other combinations are assumed to take the "slow 1179 path". 1181 In time, needs and requirements of the protocol may change which may 1182 manifest themselves as new parameterizations to be supported in the 1183 fast path. To allow allow this extensibility, a device practicing 1184 constrained flexibility should allow the fast path parameterizations 1185 to be programmable. 1187 Authors' Addresses 1189 Tom Herbert 1190 Facebook 1191 1 Hacker Way 1192 Menlo Park, CA 94052 1193 US 1195 Email: tom@herbertland.com 1197 Lucy Yong 1198 Huawei USA 1199 5340 Legacy Dr. 1200 Plano, TX 75024 1201 US 1203 Email: lucy.yong@huawei.com 1205 Osama Zia 1206 Microsoft 1207 1 Microsoft Way 1208 Redmond, WA 98029 1209 US 1211 Email: osamaz@microsoft.com