INTERNET-DRAFT H. Xiang Intended Status: Standards Track Y. Yu Expires: Sep 1, 2018 Huawei Technologies P. Congdon Tallac Networks J. Wang China Telecom March 1, 2018 Packet Spraying in Geneve Overlay Network draft-xiang-nvo3-geneve-packet-spray-00 Abstract Congestion is the killer of low latency and high throughput.Network congestion occurs on the interconnection links of a data center due to poor traffic distribution. Load balancing technologies are used to solve network congestion. Packet spraying is a kind of load balancing technology with finer granularity. This document describes a packet spraying protocol in the Geneve encapsulation network[1] using a newly defined Geneve Option field. Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright and License Notice Expires [Page 1] INTERNET DRAFT Copyright (c) 2018 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 Problem Statements & Requirements . . . . . . . . . . . . . . . 3 5 Packet Spraying on Geneve . . . . . . . . . . . . . . . . . . . 4 5.1 Packet Spraying Format . . . . . . . . . . . . . . . . . . . 4 5.2 Packet Spray Capability Discovery . . . . . . . . . . . . . 6 5.3 TCP/UDP over Geneve . . . . . . . . . . . . . . . . . . . . 8 6 Security Considerations . . . . . . . . . . . . . . . . . . . . 8 7 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 9 8 References . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 10 Expires [Page 2] INTERNET DRAFT 1 Introduction In many current data centers, network utilization is not has high as it could be. For example, in some scenarios, the average network utilization is about 20% and the peak utilization is about 45%[2]. With the improvement of end systems (or endpoints), the deployment of multi-services and high-volume traffic services (such as streaming media, big data processing applications and user-oriented large-scale web applications, etc.), more and more network performance problems appear. These problems are created by traffic bursts and traffic routing collisions. The imbalance of traffic on the network becomes more and more prominent which leads to underutilized network bandwidth and decreased overall performance of network applications. In order to fully utilize the available network bandwidth, traffic flows into the network are dispersed across multiple paths to achieve load balancing. The finer the granularity of the load balancing, the higher the utilization of available network bandwidth. Current flow- based and flowlet-based[3] approaches are more coarse grain than packet-based load balancing. This document describes how to extend the Geneve header to support packet-based load balancing, called packet spraying in the Geneve encapsulation network. 2 Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. 3 Abbreviations GENEVE - Generic Network Virtualization Encapsulation ECMP - Equal-cost multi-path routing SDN - Software Defined Network GFP - Geneve Forwarding Policy 4 Problem Statements & Requirements The current general network topology in the data center is a multi- rooted tree architecture, such as the typical CLOS network. This kind of network has multiple paths and an equal division of bandwidth across those paths which provides good scalability and flexibility depending on how the multiple paths are utilized. In order to fully utilize the network bandwidth, traffic flows into the network are dispersed on the multiple paths to achieve load balancing. Currently, Expires [Page 3] INTERNET DRAFT the granularity of load balancing can be seen in the following approaches: flow-based load balancing (such as ECMP), flowlet-based load balancing (such as CONGA[2]) and packet-based load balancing (such as Packet Spraying). The finer the granularity of load balancing, the more effective the load balancing is and the higher the utilization of network bandwidth can be. The effect of packet-based load balancing is the best one among the three because the corresponding granularity is the smallest. However, the consequence is that packets belonging to the same flow will be allocated to different paths. When the forwarding delays of paths are different, it is possible that packets may arrive at the receiver out-of-order. To detect out-of-order packets and restore the correct order, a sequence number is needed in the packets. 5 Packet Spraying on Geneve 5.1 Packet Spraying Format The Geneve Header and the Geneve option have the following format[1]: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Ver| Opt Len |O|C| Rsvd. | Protocol Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Virtual Network Identifier (VNI) | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Variable Length Options | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Geneve Header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Option Class | Type |R|R|R| Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Variable Option Data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Geneve Option Option Class = To be assigned by IANA (TBA). Type = TBA. Length = 2 (8 byte) The proposed Packet Spraying option for Geneve will have the following format: Expires [Page 4] INTERNET DRAFT 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Option Class = GFP | Type |R|R|R| Length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Flow Group ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequencing Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Packet Spraying Format over Geneve Option Class = Geneve Forwarding Policy(suggested), to be assigned by IANA (TBA). Type = TBA. Length = 2 (8 byte) Flow Group ID: will be described in 5.1.1 Sequencing Number: will be described in 5.1.2 5.1.1 Flow Group ID Field (4 Bytes) The Flow Group ID field is a four byte field. The Flow Group ID identifies a group of flows within the same reorder sequence space between a pair of src/dest nodes. The Flow Group ID may correspond to an individual flow, some subset of flows, or even all flows between the src/dest pair. How the flow corresponds to the Flow Group ID is not defined by this draft. The same Flow Group ID can be used by different src/dest pairs (i.e. a Flow Group ID is only unique within the context of a src/dest pair). A Flow Group is uniquely identified by the 3 tuple that includes src IP, dest IP and Flow Group ID. The source node allocates the sequence number according to the order packets are sent for flows of the same Flow Group. The destination will reorder the received packets of a Flow Group according to the received sequence number. 5.1.2 Sequence Number Field The Sequence Number field is a four byte field that closely follows the definition of the Sequence Number in RFC 2890[4]. The sequence number value ranges from 0 to (2**32)-1. The first datagram is sent with a sequence number of 0. The sequence number is thus a monotonically increasing counter represented modulo 2**32. The receiver maintains the sequence number value of the last successfully decapsulated packet. This value should be initialized to (2**32)-1. A packet is considered an out-of-sequence packet if the sequence number of the received packet is less than or equal to the sequence Expires [Page 5] INTERNET DRAFT number of last successfully decapsulated packet. The sequence number of a received message is considered less than or equal to the last successfully received sequence number if its value lies in the range of the last received sequence number and the preceding 2**31-1 values, inclusive. If the received packet is an in-sequence packet, it is successfully decapsulated. An in-sequence packet is one with a sequence number exactly 1 greater than (modulo 2**32) the last successfully decapsulated packet. If the received packet is neither an in-sequence nor an out-of-sequence packet it indicates a sequence number gap. The receiver may perform a small amount of buffering in an attempt to recover the original sequence of transmitted packets. In this case, the packet may be placed in a buffer sorted by sequence number. If an in-sequence packet is received and successfully decapsulated, the receiver should consult the head of this buffer to see if the next in-sequence packet has already been received. If so, the receiver should decapsulate it as well as the following in-sequence packets that may be present in the buffer. The "last successfully decapsulated sequence number" should then be set to the last packet that was decapsulated from the buffer. Under no circumstances should a packet wait more that OUTOFORDER_TIMER milliseconds in the buffer. If a packet has been waiting that long, the receiver MUST immediately traverse the buffer in sorted order, decapsulating packets (and ignoring any sequence number gaps) until there are no more packets in the buffer that have been waiting longer than OUTOFORDER_TIMER milliseconds. The "last successfully decapsulated sequence number" should then be set to the last packet so decapsulated. The receiver may place a limit on the number of packets in any per- flow group buffer (Packets with the same Flow Group ID Field value belong to a flow group). If a packet arrives that would cause the receiver to place more than MAX_PERFLOW_BUFFER packets into a given buffer, then the packet at the head of the buffer is immediately decapsulated regardless of its sequence number and the "last successfully decapsulated sequence number" is set to its sequence number. The newly arrived packet may then be placed in the buffer. The received packets of flows from the same Flow Group are in the same reorder sequence space. The source ensures to allocate the sequence number according to the sequence of sent packets. If the sequence number wraps, the source will allocate from 0 again. 5.2 Packet Spray Capability Discovery Expires [Page 6] INTERNET DRAFT The reorder function on the destination needs certain resources. For example, there is a reorder queue corresponding to each Group ID(Flow Group ID plus the Source IP address). For some resource-intensive chips such as switch chips, the amount of queues are limited. Therefore, it is important to not exceed the ability of the destination when assigning the Group ID at the source. This requires that the source understands the ability of the destination. There are several solutions, such as static configuration, or direct signaling between the two ends. In the following situations, the capability notifications need to be sent to the peer: 1. When the source communicates with the destination for the first time. 2. When receiving the peer packet for the first time 3. When receiving the capability notification from the source 4. When the Group ID of peer exceed the local capability In the above cases, the destination needs to notify the capability (reorder queues assigned to the peer) to the source. When receiving the capability notification from the destination, the source needs to tune the allocation mechanism of Group ID according to the capability of destination to ensure the number of Group IDs does not exceed the number of reordering queues allocated to the source. When the number of Group IDs exceed the local capability, the following 2 actions can be taken. Which option is selected is not covered in this draft. 1.Discard the Geneve packet for the Group ID that exceeds the local capability 2.Remove the Geneve encapsulation, without performing reordering and pass the packet to higher layer protocol. For higher layer protocols that can tolerate a certain degree of out-of-order packets (such as TCP), the message may be processed correctly. When the Group ID exceeds the local capability, the destination sends a notification of the reordering capability to the source. To prevent sending the capability notification too frequently, a notification suppression capability is needed. When the destination wants to send a notification of the capability of the source, it enters a suppression cycle. The destination will not send the capability notification to the source until the suppression cycle ends. The suppression period is longer than the RTT between 2 nodes. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Ver| Opt Len |O|C| Rsvd. | Protocol Type | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Expires [Page 7] INTERNET DRAFT | Virtual Network Identifier (VNI) | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Option Class = GFP | Type=Capacity |R|R|R| Length=8| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MAX GROUP ID | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Capability notification message format MAX GROUP ID is a four byte field. MAX Group ID indicate the max Group ID assigned to the destination. The Group ID allocated by the source must be limited to 0 ~ MAX Group ID. 5.3 TCP/UDP over Geneve For some certain applications, the main parts of outer IP header are the same with Geneve inner IP header. For example, source/destination nodes and IP addresses are the same on both inner and outer header. When source/destination nodes are same, TCP/UDP layer could be over the Geneve directly, saving 20 bytes(IPv4) in the header. In this situation, the Geneve header Protocol Type must specify the transport layer protocol. When the destination receives such a message, it strips off the Geneve header directly and splices the TCP/UDP message to the back of the IP header. Geneve Header: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Ver| Opt Len |O|C| Rsvd. |(Protocol Type) TCP/UDP | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Virtual Network Identifier (VNI) | Reserved | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Variable Length Options | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TCP/UDP Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TCP/UDP Payload | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 6 Security Considerations This document describes Geneve option which introduce Flow Group ID and Sequence Number to reorder packets. Within the Sequence Number Field, it is possible to inject packets with an arbitrary Sequence Number and launch a Denial of Service attack. This is a general Expires [Page 8] INTERNET DRAFT security issue which is defined in Geneve security requirements[5]. In order to protect against such attacks, IPSec could be used to protect the Geneve header and the tunneled payload. Any common Geneve security mechanism also applies to this draft. 7 IANA Considerations IANA is requested to allocate a Geneve "option class" number for GFP(Geneve Forwarding Policy): +---------------+-------------+---------------+ | Option Class | Description | Reference | +---------------+-------------+---------------+ | x | GFP_ID | This document | +---------------+-------------+---------------+ IANA/IEEE is requested to allocate a Geneve "Protocol Type" number for TCP/UDP over Geneve: +---------------+-------------+---------------+ | Protocol Type | Description | Reference | +---------------+-------------+---------------+ | 0x9004 | TCP | This document | +---------------+-------------+---------------+ | 0x9005 | UDP | This document | +---------------+-------------+---------------+ 8 References [1] J. Gross, Ed., I. Ganga, Ed., T. Sridhar, Ed., "Generic Network Virtualization Encapsulation", [I-D.ietf-nvo3-geneve] [2] Jiaxin Cao, et al, "Per-packet Load-balanced, Low-Latency Routing for Clos-based Data Center Networks", CoNEXT'13 [3] Mohammad Alizadeh, et al, "CONGA: Distributed Congestion-Aware Load Balancing for Datacenters", Sigcomm'14 [4] G. Dommety, "Key and Sequence Number Extensions to GRE", RFC 2890, September 2000 [5] D. Migault, S. Boutros, D. Wing, S. Krishnan,"Geneve Protocol Security Requirement", [I-D. draft-mglt-nvo3-geneve-security- requirements-03] Expires [Page 9] INTERNET DRAFT Authors' Addresses Haizhou Xiang Huawei Technologies Co., Ltd. Email: xianghaizhou@huawei.com Yolanda Yu Huawei Technologies Co., Ltd. Email: yolanda.yu@huawei.com Paul Congdon Tallac Networks paul.congdon@tallac.com Jianglong Wang China Telecom Email: wangjl1.bri@chinatelecom.cn Expires [Page 10]