idnits 2.17.1 draft-yu-nvo3-geneve-pkt-reordering-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack separate sections for Informative/Normative References. All references will be assumed normative when checking for downward references. ** The abstract seems to contain references ([1]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 81 has weird spacing: '...s), the deplo...' == Line 264 has weird spacing: '...equence numb...' -- The document date (Sep 1, 2018) is 2063 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: 'RFC2119' on line 106 -- Looks like a reference, but probably isn't: 'I-D.ietf-nvo3-geneve' on line 366 -- Possible downref: Non-RFC (?) normative reference: ref. '1' -- Possible downref: Non-RFC (?) normative reference: ref. '2' -- Possible downref: Non-RFC (?) normative reference: ref. '3' == Outdated reference: A later version (-06) exists of draft-mglt-nvo3-geneve-security-requirements-03 ** Downref: Normative reference to an Informational draft: draft-mglt-nvo3-geneve-security-requirements (ref. '5') Summary: 3 errors (**), 0 flaws (~~), 4 warnings (==), 6 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT Y. Yu 3 Intended Status: Standards Track Huawei Technologies 4 Expires: Mar 5, 2019 J. Wang 5 China Telecom 6 Sep 1, 2018 8 Packet Reordering in Geneve Overlay Network 9 draft-yu-nvo3-geneve-pkt-reordering-00 11 Abstract 13 Congestion is the killer of low latency and high throughput.Network 14 congestion occurs on the interconnection links of a data center due 15 to poor traffic distribution. Load balancing technologies are used to 16 solve network congestion. Packet spraying is a kind of load balancing 17 technology with finer granularity. During this situation, the packets 18 may arrive at the destination out of order. This document describes 19 a reordering protocol in the Geneve encapsulation network[1] using a 20 newly defined Geneve Option field. 22 Status of this Memo 24 This Internet-Draft is submitted to IETF in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF), its areas, and its working groups. Note that 29 other groups may also distribute working documents as 30 Internet-Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/1id-abstracts.html 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html 43 INTERNET DRAFT 45 Copyright and License Notice 47 Copyright (c) 2018 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents 52 (http://trustee.ietf.org/license-info) in effect on the date of 53 publication of this document. Please review these documents 54 carefully, as they describe your rights and restrictions with respect 55 to this document. Code Components extracted from this document must 56 include Simplified BSD License text as described in Section 4.e of 57 the Trust Legal Provisions and are provided without warranty as 58 described in the Simplified BSD License. 60 Table of Contents 62 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 63 2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . 3 64 3 Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . 3 65 4 Problem Statements & Requirements . . . . . . . . . . . . . . . 3 66 5 Packet Reordering on Geneve . . . . . . . . . . . . . . . . . . 4 67 5.1 Packet Reordering Format . . . . . . . . . . . . . . . . . . 4 68 5.2 Packet Reordering Capability Discovery . . . . . . . . . . . 6 69 6 Security Considerations . . . . . . . . . . . . . . . . . . . . 8 70 7 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 8 71 8 References . . . . . . . . . . . . . . . . . . . . . . . . . . 8 72 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 9 74 INTERNET DRAFT 76 1 Introduction 78 In many current data centers, network utilization is not has high as 79 it could be. For example, in some scenarios, the average network 80 utilization is about 20% and the peak utilization is about 45%[2]. 81 With the improvement of end systems (or endpoints), the deployment 82 of multi-services and high-volume traffic services (such as streaming 83 media, big data processing applications and user-oriented large-scale 84 web applications, etc.), more and more network performance problems 85 appear. These problems are created by traffic bursts and traffic 86 routing collisions. The imbalance of traffic on the network becomes 87 more and more prominent which leads to underutilized network 88 bandwidth and decreased overall performance of network applications. 90 In order to fully utilize the available network bandwidth, traffic 91 flows into the network are dispersed across multiple paths to achieve 92 load balancing. The finer the granularity of the load balancing, the 93 higher the utilization of available network bandwidth. Current flow- 94 based and flowlet-based[3] approaches are more coarse grain than 95 packet-based load balancing. During the packet spraying situation, 96 the packets may arrive at the destination out of order because the 97 difference latency of links. This document describes how to extend 98 the Geneve header to support reordering for packet-based load 99 balancing, called reordering in the Geneve encapsulation network. 101 2 Terminology 103 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 104 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 105 document are to be interpreted as described in RFC 2119 [RFC2119]. 107 3 Abbreviations 109 GENEVE - Generic Network Virtualization Encapsulation 111 ECMP - Equal-cost multi-path routing 113 SDN - Software Defined Network 115 GFP - Geneve Forwarding Policy 117 4 Problem Statements & Requirements 119 The current general network topology in the data center is a multi- 120 rooted tree architecture, such as the typical CLOS network. This kind 121 of network has multiple paths and an equal division of bandwidth 122 across those paths which provides good scalability and flexibility 123 depending on how the multiple paths are utilized. In order to fully 125 INTERNET DRAFT 127 utilize the network bandwidth, traffic flows into the network are 128 dispersed on the multiple paths to achieve load balancing. Currently, 129 the granularity of load balancing can be seen in the following 130 approaches: flow-based load balancing (such as ECMP), flowlet-based 131 load balancing (such as CONGA[2]) and packet-based load balancing 132 (such as Packet Spraying). The finer the granularity of load 133 balancing, the more effective the load balancing is and the higher 134 the utilization of network bandwidth can be. 136 The effect of packet-based load balancing is the best one among the 137 three because the corresponding granularity is the smallest. However, 138 the consequence is that packets belonging to the same flow will be 139 allocated to different paths. When the forwarding delays of paths are 140 different, it is possible that packets may arrive at the receiver 141 out-of-order. To detect out-of-order packets and restore the correct 142 order, a sequence number is needed in the packets. 144 5 Packet Reordering on Geneve 145 5.1 Packet Reordering Format 146 The Geneve Header and the Geneve option have the following format[1]: 147 0 1 2 3 148 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 149 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 150 |Ver| Opt Len |O|C| Rsvd. | Protocol Type | 151 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 152 | Virtual Network Identifier (VNI) | Reserved | 153 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 154 | Variable Length Options | 155 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 156 Geneve Header 158 0 1 2 3 159 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 160 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 161 | Option Class | Type |R|R|R| Length | 162 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 163 | Variable Option Data | 164 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 165 Geneve Option 167 Option Class = To be assigned by IANA (TBA). 168 Type = TBA. 169 Length = 2 (8 byte) 171 The proposed Packet Reordering option for Geneve will have the 172 following format: 174 INTERNET DRAFT 176 0 1 2 3 177 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 178 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 179 | Option Class = GFP | Type |R|R|R| Length | 180 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 181 | Flow Group ID | 182 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 183 | Sequencing Number | 184 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 185 Packet Reordering Format over Geneve 187 Option Class = Geneve Forwarding Policy(suggested), to be assigned by 188 IANA (TBA). 189 Type = TBA. 190 Length = 2 (8 byte) 192 Flow Group ID: will be described in 5.1.1 194 Sequencing Number: will be described in 5.1.2 196 5.1.1 Flow Group ID Field (4 Bytes) 198 The Flow Group ID field is a four byte field. The Flow Group ID 199 identifies a group of flows within the same reorder sequence space 200 between a pair of src/dest nodes. The Flow Group ID may correspond to 201 an individual flow, some subset of flows, or even all flows between 202 the src/dest pair. How the flow corresponds to the Flow Group ID is 203 not defined by this draft. The same Flow Group ID can be used by 204 different src/dest pairs (i.e. a Flow Group ID is only unique within 205 the context of a src/dest pair). A Flow Group is uniquely identified 206 by the 3 tuple that includes src IP, dest IP and Flow Group ID. The 207 source node allocates the sequence number according to the order 208 packets are sent for flows of the same Flow Group. The destination 209 will reorder the received packets of a Flow Group according to the 210 received sequence number. 212 5.1.2 Sequence Number Field 214 The Sequence Number field is a four byte field that closely follows 215 the definition of the Sequence Number in RFC 2890[4]. The sequence 216 number value ranges from 0 to (2**32)-1. The first datagram is sent 217 with a sequence number of 0. The sequence number is thus a 218 monotonically increasing counter represented modulo 2**32. The 219 receiver maintains the sequence number value of the last successfully 220 decapsulated packet. This value should be initialized to (2**32)-1. 222 A packet is considered an out-of-sequence packet if the sequence 223 number of the received packet is less than or equal to the sequence 225 INTERNET DRAFT 227 number of last successfully decapsulated packet. The sequence number 228 of a received message is considered less than or equal to the last 229 successfully received sequence number if its value lies in the range 230 of the last received sequence number and the preceding 2**31-1 231 values, inclusive. 233 If the received packet is an in-sequence packet, it is successfully 234 decapsulated. An in-sequence packet is one with a sequence number 235 exactly 1 greater than (modulo 2**32) the last successfully 236 decapsulated packet. If the received packet is neither an in-sequence 237 nor an out-of-sequence packet it indicates a sequence number gap. The 238 receiver may perform a small amount of buffering in an attempt to 239 recover the original sequence of transmitted packets. In this case, 240 the packet may be placed in a buffer sorted by sequence number. If 241 an in-sequence packet is received and successfully decapsulated, the 242 receiver should consult the head of this buffer to see if the next 243 in-sequence packet has already been received. If so, the receiver 244 should decapsulate it as well as the following in-sequence packets 245 that may be present in the buffer. The "last successfully 246 decapsulated sequence number" should then be set to the last packet 247 that was decapsulated from the buffer. 249 Under no circumstances should a packet wait more that 250 OUTOFORDER_TIMER microseconds in the buffer. If a packet has been 251 waiting that long, the receiver MUST immediately traverse the buffer 252 in sorted order, decapsulating packets (and ignoring any sequence 253 number gaps) until there are no more packets in the buffer that have 254 been waiting longer than OUTOFORDER_TIMER milliseconds. The "last 255 successfully decapsulated sequence number" should then be set to the 256 last packet so decapsulated. 258 The receiver may place a limit on the number of packets in any per- 259 flow group buffer (Packets with the same Flow Group ID Field value 260 belong to a flow group). If a packet arrives that would cause the 261 receiver to place more than MAX_PERFLOW_BUFFER packets into a given 262 buffer, then the packet at the head of the buffer is immediately 263 decapsulated regardless of its sequence number and the "last 264 successfully decapsulated sequence number" is set to its sequence 265 number. The newly arrived packet may then be placed in the buffer. 267 The received packets of flows from the same Flow Group are in the 268 same reorder sequence space. The source ensures to allocate the 269 sequence number according to the sequence of sent packets. If the 270 sequence number wraps, the source will allocate from 0 again. 272 5.2 Packet Reordering Capability Discovery 273 INTERNET DRAFT 275 The reorder function on the destination needs certain resources. For 276 example, there is a reorder queue corresponding to each Group ID(Flow 277 Group ID plus the Source IP address). For some resource-intensive 278 chips such as switch chips, the amount of queues are limited. 279 Therefore, it is important to not exceed the ability of the 280 destination when assigning the Group ID at the source. This requires 281 that the source understands the ability of the destination. There are 282 several solutions, such as static configuration, or direct signaling 283 between the two ends. In the following situations, the capability 284 notifications need to be sent to the peer: 285 1. When the source communicates with the destination for the first 286 time. 287 2. When receiving the peer packet for the first time 288 3. When receiving the capability notification from the source 289 4. When the Group ID of peer exceed the local capability 291 In the above cases, the destination needs to notify the capability 292 (reorder queues assigned to the peer) to the source. When receiving 293 the capability notification from the destination, the source needs to 294 tune the allocation mechanism of Group ID according to the capability 295 of destination to ensure the number of Group IDs does not exceed the 296 number of reordering queues allocated to the source. 298 When the number of Group IDs exceed the local capability, the 299 following 2 actions can be taken. Which option is selected is not 300 covered in this draft. 301 1.Discard the Geneve packet for the Group ID that exceeds the local 302 capability 304 2.Remove the Geneve encapsulation, without performing reordering and 305 pass the packet to higher layer protocol. For higher layer protocols 306 that can tolerate a certain degree of out-of-order packets (such as 307 TCP), the message may be processed correctly. 309 When the Group ID exceeds the local capability, the destination sends 310 a notification of the reordering capability to the source. To prevent 311 sending the capability notification too frequently, a notification 312 suppression capability is needed. When the destination wants to send 313 a notification of the capability of the source, it enters a 314 suppression cycle. The destination will not send the capability 315 notification to the source until the suppression cycle ends. The 316 suppression period is longer than the RTT between 2 nodes. 318 0 1 2 3 319 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 320 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 321 |Ver| Opt Len |O|C| Rsvd. | Protocol Type | 322 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 324 INTERNET DRAFT 326 | Virtual Network Identifier (VNI) | Reserved | 327 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 328 | Option Class = GFP | Type=Capacity |R|R|R| Length | 329 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 330 | MAX GROUP ID | 331 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 332 Capability notification message format 334 Length=1 (4 byte) 336 MAX GROUP ID is a four byte field. MAX Group ID indicate the max 337 Group ID assigned to the destination. The Group ID allocated by the 338 source must be limited to 0 ~ (MAX Group ID - 1). 340 6 Security Considerations 342 This document describes Geneve option which introduce Flow Group ID 343 and Sequence Number to reorder packets. Within the Sequence Number 344 Field, it is possible to inject packets with an arbitrary Sequence 345 Number and launch a Denial of Service attack. This is a general 346 security issue which is defined in Geneve security requirements[5]. 348 In order to protect against such attacks, IPSec could be used to 349 protect the Geneve header and the tunneled payload. Any common Geneve 350 security mechanism also applies to this draft. 352 7 IANA Considerations 354 IANA is requested to allocate a Geneve "option class" number for 355 GFP(Geneve Forwarding Policy): 357 +---------------+-------------+---------------+ 358 | Option Class | Description | Reference | 359 +---------------+-------------+---------------+ 360 | x | GFP_ID | This document | 361 +---------------+-------------+---------------+ 363 8 References 365 [1] J. Gross, Ed., I. Ganga, Ed., T. Sridhar, Ed., "Generic Network 366 Virtualization Encapsulation", [I-D.ietf-nvo3-geneve] 368 [2] Jiaxin Cao, et al, "Per-packet Load-balanced, Low-Latency Routing 370 INTERNET DRAFT 372 for Clos-based Data Center Networks", CoNEXT'13 374 [3] Mohammad Alizadeh, et al, "CONGA: Distributed Congestion-Aware 375 Load Balancing for Datacenters", Sigcomm'14 377 [4] G. Dommety, "Key and Sequence Number Extensions to GRE", RFC 378 2890, September 2000 380 [5] D. Migault, S. Boutros, D. Wing, S. Krishnan,"Geneve Protocol 381 Security Requirement", [I-D. draft-mglt-nvo3-geneve-security- 382 requirements-03] 384 Authors' Addresses 386 Yolanda Yu 387 Huawei Technologies Co., Ltd. 388 Email: yolanda.yu@huawei.com 390 Jianglong Wang 391 China Telecom 392 Email: wangjl1.bri@chinatelecom.cn