idnits 2.17.1 draft-yu-tsvwg-l3qcn-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. == There are 3 instances of lines with private range IPv4 addresses in the document. If these are generic example addresses, they should be changed to use any of the ranges defined in RFC 6890 (or successor): 192.0.2.x, 198.51.100.x or 203.0.113.x. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 93 has weird spacing: '...h. This avoid...' == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (October 31, 2016) is 2733 days in the past. Is this intentional? -- Found something which looks like a code comment -- if you have code sections in the document, please surround them with '' and '' lines. Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) No issues found here. Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 2 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 INTERNET-DRAFT Y.Yu 3 Intended Status: Standard Track HUAWEI Technologies 4 Expires: May 4, 2017 October 31, 2016 6 Layer 3 Quantized Congestion Notification(L3QCN)in the Converged Network 7 draft-yu-tsvwg-l3qcn-00 9 Abstract 11 The more demands for the lossless and low latency network in the 12 modern datacenter appear because the proliferation of demanding 13 applications. Some congestion control schemes such as CN, PFC, ETS 14 which is introduced by IEEE 802.1 focus on the L2 network domain. 15 While current TCP/IP stacks can't meet these requirement on L3 or 16 above networks. This draft introduces the L3QCN(Layer 3 Quantized 17 Congestion Notification), an end to end congestion control scheme 18 which adopt QCN and DCQCN on L2 network. It specifies protocols, 19 procedures, and managed objects to support congestion control on the 20 datacenter network. 22 Status of this Memo 24 This Internet-Draft is submitted to IETF in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on May 4, 2017. 39 Copyright and License Notice 41 Copyright (c) 2016 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . 3 58 2 Current Congestion Control method . . . . . . . . . . . . . . . 4 59 2.1 QCN Introduction . . . . . . . . . . . . . . . . . . . . . . 4 60 2.1.1 QCN Technical Solution . . . . . . . . . . . . . . . . . 4 61 2.1.2 The limitation of QCN . . . . . . . . . . . . . . . . . 5 62 2.2 Introduction of DCQCN . . . . . . . . . . . . . . . . . . . 6 63 2.2.1 DCQCN technical solution . . . . . . . . . . . . . . . . 6 64 2.2.2 The limitation of DCQCN . . . . . . . . . . . . . . . . 6 65 3. Layer3 QCN . . . . . . . . . . . . . . . . . . . . . . . . . . 6 66 3.1 L3QCN Introduction . . . . . . . . . . . . . . . . . . . . . 6 67 3.2 Use case of L3QCN . . . . . . . . . . . . . . . . . . . . . 6 68 3.2.1 A hybrid method with QCN . . . . . . . . . . . . . . . . 6 69 3.2.2 L3QCN in CLOS fat-tree . . . . . . . . . . . . . . . . . 7 70 4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 11 71 5 Security Considerations . . . . . . . . . . . . . . . . . . . . 11 72 6 IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 11 73 7 References . . . . . . . . . . . . . . . . . . . . . . . . . . 11 74 7.1 Normative References . . . . . . . . . . . . . . . . . . . 11 75 7.2 Informative References . . . . . . . . . . . . . . . . . . 11 76 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12 78 1 Introduction 80 Currently, there are 3 classes of streams in the DC network: 81 1)Storage Traffic (Lossless) 82 2)High Compute Traffic(Low latency) 83 3)Ethernet Traffic (Certain packet loss& latency tolerance) 84 Traditional DC network treat different traffic with different network 85 bearer which exist in the small scale DC. While with the expand of the 86 DC scale, there is an available method which use the Ethernet to bear 87 the streams by applying the congestion control method. IEEE has 88 introduced the following specifications: 90 1. Enhanced Transmission Selection (ETS) [1] When the offered load in a 91 traffic class doesn't use its allocated bandwidth, enhanced transmission 92 selection will allow other traffic classes to use the available 93 bandwidth. This avoid the burst of one class traffic to influence other 94 classes which provide the minimum guaranteed bandwidth to all traffic 95 classes. This also facilitate the multiple classes exist in one network. 97 2.Priority-based Flow Control(PFC) [2] Data Center Bridging networks 98 (bridges and end nodes) are characterized by limited bandwidth-delay 99 product and limited hop-count. Traffic class is identified by the VLAN 100 tag priority values. Priority-based flow control is intended to 101 eliminate frame loss due to congestion. This realized the lossless of 102 storage stream and no impact to other 2 traffic classes when all the 3 103 traffic classes coexist in the Ethernet. 105 3.Quantized Congestion Notification (QCN) [3] This mechanism enable 106 bridges to signal congestion information to end stations capable of 107 transmission rate limiting to avoid frame loss. Resolve the latency 108 increase caused by flow control or packet retransmission to achieve the 109 higher network throughput. 111 This draft introduce a L3QCN method to resolve the congestion problem 112 under the converged network in the datacenter. Different classes of 113 traffic will be configured with corresponding priorities. Bridge will 114 apply the policies of congestion control according to the traffic of 115 congested traffic which is defined by the priority. 117 1.1 Terminology 119 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 120 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 121 document are to be interpreted as described in RFC 2119 [RFC2119]. 123 2 Current Congestion Control method 124 2.1 QCN Introduction 125 2.1.1 QCN Technical Solution 127 QCN is defined in IEEE 802.1Qau, there are 2 types of Ethernet frame: 128 One data frame with CN-TAG as Figure 1 shown. The Converged Network 129 Adapters (CNA) which support QCN function will send out the CN-TAG frame 130 when connecting network domain. The difference from the normal frame is 131 the CN-TAG field in the head of Ethernet frame which includes RPID (also 132 known as FLOW-ID). RPID will uniquely identifies every stream sent by 133 the adaptor. When the congestion appeared, bridge will send out CNM 134 frame(introduced in second clause) to notify the source node to stop 135 sending this stream. The FLOW-ID of source frame will be encapsulated in 136 the CNM frame. When the adaptor receives the CNM frame, it will reduce 137 the transmission rate of the identified flow in order to control the 138 specific traffic precisely. 140 Data Frame CMN 141 +----------------+ +---------------------------+ 142 | DA (6 bytes) | | DA = SampledFrame.SA | 143 +----------------+ +---------------------------+ 144 | SA (6 bytes) | | SA = Switch.QCN.MACSA | 145 +----------------+ +---------------------------+ 146 | S-TAG (4 bytes)| | S-TAG | 147 +----------------+ +---------------------------+ 148 | C-TAG (4 bytes)| | C-TAG | 149 +----------------+ +---------------------------+ 150 |CN-TAG (4 bytes)-----| |CN-TAG = SampledFrame.CN-TAG 151 +----------------+ | +---------------------------+ 152 | MSDU | | | CNM Payload | 153 +----------------+ | +---------------------------+ 154 | 155 | 156 | 157 +-------------------|-+----------------+ 158 |EtherType (2 bytes) | RPID (2 bytes) | 159 +---------------------+----------------+ 161 Figure 1. QCN data frame and CMN 162 CNM Payload 163 +------------------------------------+ 164 | Version (4 bits) | 165 +------------------------------------+ 166 | Reserved (6 bits) | 167 +------------------------------------+ 168 | QntzFb (6 bits) | 169 +------------------------------------+ 170 | CPID (64 bits) | 171 +------------------------------------+ 172 | Qoffeset (16 bits) | 173 +------------------------------------+ 174 | Qdelta (16 bits) | 175 +------------------------------------+ 176 | Encapsulated Priority (16 bits) | 177 +------------------------------------+ 178 | Encapsulated MAC-DA (48 bits) | 179 +------------------------------------+ 180 | Encapsulated Frame Length = 64 | 181 +------------------------------------+ 182 | First 64 bytes of the Sampled Data | 183 | Frame MSDU | 184 +------------------------------------+ 185 Figure 2. CMN payload 187 The CNM frame is shown as Figure 2: 189 Field 1: Version of CNM message (4 bits) 190 Field 2: Reserved (6 bits) 191 Field 3: QntzFB, Quantized feedback of CNM message (6 bits) 192 Field 4: Congestion Point Identifier (CPID, 8 bytes). In order to assure 193 the uniqueness of the identifier, use the MAC address as the upper 6 194 bytes. Lower 2 bytes identify the different ports or different priority 195 classes in the same device. 196 Field 5: QOffset (2 bytes). Current number of available bytes in the 197 sending queue of the congested point (CP) 198 Field 6: QDelta (2 bytes), the difference of available bytes of CP at 2 199 time point. 200 Field 7: Encapsulated priority (2 bytes). Use upper 3 bits of the 1st 201 byte to fill the priority of the CNM frame. Else is 0. 202 Field 8: Encapsulated destination MAC address (6 bytes). Fill the 203 destination MAC address which trigger the CNM frame. 204 Field 9: Encapsulated MSDU length (2 bytes). The length of the 205 Encapsulated MSDU. 206 Field 10: Encapsulated MSDU (64 bytes). Fill in the payload of the CNM. 208 2.1.2 The limitation of QCN 209 During the congestion, bridge need to encapsulate the FLOW-ID(in the 210 head of Ethernet frame) in the CNM. Then replace the destination MAC 211 address of CNM with the source MAC address of the congested frame in 212 order to ensure CNM could be send back the sending server. Sending 213 server reduce the flow according to the FLOW-ID carried in the CNM. This 214 characteristic limit QCN only in Ethernet(Level 2 in ISO). Since the 215 head of Ethernet frame will be changed during every packet routing in 216 the IP network, the FLOW-ID and MAC address of sending server will be 217 lost. So the downstream bridge could not create the CNM and send back to 218 the sending server. QCN couldn't support the Layer 3 networking. 220 2.2 Introduction of DCQCN 221 2.2.1 DCQCN technical solution 223 DCQCN[4] is a kind of congestion control solution proposed by Microsoft 224 for the DC network domain. DCQCN is mainly deployed in the RoCEv2 scene. 225 CP (Congestion Point, bridge) set the CN(congestion notification) for 226 the datagram with probability according to the degree of the congestion. 227 After the datagram sent to NP(Notification Point, receiving server) , NP 228 construct CNP (Congestion Notification Packet) to RP(Reaction Point, 229 sending server). RP reduce or increase the transmission rate according 230 to the dedicated algorithm which is similar to QCN. 232 2.2.2 The limitation of DCQCN 234 DCN construct ECN(explicit congestion notification)[5] tag during the 235 congestion and forward to NP. NP construct CNP to notify RP. The 236 reaction is not quite timely( Control Loop Delay is big). If the 237 congestion appeared on the upper jump, for example on the TOR, there is 238 more delay of 9 jumps than the direct response. 240 3. Layer3 QCN 241 3.1 L3QCN Introduction 243 L3QCN is a technical solution to resolve the congestion problem under 244 the converged network in the datacenter. Different class of traffic will 245 be configured with corresponding priorities. Bridge will deploy the 246 policies of congestion control according to the class of congested 247 traffic which is defined by the priority. 249 3.2 Use case of L3QCN 250 3.2.1 A hybrid method with QCN 251 Deploy priority 5 to the traffic sent out by QCN server. When the queue 252 buffer for the priority 5 exceed the defined threshold, the bridge will 253 back-haul the congestion information to the accessing TOR. TOR and HOST 254 can reach to each other on the Ethernet which is similar to a L2 domain. 255 In this situation, standard QCN is performed. Accessing TOR transform 256 the congestion information to standard CNM frame and send to QCN server 257 which realize the congestion control. 259 Deploy priority 7 to the traffic sent out by RoCEv2 server. When the 260 queue buffer for the priority 7 exceed the defined threshold, CP judges 261 the key flow causing the congestion. Then CP construct the standard CNP. 262 The RoCEv2 server reduce the transform rate according to the probability 263 of CNP reception. 265 3.2.2 L3QCN in CLOS fat-tree 267 L3QCN control steps are as follows: 269 1)Datagram sent out from QCN server enters the accessing TOR. Firstly, 270 accessing TOR will save the source MAC address, FLOW-ID, VLAN-TAG and IP 271 5-tuple to the local table, shown in Table 1. Then TOR perform the 272 normal routing. 274 Src IP Dst IP Src Dst Proto MAC SA Flow VLAN 275 Port Port -col ID TAG 276 192.168.2.100 192.168.3.30 5678 21 6 0x01a4f5aefe 0xa878 100 277 10.1.10.2 10.2.20.1 8957 21 6 0xfd16783acd 0xc9a0 1024 278 192.168.2.100 10.3.50.1 2345 80 6 0x0a25364101 0x0ac9 3 279 200.1.2.3 100.2.3.4 2567 47 17 0xed16d8ea0a 0x37a0 90 281 Table 1. FLOW-ID Mapping Table 283 2)Shown in Figure 3. Congestion caused by Incast flow, T4 detect the 284 congestion in a certain queue and exceed the threshold. Distinguish the 285 flow model according to the priority of the queue. 287 Please view in a fixed-width font such as Courier. 289 +-----------+ +-----------+ 290 | SPINE#1 | | SPINE#2 | 291 | * * * * * * * * * * * * * * | 292 +-* ---- -+ * +----* -----+ 293 * * * * 294 * * * * 295 * * * * 296 +-----*-----+ +-----*-----+ +----*------+ +-----------+ 297 | AGG#1 | | AGG#2 | | AGG#3 | | AGG#4 | 298 | * | | * | | * | | * | 299 +-----*-----+ +-----*-----+ +----------*+ +------*----+ 300 * * * * 301 * * * * 302 * * * * 303 * * * 304 +-----*-----+ +-----*-----+ +-----------+ +-*--- --*--+ 305 | TOR#1 | | TOR#2 | | TOR#3 | | *TOR#4 * | 306 | * | | * | | | | * /---\* | 307 +-----*-----+ +-----*-----+ +-----------+ +-*| CP |*-+ 308 /|\ * /|\ * * \---/* 309 | * | * * | * 310 | * | * * | * 311 | * | * * \|/ * 312 +-----*-----+ +-- --*-- --+ +-----------+ +-*------*--+ 313 | SERVER#1 | | SERVER#2 | | SERVER#3 | | SERVER#4 | 314 | | | | | | | | 315 +-----------+ +-----------+ +-----------+ +-----------+ 317 Figure 3. Incast flow model 319 3)If it is the flow from QCN server, conduct self-defined CNM which 320 include the 5-tuple, congestion indications (defined in QCN 321 specification, such as QntzFb, CPID, Qoffset,QDelta), encapsulate IP 322 +UDP. UDP need to use a specific port No. which is used to recognize the 323 QCN frame in TOR. Or use a bit in the IP head(reserved bit) to indicate 324 the type of the frame. The dedicate IP is set to the source IP which 325 assure the CNM could be routed to the accessing TOR. It's better to 326 construct the self-defined CNM based on the standard CNM to reduce the 327 writing times which might increase the performance. 329 4)As shown in the Figure 4&5 , T2 recognize the self-defined CNM 330 according to the destination UDP port. T2 map the self-defined CNM to 331 the standard CNM and send to H2. The QCN is performed in L2 domain 332 because the adaptor of H2 support the standard QCN. 334 +-----------+ +-----------+ 335 | SPINE#1 | | SPINE#2 | 336 | ** * * | | 337 +--------*--+ * +-----------+ 338 * * 339 * * 340 * * 341 +-----------+ +-----*-----+ +----*------+ +-----------+ 342 | AGG#1 | | AGG*2 | | AGG#3 | | AGG#4 | 343 | | | * | | * | | | 344 +-----------+ +-----*-----+ +----------*+ +-----------+ 345 * * 346 * * 347 * * 348 * Private CNP 349 +----------+ +-----------+ +----------+ +-----*-----+ 350 | TOR#1 | | TOR*2 | | TOR#3 | | TOR#4 | 351 | | /-|-----------|-\ | | | /---\ | 352 +----------+ | +-----------+ | +----------+ +- | CP | -+ 353 | * /|\ | \---/ 354 | * | | | 355 | Standard|CNP | | 356 | * | | \|/ 357 +----------+ | +-----*--+--+ | +----------+ +-----------+ 358 | SERVER#1 | | | SERVER#2 | | | SERVER#3 | | SERVER#4 | 359 | | | | | | | | | | 360 +----------+ | +-----------+ | +----------+ +-----------+ 361 | L2 domain | 362 \---------------/ 364 Figure 4. Construct the self-defined CNM 366 Please view in a fixed-width font such as Courier. 368 +-----------------------------------------+ 369 | IP | 370 |----------------------+ | 371 |DIP:Flow's SIP | | 372 |----------------------+ | 373 |SIP:Flow's DIP | | 374 +-----------------------------------------+ 375 | UDP | 376 +----------------------+ | 377 |DPORT:L3QCN Port | | 378 +----------------------+------------------+ 379 |Payload(5-tuple,congestion extent metric)| 380 +-----------------------------------------+ 381 Private CNM transfer CNM Payload 382 to Standard CNM +-------------------------+ 383 | +----| Version (4 bits) | 384 | | +-------------------------+ 385 | | | Reserved (6 bits) | 386 \|/ | +-------------------------+ 387 | | QntzFb (6 bits) | 388 CNM | +-------------------------+ 389 +------------------------+ | | CPID (64 bits) | 390 | DA = SampledFrame.SA | | +-------------------------+ 391 +------------------------+ | | Qoffset (16 bits) | 392 | SA = Switch.QCN.MACSA | | +-------------------------+ 393 +------------------------+ | | QDelta (16 bits) | 394 | S-TAG | | +-------------------------+ 395 +------------------------+ | | Encapsulated Priority | 396 | C-TAG | | | (16 bits) | 397 +------------------------+ | +-------------------------+ 398 | CN-TAG = SampledFrame. | | | Encapsulated MAC-DA | 399 | CN-TAG | | | (48 bits) | 400 +------------------------+ | +-------------------------+ 401 | CNM Payload +---| |Encapsulated Frame Length| 402 +------------------------+ | | = 64 | 403 | +-------------------------+ 404 +----+ First 64 bytes of the | 405 |Sampled Data Frame MSDU | 406 +-------------------------+ 407 Figure 5. Transfer Private CNM to Standard CNM 409 5)As shown in the Figure 6 , T4 recognize which flow causes the 410 congestion. CP construct the standard CNP. The adaptor of RoCEv2 server 411 support CNP and reduce the transmission rate according to the 412 probability of CNP reception. 414 Please view in a fixed-width font such as Courier. 416 +-----------+ +-----------+ 417 | SPINE#1 | | SPINE#2 | 418 | ** * * | | 419 +--------*--+ * +-----------+ 420 * * 421 * * 422 * * 423 +-----------+ +-----*-----+ +----*------+ +-----------+ 424 | AGG#1 | | AGG*2 | | AGG#3 | | AGG#4 | 425 | | | * | | * | | | 426 +-----------+ +-----*-----+ +----------*+ +-----------+ 427 * * 428 * * 429 * * 430 * Standard CNP 431 +-----------+ +-----*-----+ +-----------+ +-----*-----+ 432 | TOR#1 | | TOR*2 | | TOR#3 | | TOR#4 | 433 | | | * | | | | /---\ | 434 +-----------+ +-----*- -+ +-----------+ +- | CP | -+ 435 * /|\ \---/ 436 * | | 437 Standard|CNP | 438 * | \|/ 439 +-----------+ +-----*--+--+ +-----------+ +-----------+ 440 | SERVER#1 | | SERVER#2 | | SERVER#3 | | SERVER#4 | 441 | | | | | | | | 442 +-----------+ +-----------+ +-----------+ +-----------+ 444 Figure 6. CP construct the standard CNP based on RoCEv2 446 4. Conclusion 448 L3QCN resolve the problem that QCN could not support L3 network. L3QCN 449 realize the QCN mechanism across the L3 network. There is no 450 modification on the QCN servers. 451 For the RoCEv2 traffic, since the CP send the CNP when reach the 452 congestion threshold, it reduce the Control Loop Delay dramatically 453 which could reduce the depth of the queue buffer and the datagram delay. 454 The performance of the network is improved. 456 5 Security Considerations 458 N/A 460 6 IANA Considerations 462 Will apply the specific UDP port No. if required. 464 7 References 466 7.1 Normative References 468 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 469 Requirement Levels", BCP 14, RFC 2119, March 1997. 471 7.2 Informative References 473 [1] IEEE 802.1: 802.1Qaz Draft 2.5- Enhanced Transmission Selection 475 [2] IEEE 802.1: 802.1Qbb Draft 2.3- Priority-based Flow Control 477 [3] IEEE 802.1: 802.1Qau Draft 2.4- Congestion Notification 479 [4] Yibo Zhu et al., SIGCOMM 2015, Congestion Control for Large-Scale 480 RDMA Deployments 482 [5] K. Ramakrishnan, S. Floyd, and D. Black. The addition of 483 explicit congestion notification (ECN). RFC 3168 485 Authors' Addresses 487 Yolanda Yu 488 101 SOFTWARE AV., YUHUATAI DIST., NANJING, 489 JIANGSU,210012,CHINA 490 EMail: yolanda.yu@huawei.com