idnits 2.17.1 draft-ietf-tsvwg-tunnel-congestion-feedback-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 18 has weird spacing: '...A basic model...' == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (September 16, 2015) is 3144 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'ConEx' is mentioned on line 105, but not defined == Unused Reference: 'RFC3168' is defined on line 490, but no explicit reference was found in the text == Unused Reference: 'RFC6040' is defined on line 500, but no explicit reference was found in the text == Unused Reference: 'CONEX' is defined on line 512, but no explicit reference was found in the text == Outdated reference: A later version (-15) exists of draft-ietf-tsvwg-circuit-breaker-01 Summary: 0 errors (**), 0 flaws (~~), 8 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Engineering Task Force X. Wei 3 INTERNET-DRAFT Huawei Technologies 4 Intended Status: Informational L.Zhu 5 Expires: March 19, 2016 Huawei Technologies 6 L.Deng 7 China Mobile 8 September 16, 2015 10 Tunnel Congestion Feedback 11 draft-ietf-tsvwg-tunnel-congestion-feedback-00 13 Abstract 15 This document describes a mechanism to calculate congestion of a 16 tunnel segment based on RFC 6040 recommendations, and a feedback 17 protocol by which to send the measured congestion of the tunnel from 18 egress to ingress . A basic model for measuring tunnel congestion 19 and feedback is described, and a protocol for carrying the feedback 20 data is outlined. 22 Status of this Memo 24 This Internet-Draft is submitted to IETF in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF), its areas, and its working groups. Note that 29 other groups may also distribute working documents as 30 Internet-Drafts. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 The list of current Internet-Drafts can be accessed at 38 http://www.ietf.org/1id-abstracts.html 40 The list of Internet-Draft Shadow Directories can be accessed at 41 http://www.ietf.org/shadow.html 43 Copyright and License Notice 45 Copyright (c) 2015 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (http://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 61 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . 3 62 3. Congestion Information Feedback Models . . . . . . . . . . . . 4 63 3.1 Direct Model . . . . . . . . . . . . . . . . . . . . . . . . 4 64 3.2 Centralized Model . . . . . . . . . . . . . . . . . . . . . 4 65 4. Congestion Level Measurement . . . . . . . . . . . . . . . . . 5 66 5. Congestion Information Delivery . . . . . . . . . . . . . . . . 7 67 5.1 IPFIX Extentions . . . . . . . . . . . . . . . . . . . . . . 7 68 5.1.1 ce-cePacketTotalCount . . . . . . . . . . . . . . . . . 7 69 5.1.2 ect0-nectPacketTotalCount . . . . . . . . . . . . . . . 8 70 5.1.3 ect1-nectPacketTotalCount . . . . . . . . . . . . . . . 8 71 5.1.4 ce-nectPacketTotalCount . . . . . . . . . . . . . . . . 8 72 5.1.5 ce-ect0PacketTotalCount . . . . . . . . . . . . . . . . 9 73 5.1.6 ce-ect1PacketTotalCount . . . . . . . . . . . . . . . . 9 74 5.1.7 ect0-ect0PacketTotalCount . . . . . . . . . . . . . . . 9 75 5.1.8 ect1-ect1PacketTotalCount . . . . . . . . . . . . . . . 10 76 6. Congestion Management . . . . . . . . . . . . . . . . . . . . . 10 77 7. Security . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 78 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 11 79 9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 11 80 9.1 Normative References . . . . . . . . . . . . . . . . . . . 11 81 9.2 Informative References . . . . . . . . . . . . . . . . . . 12 82 10. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12 83 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 12 85 1. Introduction 87 In IP network, persistent congestion (or named congestion collapse) 88 lowers transport throughput, leading to waste of network resource. 89 Appropriate congestion control mechanisms are therefore critical to 90 prevent the network from falling into the persistent congestion 91 state. Currently, transport protocols such as TCP, SCTP, DCCP, have 92 their built-in congestion control mechanisms, and even for certain 93 single transport protocol like TCP there can be a couple of different 94 congestion control mechanisms to choose from. All these congestion 95 control mechanisms are implemented on host side, and there are 96 reasons that only host side congestion control is not sufficient for 97 the whole network to keep away from persistent congestion. For 98 example, (1) some protocol's congestion control scheme may have 99 internal design flaws; (2) improper software implementation of 100 protocol; (3) some transport protocols do not even provide congestion 101 control at all. 103 In order to have a better control on network congestion status, it's 104 necessary for the network side to do certain kind of traffic control. 105 For example, ConEx [ConEx] provides a method for network operator to 106 learn about traffic's congestion contribution information, and then 107 congestion management action can be taken based on this information. 109 Tunnels are widely deployed in various networks including public 110 Internet, datacenter network, and enterprise network etc. A tunnel 111 consists of ingress, an egress and a set of interior routers. For the 112 tunnel scenario, a tunnel-based mechanism which is different from 113 ConEx is introduced for network traffic control to keep the network 114 from persistent congestion. Here, tunnel ingress will implement 115 congestion management function to control the traffic entering the 116 tunnel. 118 In order to perform congestion management at ingress, the ingress 119 must first obtain the inner tunnel congestion level information. Yet 120 the ingress cannot use the locally visible traffic rates, because it 121 would require additional knowledge of downstream capacity and 122 topology, as well as cross traffic that does not pass through this 123 ingress. 125 This document provides a mechanism of feeding back inner tunnel 126 congestion level to the ingress. Using this mechanism the egress can 127 feed the tunnel congestion level information it collects back to the 128 ingress. After receiving this information the ingress will be able to 129 perform congestion management according to network management policy. 131 2. Conventions 132 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 133 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 134 document are to be interpreted as described in RFC 2119 [RFC2119] 136 3. Congestion Information Feedback Models 138 According to specific network deployment, there are two kinds of 139 feedback model: direct model and centralized model. 141 3.1 Direct Model 142 Feedback 143 |-----------------------------------------| 144 | | 145 | | 146 | V 147 +----------+ tunnel +-----------+ 148 |Egress |========================== |Inress | 149 |(Exporter)| |(Collector)| 150 +----------+ +-----------+ 152 (a) Direct Feedback Model. 154 Direct model means egress feeds information directly to ingress. In 155 this model, egress collects network congestion level information and 156 feedback the information to the ingress for congestion management. 157 The ingress here will act as both the decision point that decides how 158 to do congestion management and the action point that implements 159 congestion management decision. 161 3.2 Centralized Model 163 Feedback +-----------+ 164 --------->|Controller |##################### 165 | |(Collector)| # 166 | +-----------+ # 167 | # 168 +----------+ tunnel +-----V-+ 169 |Egress | ===========================|Ingress| 170 |(Exporter)| +-------+ 171 +----------+ 173 (b) Centralized Feedback Model 175 In the centralized model, the ingress only takes the role of action 176 point, and it implements traffic control decision from another entity 177 named "controller". Here, after egress has collected network 178 congestion level information, it feeds back the information to a 179 controller instead of the ingress. Then the controller makes 180 congestion management decision and sends the decision to the ingress 181 to implement. 183 4. Congestion Level Measurement 185 This section describes how to measure congestion level in a tunnel. 187 There may be different approaches to packet loss detection for 188 different tunneling protocol scenarios. For instance, if there is a 189 sequence field in the tunneling protocol header, it will be easy for 190 egress to detect packet loss through the gaps in sequence number 191 space. Another approach is to compare the number of packets entering 192 ingress and the number of packets arriving at egress over the same 193 span of packets. This document will focus on the latter one which is 194 a more general approach. 196 If the routers support Explicit Congestion Notification (ECN), after 197 router's queue length is over a predefined threshold, the routers 198 will marks the ECN-capable packets as Congestion Experienced (CE) or 199 drop not-ECT packets with the probability proportional to queue 200 length; if the queue overflows all packets will be dropped. If the 201 routers do not support ECN, after router's queue length is over a 202 predefined threshold, the routers will drop both the ECN-capable 203 packets and the not-ECT packets with the probability proportional to 204 the queue length. It's assumed all routers in the tunnel support ECN. 206 Faked ECN-capable transport (ECT) is used at ingress to defer packet 207 loss to egress. The basic idea of faked ECT is that, when 208 encapsulating packets, ingress first marks tunnel outer header 209 according to RFC6040, and then remarks outer header of Not-ECT packet 210 as ECT, there will be three kinds of combination of outer header ECN 211 field and inner header ECN field: CE|CE, ECT|N-ECT, ECT|ECT (in the 212 form of outer ECN| inner ECN). 214 In case all interior routers support ECN, the network congestion 215 level could be indicated through the ratio of CE-marked packet and 216 the ratio of packet drop, the relationship between these two kinds of 217 indicator is complementary. If the congestion level in tunnel is not 218 high enough, the packets would be marked as CE instead of being 219 dropped, and then it is easy to calculate congestion level according 220 to the ratio of CE-marked packets. If the congestion level is so high 221 that ECT packet will be dropped, then the packet loss ratio could be 222 calculated by comparing total packets entering ingress and total 223 packets arriving at egress over the same span of packets, if packet 224 loss is detected, it could be assumed that severe congestion has 225 occurred in the tunnel. Because loss is only ever a sign of serious 226 congestion, so it doesn't need to measure loss ratio accurately. 228 The basic procedure of congestion level measurement is as follows: 230 +-------+ +------+ 231 |Ingress| |Egress| 232 +-------+ +------+ 233 | | 234 +----------------+ | 235 |cumulative count| | 236 +----------------+ | 237 | | 238 | | 239 |------------------------>| 240 | | 241 |<------------------------| 242 | | 243 | | 245 (a) Direct model feedback procedure 247 +----------+ +-------+ +------+ 248 |Controller| |Ingress| |Egress| 249 +----------+ +-------+ +------+ 250 | | | 251 | +----------------+ | 252 | |cumulative count| | 253 | +----------------+ | 254 | | | 255 | | | 256 | |------------------------>| 257 | | | 258 | | 259 | | 260 | | 261 | | 262 |<---------------------------------------| 263 | | 264 | | 265 | | 267 (b) Centralized model feedback procedure 269 Ingress encapsulates packets and marks outer header according to 270 faked ECT as described above. Ingress cumulatively counts packets for 271 three types of ECN combination (CE|CE, ECT|N-ECT, ECT|ECT) and then 272 the ingress regularly sends cumulative packet counts message of each 273 type of ECN combination to the egress. When each message arrives, the 274 egress cumulatively counts packets coming from the ingress and adds 275 its own packet counts of each type of ECN combination (CE|CE, ECT|N- 276 ECT, CE|N-ECT, CE|ECT, ECT|ECT) to the message and either returns the 277 whole message to the ingress, or to a central controller. 279 The counting of packets can be at the granularity of the all traffic 280 from the ingress to the egress to learn about the overall congestion 281 status of the path between the ingress and the egress. The counting 282 can also be at the granularity of individual customer's traffic or a 283 specific set of flows to learn about their congestion contribution. 285 5. Congestion Information Delivery 287 As described above, the tunnel ingress needs to convey message of 288 cumulative packet counts of each type of ECN combination to tunnel 289 egress, and the tunnel egress also needs to feed the message of 290 cumulative packet counts of each type of ECN combination to the 291 ingress or central collector. This section describes how the messages 292 could be conveyed. 294 The message can travel along the same path with network data traffic, 295 referred as in band signal; or go through a different path from 296 network data traffic, referred as out of band signal. Because out of 297 band scheme needs additional separate path which might limit its 298 actual deployment, the in band scheme will be discussed here. 300 Because the message is transmitted in band, so the message packet may 301 get lost in case of network congestion. To cope with the situation 302 that the message packet gets lost, the packet counts values are sent 303 as cumulative counters. Then if a message is lost the next message 304 will recover the missing information. 306 IPFIX [RFC7011] is selected as a choice of candidate protocol. IPFIX 307 is preferred to use SCTP as transport. SCTP allows partially reliable 308 delivery [RFC3758], which ensures the feedback message will not be 309 blocked in case of packet loss due to network congestion. 311 When sending message from ingress to egress, the ingress acts as 312 IPFIX exporter and egress acts as IPFIX collector; when sending 313 message from egress to ingress or controller, the egress acts as 314 IPFIX exporter and ingress or controller acts as IPFIX collector. 316 5.1 IPFIX Extentions 318 5.1.1 ce-cePacketTotalCount 319 Description: The total number of incoming packets with CE|CE ECN 320 marking combination for this Flow at the Observation Point since the 321 Metering Process (re-)initialization for this Observation Point. 323 Abstract Data Type: unsigned64 325 Data Type Semantics: totalCounter 327 ElementId: TBD1 329 Statues: current 331 Units: packets 333 5.1.2 ect0-nectPacketTotalCount 335 Description: The total number of incoming packets with ECT(0)|N-ECT 336 ECN marking combination for this Flow at the Observation Point since 337 the Metering Process (re-)initialization for this Observation Point. 339 Abstract Data Type: unsigned64 341 Data Type Semantics: totalCounter 343 ElementId: TBD2 345 Statues: current 347 Units: packets 349 5.1.3 ect1-nectPacketTotalCount 351 Description: The total number of incoming packets with ECT(1)|N-ECT 352 ECN marking combination for this Flow at the Observation Point since 353 the Metering Process (re-)initialization for this Observation Point. 355 Abstract Data Type: unsigned64 357 Data Type Semantics: totalCounter 359 ElementId: TBD3 361 Statues: current 363 Units: packets 365 5.1.4 ce-nectPacketTotalCount 366 Description: The total number of incoming packets with CE|N-ECT ECN 367 marking combination for this Flow at the Observation Point since the 368 Metering Process (re-)initialization for this Observation Point. 370 Abstract Data Type: unsigned64 372 Data Type Semantics: totalCounter 374 ElementId: TBD4 376 Statues: current 378 Units: packets 380 5.1.5 ce-ect0PacketTotalCount 382 Description: The total number of incoming packets with CE|ECT(0) ECN 383 marking combination for this Flow at the Observation Point since the 384 Metering Process (re-)initialization for this Observation Point. 386 Abstract Data Type: unsigned64 388 Data Type Semantics: totalCounter 390 ElementId: TBD5 392 Statues: current 394 Units: packets 396 5.1.6 ce-ect1PacketTotalCount 398 Description: The total number of incoming packets with CE|ECT(1) ECN 399 marking combination for this Flow at the Observation Point since the 400 Metering Process (re-)initialization for this Observation Point. 402 Abstract Data Type: unsigned64 404 Data Type Semantics: totalCounter 406 ElementId: TBD6 408 Statues: current 410 Units: packets 411 5.1.7 ect0-ect0PacketTotalCount 413 Description: The total number of incoming packets with ECT(0)|ECT(0) 414 ECN marking combination for this Flow at the Observation Point since 415 the Metering Process (re-)initialization for this Observation Point. 417 Abstract Data Type: unsigned64 419 Data Type Semantics: totalCounter 421 ElementId: TBD7 423 Statues: current 425 Units: packets 427 5.1.8 ect1-ect1PacketTotalCount 429 Description: The total number of incoming packets with ECT(1)|ECT(1) 430 ECN marking combination for this Flow at the Observation Point since 431 the Metering Process (re-)initialization for this Observation Point. 433 Abstract Data Type: unsigned64 435 Data Type Semantics: totalCounter 437 ElementId: TBD8 439 Statues: current 441 Units: packets 443 6. Congestion Management 445 After tunnel ingress (or controller) receives congestion level 446 information, then congestion management actions could be taken based 447 on the information, e.g. if the congestion level is higher than a 448 predefined threshold, then action could be taken to reduce the 449 congestion level. 451 The design of network side congestion management SHOULD take host 452 side e2e congestion control mechanism into consideration, which means 453 the congestion management needs to avoid the impacts on e2e 454 congestion control. For instance, congestion management action must 455 be delayed by more than a worst-case global RTT, otherwise tunnel 456 traffic management will not give normal e2e congestion control enough 457 time to do its job, and the system could go unstable. 459 The detailed description of congestion management is out of scope of 460 this document, as examples, congestion management such as circuit 461 breaker [CB] and congestion policing [CP] could be applied. Circuit 462 breaker is an automatic mechanism to estimate congestion, and to 463 terminate flow(s) when persistent congestion is detected to prevent 464 network congestion collapse; Congestion policing is used in data 465 center to limit the amount of congestion any tenant can cause 466 according to the congestion information in the tunnels. 468 7. Security 470 This document describes the tunnel congestion calculation and 471 feedback. For feeding back congestion, security mechanisms of IPFIX 472 are expected to be sufficient. No additional security concerns are 473 expected. 475 8. IANA Considerations 477 This document defines a set of new IPFIX Information Elements (IE). 478 New registry for these IE identifiers is needed. 480 TBD1~TBD8. 482 9. References 484 9.1 Normative References 486 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 487 Requirement Levels", BCP 14, RFC 2119, March 1997, 488 . 490 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 491 of Explicit Congestion Notification (ECN) to IP", 492 RFC 3168, September 2001, . 495 [RFC3758] Stewart, R., Ramalho, M., Xie, Q., Tuexen, M., and P. 496 Conrad, "Stream Control Transmission Protocol (SCTP) 497 Partial Reliability Extension", RFC 3758, May 2004, 498 . 500 [RFC6040] Briscoe, B., "Tunnelling of Explicit Congestion 501 Notification", RFC 6040, November 2010, . 504 [RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken, 505 "Specification of the IP Flow Information Export (IPFIX) 506 Protocol for the Exchange of Flow Information", STD 77, 507 RFC 7011, September 2013, . 510 9.2 Informative References 512 [CONEX] Matt Mathis, Bob Briscoe. "Congestion Exposure (ConEx) 513 Concepts, Abstract Mechanism and Requirements", draft- 514 ietf-conex-abstract-mech-13, October 24, 2014 516 [CB] G. Fairhurst. "Network Transport Circuit Breakers", draft-ietf- 517 tsvwg-circuit-breaker-01, April 02, 2015 519 [CP] Bob Briscoe, Murari Sridharan. "Network Performance Isolation 520 in Data Centres using Congestion Policing", draft-briscoe- 521 conex-data-centre-02, February 14, 2014 523 10. Acknowledgements 525 Thanks Bob Briscoe for his insightful suggestions on the basic 526 mechanisms of congestion information collection and many other useful 527 comments. Thanks David Black for his useful technical suggestions. 528 Also, thanks Anthony Chan and John Kaippallimalil for their careful 529 reviews. 531 Authors' Addresses 533 Xinpeng Wei 534 Beiqing Rd. Z-park No.156, Haidian District, 535 Beijing, 100095, P. R. China 536 E-mail: weixinpeng@huawei.com 538 Zhu Lei 539 Beiqing Rd. Z-park No.156, Haidian District, 540 Beijing, 100095, P. R. China 541 E-mail:lei.zhu@huawei.com 543 Lingli Deng 544 Beijing, 100095, P. R. China 545 E-mail: denglingli@gmail.com