idnits 2.17.1 draft-li-tsvwg-loops-problem-opportunities-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 11, 2019) is 1873 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 793 (Obsoleted by RFC 9293) -- Obsolete informational reference (is this intentional?): RFC 6830 (Obsoleted by RFC 9300, RFC 9301) == Outdated reference: A later version (-16) exists of draft-ietf-nvo3-geneve-11 == Outdated reference: A later version (-15) exists of draft-ietf-tcpm-rack-04 == Outdated reference: A later version (-26) exists of draft-ietf-6man-segment-routing-header-16 == Outdated reference: A later version (-02) exists of draft-cardwell-iccrg-bbr-congestion-control-00 Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 TSVWG Y. Li 3 Internet-Draft X. Zhou 4 Intended status: Informational Huawei 5 Expires: September 12, 2019 March 11, 2019 7 LOOPS (Localized Optimization of Path Segments) Problem Statement and 8 Opportunities 9 draft-li-tsvwg-loops-problem-opportunities-01 11 Abstract 13 Various overlay tunnels are used in networks including WAN, 14 enterprise campus and others. End to end paths are partitioned into 15 multiple segments using overlay tunnels to achieve better path 16 selection, lower latency and so on. Traditional end-to-end transport 17 layers respond to packet loss slowly especially in long-haul 18 networks: They either wait for some signal from the receiver to 19 indicate a loss and then retransmit from the sender or rely on 20 sender's timeout which is often quite long. 22 LOOPS (Localized Optimization of Path Segments) attempts to provide 23 non end-to-end local based in-network recovery to achieve better data 24 delivery by making packet loss recovery faster. In an overlay 25 network scenario, LOOPS can be performed over the existing, or 26 purposely created, overlay tunnel based path segments. 28 This document illustrates the slow packet loss recovery problems 29 LOOPS tries to solve in some use cases and analyzes the impacts when 30 local in-network recovery is employed as a LOOPS mechanism. 32 Status of This Memo 34 This Internet-Draft is submitted in full conformance with the 35 provisions of BCP 78 and BCP 79. 37 Internet-Drafts are working documents of the Internet Engineering 38 Task Force (IETF). Note that other groups may also distribute 39 working documents as Internet-Drafts. The list of current Internet- 40 Drafts is at https://datatracker.ietf.org/drafts/current/. 42 Internet-Drafts are draft documents valid for a maximum of six months 43 and may be updated, replaced, or obsoleted by other documents at any 44 time. It is inappropriate to use Internet-Drafts as reference 45 material or to cite them other than as "work in progress." 47 This Internet-Draft will expire on September 12, 2019. 49 Copyright Notice 51 Copyright (c) 2019 IETF Trust and the persons identified as the 52 document authors. All rights reserved. 54 This document is subject to BCP 78 and the IETF Trust's Legal 55 Provisions Relating to IETF Documents 56 (https://trustee.ietf.org/license-info) in effect on the date of 57 publication of this document. Please review these documents 58 carefully, as they describe your rights and restrictions with respect 59 to this document. Code Components extracted from this document must 60 include Simplified BSD License text as described in Section 4.e of 61 the Trust Legal Provisions and are provided without warranty as 62 described in the Simplified BSD License. 64 Table of Contents 66 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 67 1.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 68 2. Cloud-Internet Overlay Network . . . . . . . . . . . . . . . 5 69 2.1. Tail Loss or Loss in Short Flows . . . . . . . . . . . . 7 70 2.2. Packet Loss in Real Time Media Streams . . . . . . . . . 7 71 2.3. Packet Loss and Congestion Control in Bulk Data Transfer 8 72 2.4. Multipathing . . . . . . . . . . . . . . . . . . . . . . 8 73 3. Features and Impacts to be Considered for LOOPS . . . . . . . 9 74 3.1. Local Recovery and End-to-end Retransmission . . . . . . 10 75 3.1.1. OE to OE Measurement, Recovery and Multipathing . . . 12 76 3.2. Congestion Control Interaction . . . . . . . . . . . . . 12 77 3.3. Overlay Protocol Extensions . . . . . . . . . . . . . . . 14 78 3.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 14 79 4. Security Considerations . . . . . . . . . . . . . . . . . . . 15 80 5. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 15 81 6. Informative References . . . . . . . . . . . . . . . . . . . 15 82 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 17 84 1. Introduction 86 Overlay tunnels are widely deployed for various networks, including 87 long haul WAN interconnection, enterprise wireless access networks, 88 etc. The end to end connection is partitioned into multiple path 89 segments using overlay tunnels. This serves a number of purposes, 90 for instance, selecting a better path over the WAN or delivering the 91 packets over heterogeneous network, such as enterprise access and 92 core networks. 94 A reliable transport layer normally employs some end-to-end 95 retransmission mechanisms which also address congestion control 96 [RFC0793] [RFC5681]. The sender either waits for the receiver to 97 send some signals on a packet loss or sets some form of timeout for 98 retransmission. For unreliable transport layer protocols such as RTP 99 [RFC3550], optional and limited usage of end-to-end retransmission is 100 employed to recover from packet loss [RFC4585] [RFC4588]. 102 End-to-end retransmission to recover lost packets is slow especially 103 when the network is long haul. When a path is partitioned into 104 multiple path segments that are realized as overlay tunnels, LOOPS 105 (Localized Optimization of Path Segments) tries to enhance transport 106 over some path segment instead of end-to-end. Local in-network 107 recovery is one example of LOOPS to make recovery from packet loss 108 faster. Figure 1 shows a basic LOOPS usage scenario. 110 This document illustrates the slow packet loss recovery problems 111 LOOPS tries to solve in some use cases and analyzes the impacts when 112 local in-network recovery is employed as a LOOPS mechanism. 114 Section 3 presents some of the issues and opportunities found in 115 Cloud-Internet overlay network that require higher performance and 116 more reliable packet transmission in best effort networks. Section 4 117 describes the corresponding solution features and the impact of them 118 on existing network technologies. 120 ON=overlay node 121 UN=underlay node 123 +---------+ +---------+ 124 | App | <---------------- end-to-end ---------------> | App | 125 +---------+ +---------+ 126 |Transport| <---------------- end-to-end ---------------> |Transport| 127 +---------+ +---------+ 128 | | | | 129 | | +--+ path +--+ path segment2 +--+ | | 130 | | | |<-seg1->| |<--------------> | | | | 131 | Network | +--+ |ON| +--+ |ON| +--+ +----+ |ON| | Network | 132 | |--|UN|--| |--|UN|--| |--|UN|---| UN |--| |--| | 133 +---------+ +--+ +--+ +--+ +--+ +--+ +----+ +--+ +---------+ 134 End Host End Host 135 <---------------------------------> 136 LOOPS domain: path segment enables 137 optimization for better local transport 139 Figure 1: LOOPS in Overlay Network Usage Scenario 141 1.1. Terminology 143 LOOPS: Localized Optimization of Path Segments. LOOPS includes the 144 local in-network (i.e. non end-to-end) recovery function, for 145 instance, loss detection and measurements. 147 LOOPS Node: Node supporting LOOPS functions. 149 Overlay Node (ON): Node having overlay functions (like overlay 150 protocol encapsulation/decapsulation, header modification, TLV 151 inspection) and LOOPS functions in LOOPS overlay network usage 152 scenario. Both OR and OE are Overlay Nodes. 154 Overlay Tunnel: It specifies a tunnel with designated ingress and 155 egress nodes using some network overlay protocol as encapsulation, 156 optionally with a specific traffic type. 158 Overlay Path: It specifies a channel within the overlay tunnel, and 159 the traffic transmitted on the channel needs to pass through none 160 or any number of designated intermediate overlay node. There may 161 be more than one overlay path within an overlay tunnel when the 162 different sets of designated intermediate overlay nodes are 163 specified. An overlay path may contain multiple path segments. 164 When an overlay tunnel contains only one overlay path without any 165 intermediate overlay node specified, overlay path and overlay 166 tunnel are used interchangeably. 168 Overlay Edge (OE): Edge node of an overlay tunnel. 170 Overlay Relay (OR): Intermediate overlay node on an overlay path. 171 Overlay path may not contain any OR. 173 Path segment: Part of an overlay path between two neighbor overlay 174 nodes. It is used interchangeably with overlay segment in this 175 document when the context wants to emphasize on its overlay 176 encapsulated nature. An overlay path may contain multiple path 177 segments. When an overlay path contains only one path segment, 178 i.e. the segment is between two OEs, the path segment is 179 equivalent to the overlay path. It is also called segment for 180 simplicity in this document. 182 Overlay segment: Refers to path segment. 184 Underlay Node (UN): Nodes not participating in overlay network 185 function. 187 2. Cloud-Internet Overlay Network 189 The Internet is a huge network of networks. The interconnections of 190 end devices using this global network are normally provided by ISPs 191 (Internet Service Provider). This ISP provided huge network is 192 considered as the traditional Internet. CSPs (Cloud Service 193 Providers) are connecting their data centers using the Internet or 194 via self-constructed networks/links. This expands the Internet's 195 infrastructure and, together with the original ISP's infrastructure, 196 forms the Internet underlay. 198 NFV (network function virtualization) further makes it easier to 199 dynamically provision a new virtual node as a work load in a cloud 200 for CPU/storage intensive functions. With the aid of various 201 mechanisms such as kernel bypassing and Virtual IO, forwarding based 202 on virtual nodes is becoming more and more effective. The 203 interconnections among the purposely positioned virtual nodes and/or 204 the existing nodes with virtualization functions potentially form an 205 overlay of Internet. It is called the Cloud-Internet Overlay Network 206 (CION) in this document. 208 CION makes use of overlay technologies to direct the traffic going 209 through the specific overlay path regardless of the underlying 210 physical topology, in order to achieve better service delivery. It 211 purposely creates or selects overlay nodes (ON) from providers. By 212 continuously measuring the delay of path segments and use them as 213 metrics for path selection, when the number of overlay nodes is 214 sufficiently large, there is a high chance that a better path could 215 be found [DOI_10.1109_ICDCS.2016.49].Figure 2 shows an example of an 216 overlay path over large geographic distances. The path between two 217 OEs (Overlay Edges) is an overlay path. OEs are ON1 & ON4 in figure 218 2. Part of the path between ONs is a path segment. Figure 2 shows 219 the overlay path with 3 segments, i.e. ON1-ON2-ON3-ON4. ON is 220 usually a virtual node, though it does not have to be. Overlay path 221 transmits packets in some form of network overlay protocol 222 encapsulation. ON has the computing and memory resources that can be 223 used for some functions like packet loss detection, network 224 measurement and feedback, packet recovery. 226 _____________ 227 / domain 1 \ 228 / \ 229 ___/ -------------\ 230 / \ 231 PoP1 ->--ON1 \ 232 | | ON4------>-- PoP2 233 | | ON2 ___|__/ 234 \__|_ |->| _____ / | 235 | \|__|__ / \ / | 236 | | | \____/ \__/ | 237 \|/ | | _____ | 238 | | | ___/ \ | 239 | | \|/ / \_____ | 240 | | | / domain 2 \ /|\ 241 | | | | ON3 | | 242 | | | \ |->| | | 243 | | | \_____|__|_______/ | 244 | /|\ | | \|/ | 245 | | | | | | 246 | | | /|\ | | 247 +--------------------------------------------------+ 248 | | | | | | | Internet | 249 | o--o o---o->---o o---o->--o--o underlay | 250 +--------------------------------------------------+ 252 Figure 2: Cloud-Internet Overlay Network (CION) 254 We tested based on 37 overlay nodes from multiple cloud providers 255 globally. Each pair of the overlay nodes are used as sender and 256 receiver. When the traffic is not intentionally directed to go 257 through any intermediate virtual nodes, we call the path that the 258 traffic takes the _default path_ in the test. When any of the 259 virtual nodes is intentionally used as an intermediate node to 260 forward the traffic, the path that the traffic takes is an _overlay 261 path_ in the test. The preliminary experiments showed that the delay 262 of an overlay path is shorter than that of the default path in 69% of 263 cases at 99% percentile and improvement is 17.5% at 99% percentile 264 when we probe Ping packets every second for a week. 266 Lower delay does not necessarily mean higher throughput. Different 267 path segments may have different packet loss rates. Loss rate is 268 another major factor impacting TCP throughput. From some customer 269 requirements, we set the target loss rate to be less than 1% at 99% 270 percentile and 99.9% percentile, respectively. The loss was measured 271 between any two overlay nodes, i.e. any potential path segment. Two 272 thousand Ping packets were sent every 20 seconds between two overlay 273 nodes for 55 hours. This preliminary experiment showed that the 274 packet loss rate satisfaction are 44.27% and 29.51% at the 99% and 275 99.9% percentiles respectively. 277 Hence packet loss in an overlay segment is a key issue to be solved 278 in CION. In long-haul networks, the end-to-end retransmission of 279 lost packet can result in an extra round trip time. Such extra time 280 is not acceptable in some cases. As CION naturally consists of 281 multiple overlay segments, LOOPS tries to leverage it to do the local 282 optimization for a single hop between two overlay nodes. ("Local" 283 here is a concept relative to end-to-end, it does not mean such 284 optimization is limited to LAN networks.) 286 The following subsections present different scenarios using multiple 287 segment based overlay paths with a common need of local in-network 288 loss recovery in best effort networks. 290 2.1. Tail Loss or Loss in Short Flows 292 When the lost segments are at the end of the transactions, TCP's fast 293 retransmit algorithm does not work here as there are no ACKs to 294 trigger it. When an ACK for a given segment is not received in a 295 certain amount of time called retransmission timeout (RTO), the 296 segment is resent [RFC6298]. RTO can be as long as several seconds. 297 Hence the recovery of lost segments triggered by RTO is lengthy. 298 [I-D.dukkipati-tcpm-tcp-loss-probe] indicates that large RTOs make a 299 significant contribution to the long tail on the latency statistics 300 of short flows like web pages. 302 The short flow often completes in one or two RTTs. Even when the 303 loss is not a tail loss, it can possibly add another RTT because of 304 end-to-end retransmission (not enough packets are in flight to 305 trigger fast retransmit). In long haul networks, it can result in 306 extra time of tens or even hundreds of milliseconds. 308 An overlay segment transmits the aggregated flows from ON to ON. As 309 short flows are aggregated, the probability of tail loss over this 310 specific overlay segment decreases compared to an individual flow. 311 The overlay segment is much shorter than the end-to-end path in a 312 Cloud- Internet overlay network, hence loss recovery over an overlay 313 segment is faster. 315 2.2. Packet Loss in Real Time Media Streams 317 The Real-time transport protocol (RTP) is widely used in interactive 318 audio and video. Packet loss degrades the quality of the received 319 media. When the latency tolerance of the application is sufficiently 320 large, the RTP sender may use RTCP NACK feedback from the receiver 322 [RFC4585] to trigger the retransmission of the lost packets before 323 the playout time is reached at the receiver. 325 In a Cloud-Internet overlay network, the end-to-end path can be 326 hundreds of milliseconds. End-to-end feedback based retransmission 327 may be not be very useful when applications can not tolerate one more 328 RTT of this length. Loss recovery over an overlay segment can then 329 be used for the scenarios where RTCP NACK triggered retransmission is 330 not appropriate. 332 2.3. Packet Loss and Congestion Control in Bulk Data Transfer 334 TCP congestion control algorithms such as Reno and CUBIC basically 335 interpret packet loss as congestion experienced somewhere in the 336 path. When a loss is detected, the congestion window will be 337 decreased at the sender to make the sending slower. It has been 338 observed that packet loss is not an accurate way to detect congestion 339 in the current Internet [I-D.cardwell-iccrg-bbr-congestion-control]. 340 In long-haul links, when the loss is caused by non-persistent burst 341 which is extremely short and pretty random, the sender's reaction of 342 reducing sending rate is not able to respond in time to the 343 instantaneous path situation or to mitigate such bursts. On the 344 contrary, reducing window size at the sender unnecessarily or too 345 aggressively harms the throughput for application's long lasting 346 traffic like bulk data transfer. 348 The overlay nodes are distributed over the path with computing 349 capability, they are in a better position than the end hosts to 350 deduce the underlying links' instantaneous situation from measuring 351 the delay, loss or other metrics over the segment. Shorter round 352 trip time over a path segment will benefit more accurate and 353 immediate measurements for the maximum recent bandwidth available, 354 the minimum recent latency, or trend of change. ONs can further 355 decide if the sending rate reduction at the sender is necessary when 356 a loss happened. Section 4.2 talks more details on this. 358 2.4. Multipathing 360 As an overlay path may suffer from an impairment of the underlying 361 network, two or more overlay paths between the same set of ingress 362 and egress overlay nodes can be combined for reliability purpose. 363 During a transient time when a network impairment is detected, 364 sending replicating traffic over two paths can improve reliability. 366 When two or more disjoint overlay paths are available as shown in 367 figure 3 from ON1 to ON2, different sets of traffic may use different 368 overlay paths. For instance, one path is for low latency and the 369 other is for higher bandwidth, or they can be simply used as load 370 balancing for better bandwidth utilization. 372 Two disjoint paths can usually be found by measuring to figure out 373 the segments with very low mathematical correlation in latency 374 change. When the number of overlay nodes is large, it is easy to 375 find disjoint or partially disjoint segments. 377 Different overlay paths may have varying characteristics. The 378 overlay tunnel should allow the overlay path to handle the packet 379 loss depending on its own path measurements. 381 ON-A 382 +----------o------------------+ 383 | | 384 | | 385 A -----o ON1 ON2o----- B 386 | | 387 +-----------------------o-----+ 388 ON-B 390 Figure 3: Multiple Overlay Paths 392 3. Features and Impacts to be Considered for LOOPS 394 LOOPS (Localized Optimization of Path Segments) tries to leverage the 395 virtual nodes in a selected path to improve the transport performance 396 "locally" instead of end-to-end as those nodes have partitioned the 397 path to multiple segments. With the technologies like NFV (Network 398 function virtualization) and virtual IO, it is easier to add 399 functions to virtual nodes and even the forwarding on those virtual 400 nodes is getting more efficient. Some overlay protocols such as 401 VXLAN [RFC7348], GENEVE [I-D.ietf-nvo3-geneve], LISP [RFC6830] or 402 CAPWAP [RFC5415] are assumed to be employed in the network. In 403 overlay network usage scenario, LOOPS can extend a specific overlay 404 protocol header to perform local measurement and local recovery 405 functions, like the example shown in Figure 4. 407 +------------+------------+---------------+---------+---------+ 408 |Outer IP hdr|Overlay hdr |LOOPS extension|Inner hdr|payload | 409 +------------+------------+---------------+---------+---------+ 411 Figure 4: LOOPS Extension Header Example 413 LOOPS uses packet number space independent from that of the transport 414 layer. Acknowledgment should be generated from ON receiver to ON 415 sender for packet loss detection and local measurement. To reduce 416 overhead, negative ACK over each path segment is a good choice here. 417 A Timestamp echo mechanism, analogous to TCP's Timestamp option, 418 should be employed in band in LOOPS extension to measure the local 419 RTT and variation for an overlay segment. Local in-network recovery 420 is performed. The measurement over segment is expected to give a 421 hint on whether the lost packet of locally recovered one was caused 422 by congestion. Such a hint could be further feedback, using like by 423 ECN Congestion Experienced (CE) markings, to the end host sender. It 424 directs the end host sender if congestion window adjustment is 425 necessary. LOOPS normally works on the overlay segment which 426 aggregates the same type of traffic, for instance TCP traffic or 427 finer granularity like TCP throughput sensitive traffic. LOOPS does 428 not look into the inner packet. Elements to be considered in LOOPS 429 are discussed briefly here. 431 3.1. Local Recovery and End-to-end Retransmission 433 There are basically two ways to perform local recovery, 434 retransmission and FEC (forward error correction). They are possibly 435 used together in some cases. Such approaches between two overlay 436 nodes recover the lost packet in relatively shorter distance and thus 437 shorter latency. Therefore the local recovery is always faster 438 compared to end-to- end. 440 At the same time, most transport layer protocols have their own end- 441 to-end retransmission to recover the lost packet. It would be ideal 442 that end-to-end retransmission at the sender was not triggered if the 443 local recovery was successful. 445 End-to-end retransmission is normally triggered by a NACK like in 446 RTCP or multiple duplicate ACKs like in TCP. 448 When FEC is used for local recovery, it may come with a buffer to 449 make sure the recovered packets delivered are in order subsequently. 450 Therefore the receiver side is unlikely to see the out-of-order 451 packets and then send a NACK or multiple duplicate ACKs. The side 452 effect to unnecessarily trigger end-to-end retransmit is minimum. 453 When FEC is used, if redundancy and block size are determined, extra 454 latency required to recover lost packets is also bounded. Then RTT 455 variation caused by it is predictable. In some extreme case like a 456 large number of packet loss caused by persistent burst, FEC may not 457 be able to recover it. Then end-to-end retransmit will work as a 458 last resort. In summary, when FEC is used as local recovery, the 459 impact on end-to-end retransmission is limited. 461 When retransmission is used, more care is required. 463 For packet loss in RTP streaming, retransmission can recover those 464 packets which would not be retransmitted end-to-end otherwise due to 465 long RTT. It would be ideal if the retransmitted packet reaches the 466 receiver before a NACK for the lost packet would be sent out. 467 Therefore when the segment(s) being retransmitted is a small portion 468 of the whole end to end path, the retransmission will have a 469 significant effect of improving the quality at receiver. When the 470 sender also re-transmits the packet based on a NACK received, the 471 receiver will receive the duplicated retransmitted packets and should 472 ignore the duplication. 474 For packet loss in TCP flows, TCP RENO and CUBIC use duplicate ACKs 475 as a loss signal to trigger the fast retransmit. There are different 476 ways to prevent that the sender's end-to-end retransmission is 477 triggered prematurely: 479 o The egress overlay node can buffer the out-of-order packets for a 480 while to give a limited time for a packet retransmitted somewhere 481 in the overlay path to reach it. The retransmitted packet and the 482 buffered packets caused by it may increase the RTT variation at 483 the sender. When the retransmitted latency is a small portion of 484 RTT or the loss is rare, such RTT variation will be smoothed 485 without much impact. Another possible way is to make the sender 486 exclude such packets from the RTT measurement. The buffer 487 management is nontrivial. It has to be determined how many out- 488 of-order packets can be buffered at the egress overlay node before 489 it gives up waiting for a successful local retransmission. As the 490 lost packet is not always recovered successfully locally, the 491 sender may invoke end-to-end fast retransmit slower than it would 492 be in classic TCP. 494 o If LOOPS network does not buffer the out-of-order packets caused 495 by packet loss, TCP sender can use a time based loss detection 496 like RACK [I-D.ietf-tcpm-rack] to prevent the TCP sender from 497 invoking the fast retransmit too early. RACK uses the notion of 498 time to replace the conventional DUPACK threshold approach to 499 detect losses. RACK is required to be tuned to fit the local 500 retransmission better. If there are n segments over the path, 501 segment retransmission will at least add RTT/n to the reordering 502 window by average when the packet is lost only once over the whole 503 overlay path. This approach is more preferred than one described 504 in previous bullet. On the other hand, if time based loss 505 detection is not supported at the sender, end to end 506 retransmission will be invoked as usual. It wastes some 507 bandwidth. 509 3.1.1. OE to OE Measurement, Recovery and Multipathing 511 When local recovery is between two neighbor ONs, it is called per-hop 512 recovery. It can be between overlay relays or between overlay relay 513 and overlay edge. Another type of local recovery is called OE to OE 514 recovery which performs between overlay edge nodes. When the 515 segments of an overlay path have similar characteristics and/or only 516 OE has the expected processing capability, OE to OE based local 517 recovery can be used instead of per-hop recovery. 519 If there are more than one overlay path in an overlay tunnel, 520 multipathing splits and recombines the traffic. Measurement like 521 round trip time and loss rate between OEs has to be path based. The 522 ingress OE can use the feedback measurement to determine the FEC 523 parameter settings for different path. FEC can also be configured to 524 work over the combined path. The egress OE must be able to remove 525 the replicated packet when overlay path is switched during 526 impairment. 528 OE to OE measurement can help each segment determine its proportion 529 in edge to edge delay. It is useful for ON to decide if it is 530 necessary to turn on the per-hop recovery or how to fine tune the 531 parameter settings. When the segment delay ratio is small, the 532 segment retransmission is more effective. 534 3.2. Congestion Control Interaction 536 When a TCP-like transport layer protocol is used, local recovery in 537 LOOPS has to interact with the upper layer transport congestion 538 control. Classic TCP adjusts the congestion window when a loss is 539 detected and fast retransmit is invoked. 541 Local recovery mechanism breaks the assumption of the necessary and 542 sufficient conditional relationship between detected packet loss and 543 congestion control trigger at the sender in classic TCP. A locally 544 recovered packet can be caused by a non-persistent congestion such as 545 a microburst or a random loss which ideally would not let sender 546 invoke the congestion control reduction mechanism. And it can also 547 possibly caused by a real persistent congestion which should let the 548 sender invoke reduction. In either case, the sender does not detect 549 such a loss if local recovery succeeds. 551 When the local recovery takes effect, we consider the following two 552 cases. Firstly, the classic TCP sender does not see the enough 553 number of duplicate ACKs to trigger fast retransmit. This could be 554 the result of in-order packet delivery including locally recovered 555 ones to the receiver as mentioned in last subsection. Classic TCP 556 sender in this case will not reduce congestion window as no loss is 557 detected. Secondly, if a time based loss detection such as RACK is 558 used, as long as the locally recovered packet's ACK reaches the 559 sender before the reordering window expires, the congestion window 560 will not be reduced. 562 Such behavior brings great throughput improvement and it is desirable 563 when the recovered packet was lost due to non-persistent congestion 564 or random factors. It solves the throughput problem mentioned in 565 section 3.3. However, it also brings the risk that the sender is not 566 able to detect the real persistent congestion in time and then 567 overshoot. Eventually a severe congestion that is not recoverable by 568 a local recovery mechanism may occur. In addition, it may be 569 unfriendly to other flows (possibly pushing them out) if those flows 570 are running over the same underlying bottleneck links. 572 There is a spectrum of approaches. On one end, each locally 573 recovered packet can be treated exactly as a loss in order to invoke 574 the congestion control at the sender to guarantee the fair sharing as 575 classic TCP by setting its CE (Congestion Experienced) bit. Explicit 576 Congestion Notification (ECN) can be used here as ECN marking was 577 required to be equivalent to a packet drop [RFC3168]. Congestion 578 control at the sender works as usual and no throughput improvement 579 could be achieved (although the benefit of faster recovery is still 580 there). On the other hand, ON can perform its congestion measurement 581 over the segment, for instance local RTT and its variation trend. 582 Then the lost packet can be determined if it was caused by congestion 583 or other factors. It will further decide if it is necessary to set 584 CE marking or even what ratio is set to make the sender adjust the 585 sending rate more correctly. 587 There are possible cases that the sender detects the loss even with 588 local recovery in function. For example, when the re-ordering window 589 in RACK is not optimally adapted, the sender may trigger the 590 congestion control at the same time of end-to-end retransmission. If 591 spurious retransmission detection based on DSACK [RFC3708] is used, 592 such end-to-end retransmission will be found out unnecessary when 593 locally recovered packets reaches the receiver successfully. Then 594 congestion control changes will be undone at the sender. This 595 results in similar pros and cons as described earlier. Pros are 596 preventing the necessary window reduction and improving the 597 throughput when the loss is considered caused by non-persistent 598 congestion or random loss. Cons are some mechanisms like ECN or its 599 variants should be used wisely to make sure the congestion control is 600 invoked in case of persistent congestion. 602 An approach where the losses on a path segment are not immediately 603 made known to the end-to-end congestion control can be combined with 604 a "circuit breaker" style congestion control on the path segment. 606 When the usage of path segment by the overlay flow starts to become 607 unfair, the path segment sends congestion signals up to the end-to- 608 end congestion control. This must be carefully tuned to avoid 609 unwanted oscillation. 611 In summary, local recovery can improve Flow Completion Time (FCT) by 612 eliminating tail loss in small flows. As it changes loss event to 613 out-of-order event in most cases to TCP sender, if TCP sender uses 614 loss based congestion control, there is some implication on the 615 throughput. We suggest ECN and spurious retransmission to be enabled 616 when local recovery is in use, it would give the desirable 617 throughput, i.e. when loss is caused by congestion, reduce congestion 618 window; otherwise keep sender's sending rate. We do not suggest to 619 use spurious retransmission alone together with local recovery as it 620 may cause the TCP sender falsely undo window reduction when 621 congestion occurs. If only ECN is enabled or neither ECN nor 622 spurious retransmission is enabled, the throughput with local 623 recovery in use is no much difference from that of the tradition TCP. 625 3.3. Overlay Protocol Extensions 627 The overlay usually has no control over how packets are routed in the 628 underlying network between two overlay nodes, but it can control, for 629 example, the sequence of overlay nodes a message traverses before 630 reaching its destination. LOOPS assumes the overlay protocol can 631 deliver the packets in such designated sequence. Most forms of 632 overlay networking use some sort of "encapsulation". The whole path 633 taken can be performed by stitching multiple short overlay paths, 634 like VXLAN[RFC7348], GENEVE [I-D.ietf-nvo3-geneve], or it can be a 635 single overlay path with a sequence of intermediate overlay nodes 636 specified, like SRv6 [I-D.ietf-6man-segment-routing-header]. In 637 either way, LOOPS requires to extend the protocol to support the data 638 plane measurement and feedback, retransmission or FEC based loss 639 recovery either per ON-hop based or OE to OE based. 641 LOOPS alone has no setup requirement on control plane. Some overlay 642 protocol, e.g. CAPWAP [RFC5415], has session setup phase, we can use 643 it to exchange the infomation like dynamic FEC parameters. 645 3.4. Summary 647 LOOPS is expected to extend the existing overlay protocols in data 648 plane. Path selection is assumed a feature provided by the overlay 649 protocols via SDN or other approaches and is not a part of LOOPS. 650 LOOPS is a set of functions to be implemented on ONs in a long haul 651 overlay network. LOOPS includes the following features. 653 1. Local recovery. Retransmission, FEC or hybrid can be used as 654 local recovery method. Such recovery mechanism is in-network. 655 It is performed by two network nodes with computing and memory 656 resources. 658 2. Local congestion measurement. Sender ON measures the local 659 segment RTT and/or loss to immediately get the overlay segment 660 status. 662 3. Determination on how to set ECN CE marking based on local 663 recovery and/or local congestion measurement information to 664 feedback the end host sender to adjust the sending rate 665 correctly. 667 4. Security Considerations 669 LOOPS does not look at the traffic payload, so encrypted payload does 670 not affect functionality of LOOPS. The use of LOOPS introduces some 671 issues which impact security. ON with LOOPS function represents a 672 point in the network where the traffic can be potentially 673 manipulated. Denial of service attack can be launched from an ON. A 674 rogue ON might be able to spoof packet as if it come from a 675 legitimate ON. It may also modify the ECN CE marking in packets to 676 influence the sender's rate. In order to protected from such 677 attacks, the overlay protocol itself should have some build-in 678 security protection which inherently be used by LOOPS. The operator 679 should use some authentication mechanism to make sure ONs are valid 680 and non-compromised. 682 5. IANA Considerations 684 No IANA action is required. 686 6. Informative References 688 [RFC0793] Postel, J., "Transmission Control Protocol", STD 7, 689 RFC 793, DOI 10.17487/RFC0793, September 1981, 690 . 692 [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition 693 of Explicit Congestion Notification (ECN) to IP", 694 RFC 3168, DOI 10.17487/RFC3168, September 2001, 695 . 697 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 698 Jacobson, "RTP: A Transport Protocol for Real-Time 699 Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550, 700 July 2003, . 702 [RFC3708] Blanton, E. and M. Allman, "Using TCP Duplicate Selective 703 Acknowledgement (DSACKs) and Stream Control Transmission 704 Protocol (SCTP) Duplicate Transmission Sequence Numbers 705 (TSNs) to Detect Spurious Retransmissions", RFC 3708, 706 DOI 10.17487/RFC3708, February 2004, 707 . 709 [RFC4585] Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey, 710 "Extended RTP Profile for Real-time Transport Control 711 Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585, 712 DOI 10.17487/RFC4585, July 2006, 713 . 715 [RFC4588] Rey, J., Leon, D., Miyazaki, A., Varsa, V., and R. 716 Hakenberg, "RTP Retransmission Payload Format", RFC 4588, 717 DOI 10.17487/RFC4588, July 2006, 718 . 720 [RFC5415] Calhoun, P., Ed., Montemurro, M., Ed., and D. Stanley, 721 Ed., "Control And Provisioning of Wireless Access Points 722 (CAPWAP) Protocol Specification", RFC 5415, 723 DOI 10.17487/RFC5415, March 2009, 724 . 726 [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion 727 Control", RFC 5681, DOI 10.17487/RFC5681, September 2009, 728 . 730 [RFC6298] Paxson, V., Allman, M., Chu, J., and M. Sargent, 731 "Computing TCP's Retransmission Timer", RFC 6298, 732 DOI 10.17487/RFC6298, June 2011, 733 . 735 [RFC6830] Farinacci, D., Fuller, V., Meyer, D., and D. Lewis, "The 736 Locator/ID Separation Protocol (LISP)", RFC 6830, 737 DOI 10.17487/RFC6830, January 2013, 738 . 740 [RFC7348] Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, 741 L., Sridhar, T., Bursell, M., and C. Wright, "Virtual 742 eXtensible Local Area Network (VXLAN): A Framework for 743 Overlaying Virtualized Layer 2 Networks over Layer 3 744 Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014, 745 . 747 [I-D.dukkipati-tcpm-tcp-loss-probe] 748 Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis, 749 "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of 750 Tail Losses", draft-dukkipati-tcpm-tcp-loss-probe-01 (work 751 in progress), February 2013. 753 [I-D.ietf-nvo3-geneve] 754 Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic 755 Network Virtualization Encapsulation", draft-ietf- 756 nvo3-geneve-11 (work in progress), March 2019. 758 [I-D.ietf-tcpm-rack] 759 Cheng, Y., Cardwell, N., Dukkipati, N., and P. Jha, "RACK: 760 a time-based fast loss detection algorithm for TCP", 761 draft-ietf-tcpm-rack-04 (work in progress), July 2018. 763 [I-D.ietf-6man-segment-routing-header] 764 Filsfils, C., Previdi, S., Leddy, J., Matsushima, S., and 765 d. daniel.voyer@bell.ca, "IPv6 Segment Routing Header 766 (SRH)", draft-ietf-6man-segment-routing-header-16 (work in 767 progress), February 2019. 769 [I-D.cardwell-iccrg-bbr-congestion-control] 770 Cardwell, N., Cheng, Y., Yeganeh, S., and V. Jacobson, 771 "BBR Congestion Control", draft-cardwell-iccrg-bbr- 772 congestion-control-00 (work in progress), July 2017. 774 [DOI_10.1109_ICDCS.2016.49] 775 Cai, C., Le, F., Sun, X., Xie, G., Jamjoom, H., and R. 776 Campbell, "CRONets: Cloud-Routed Overlay Networks", 2016 777 IEEE 36th International Conference on Distributed 778 Computing Systems (ICDCS), DOI 10.1109/icdcs.2016.49, June 779 2016. 781 Authors' Addresses 783 Yizhou Li 784 Huawei Technologies 785 101 Software Avenue, 786 Nanjing 210012 787 China 789 Phone: +86-25-56624584 790 Email: liyizhou@huawei.com 791 Xingwang Zhou 792 Huawei Technologies 793 101 Software Avenue, 794 Nanjing 210012 795 China 797 Email: zhouxingwang@huawei.com