idnits 2.17.1 

draft-li-tsvwg-loops-problem-opportunities-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (May 24, 2019) is 1799 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC  793
     (Obsoleted by RFC 9293)

  -- Obsolete informational reference (is this intentional?): RFC 6830
     (Obsoleted by RFC 9300, RFC 9301)

  == Outdated reference: A later version (-16) exists of
     draft-ietf-nvo3-geneve-13

  == Outdated reference: A later version (-15) exists of
     draft-ietf-tcpm-rack-05

  == Outdated reference: A later version (-26) exists of
     draft-ietf-6man-segment-routing-header-19

  == Outdated reference: A later version (-02) exists of
     draft-cardwell-iccrg-bbr-congestion-control-00


     Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 3 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.
--------------------------------------------------------------------------------


2	TSVWG                                                              Y. Li
3	Internet-Draft                                                   X. Zhou
4	Intended status: Informational                                    Huawei
5	Expires: November 25, 2019                                  May 24, 2019

7	 LOOPS (Localized Optimizations of Path Segments) Problem Statement and
8	                             Opportunities
9	             draft-li-tsvwg-loops-problem-opportunities-02

11	Abstract

13	   In various network deployments, end to end paths are partitioned into
14	   multiple segments.  In some cloud based WAN connections, multiple
15	   overlay tunnels in series are used to achieve better path selection
16	   and lower latency.  In satellite communication, the end to end path
17	   is split into two terrestrial segments and a satellite segment.
18	   Packet losses can be caused both by random events or congestion in
19	   various deployments.

21	   Traditional end-to-end transport layers respond to packet loss slowly
22	   especially in long-haul networks: They either wait for some signal
23	   from the receiver to indicate a loss and then retransmit from the
24	   sender or rely on sender's timeout which is often quite long.  Non-
25	   congestion caused packet loss may make the TCP sender over-reduce the
26	   sending rate unnecessarily.  With end-to-end encryption moving under
27	   the transport (QUIC), traditional PEP (performance enhancing proxy)
28	   techniques such as TCP splitting are no longer applicable.

30	   LOOPS (Local Optimizations on Path Segments) aims to provide non end-
31	   to-end, locally based in-network recovery to achieve better data
32	   delivery by making packet loss recovery faster and by avoiding the
33	   senders over-reducing their sending rate.  In an overlay network
34	   scenario, LOOPS can be performed over the existing, or purposely
35	   created, overlay tunnel based path segments.

37	Status of This Memo

39	   This Internet-Draft is submitted in full conformance with the
40	   provisions of BCP 78 and BCP 79.

42	   Internet-Drafts are working documents of the Internet Engineering
43	   Task Force (IETF).  Note that other groups may also distribute
44	   working documents as Internet-Drafts.  The list of current Internet-
45	   Drafts is at https://datatracker.ietf.org/drafts/current/.

47	   Internet-Drafts are draft documents valid for a maximum of six months
48	   and may be updated, replaced, or obsoleted by other documents at any
49	   time.  It is inappropriate to use Internet-Drafts as reference
50	   material or to cite them other than as "work in progress."

52	   This Internet-Draft will expire on November 25, 2019.

54	Copyright Notice

56	   Copyright (c) 2019 IETF Trust and the persons identified as the
57	   document authors.  All rights reserved.

59	   This document is subject to BCP 78 and the IETF Trust's Legal
60	   Provisions Relating to IETF Documents
61	   (https://trustee.ietf.org/license-info) in effect on the date of
62	   publication of this document.  Please review these documents
63	   carefully, as they describe your rights and restrictions with respect
64	   to this document.  Code Components extracted from this document must
65	   include Simplified BSD License text as described in Section 4.e of
66	   the Trust Legal Provisions and are provided without warranty as
67	   described in the Simplified BSD License.

69	Table of Contents

71	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
72	     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   4
73	   2.  Cloud-Internet Overlay Network  . . . . . . . . . . . . . . .   5
74	     2.1.  Tail Loss or Loss in Short Flows  . . . . . . . . . . . .   7
75	     2.2.  Packet Loss in Real Time Media Streams  . . . . . . . . .   8
76	     2.3.  Packet Loss and Congestion Control in Bulk Data Transfer    8
77	     2.4.  Multipathing  . . . . . . . . . . . . . . . . . . . . . .   9
78	   3.  Satellite Communication . . . . . . . . . . . . . . . . . . .   9
79	   4.  Features and Impacts to be Considered for LOOPS . . . . . . .  11
80	     4.1.  Local Recovery and End-to-end Retransmission  . . . . . .  12
81	       4.1.1.  OE to OE Measurement, Recovery and Multipathing . . .  13
82	     4.2.  Congestion Control Interaction  . . . . . . . . . . . . .  14
83	     4.3.  Overlay Protocol Extensions . . . . . . . . . . . . . . .  16
84	     4.4.  Summary . . . . . . . . . . . . . . . . . . . . . . . . .  16
85	   5.  Security Considerations . . . . . . . . . . . . . . . . . . .  17
86	   6.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  17
87	   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  17
88	   8.  Informative References  . . . . . . . . . . . . . . . . . . .  17
89	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  19

91	1.  Introduction

93	   Overlay tunnels are widely deployed for various networks, including
94	   long haul WAN interconnection, enterprise wireless access networks,
95	   etc.  The end to end connection is partitioned into multiple path
96	   segments using overlay tunnels.  This serves a number of purposes,
97	   for instance, selecting a better path over the WAN or delivering the
98	   packets over heterogeneous network, such as enterprise access and
99	   core networks.

101	   A reliable transport layer normally employs some end-to-end
102	   retransmission mechanisms which also address congestion control
103	   [RFC0793] [RFC5681].  The sender either waits for the receiver to
104	   send some signals on a packet loss or sets some form of timeout for
105	   retransmission.  For unreliable transport layer protocols such as RTP
106	   [RFC3550], optional and limited usage of end-to-end retransmission is
107	   employed to recover from packet loss [RFC4585] [RFC4588].

109	   End-to-end retransmission to recover lost packets is slow especially
110	   when the network is long haul.  When a path is partitioned into
111	   multiple path segments that are realized as overlay tunnels, LOOPS
112	   (Local Optimizations on Path Segments) tries to provide local segment
113	   based in-network recovery to achieve better data delivery by making
114	   packet loss recovery faster and by avoiding the senders over-reducing
115	   their sending rate.  In an overlay network scenario, LOOPS can be
116	   performed over the existing, or purposely created, overlay tunnel
117	   based path segments.

119	   Some link types (satellite, microwave) may exhibit unusually high
120	   loss rate in special conditions (e.g., fades due to heavy rain).  The
121	   traditional TCP sender interprets loss as congestion and over-reduces
122	   the sending rate, degrading the throughput.  LOOPS is also applicable
123	   to such scenarios to improve throughput.

125	   Section 2 presents some of the issues and opportunities found in
126	   Cloud-Internet overlay networks that require higher performance and
127	   more reliable packet transmission in best effort networks.  Section 3
128	   discusses applications of LOOPS in satellite communication.
129	   Section 4 describes the corresponding solution features and the their
130	   impact on existing network technologies.

132	                                                      ON=overlay node
133	                                                      UN=underlay node

135	   +---------+                                               +---------+
136	   |   App   | <---------------- end-to-end ---------------> |   App   |
137	   +---------+                                               +---------+
138	   |Transport| <---------------- end-to-end ---------------> |Transport|
139	   +---------+                                               +---------+
140	   |         |                                               |         |
141	   |         |        +--+  path  +--+  path segment2  +--+  |         |
142	   |         |        |  |<-seg1->|  |<--------------> |  |  |         |
143	   | Network |  +--+  |ON|  +--+  |ON|  +--+   +----+  |ON|  | Network |
144	   |         |--|UN|--|  |--|UN|--|  |--|UN|---| UN |--|  |--|         |
145	   +---------+  +--+  +--+  +--+  +--+  +--+   +----+  +--+  +---------+
146	     End Host                                                  End Host
147	                       <--------------------------------->
148	                        LOOPS domain: path segment enables
149	                        optimizations for better local transport

151	             Figure 1: LOOPS in Overlay Network Usage Scenario

153	1.1.  Terminology

155	   LOOPS:  Local Optimizations on Path Segments.  LOOPS includes the
156	      local in-network (i.e. non end-to-end) recovery function, for
157	      instance, loss detection and measurements.

159	   LOOPS Node:  Node supporting LOOPS functions.

161	   Overlay Node (ON):  Node having overlay functions (like overlay
162	      protocol encapsulation/decapsulation, header modification, TLV
163	      inspection) and LOOPS functions in LOOPS overlay network usage
164	      scenario.  Both OR and OE are Overlay Nodes.

166	   Overlay Tunnel:  A tunnel with designated ingress and egress nodes
167	      using some network overlay protocol as encapsulation, optionally
168	      with a specific traffic type.

170	   Overlay Path:  A channel within the overlay tunnel, where the traffic
171	      transmitted on the channel needs to pass through zero or more
172	      designated intermediate overlay nodes.  There may be more than one
173	      overlay path within an overlay tunnel when the different sets of
174	      designated intermediate overlay nodes are specified.  An overlay
175	      path may contain multiple path segments.  When an overlay tunnel
176	      contains only one overlay path without any intermediate overlay
177	      node specified, overlay path and overlay tunnel are used
178	      interchangeably.

180	   Overlay Edge (OE):  Edge node of an overlay tunnel.

182	   Overlay Relay (OR):  Intermediate overlay node on an overlay path.
183	      An overlay path need not contain any OR.

185	   Path segment:  Part of an overlay path between two neighbor overlay
186	      nodes.  It is used interchangeably with overlay segment in this
187	      document when the context wants to emphasize on its overlay
188	      encapsulated nature.  An overlay path may contain multiple path
189	      segments.  When an overlay path contains only one path segment,
190	      i.e. the segment is between two OEs, the path segment is
191	      equivalent to the overlay path.  It is also called segment for
192	      simplicity in this document.

194	   Overlay segment:  Refers to path segment.

196	   Underlay Node (UN):  Nodes not participating in the overlay network
197	      function.

199	2.  Cloud-Internet Overlay Network

201	   The Internet is a huge network of networks.  The interconnections of
202	   end devices using this global network are normally provided by ISPs
203	   (Internet Service Provider).  This network created by the composition
204	   of the ISP networks is considered as the traditional Internet.  CSPs
205	   (Cloud Service Providers) are connecting their data centers using the
206	   Internet or via self-constructed networks/links.  This expands the
207	   Internet's infrastructure and, together with the original ISP's
208	   infrastructure, forms the Internet underlay.

210	   NFV (network function virtualization) further makes it easier to
211	   dynamically provision a new virtual node as a work load in a cloud
212	   for CPU/storage intensive functions.  With the aid of various
213	   mechanisms such as kernel bypassing and Virtual IO, forwarding based
214	   on virtual nodes is becoming more and more effective.  The
215	   interconnections among the purposely positioned virtual nodes and/or
216	   the existing nodes with virtualization functions potentially form an
217	   overlay of Internet.  It is called the Cloud-Internet Overlay Network
218	   (CION) in this document.

220	   CION makes use of overlay technologies to direct the traffic going
221	   through the specific overlay path regardless of the underlying
222	   physical topology, in order to achieve better service delivery.  It
223	   purposely creates or selects overlay nodes (ON) from providers.  By
224	   continuously measuring the delay of path segments and use them as
225	   metrics for path selection, when the number of overlay nodes is
226	   sufficiently large, there is a high chance that a better path could
227	   be found [DOI_10.1109_ICDCS.2016.49] [DOI_10.1145_3038912.3052560].

229	   [DOI_10.1145_3038912.3052560] further shows all cloud providers
230	   experience random loss episodes and random loss accounts for more
231	   than 35% of total loss.

233	   Figure 2 shows an example of an overlay path over large geographic
234	   distances.  The path between two OEs (Overlay Edges) is an overlay
235	   path.  OEs are ON1 & ON4 in Figure 2.  Part of the path between ONs
236	   is a path segment.  Figure 2 shows the overlay path with 3 segments,
237	   i.e. ON1-ON2-ON3-ON4.  ON is usually a virtual node, though it does
238	   not have to be.  Overlay path transmits packets in some form of
239	   network overlay protocol encapsulation.  ON has the computing and
240	   memory resources that can be used for some functions like packet loss
241	   detection, network measurement and feedback, packet recovery.

243	                  _____________
244	                 /  domain 1   \
245	                /               \
246	            ___/                 -------------\
247	           /                                   \
248	    PoP1 ->--ON1                                \
249	          |   |                            ON4------>-- PoP2
250	          |   |   ON2                     ___|__/
251	           \__|_ |->|         _____      /   |
252	              | \|__|__      /     \    /    |
253	              |  |  |  \____/       \__/     |
254	             \|/ |  |        _____           |
255	              |  |  |    ___/     \          |
256	              |  | \|/  /          \_____    |
257	              |  |  |  /         domain 2 \ /|\
258	              |  |  | |       ON3         |  |
259	              |  |  |  \      |->|        |  |
260	              |  |  |   \_____|__|_______/   |
261	              | /|\ |         | \|/          |
262	              |  |  |         |  |           |
263	              |  |  |        /|\ |           |
264	       +--------------------------------------------------+
265	       |      |  |  |         |  |           |   Internet |
266	       |      o--o  o---o->---o  o---o->--o--o   underlay |
267	       +--------------------------------------------------+

269	              Figure 2: Cloud-Internet Overlay Network (CION)

271	   We tested based on 37 overlay nodes from multiple cloud providers
272	   globally.  Each pair of the overlay nodes are used as sender and
273	   receiver.  When the traffic is not intentionally directed to go
274	   through any intermediate virtual nodes, we call the path that the
275	   traffic takes the _default path_ in the test.  When any of the
276	   virtual nodes is intentionally used as an intermediate node to
277	   forward the traffic, the path that the traffic takes is an _overlay
278	   path_ in the test.  The preliminary experiments showed that the delay
279	   of an overlay path is shorter than that of the default path in 69% of
280	   cases at 99% percentile and improvement is 17.5% at 99% percentile
281	   when we probe Ping packets every second for a week.

283	   Lower delay does not necessarily mean higher throughput.  Different
284	   path segments may have different packet loss rates.  Loss rate is
285	   another major factor impacting TCP throughput.  From some customer
286	   requirements, we set the target loss rate to be less than 1% at 99%
287	   percentile and 99.9% percentile, respectively.  The loss was measured
288	   between any two overlay nodes, i.e. any potential path segment.  Two
289	   thousand Ping packets were sent every 20 seconds between two overlay
290	   nodes for 55 hours.  This preliminary experiment showed that the
291	   packet loss rate satisfaction are 44.27% and 29.51% at the 99% and
292	   99.9% percentiles respectively.

294	   Hence packet loss in an overlay segment is a key issue to be solved
295	   in CION.  In long-haul networks, the end-to-end retransmission of
296	   lost packet can result in an extra round trip time.  Such extra time
297	   is not acceptable in some cases.  As CION naturally consists of
298	   multiple overlay segments, LOOPS leverages this to perform local
299	   optimizations on a single hop between two overlay nodes.  ("Local"
300	   here is a concept relative to end-to-end, it does not mean such
301	   optimization is limited to LAN networks.)

303	   The following subsections present different scenarios using multiple
304	   segment based overlay paths with a common need of local in-network
305	   loss recovery in best effort networks.

307	2.1.  Tail Loss or Loss in Short Flows

309	   When the lost segments are at the end of a transaction, TCP's fast
310	   retransmit algorithm does not work as there are no ACKs to trigger
311	   it.  When a sender does not receive an ACK for a given segment within
312	   a certain amount of time called retransmission timeout (RTO), it re-
313	   sends the segment [RFC6298].  RTO can be as long as several seconds.
314	   Hence the recovery of lost segments triggered by RTO is lengthy.
315	   [I-D.dukkipati-tcpm-tcp-loss-probe] indicates that large RTOs make a
316	   significant contribution to the long tail on the latency statistics
317	   of short flows like web pages.

319	   The short flow often completes in one or two RTTs.  Even when the
320	   loss is not a tail loss, it can possibly add another RTT because of
321	   end-to-end retransmission (not enough packets are in flight to
322	   trigger fast retransmit).  In long haul networks, it can result in
323	   extra time of tens or even hundreds of milliseconds.

325	   An overlay segment transmits the aggregated flows from ON to ON.  As
326	   short flows are aggregated, the probability of tail loss over this
327	   specific overlay segment decreases compared to an individual flow.
328	   The overlay segment is much shorter than the end-to-end path in a
329	   Cloud- Internet overlay network, hence loss recovery over an overlay
330	   segment is faster.

332	2.2.  Packet Loss in Real Time Media Streams

334	   The Real-time transport protocol (RTP) is widely used in interactive
335	   audio and video.  Packet loss degrades the quality of the received
336	   media.  When the latency tolerance of the application is sufficiently
337	   large, the RTP sender may use RTCP NACK feedback from the receiver
338	   [RFC4585] to trigger the retransmission of the lost packets before
339	   the playout time is reached at the receiver.

341	   In a Cloud-Internet overlay network, the end-to-end path can be
342	   hundreds of milliseconds.  End-to-end feedback based retransmission
343	   may be not be very useful when applications can not tolerate one more
344	   RTT of this length.  Loss recovery over an overlay segment can then
345	   be used for the scenarios where RTCP NACK triggered retransmission is
346	   not appropriate.

348	2.3.  Packet Loss and Congestion Control in Bulk Data Transfer

350	   TCP congestion control algorithms such as Reno and CUBIC basically
351	   interpret packet loss as congestion experienced somewhere in the
352	   path.  When a loss is detected, the congestion window will be
353	   decreased at the sender to make the sending slower.  It has been
354	   observed that packet loss is not an accurate way to detect congestion
355	   in the current Internet [I-D.cardwell-iccrg-bbr-congestion-control].
356	   In long-haul links, when the loss is caused by non-persistent burst
357	   which is extremely short and pretty random, the sender's reaction of
358	   reducing sending rate is not able to respond in time to the
359	   instantaneous path situation or to mitigate such bursts.  On the
360	   contrary, reducing window size at the sender unnecessarily or too
361	   aggressively harms the throughput for application's long lasting
362	   traffic like bulk data transfer.

364	   The overlay nodes are distributed over the path with computing
365	   capability, they are in a better position than the end hosts to
366	   deduce the underlying links' instantaneous situation from measuring
367	   the delay, loss or other metrics over the segment.  Shorter round
368	   trip time over a path segment will benefit more accurate and
369	   immediate measurements for the maximum recent bandwidth available,
370	   the minimum recent latency, or trend of change.  ONs can further
371	   decide if the sending rate reduction at the sender is necessary when
372	   a loss happened.  Section 4.2 talks more details on this.

374	2.4.  Multipathing

376	   As an overlay path may suffer from an impairment of the underlying
377	   network, two or more overlay paths between the same set of ingress
378	   and egress overlay nodes can be combined for reliability purpose.
379	   During a transient time when a network impairment is detected,
380	   sending replicating traffic over two paths can improve reliability.

382	   When two or more disjoint overlay paths are available as shown in
383	   Figure 3 from ON1 to ON2, different sets of traffic may use different
384	   overlay paths.  For instance, one path is for low latency and the
385	   other is for higher bandwidth, or they can be simply used as load
386	   balancing for better bandwidth utilization.

388	   Two disjoint paths can usually be found by measuring to figure out
389	   the segments with very low mathematical correlation in latency
390	   change.  When the number of overlay nodes is large, it is easy to
391	   find disjoint or partially disjoint segments.

393	   Different overlay paths may have varying characteristics.  The
394	   overlay tunnel should allow the overlay path to handle the packet
395	   loss depending on its own path measurements.

397	                       ON-A
398	             +----------o------------------+
399	             |                             |
400	             |                             |
401	      A -----o ON1                      ON2o----- B
402	             |                             |
403	             +-----------------------o-----+
404	                                   ON-B

406	                     Figure 3: Multiple Overlay Paths

408	3.  Satellite Communication

410	   Traditionally, satellite communications deploy PEP (performance
411	   enhancing proxy) nodes around the satellite link to enhance end-to-
412	   end performance.  TCP splitting is a common approach employed by such
413	   PEPs, where the TCP connection is split into three: the segment
414	   before the satellite hop, the satellite section (uplink, downlink),
415	   and the segment behind the satellite hop.  This requires heavy
416	   interactions with the end-to-end transport protocols, usually without
417	   the explicit consent of the end hosts.  Unfortunately, this is
418	   indistinguishable from a man-in-the-middle attack on TCP.  With end-
419	   to-end encryption moving under the transport (QUIC), this approach is
420	   no longer useful.

422	   Geosynchronous Earth Orbit (GEO) satellites have a one-way delay (up
423	   to the satellite and back) on the order of 250 milliseconds.  This
424	   does not include queueing, coding and other delays in the satellite
425	   ground equipment.  The Round Trip Time for a TCP or QUIC connection
426	   going over a satellite hop in both directions, in the best case, will
427	   be on the order of 600 milliseconds.  And, it may be considerably
428	   longer.  RTTs on this order of magnitude have significant performance
429	   implications.

431	   Packet loss recovery is an area where splitting the TCP connection
432	   into different parts helps.  Packets lost on the terrestrial links
433	   can be recovered at terrestrial latencies.  Packet loss on the
434	   satellite link can be recovered more quickly by an optimized for
435	   satellite protocol between the PEPs and/or link layer FEC than they
436	   could be end to end.  Again, encryption makes TCP splitting no longer
437	   applicable.  Enhanced error recovery at the satellite link layer
438	   helps for the loss on the satellite link but doesn't help for the
439	   terrestrial links.  Even when the terrestrial segments are short, any
440	   loss must be recovered across the satellite link delay.  And, there
441	   are cases when a satellite ground station connects to the general
442	   Internet with a potentially larger terrestrial segment (e.g., to a
443	   correspondent host in another country).  Faster recovery over such
444	   long terrestrial segments is desirable.

446	   Another aspect of recovery is that terrestrial loss is highly likely
447	   to be congestion related but satellite loss is more likely to be
448	   transmission errors due to link conditions.  A transport endpoint
449	   slowing down because of mis-interpreting these errors as congestion
450	   losses unnecessarily reduces performance.  But, at the end points,
451	   the difference between the two is not easily distinguished.  To
452	   elaborate more on the loss recovery for satellite communications,
453	   while the error rate on the satellite paths is generally very low
454	   most of the time, it might get higher during special link conditions
455	   (e.g.  fades due to heavy rain).  The satellite hop itself does know
456	   which losses are due to link conditions as opposed to congestion, but
457	   it has no mechanism to signal this difference to the end hosts.

459	   We will need the protocol under QUIC to try to minimize non-
460	   congestion packet drop.  Specific link layers may have techniques
461	   such as satellite FEC to recover.  Where the capabilities of that may
462	   be exceeded (e.g., rain fade), we can look at LOOPS-like approaches.

464	   There are two high level classes of solutions for making encrypted
465	   transport traffic like QUIC work well over satellite:

467	   o  Hooks in the protocol which can adapt to large BDPs where both the
468	      bandwidth and the latency are large.  This would require end to
469	      end enhancement.

471	   o  Capabilities (such as LOOPS) under the protocol to improve
472	      performance over specific segments of the path.  In particular,
473	      separating the terrestrial from the satellite losses.  Fixing the
474	      terrestrial loss quickly and keeping throughput high over
475	      satellite segment by not causing the end-hosts to over-reduce
476	      their sending window in case of non-congestion loss.

478	   This document focuses on the latter.

480	4.  Features and Impacts to be Considered for LOOPS

482	   LOOPS (Localized Optimizations of Path Segments) aims to leverage the
483	   virtual nodes in a selected path to improve the transport performance
484	   "locally" instead of end-to-end as those nodes have partitioned the
485	   path to multiple segments.  With the technologies like NFV (Network
486	   function virtualization) and virtual IO, it is easier to add
487	   functions to virtual nodes and even the forwarding on those virtual
488	   nodes is getting more efficient.  Some overlay protocols such as
489	   VXLAN [RFC7348], GENEVE [I-D.ietf-nvo3-geneve], LISP [RFC6830] or
490	   CAPWAP [RFC5415] are assumed to be employed in the network.  In
491	   overlay network usage scenario, LOOPS can extend a specific overlay
492	   protocol header to perform local measurement and local recovery
493	   functions, like the example shown in Figure 4.

495	    +------------+------------+-----------------+---------+---------+
496	    |Outer IP hdr|Overlay hdr |LOOPS information|Inner hdr|payload  |
497	    +------------+------------+-----------------+---------+---------+

499	                 Figure 4: LOOPS Extension Header Example

501	   LOOPS uses packet number space independent from that of the transport
502	   layer.  Acknowledgment should be generated from ON receiver to ON
503	   sender for packet loss detection and local measurement.  To reduce
504	   overhead, negative ACK over each path segment is a good choice here.
505	   A Timestamp echo mechanism, analogous to TCP's Timestamp option,
506	   should be employed in band in LOOPS extension to measure the local
507	   RTT and variation for an overlay segment.  Local in-network recovery
508	   is performed.  The measurement over segment is expected to give a
509	   hint on whether the lost packet of locally recovered one was caused
510	   by congestion.  Such a hint could be further feedback, using like by
511	   ECN Congestion Experienced (CE) markings, to the end host sender.  It
512	   directs the end host sender if congestion window adjustment is
513	   necessary.  LOOPS normally works on the overlay segment which
514	   aggregates the same type of traffic, for instance TCP traffic or
515	   finer granularity like TCP throughput sensitive traffic.  LOOPS does
516	   not look into the inner packet.  Elements to be considered in LOOPS
517	   are discussed briefly here.

519	4.1.  Local Recovery and End-to-end Retransmission

521	   There are basically two ways to perform local recovery,
522	   retransmission and FEC (forward error correction).  They are possibly
523	   used together in some cases.  Such approaches between two overlay
524	   nodes recover the lost packet in relatively shorter distance and thus
525	   shorter latency.  Therefore the local recovery is always faster
526	   compared to end-to- end.

528	   At the same time, most transport layer protocols have their own end-
529	   to-end retransmission to recover the lost packet.  It would be ideal
530	   that end-to-end retransmission at the sender was not triggered if the
531	   local recovery was successful.

533	   End-to-end retransmission is normally triggered by a NACK as in RTCP
534	   or multiple duplicate ACKs as in TCP.

536	   When FEC is used for local recovery, it may come with a buffer to
537	   make sure the recovered packets delivered are in order subsequently.
538	   Therefore the receiver side is unlikely to see the out-of-order
539	   packets and then send a NACK or multiple duplicate ACKs.  The side
540	   effect to unnecessarily trigger end-to-end retransmit is minimum.
541	   When FEC is used, if redundancy and block size are determined, extra
542	   latency required to recover lost packets is also bounded.  Then RTT
543	   variation caused by it is predictable.  In some extreme case like a
544	   large number of packet loss caused by persistent burst, FEC may not
545	   be able to recover it.  Then end-to-end retransmit will work as a
546	   last resort.  In summary, when FEC is used as local recovery, the
547	   impact on end-to-end retransmission is limited.

549	   When retransmission is used, more care is required.

551	   For packet loss in RTP streaming, retransmission can recover those
552	   packets which would not be retransmitted end-to-end otherwise due to
553	   long RTT.  It would be ideal if the retransmitted packet reaches the
554	   receiver before it sends back information that the sender would
555	   interpret as a NACK for the lost packet.  Therefore when the
556	   segment(s) being retransmitted is a small portion of the whole end to
557	   end path, the retransmission will have a significant effect of
558	   improving the quality at receiver.  When the sender also re-transmits
559	   the packet based on a NACK received, the receiver will receive the
560	   duplicated retransmitted packets and should ignore the duplication.

562	   For packet loss in TCP flows, TCP RENO and CUBIC use duplicate ACKs
563	   as a loss signal to trigger the fast retransmit.  There are different
564	   ways to avoid the sender's end-to-end retransmission being triggered
565	   prematurely:

567	   o  The egress overlay node can buffer the out-of-order packets for a
568	      while, giving a limited time for a packet being retransmitted
569	      somewhere in the overlay path to reach it.  The retransmitted
570	      packet and the buffered packets caused by it may increase the RTT
571	      variation at the sender.  When the retransmitted latency is a
572	      small portion of RTT or the loss is rare, such RTT variation will
573	      be smoothed without much impact.  Another possible way is to make
574	      the sender exclude such packets from the RTT measurement.  The
575	      locally recovered packets can be specially marked and this marking
576	      is spin back to end host sender.  Then RTT measurement should not
577	      use that packet.

579	      The buffer management is nontrivial in this case.  It has to be
580	      determined how many out-of-order packets can be buffered at the
581	      egress overlay node before it gives up waiting for a successful
582	      local retransmission.  As the lost packet is not always recovered
583	      successfully locally, the sender may invoke end-to-end fast
584	      retransmit slower than it would be in classic TCP.

586	   o  If LOOPS network does not buffer the out-of-order packets caused
587	      by packet loss, TCP sender can use a time based loss detection
588	      like RACK [I-D.ietf-tcpm-rack] to prevent the TCP sender from
589	      invoking fast retransmit too early.  RACK uses the notion of time
590	      to replace the conventional DUPACK threshold approach to detect
591	      losses.  RACK is required to be tuned to fit the local
592	      retransmission better.  If there are n similar segments over the
593	      path, segment retransmission will at least add RTT/n to the
594	      reordering window by average when the packet is lost only once
595	      over the whole overlay path.  This approach is more preferred than
596	      one described in previous bullet.  On the other hand, if time
597	      based loss detection is not supported at the sender, end to end
598	      retransmission will be invoked as usual.  It wastes some
599	      bandwidth.

601	4.1.1.  OE to OE Measurement, Recovery and Multipathing

603	   When local recovery is between two neighbor ONs, it is called per-hop
604	   recovery.  It can be between overlay relays or between overlay relay
605	   and overlay edge.  Another type of local recovery is called OE to OE
606	   recovery which performs between overlay edge nodes.  When the
607	   segments of an overlay path have similar characteristics and/or only
608	   OE has the expected processing capability, OE to OE based local
609	   recovery can be used instead of per-hop recovery.

611	   If there is more than one overlay path in an overlay tunnel,
612	   multipathing splits and recombines the traffic.  Measurements such as
613	   round trip time and loss rate between OEs hav to be specific to each
614	   path.  The ingress OE can use the feedback measurement to determine
615	   the FEC parameter settings for different path.  FEC can also be
616	   configured to work over the combined path.  The egress OE must be
617	   able to remove the replicated packet when overlay path is switched
618	   during impairment.

620	   OE to OE measurement can help each segment determine its proportion
621	   in edge to edge delay.  It is useful for ON to decide if it is
622	   necessary to turn on the per-hop recovery or how to fine tune the
623	   parameter settings.  When the segment delay ratio is small, the
624	   segment retransmission is more effective.

626	4.2.  Congestion Control Interaction

628	   When a TCP-like transport layer protocol is used, local recovery in
629	   LOOPS has to interact with the upper layer transport congestion
630	   control.  Classic TCP adjusts the congestion window when a loss is
631	   detected and fast retransmit is invoked.

633	   The local recovery mechanism breaks the assumption of the necessary
634	   and sufficient conditional relationship between detected packet loss
635	   and congestion control trigger at the sender in classic TCP.  The
636	   loss that is locally recovered can be caused by a non-persistent
637	   congestion such as a microburst or a random loss, both of which
638	   ideally would not let the sender invoke the congestion control
639	   mechanism.  But then, it can also possibly caused by a real
640	   persistent congestion which should let the sender invoke sending rate
641	   reduction.  In either case, the sender does not see the locally
642	   recovered packet as a loss.

644	   When the local recovery takes effect, we consider the following two
645	   cases.  Firstly, the classic TCP sender does not see the enough
646	   number of duplicate ACKs to trigger fast retransmit.  This could be
647	   the result of in-order packet delivery including locally recovered
648	   ones to the receiver as mentioned in last subsection.  Classic TCP
649	   sender in this case will not reduce congestion window as no loss is
650	   detected.  Secondly, if a time based loss detection such as RACK is
651	   used, as long as the locally recovered packet's ACK reaches the
652	   sender before the reordering window expires, the congestion window
653	   will not be reduced.

655	   Such behavior brings the desirable throughput improvement when the
656	   recovered packet is lost due to non-persistent congestion.  It solves
657	   the throughput problem mentioned in Section 2.3 and Section 3.
658	   However, it also brings the risk that the sender is not able to
659	   detect the real persistent congestion in time and then overshoot.
660	   Eventually a severe congestion that is not recoverable by a local
661	   recovery mechanism may occur.  In addition, it may be unfriendly to
662	   other flows (possibly pushing them out) if those flows are running
663	   over the same underlying bottleneck links.

665	   There is a spectrum of approaches.  On one end, each locally
666	   recovered packet can be treated exactly as a loss in order to invoke
667	   the congestion control at the sender to guarantee the fair sharing as
668	   classic TCP by setting its CE (Congestion Experienced) bit.  Explicit
669	   Congestion Notification (ECN) can be used here as ECN marking was
670	   required to be equivalent to a packet drop [RFC3168].  Congestion
671	   control at the sender works as usual and no throughput improvement
672	   could be achieved (although the benefit of faster recovery is still
673	   there).  On the other hand, ON can perform its congestion measurement
674	   over the segment, for instance local RTT and its variation trend.
675	   Then the lost packet can be determined if it was caused by congestion
676	   or other factors.  It will further decide if it is necessary to set
677	   CE marking or even what ratio is set to make the sender adjust the
678	   sending rate more correctly.

680	   There are possible cases that the sender detects the loss even with
681	   local recovery in function.  For example, when the re-ordering window
682	   in RACK is not optimally adapted, the sender may trigger the
683	   congestion control at the same time of end-to-end retransmission.  If
684	   spurious retransmission detection based on DSACK [RFC3708] is used,
685	   such end-to-end retransmission will be found out unnecessary when
686	   locally recovered packets reaches the receiver successfully.  Then
687	   congestion control changes will be undone at the sender.  This
688	   results in similar pros and cons as described earlier.  Pros are
689	   preventing the unnecessary window reduction and improving the
690	   throughput when the loss is caused by non-persistent congestion or
691	   random loss.  Cons are some mechanisms like ECN or its variants
692	   should be used wisely to make sure the congestion control is invoked
693	   in case of persistent congestion.

695	   An approach where the losses on a path segment are not immediately
696	   made known to the end-to-end congestion control can be combined with
697	   a "circuit breaker" style congestion control on the path segment.
698	   When the usage of path segment by the overlay flow starts to become
699	   unfair, the path segment sends congestion signals up to the end-to-
700	   end congestion control.  This must be carefully tuned to avoid
701	   unwanted oscillation.

703	   In summary, local recovery can improve Flow Completion Time (FCT) by
704	   eliminating tail loss in small flows.  As it changes loss event to
705	   out-of-order event in most cases to TCP sender, if TCP sender uses
706	   loss based congestion control, there is some implication on the
707	   throughput.  We suggest ECN and spurious retransmission to be enabled
708	   when local recovery is in use, it would give the desirable
709	   throughput, i.e. when loss is caused by congestion, reduce congestion
710	   window; otherwise keep sender's sending rate.  We do not suggest to
711	   use spurious retransmission alone together with local recovery as it
712	   may cause the TCP sender falsely undo window reduction when
713	   congestion occurs.  If only ECN is enabled or neither ECN nor
714	   spurious retransmission is enabled, the throughput with local
715	   recovery in use is no much difference from that of the tradition TCP.

717	4.3.  Overlay Protocol Extensions

719	   The overlay usually has no control over how packets are routed in the
720	   underlying network between two overlay nodes, but it can control, for
721	   example, the sequence of overlay nodes a message traverses before
722	   reaching its destination.  LOOPS assumes the overlay protocol can
723	   deliver the packets in such designated sequence.  Most forms of
724	   overlay networking use some sort of "encapsulation".  The whole path
725	   taken can be performed by stitching multiple short overlay paths,
726	   like VXLAN [RFC7348], GENEVE [I-D.ietf-nvo3-geneve], or it can be a
727	   single overlay path with a sequence of intermediate overlay nodes
728	   specified, as in SRv6 [I-D.ietf-6man-segment-routing-header].  In
729	   either way, LOOPS information is required to be embedded in those
730	   protocols to support the data plane measurement and feedback.
731	   Retransmission or FEC based loss recovery can be either per ON-hop
732	   based or OE to OE based.

734	   LOOPS alone has no setup requirement on control plane.  Some overlay
735	   protocol, e.g.  CAPWAP [RFC5415], has session setup phase, we can use
736	   it to exchange the information such as dynamic FEC parameters.

738	4.4.  Summary

740	   LOOPS is expected to extend the existing overlay protocols in data
741	   plane.  Path selection is assumed a feature provided by the overlay
742	   protocols via SDN or other approaches and is not a part of LOOPS.
743	   LOOPS is a set of functions to be implemented on ONs in a long haul
744	   overlay network.  LOOPS includes the following features.

746	   1.  Local recovery.  Retransmission, FEC or hybrid can be used as
747	       local recovery method.  Such recovery mechanism is in-network.
748	       It is performed by two network nodes with computing and memory
749	       resources.

751	   2.  Local congestion measurement.  Sender ON measures the local
752	       segment RTT, loss and/or throughput to immediately get the
753	       overlay segment status.

755	   3.  Signal to end to end congestion control.  Strategy to set/not set
756	       ECN CE marking or simply drop the packet to signal the end host
757	       sender about the loss event to help adjust the sending rate.

759	5.  Security Considerations

761	   LOOPS does not look at the traffic payload, so encrypted payload does
762	   not affect functionality of LOOPS.  The use of LOOPS introduces some
763	   issues which impact security.  ON with LOOPS function represents a
764	   point in the network where the traffic can be potentially
765	   manipulated.  Denial of service attack can be launched from an ON.  A
766	   rogue ON might be able to spoof packet as if it come from a
767	   legitimate ON.  It may also modify the ECN CE marking in packets to
768	   influence the sender's rate.  In order to protected from such
769	   attacks, the overlay protocol itself should have some build-in
770	   security protection which inherently be used by LOOPS.  The operator
771	   should use some authentication mechanism to make sure ONs are valid
772	   and non-compromised.

774	6.  IANA Considerations

776	   No IANA action is required.

778	7.  Acknowledgements

780	   Thanks to etosat mailing list about the discussion about the SatCom
781	   and LOOPS use case.

783	8.  Informative References

785	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
786	              RFC 793, DOI 10.17487/RFC0793, September 1981,
787	              <https://www.rfc-editor.org/info/rfc793>.

789	   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
790	              of Explicit Congestion Notification (ECN) to IP",
791	              RFC 3168, DOI 10.17487/RFC3168, September 2001,
792	              <https://www.rfc-editor.org/info/rfc3168>.

794	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
795	              Jacobson, "RTP: A Transport Protocol for Real-Time
796	              Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550,
797	              July 2003, <https://www.rfc-editor.org/info/rfc3550>.

799	   [RFC3708]  Blanton, E. and M. Allman, "Using TCP Duplicate Selective
800	              Acknowledgement (DSACKs) and Stream Control Transmission
801	              Protocol (SCTP) Duplicate Transmission Sequence Numbers
802	              (TSNs) to Detect Spurious Retransmissions", RFC 3708,
803	              DOI 10.17487/RFC3708, February 2004,
804	              <https://www.rfc-editor.org/info/rfc3708>.

806	   [RFC4585]  Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey,
807	              "Extended RTP Profile for Real-time Transport Control
808	              Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585,
809	              DOI 10.17487/RFC4585, July 2006,
810	              <https://www.rfc-editor.org/info/rfc4585>.

812	   [RFC4588]  Rey, J., Leon, D., Miyazaki, A., Varsa, V., and R.
813	              Hakenberg, "RTP Retransmission Payload Format", RFC 4588,
814	              DOI 10.17487/RFC4588, July 2006,
815	              <https://www.rfc-editor.org/info/rfc4588>.

817	   [RFC5415]  Calhoun, P., Ed., Montemurro, M., Ed., and D. Stanley,
818	              Ed., "Control And Provisioning of Wireless Access Points
819	              (CAPWAP) Protocol Specification", RFC 5415,
820	              DOI 10.17487/RFC5415, March 2009,
821	              <https://www.rfc-editor.org/info/rfc5415>.

823	   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
824	              Control", RFC 5681, DOI 10.17487/RFC5681, September 2009,
825	              <https://www.rfc-editor.org/info/rfc5681>.

827	   [RFC6298]  Paxson, V., Allman, M., Chu, J., and M. Sargent,
828	              "Computing TCP's Retransmission Timer", RFC 6298,
829	              DOI 10.17487/RFC6298, June 2011,
830	              <https://www.rfc-editor.org/info/rfc6298>.

832	   [RFC6830]  Farinacci, D., Fuller, V., Meyer, D., and D. Lewis, "The
833	              Locator/ID Separation Protocol (LISP)", RFC 6830,
834	              DOI 10.17487/RFC6830, January 2013,
835	              <https://www.rfc-editor.org/info/rfc6830>.

837	   [RFC7348]  Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger,
838	              L., Sridhar, T., Bursell, M., and C. Wright, "Virtual
839	              eXtensible Local Area Network (VXLAN): A Framework for
840	              Overlaying Virtualized Layer 2 Networks over Layer 3
841	              Networks", RFC 7348, DOI 10.17487/RFC7348, August 2014,
842	              <https://www.rfc-editor.org/info/rfc7348>.

844	   [I-D.dukkipati-tcpm-tcp-loss-probe]
845	              Dukkipati, N., Cardwell, N., Cheng, Y., and M. Mathis,
846	              "Tail Loss Probe (TLP): An Algorithm for Fast Recovery of
847	              Tail Losses", draft-dukkipati-tcpm-tcp-loss-probe-01 (work
848	              in progress), February 2013.

850	   [I-D.ietf-nvo3-geneve]
851	              Gross, J., Ganga, I., and T. Sridhar, "Geneve: Generic
852	              Network Virtualization Encapsulation", draft-ietf-
853	              nvo3-geneve-13 (work in progress), March 2019.

855	   [I-D.ietf-tcpm-rack]
856	              Cheng, Y., Cardwell, N., Dukkipati, N., and P. Jha, "RACK:
857	              a time-based fast loss detection algorithm for TCP",
858	              draft-ietf-tcpm-rack-05 (work in progress), April 2019.

860	   [I-D.ietf-6man-segment-routing-header]
861	              Filsfils, C., Dukes, D., Previdi, S., Leddy, J.,
862	              Matsushima, S., and d. daniel.voyer@bell.ca, "IPv6 Segment
863	              Routing Header (SRH)", draft-ietf-6man-segment-routing-
864	              header-19 (work in progress), May 2019.

866	   [I-D.cardwell-iccrg-bbr-congestion-control]
867	              Cardwell, N., Cheng, Y., Yeganeh, S., and V. Jacobson,
868	              "BBR Congestion Control", draft-cardwell-iccrg-bbr-
869	              congestion-control-00 (work in progress), July 2017.

871	   [DOI_10.1109_ICDCS.2016.49]
872	              Cai, C., Le, F., Sun, X., Xie, G., Jamjoom, H., and R.
873	              Campbell, "CRONets: Cloud-Routed Overlay Networks", 2016
874	              IEEE 36th International Conference on Distributed
875	              Computing Systems (ICDCS), DOI 10.1109/icdcs.2016.49, June
876	              2016.

878	   [DOI_10.1145_3038912.3052560]
879	              Haq, O., Raja, M., and F. Dogar, "Measuring and Improving
880	              the Reliability of Wide-Area Cloud Paths", Proceedings of
881	              the 26th International Conference on World Wide Web -
882	              WWW '17, DOI 10.1145/3038912.3052560, 2017.

884	Authors' Addresses

886	   Yizhou Li
887	   Huawei Technologies
888	   101 Software Avenue,
889	   Nanjing 210012
890	   China

892	   Phone: +86-25-56624584
893	   Email: liyizhou@huawei.com

895	   Xingwang Zhou
896	   Huawei Technologies
897	   101 Software Avenue,
898	   Nanjing 210012
899	   China

901	   Email: zhouxingwang@huawei.com