idnits 2.17.1 

draft-cpaasch-ippm-responsiveness-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 3 instances of too long lines in the document, the longest one
     being 10 characters in excess of 72.

  ** The document seems to lack a both a reference to RFC 2119 and the
     recommended RFC 2119 boilerplate, even if it appears to use RFC 2119
     keywords. 

     RFC 2119 keyword, line 399: '...t and the server MUST support HTTP/2 o...'
     RFC 2119 keyword, line 400: '...   client MUST be able to send a GET r...'
     RFC 2119 keyword, line 401: '...   MUST be able to respond to both of ...'
     RFC 2119 keyword, line 402: '... server endpoint MUST be accessible th...'
     RFC 2119 keyword, line 403: '...DNS.  The server MUST have the ability...'
     (3 more instances...)


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (October 25, 2021) is 907 days in the past.  Is this
     intentional?


  Checking references for intended status: Experimental
  ----------------------------------------------------------------------------

  -- Obsolete informational reference (is this intentional?): RFC  793
     (Obsoleted by RFC 9293)


     Summary: 2 errors (**), 0 flaws (~~), 1 warning (==), 2 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	IP Performance Measurement                                     C. Paasch
3	Internet-Draft                                                  R. Meyer
4	Intended status: Experimental                                S. Cheshire
5	Expires: April 28, 2022                                       O. Shapira
6	                                                              Apple Inc.
7	                                                        October 25, 2021

9	                Responsiveness under Working Conditions
10	                  draft-cpaasch-ippm-responsiveness-01

12	Abstract

14	   For many years, a lack of responsiveness, variously called lag,
15	   latency, or bufferbloat, has been recognized as an unfortunate, but
16	   common symptom in today's networks.  Even after a decade of work on
17	   standardizing technical solutions, it remains a common problem for
18	   the end users.

20	   Everyone "knows" that it is "normal" for a video conference to have
21	   problems when somebody else at home is watching a 4K movie or
22	   uploading photos from their phone.  However, there is no technical
23	   reason for this to be the case.  In fact, various queue management
24	   solutions (fq_codel, cake, PIE) have solved the problem for tens of
25	   thousands of people.

27	   Our networks remain unresponsive, not from a lack of technical
28	   solutions, but rather a lack of awareness of the problem.  We believe
29	   that creating a tool whose measurement matches people's every day
30	   experience will create the necessary awareness, and result in a
31	   demand for products that solve the problem.

33	   This document specifies the "RPM Test" for measuring responsiveness.
34	   It uses common protocols and mechanisms to measure user experience
35	   especially when the network is fully loaded ("responsiveness under
36	   working conditions".)  The measurement is expressed as "Round-trips
37	   Per Minute" (RPM) and should be included with throughput (up and
38	   down) and idle latency as critical indicators of network quality.

40	Status of This Memo

42	   This Internet-Draft is submitted in full conformance with the
43	   provisions of BCP 78 and BCP 79.

45	   Internet-Drafts are working documents of the Internet Engineering
46	   Task Force (IETF).  Note that other groups may also distribute
47	   working documents as Internet-Drafts.  The list of current Internet-
48	   Drafts is at https://datatracker.ietf.org/drafts/current/.

50	   Internet-Drafts are draft documents valid for a maximum of six months
51	   and may be updated, replaced, or obsoleted by other documents at any
52	   time.  It is inappropriate to use Internet-Drafts as reference
53	   material or to cite them other than as "work in progress."

55	   This Internet-Draft will expire on April 28, 2022.

57	Copyright Notice

59	   Copyright (c) 2021 IETF Trust and the persons identified as the
60	   document authors.  All rights reserved.

62	   This document is subject to BCP 78 and the IETF Trust's Legal
63	   Provisions Relating to IETF Documents
64	   (https://trustee.ietf.org/license-info) in effect on the date of
65	   publication of this document.  Please review these documents
66	   carefully, as they describe your rights and restrictions with respect
67	   to this document.  Code Components extracted from this document must
68	   include Simplified BSD License text as described in Section 4.e of
69	   the Trust Legal Provisions and are provided without warranty as
70	   described in the Simplified BSD License.

72	Table of Contents

74	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
75	     1.1.  Terminology . . . . . . . . . . . . . . . . . . . . . . .   3
76	   2.  Design Constraints  . . . . . . . . . . . . . . . . . . . . .   4
77	   3.  Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . .   4
78	   4.  Measuring Responsiveness Under Working Conditions . . . . . .   5
79	     4.1.  Working Conditions  . . . . . . . . . . . . . . . . . . .   5
80	       4.1.1.  From single-flow to multi-flow  . . . . . . . . . . .   6
81	       4.1.2.  Parallel vs Sequential Uplink and Downlink  . . . . .   6
82	       4.1.3.  Reaching saturation . . . . . . . . . . . . . . . . .   6
83	       4.1.4.  Final "Working Conditions" Algorithm  . . . . . . . .   7
84	     4.2.  Measuring Responsiveness  . . . . . . . . . . . . . . . .   8
85	       4.2.1.  Aggregating the Measurements  . . . . . . . . . . . .   8
86	       4.2.2.  Statistical Confidence  . . . . . . . . . . . . . . .   9
87	   5.  RPM Test Server API . . . . . . . . . . . . . . . . . . . . .   9
88	   6.  Security Considerations . . . . . . . . . . . . . . . . . . .  10
89	   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  10
90	   8.  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . .  10
91	   9.  Informative References  . . . . . . . . . . . . . . . . . . .  10
92	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  11

94	1.  Introduction

96	   For many years, a lack of responsiveness, variously called lag,
97	   latency, or bufferbloat, has been recognized as an unfortunate, but
98	   common symptom in today's networks [Bufferbloat].  Solutions like
99	   fq_codel [RFC8290] or PIE [RFC8033] have been standardized and are to
100	   some extent widely implemented.  Nevertheless, people still suffer
101	   from bufferbloat.

103	   Although significant, the impact on user experience can be transitory
104	   - that is, its effect is not always present.  Whenever a network is
105	   actively being used at its full capacity, buffers can fill up and
106	   create latency for traffic.  The duration of those full buffers may
107	   be brief: a medium-sized file transfer, like an email attachment or
108	   uploading photos, can create bursts of latency spikes.  An example of
109	   this is lag occurring during a videoconference, where a connection is
110	   briefly shown as unstable.

112	   These short-lived disruptions make it hard to narrow down the cause.
113	   We believe that it is necessary to create a standardized way to
114	   measure and express responsiveness.

116	   Existing network measurement tools could incorporate a responsiveness
117	   measurement into their set of metrics.  Doing so would also raise the
118	   awareness of the problem and make the standard "network quality
119	   measures" of throughput, idle latency, and responsiveness.

121	1.1.  Terminology

123	   A word about the term "bufferbloat" - the undesirable latency that
124	   comes from a router or other network equipment buffering too much
125	   data.  This document uses the term as a general description of bad
126	   latency, using more precise wording where warranted.

128	   "Latency" is a poor measure of responsiveness, since it can be hard
129	   for the general public to understand.  The units are unfamiliar
130	   ("what is a millisecond?") and counterintuitive ("100 msec - that
131	   sounds good - it's only a tenth of a second!").

133	   Instead, we create the term "Responsiveness under working conditions"
134	   to make it clear that we are measuring all, not just idle,
135	   conditions, and use "round-trips per minute" as the metric.  The
136	   values range from 50 (poor) to 3,000 (excellent), with the added
137	   advantage that "bigger is better."  Finally, we abbreviate the
138	   measurement to "RPM", a wink to the "revolutions per minute" that we
139	   use for cars.

141	   This document defines an algorithm for the "RPM Test" that explicitly
142	   measures responsiveness under working conditions.

144	2.  Design Constraints

146	   There are many challenges around measurements on the Internet.  They
147	   include the dynamic nature of the Internet, the diverse nature of the
148	   traffic, the large number of devices that affect traffic, and the
149	   difficulty of attaining appropriate measurement conditions.

151	   Internet paths are changing all the time.  Daily fluctuations in the
152	   demand make the bottlenecks ebb and flow.  To minimize the
153	   variability of routing changes, it's best to keep the test duration
154	   relatively short.

156	   TCP and UDP traffic, or traffic on ports 80 and 443, may take
157	   significantly different paths on the Internet and be subject to
158	   entirely different Quality of Service (QoS) treatment.  A good test
159	   will use standard transport layer traffic - typical for people's use
160	   of the network - that is subject to the transport's congestion
161	   control that might reduce the traffic's rate and thus its buffering
162	   in the network.

164	   Traditionally, one thinks of bufferbloat happening on the routers and
165	   switches of the Internet.  However, the networking stacks of the
166	   clients and servers can have huge buffers.  Data sitting in TCP
167	   sockets or waiting for the application to send or read causes
168	   artificial latency, and affects user experience the same way as
169	   "traditional" bufferbloat.

171	   Finally, it is important to note that queueing only happens behind a
172	   slow "bottleneck" link in the network, and only occurs when
173	   sufficient traffic is present.  The RPM Test must ensure that buffers
174	   are actually full for a sustained period, and only then make repeated
175	   latency measurements in this particular state.

177	3.  Goals

179	   The algorithm described here defines an RPM Test that serves as a
180	   good proxy for user experience.  This means:

182	   1.  Today's Internet traffic primarily uses HTTP/2 over TLS.  Thus,
183	       the algorithm should use that protocol.

185	       As a side note: other types of traffic are gaining in popularity
186	       (HTTP/3) and/or are already being used widely (RTP).  Traffic
187	       prioritization and QoS rules on the Internet may subject traffic
188	       to completely different paths:

190	       these could also be measured separately.

192	   2.  The Internet is marked by the deployment of countless middleboxes
193	       like transparent TCP proxies or traffic prioritization for
194	       certain types of traffic.  The RPM Test must take into account
195	       their effect on DNS-request [RFC1035], TCP-handshake [RFC0793],
196	       TLS-handshake, and request/response.

198	   3.  The test result should be expressed in an intuitive, nontechnical
199	       form.

201	   4.  Finally, to be useful to a wide audience, the measurement should
202	       finish within a short time frame.  Our target is 20 seconds.

204	4.  Measuring Responsiveness Under Working Conditions

206	   To make an accurate measurement, the algorithm must reliably put the
207	   network in a state that represents those "working conditions".  Once
208	   the network has reached that state, the algorithm can measure its
209	   responsiveness.  The following explains how the former and the latter
210	   are achieved.

212	4.1.  Working Conditions

214	   For the purpose of this methodology, typical "working conditions"
215	   represent a state of the network in which the bottleneck node is
216	   experiencing ingress and egress flows similar to those created by
217	   humans in the typical day-to-day pattern.

219	   While a single HTTP transaction might briefly put a network into
220	   working conditions, making reliable measurements requires maintaining
221	   the state over sufficient time.

223	   The algorithm must also detect when the network is in a persistent
224	   working condition, also called "saturation".

226	   Desired properties of "working condition":

228	   o  Should not waste traffic, since the person may be paying for it

230	   o  Should finish within a short time to avoid impacting other people
231	      on the same network, to avoid varying network conditions, and not
232	      try the person's patience.

234	4.1.1.  From single-flow to multi-flow

236	   A single TCP connection may not be sufficient to saturate a path.
237	   For example, the 4MB constraints on TCP window size constraints may
238	   not fill the pipe.  Additionally, traditional loss-based TCP
239	   congestion control algorithms react aggressively to packet loss by
240	   reducing the congestion window.  This reaction (intended by the
241	   protocol design) decreases the queueing within the network, making it
242	   hard to reach saturation.

244	   The goal of the RPM Test is to keep the network as busy as possible
245	   in a sustained and persistent way.  It uses multiple TCP connections
246	   and gradually adds more TCP flows until saturation is reached.

248	4.1.2.  Parallel vs Sequential Uplink and Downlink

250	   Poor responsiveness can be caused by queues in either (or both) the
251	   upstream and the downstream direction.  Furthermore, both paths may
252	   differ significantly due to access link conditions (e.g., 5G
253	   downstream and LTE upstream) or the routing changes within the ISPs.
254	   To measure responsiveness under working conditions, the algorithm
255	   must saturate both directions.

257	   Measuring in parallel achieves more data samples for a given
258	   duration.  Given the desired test duration of 20 seconds, sequential
259	   uplink and downlink tests would only yield half the data.  The RPM
260	   Test specifies parallel, concurrent measurements.

262	   However, a number of caveats come with measuring in parallel:

264	   o  Half-duplex links may not permit simultaneous uplink and downlink
265	      traffic.  This means the test might not saturate both directions
266	      at once.

268	   o  Debuggability of the results becomes harder: During parallel
269	      measurement it is impossible to differentiate whether the observed
270	      latency happens in the uplink or the downlink direction.

272	   o  Consequently, the test should have an option for sequential
273	      testing.

275	4.1.3.  Reaching saturation

277	   The RPM Test gradually increases the number of TCP connections and
278	   measures "goodput" - the sum of actual data transferred across all
279	   connections in a unit of time.  When the goodput stops increasing, it
280	   means that saturation has been reached.

282	   Saturation has two criteria: a) the load bearing connections are
283	   utilizing all the capacity of the bottleneck, b) the buffers in the
284	   bottleneck are completely filled.

286	   The algorithm notes that throughput gradually increases until TCP
287	   connections complete their TCP slow-start phase.  At that point,
288	   throughput eventually stalls usually due to receive window
289	   limitations.  The only means to further increase throughput is by
290	   adding more TCP connections to the pool of load bearing connections.
291	   If new connections leave the throughput the same, saturation has been
292	   reached and - more importantly - the working condition is stable.

294	   Filling buffers at the bottleneck depends on the congestion control
295	   deployed on the sender side.  Congestion control algorithms like BBR
296	   may reach high throughput without causing queueing because the
297	   bandwidth detection portion of BBR effectively seeks the bottleneck
298	   capacity.

300	   RPM Test clients and servers should use loss-based congestion
301	   controls like Cubic to fill queues reliably.

303	   The RPM Test detects saturation when the observed goodput is not
304	   increasing even as connections are being added, or it detects packet
305	   loss or ECN marks signaling congestion or a full buffer of the
306	   bottleneck link.

308	4.1.4.  Final "Working Conditions" Algorithm

310	   The following algorithm reaches working conditions (saturation) of a
311	   network by using HTTP/2 upload (POST) or download (GET) requests of
312	   infinitely large files.  The algorithm is the same for upload and
313	   download and uses the same term "load bearing connection" for each.

315	   The steps of the algorithm are:

317	   o  Create 4 load bearing connections

319	   o  At each 1 second interval:

321	      *  Compute "instantaneous aggregate" goodput which is the number
322	         of bytes transferred within the last second.

324	      *  Compute a moving average of the last 4 "instantaneous aggregate
325	         goodput" measurements

327	      *  If moving average > "previous" moving average + 5%:

329	         +  Network did not yet reach saturation.  If no flows added
330	            within the last 4 seconds, add 4 more flows

332	      *  Else, network reached saturation for the current flow count.

334	         +  If new flows added and for 4 seconds the moving average
335	            throughput did not change: network reached stable saturation

337	         +  Else, add four more flows

339	   Note: It is tempting to envision an initial base RTT measurement and
340	   adjust the intervals as a function of that RTT.  However, experiments
341	   have shown that this makes the saturation detection extremely
342	   unstable in low RTT environments.  In the situation where the
343	   "unloaded" RTT is in the single-digit millisecond range, yet the
344	   network's RTT increases under load to more than a hundred
345	   milliseconds, the intervals become much too low to accurately drive
346	   the algorithm.

348	4.2.  Measuring Responsiveness

350	   Once the network is in a consistent working conditions, the RPM Test
351	   must "probe" the network multiple times to measure its
352	   responsiveness.

354	   Each RPM Test probe measures:

356	   1.  The responsiveness of the different steps to create a new
357	       connection, all during working conditions.

359	       To do this, the test measures the time needed to make a DNS
360	       request, establish a TCP connection on port 443, establish a TLS
361	       context using TLS1.3 [RFC8446], and send and receive a one-byte
362	       object with a HTTP/2 GET request.  It repeats these steps
363	       multiple times for accuracy.

365	   2.  The responsiveness of the network and the client/server
366	       networking stacks for the load bearing connections themselves.

368	       To do this, the load bearing connections multiplex an HTTP/2 GET
369	       request for a one-byte object to get the end-to-end latency on
370	       the connections that are using the network at full speed.

372	4.2.1.  Aggregating the Measurements

374	   The algorithm produces sets of 5 times for each probe, namely: DNS
375	   handshake, TCP handshake, TLS handshake, HTTP/2 request/response on
376	   separate (idle) connections, HTTP/2 request/response on load bearing
377	   connections.  This fine-grained data is useful, but not necessary for
378	   creating a useful metric.

380	   To create a single "Responsiveness" (e.g., RPM) number, this first
381	   iteration of the algorithm gives an equal weight to each of these
382	   values.  That is, it sums the five time values for each probe, and
383	   divides by the total number of probes to compute an average probe
384	   duration.  The reciprocal of this, normalized to 60 seconds, gives
385	   the Round-trips Per Minute (RPM).

387	4.2.2.  Statistical Confidence

389	   The number of probes necessary for statistical confidence is an open
390	   question.  One could imagine a computation of the variance and
391	   confidence interval that would drive the number of measurements and
392	   balance the accuracy with the speed of the measurement itself.

394	5.  RPM Test Server API

396	   The RPM measurement uses standard protocols: no new protocol is
397	   defined.

399	   Both the client and the server MUST support HTTP/2 over TLS 1.3.  The
400	   client MUST be able to send a GET request and a POST.  The server
401	   MUST be able to respond to both of these HTTP commands.  Further, the
402	   server endpoint MUST be accessible through a hostname that can be
403	   resolved through DNS.  The server MUST have the ability to provide
404	   content upon a GET request.  Both client and server SHOULD use loss-
405	   based congestion controls like Cubic.  The server MUST use a packet
406	   scheduling algorithm that minimizes internal queueing to avoid
407	   affecting the client's measurement.

409	   The server MUST respond to 4 URLs:

411	   1.  A "small" URL/response: The server must respond with a status
412	       code of 200 and 1 byte in the body.  The actual body content is
413	       irrelevant.

415	   2.  A "large" URL/response: The server must respond with a status
416	       code of 200 and a body size of at least 8GB.  The body can be
417	       bigger, and may need to grow as network speeds increases over
418	       time.  The actual body content is irrelevant.  The client will
419	       probably never completely download the object, but will instead
420	       close the connection after reaching working condition and making
421	       its measurements.

423	   3.  An "upload" URL/response: The server must handle a POST request
424	       with an arbitrary body size.  The server should discard the
425	       payload.

427	   4.  A configuration URL that returns a JSON [RFC8259] object with the
428	       information the client uses to run the test (sample below).
429	       Sample JSON:

431	{
432	  "version": 1,
433	  "urls": {
434	    "small_https_download_url": "https://networkquality.example.com/api/v1/small",
435	    "large_https_download_url": "https://networkquality.example.com/api/v1/large",
436	    "https_upload_url": "https://networkquality.example.com/api/v1/upload"
437	  }
438	}

440	   The client begins the responsiveness measurement by querying for the
441	   JSON configuration.  This supplies the URLs for creating the load
442	   bearing connections in the upstream and downstream direction as well
443	   as the small object for the latency measurements.

445	6.  Security Considerations

447	   TBD

449	7.  IANA Considerations

451	   TBD

453	8.  Acknowledgments

455	   We would like to thank Rich Brown for his editorial pass over this
456	   I-D.  We also thank Erik Auerswald for his constructive feedback on
457	   the I-D.

459	9.  Informative References

461	   [Bufferbloat]
462	              Gettys, J. and K. Nichols, "Bufferbloat: Dark Buffers in
463	              the Internet", Communications of the ACM, Volume 55,
464	              Number 1 (2012) , n.d..

466	   [RFC0793]  Postel, J., "Transmission Control Protocol", STD 7,
467	              RFC 793, DOI 10.17487/RFC0793, September 1981,
468	              <https://www.rfc-editor.org/info/rfc793>.

470	   [RFC1035]  Mockapetris, P., "Domain names - implementation and
471	              specification", STD 13, RFC 1035, DOI 10.17487/RFC1035,
472	              November 1987, <https://www.rfc-editor.org/info/rfc1035>.

474	   [RFC8033]  Pan, R., Natarajan, P., Baker, F., and G. White,
475	              "Proportional Integral Controller Enhanced (PIE): A
476	              Lightweight Control Scheme to Address the Bufferbloat
477	              Problem", RFC 8033, DOI 10.17487/RFC8033, February 2017,
478	              <https://www.rfc-editor.org/info/rfc8033>.

480	   [RFC8259]  Bray, T., Ed., "The JavaScript Object Notation (JSON) Data
481	              Interchange Format", STD 90, RFC 8259,
482	              DOI 10.17487/RFC8259, December 2017,
483	              <https://www.rfc-editor.org/info/rfc8259>.

485	   [RFC8290]  Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gettys,
486	              J., and E. Dumazet, "The Flow Queue CoDel Packet Scheduler
487	              and Active Queue Management Algorithm", RFC 8290,
488	              DOI 10.17487/RFC8290, January 2018,
489	              <https://www.rfc-editor.org/info/rfc8290>.

491	   [RFC8446]  Rescorla, E., "The Transport Layer Security (TLS) Protocol
492	              Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018,
493	              <https://www.rfc-editor.org/info/rfc8446>.

495	Authors' Addresses

497	   Christoph Paasch
498	   Apple Inc.
499	   One Apple Park Way
500	   Cupertino, California 95014
501	   United States of America

503	   Email: cpaasch@apple.com

505	   Randall Meyer
506	   Apple Inc.
507	   One Apple Park Way
508	   Cupertino, California 95014
509	   United States of America

511	   Email: rrm@apple.com
512	   Stuart Cheshire
513	   Apple Inc.
514	   One Apple Park Way
515	   Cupertino, California 95014
516	   United States of America

518	   Email: cheshire@apple.com

520	   Omer Shapira
521	   Apple Inc.
522	   One Apple Park Way
523	   Cupertino, California 95014
524	   United States of America

526	   Email: oesh@apple.com