idnits 2.17.1 draft-stephan-quic-interdomain-troubleshooting-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (January 9, 2020) is 1568 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Unused Reference: 'SBA5G' is defined on line 430, but no explicit reference was found in the text Summary: 0 errors (**), 0 flaws (~~), 3 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 QUIC WG E. Stephan 2 Internet Draft M. Cayla 3 Intended status: Informational A. Braud 4 Expires: July 2020 F. Fieau 5 A. Ferrieux 6 Orange 7 M. Ihlar 8 Ericsson 9 January 9, 2020 11 QUIC Interdomain Troubleshooting 12 draft-stephan-quic-interdomain-troubleshooting-04.txt 14 Status of this Memo 16 This Internet-Draft is submitted in full conformance with the 17 provisions of BCP 78 and BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six 25 months and may be updated, replaced, or obsoleted by other documents 26 at any time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html 35 This Internet-Draft will expire on July 9, 2020. 37 Copyright Notice 39 Copyright (c) 2020 IETF Trust and the persons identified as the 40 document authors. All rights reserved. 42 This document is subject to BCP 78 and the IETF Trust's Legal 43 Provisions Relating to IETF Documents 44 (http://trustee.ietf.org/license-info) in effect on the date of 45 publication of this document. Please review these documents 46 carefully, as they describe your rights and restrictions with 47 respect to this document. 49 Abstract 51 On-path network performance measurements methods currently deployed 52 contribute to the ossification of the Internet because they are 53 expensive to deploy and to maintain. This draft motivates the 54 exposure of QUIC header fields for on-path network measurements and 55 their specification in the QUIC core protocol as a solution to avoid 56 on-path network performance measurements to ossify the IP stack in 57 the future. 59 Table of Contents 61 1. Introduction ................................................ 2 62 2. Conventions used in this document ........................... 3 63 3. Interdomain UX troubleshooting .............................. 3 64 4. Reference of Network Performance ............................ 4 65 5. Explicit measurement signals ................................ 5 66 5.1. Spinbit, measuring the delay .............................. 5 67 5.2. lossbits, measuring packet losses ......................... 6 68 6. QUIC Fallback ............................................... 6 69 6.1. Flapping .................................................. 6 70 7. Versioning and Implementations .............................. 7 71 8. Tools ....................................................... 7 72 8.1. Spindump .................................................. 7 73 9. Security Considerations ..................................... 8 74 10. IANA Considerations......................................... 8 75 11. Discussions ................................................ 8 76 11.1. Fallback ................................................. 8 77 11.2. On-path Measurement ...................................... 9 78 12. References ................................................. 9 79 12.1. Normative References ..................................... 9 80 12.2. Informative References .................................. 10 81 13. Acknowledgments ........................................... 10 83 1. Introduction 85 The IP layer does not include the material for measuring the delay 86 and packet losses of segments of a path. The network performance is 87 currently measured by points of presence of the path [SPATIAL], 88 [COMPO] using transport fields of the upper layers: TCP transport 89 layer, RTP application layer... 91 The evolution of the Internet stack toward end-to-end integrity 92 protection is unavoidable [IABSEC]. This document presents the 93 benefits of preserving the same on-path network performance 94 measurement capabilities in the evolution of TCP (TCP/TLS, 95 TCPinc...) and UDP (QUIC/UDP...) currently specified at the IETF. 97 On-path network performance measurement methods currently deployed 98 contribute to the ossification of the Internet because they are 99 expensive to deploy and complex to maintain. This is due to the use 100 of protocol fields not primarily designed for this purpose. This 101 draft motivates the exposure of the fields for on-path network 102 performance measurements of the delay [SPINBIT] and of the packet 103 losses [LOSSBITS] in the QUIC core protocol to avoid network 104 performance measurements to ossify the IP stack in the future. 106 The memo recalls the UX interdomain troubleshooting complexity 107 [FALLBACK] introduced by QUIC deployment. Then it describes 108 operational concerns QUIC fallbacks and discusses the potential 109 impacts on the security. Finally it discusses the benefits of 110 exposing durably the fields needed for measuring packet delay and 111 losses. 113 2. Conventions used in this document 115 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 116 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 117 document are to be interpreted as described in RFC 2119 [RFC2119]. 119 3. Interdomain UX troubleshooting 121 Fast troubleshooting of network performance is mandatory to maintain 122 end-users high Quality of Experience. 124 Troubleshooting is the act of identifying the origin of a problem. A 125 major case is the localization of troubles impacting a large number 126 of customers. This case becomes critical when it appears suddenly 127 and represents a noticeable part of the traffic between the entity 128 that connects customers (ISP) and the entity that provides the data 129 (APP). 131 It becomes critical because network operation center (NOC) teams of 132 the two entities are expected to immediately identify the causes in 133 order to restore UX as quickly as possible. Each team checks that 134 the point of failure is either in their domain or outside. When they 135 located the point of failure in their entity they investigate their 136 own chains of components (network, routers, reverse proxies,...) and 137 quickly fix the issue. 139 It becomes extremely critical when an entity locates the point of 140 failure outside of their entity. In this case the time needed to fix 141 the problem is much longer and unpredictable because it expects 142 other entities on the path to perform the same actions on their 143 segments. 145 There are many cases of troubleshooting. A typical example starts by 146 signaling to the ISP that its end-users are experiencing a 147 significant decrease of QoE when using an Internet application. 148 Typical point of failures can be line card memory errors or 149 overloaded routers located somewhere on the path, either in the ISP 150 domain or outside. 152 The ISP NOC has to localize asap the point of failure. Currently it 153 proceeds by dichotomy (is the point of failure inside ISP domain or 154 outside?) using passive monitoring of packet loss and congestion. 156 Following is the description of the parameters in use: 158 Packet lost downstream (vice versa for upstream): 160 o Measure of packet loss before the point of measure needs TCP 161 sequence number; 163 o Measure of loss located after the point of measure needs TCP ACK 164 + SACK. 166 Congestion: 168 o Congestion also manifests as increased delays in queues, located 169 by measuring half-RTTs upstream and downstream from the point of 170 measure; 172 o The analysis is based on TCP SEQ/ACK/Timestamp correlation. 174 4. Reference of Network Performance 176 A reference of the real performance of the network is not always 177 provided by counters of network equipment. Counters may not be 178 implemented, values are not always stable, their values could be 179 compromised in case of software bugs or equipment congestion... In 180 addition, to face the increasing network architecture complexity 181 involved by the evolution of access networks, security and hosting 182 infrastructures, NOCs need reliable network performance measurement 183 in near real time. 185 In practice, these measurement tools shall be able to monitor 186 numerous points and interfaces within the network to provide near 187 real time network performances indicators taking into account the 188 global network state. 190 These measurements systems can also allow detecting issues or 191 unexplained behaviors on equipment, links, peers: for instance, in 192 mobile networks, operators shall be able to identify in real time 193 bottleneck links responsible for customer experience degradation and 194 take necessary actions to avoid further snowball effect. 196 Charging of data usage is another key feature for Mobile Network 197 Operators where per flow accurate information is collected. It is 198 important to reconcile values between amount of charged data and 199 amount of data seen by the network (cf discussion on goodput in 200 section 6)). This could allow detecting fraud attempts or 201 dysfunctions within the network. In case of significant gap, 202 operators must be able to react quickly to isolate this traffic. 203 Additionally, charging may require the differentiation of the 204 goodput from the throughput. 206 Continuous network performance monitoring requires packet losses and 207 delay measurements to allow operators to manage properly their 208 networks and to provides them with a reference of performance of 209 their network for interdomain troubleshooting. 211 5. Explicit measurement signals 213 5.1. Spinbit, measuring the delay 215 An alternative to exposing raw transport protocol data to the path 216 is to have explicitly designed signals with the purpose of 217 facilitating on-path measurements. To facilitate troubleshooting, 218 such a signal should enable passive measurement of RTT, packet-loss 219 and other congestion indicators such as ECN. An example of how a 220 transport protocol could expose measurement information to the path 221 would be three flag bits available in a public header. The first bit 222 would be used for passive RTT measurement, bit 2 would indicate 223 whether the packet contains retransmitted data and bit 3 would 224 indicate whether the packet contains an ECN echo. An explicit signal 225 is unambiguous and simpler for a middlebox to interpret, than parsed 226 transport headers. Furthermore, it could be invariant between 227 revisions of the transport protocol that exposes it, which minimizes 228 the risk of network ossification. 230 At this step the WG specified a signal to measure the round trip 231 delay delay [SPINBIT]. 233 5.2. lossbits, measuring packet losses 235 QUIC packets numbering is available in QUIC but encrypted. By 236 consequence an additional signal like described in [LOSSBITS] is 237 needed to measure packet losses on the path. 239 [TODO: add other proposals] 241 6. QUIC Fallback 243 Fallback is necessary to address cases where a QUIC connection 244 establishment fails [QUICAPP] (A device of the path blocks UDP, the 245 stack blocks 0-RTT...). 247 Fallback may occur additionally when an active QUIC connection drops 248 and tries to reconnect. As an example, the steps of the fallback 249 could be: 251 o The QUIC connection drops accidently; 253 o The UA fallback and connects in TCP/TLS to the origin server; 255 o The UA receives from the origin server an indication for an 256 alternate service [ALTSVC] supporting QUIC; 258 o The UA ends gracefully the TCP connection; 260 o The UA tries to establish a QUIC connection to the server and 261 port described in the alternate service indication; 263 o The QUIC 1-RTT connection is established; 265 6.1. Flapping 267 A fallback may suddenly occur when one or more elements (links, 268 nodes, reverse proxy, switch, server ...) of the path fail or are 269 reconfigured. 271 There are cases where the fallback loops and triggers flapping 272 between the origin server and the alternate server. As an example, 273 this might happen when an alternate service indication is outdated 274 and points to a server which does not support QUIC anymore. 276 This becomes critical for UX when numerous fallbacks occur suddenly 277 on the same path between a set of customers of an ISP and another 278 entity which provides the application data. The time to troubleshoot 279 can be very long. The origin server and the alternate servers can be 280 hosted by different entities. 282 This should be specified either in QUIC core specifications. 284 7. Versioning and Implementations 286 Versioning is an important part of the QUIC protocol framework 287 [QUICCORE]. Multiple versions of the protocol are expected to be 288 deployed and used concurrently. In order to encourage networks to 289 rapidly support the QUIC protocol and to support any versions of 290 QUIC in the future, the exposure of the fields for on-path network 291 performance measurement must not depend on the version. 293 There might be numerous implementations of the QUIC protocol in the 294 future. An important part of them will implement the congestion 295 control at application level. There will be unfair behaviors like 296 abnormal retransmission rate which will impact the fairness of the 297 repartition of the bandwidth amongst the customers of the network. 298 By consequence the network needs to be able to detect connections 299 which have abnormal throughput/goodput. 301 8. Tools 303 8.1. Spindump 305 The "Spindump" tool [SPINDUMP] is a Unix command-line utility that can 306 be used for latency monitoring in traffic passing through an interface. 307 The tool performs passive, in-network monitoring. It is not a tool to 308 monitor traffic content or metadata of individual connections, and 309 indeed that is not possible in the Internet as most connections are 310 encrypted. 312 The tool looks at the characteristics of transport protocols, such as 313 the QUIC Spin Bit, and attempts to derive information about round-trip 314 times for individual connections or for the aggregate or average 315 values. The tool supports TCP, QUIC, COAP, DNS, and ICMP traffic. 316 There's also an easy way to anonymize connection information so that 317 the resulting statistics cannot be used to infer anything about 318 specific connections or users. 320 Spindump can both generate and read JSON formatted measurement events. 321 Measurements from several collection points can be aggregated at a 322 central point. Having multiple Spindump instances along a path allows 323 for further segmentation of measurements than having a single 324 measurement point, which is useful for both inter- and intra-domain 325 troubleshooting. 327 The Spindump command-line utility is based on the Spindump library 328 which can be integrated with other measurement or troubleshooting 329 software. 331 9. Security Considerations 333 The integrity of the parameters exposed for measuring on-path delay 334 and losses can be end-to-end protected to increase the security of 335 the connection. 337 Flapping from QUIC to a fallback protocol might overload on-path 338 devices and end-points and by consequence affect the stability of 339 the connections and introduces weaknesses. 341 The fallback from encrypted headers to clear headers transport 342 protocols might open the door to new types of active attacks. 344 It is not clear yet whether a network can distinguish numerous QUIC 345 fallback flappings from an active attack: 347 o What is the expected behavior from the network? 349 o Will networks detect QUIC flapping as an active attack? 351 10. IANA Considerations 353 This draft does not request any IANA actions. 355 11. Discussions 357 11.1. Fallback 359 Troubleshooting QUIC traffic and its fallbacks requires measuring 360 similar metrics. One suggestion is to use the integrity mechanism of 361 the TCPinc WG [TCPINC] to protect and keep visible the fields used 362 for on-path measurement. 364 Fallback must be precisely specified in the core specification of 365 QUIC [QUICCORE]. 367 To avoid unnecessary flapping [QUICCORE] might clarify the usage of 368 the advertisement of QUIC support in HTTP protocols [ALTSVC]. 370 [QUICMAN] should propose guidance for the management of QUIC 371 fallback in a way to avoid flapping situations. 373 11.2. On-path Measurement 375 QUIC is designed to carry other traffic than HTTP such as DNS and 376 Web. End-to-end encryption of the transport headers prevents the use 377 of models [E-MODEL] and heuristics to estimate UX on a path segment. 378 To maintain a high level of UX, QUIC capabilities should support the 379 measurement of the delay and the losses of a segment of a source to 380 destination path. 382 On-path measurement techniques are currently ad hoc. Adding the 383 exposure of the fields for on-path packet delay and losses in the 384 core specification of the QUIC protocol creates a stable network 385 performance measurement framework. It will be a real incentive for 386 networks to support QUIC rapidly and to support the numerous QUIC 387 versions in the future. This will reduce network impacts on the 388 ossification of the IP stack in the future. 390 12. References 392 12.1. Normative References 394 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 395 Requirement Levels", BCP 14, RFC 2119, March 1997. 397 [QUICCORE] https://tools.ietf.org/html/draft-ietf-quic-transport 399 [FALLBACK] https://github.com/quicwg/base-drafts/issues/166 401 [IABSEC] https://www.iab.org/2014/11/14/iab-statement-on-internet- 402 confidentiality/ 404 [SPINBIT] https://tools.ietf.org/html/draft-ietf-quic-transport- 405 20#section-17.3.1 407 [LOSSBITS] https://tools.ietf.org/html/draft-ferrieuxhamchaoui-quic- 408 lossbits-02 410 12.2. Informative References 412 [E-MODEL] https://www.itu.int/rec/T-REC-G.107-199812-S/en 414 [SPATIAL] https://tools.ietf.org/html/rfc5644 416 [COMPO] https://tools.ietf.org/html/rfc6049 418 [QUICAPP] https://tools.ietf.org/wg/quic/draft-ietf-quic- 419 applicability/ 421 [QUICMAN] https://tools.ietf.org/wg/quic/draft-ietf-quic- 422 manageability/ 424 [TCPINC] https://tools.ietf.org/wg/tcpinc/ 426 [ALTSVC] https://tools.ietf.org/html/rfc7838 428 [SPINDUMP] https://github.com/EricssonResearch/spindump 430 [SBA5G] https://www.3gpp.org/ftp/Specs/archive/29_series/ 431 29.893/29893-120.zip 433 13. Acknowledgments 435 This document was prepared using 2-Word-v2.0.template.dot. 437 Authors' Addresses 439 Emile Stephan 440 Orange 441 2, avenue Pierre Marzin 442 Lannion 22300 443 France 445 Email: emile.stephan@orange.com 447 Mathilde Cayla 448 Orange 449 6, avenue Albert Durand 450 Blagnac 31700 451 France 453 Email: mathilde.cayla@orange.com 455 Arnaud Braud 456 Orange 457 2, avenue Pierre Marzin 458 Lannion 22300 459 France 461 Email: arnaud.braud@orange.com 463 Fred Fieau 464 Orange 465 40-48, avenue de la Republique 466 Chatillon 92320 467 France 469 Email: frederic.fieau@orange.com 471 Alex Ferrieux 472 Orange 473 2, avenue Pierre Marzin 474 Lannion 22300 475 France 477 Email: alexandre.ferrieux@orange.com 479 Marcus Ihlar 480 Ericsson 482 Email: marcus.ihlar@ericsson.com