idnits 2.17.1 draft-stephan-quic-interdomain-troubleshooting-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document doesn't use any RFC 2119 keywords, yet seems to have RFC 2119 boilerplate text. -- The document date (July 2, 2018) is 2125 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- No issues found here. Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 QUIC WG E. Stephan 3 Internet Draft M. Cayla 4 Intended status: Informational A. Braud 5 Expires: January 2019 F. Fieau 6 A. Ferrieux 7 Orange 8 M. Ihlar 9 Ericsson 10 July 2, 2018 12 QUIC Interdomain Troubleshooting 13 draft-stephan-quic-interdomain-troubleshooting-01.txt 15 Status of this Memo 17 This Internet-Draft is submitted in full conformance with the 18 provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as Internet- 23 Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html 36 This Internet-Draft will expire on January 2, 2018. 38 Copyright Notice 40 Copyright (c) 2018 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with respect 48 to this document. 50 Abstract 52 On-path network performance measurements methods currently deployed 53 contribute to the ossification of the Internet because they are 54 expensive to deploy and to maintain. This draft motivates the 55 exposure of QUIC header fields for on-path network measurements and 56 their specification in the QUIC core protocol as a solution to avoid 57 on-path network performance measurements to ossify the IP stack in 58 the future. 60 Table of Contents 62 1. Introduction ................................................ 2 63 2. Conventions used in this document............................ 3 64 3. Interdomain UX troubleshooting............................... 3 65 4. Reference of Network Performance............................. 4 66 5. Explicit measurement signals................................. 5 67 6. QUIC Fallback ............................................... 6 68 6.1. Flapping .................................................. 6 69 7. Versioning and Implementations............................... 7 70 8. Security Considerations...................................... 7 71 9. IANA Considerations ......................................... 7 72 10. Discussions ................................................ 8 73 10.1. Fallback ................................................. 8 74 10.2. On-path Measurement....................................... 8 75 11. References ................................................. 9 76 11.1. Normative References...................................... 9 77 11.2. Informative References.................................... 9 78 12. Acknowledgments ............................................ 9 80 1. Introduction 82 The IP layer does not include the material for measuring the delay 83 and packet losses of segments of a path. The network performance is 84 currently measured by points of presence of the path [SPATIAL], 85 [COMPO] using transport fields of the upper layers: TCP transport 86 layer, RTP application layer... 88 The evolution of the Internet stack toward end-to-end integrity 89 protection is unavoidable [IABSEC]. This document presents the 90 benefits of preserving the same on-path network performance 91 measurement capabilities in the evolution of TCP (TCP/TLS, TCPinc...) 92 and UDP (QUIC/UDP...) currently specified at the IETF. 94 On-path network performance measurement methods currently deployed 95 contribute to the ossification of the Internet because they are 96 expensive to deploy and complex to maintain. This is due to the use 97 of protocol fields not primarily designed for this purpose. This 98 draft motivates the exposure of the fields for on-path network 99 performance measurements in the QUIC core protocol to avoid network 100 performance measurements to ossify the IP stack in the future. 102 The memo recalls the UX interdomain troubleshooting complexity 103 [Issue166] introduced by QUIC deployment. Then it describes 104 operational concerns QUIC fallbacks and discusses the potential 105 impacts on the security. Finally it discusses the benefits of 106 exposing durably the fields needed for measuring packet delay and 107 losses. 109 2. Conventions used in this document 111 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 112 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 113 document are to be interpreted as described in RFC 2119 [RFC2119]. 115 3. Interdomain UX troubleshooting 117 Fast troubleshooting of network performance is mandatory to maintain 118 end-users high Quality of Experience. 120 Troubleshooting is the act of identifying the origin of a problem. A 121 major case is the localization of troubles impacting a large number 122 of customers. This case becomes critical when it appears suddenly and 123 represents a noticeable part of the traffic between the entity that 124 connects customers (ISP) and the entity that provides the data (APP). 126 It becomes critical because network operation center (NOC) teams of 127 the two entities are expected to immediately identify the causes in 128 order to restore UX as quickly as possible. Each team checks that the 129 point of failure is either in their entity or outside. When they 130 located the point of failure in their entity they investigate their 131 own chains of components (network, routers, reverse proxies,...) and 132 quickly fix the issue. 134 It becomes extremely critical when an entity locates the point of 135 failure outside of their entity. In this case the time needed to fix 136 the problem is much longer and unpredictable because it expects other 137 entities on the path to perform the same actions on their segments. 139 There are many cases of troubleshooting. A typical example starts by 140 signaling to the ISP that its end-users are experiencing a 141 significant decrease of QoE when using an Internet application. 142 Typical point of failures can be line card memory errors or 143 overloaded routers located somewhere on the path, either in the ISP 144 domain or outside. 146 The ISP NOC has to localize asap the point of failure. Currently it 147 proceeds by dichotomy (is the point of failure inside ISP domain or 148 outside?) using passive monitoring of packet loss and congestion. 150 Following is the description of the parameters in use: 152 Packet lost downstream (vice versa for upstream): 154 o Measure of packet loss before the point of measure needs TCP 155 sequence number; 157 o Measure of loss located after the point of measure needs TCP ACK + 158 SACK. 160 Congestion: 162 o Congestion also manifests as increased delays in queues, located 163 by measuring half-RTTs upstream and downstream from the point of 164 measure; 166 o The analysis is based on TCP SEQ/ACK/Timestamp correlation. 168 4. Reference of Network Performance 170 A reference of the real performance of the network is not always 171 provided by counters of network equipment. Counters may not be 172 implemented, values are not always stable, their values could be 173 compromised in case of software bugs or equipment congestion... In 174 addition, to face the increasing network architecture complexity 175 involved by the evolution of access networks, security and hosting 176 infrastructures, NOCs need reliable network performance measurement 177 in near real time. 179 In practice, these measurement tools shall be able to monitor 180 numerous points and interfaces within the network to provide near 181 real time network performances indicators taking into account the 182 global network state. 184 These measurements systems can also allow detecting issues or 185 unexplained behaviors on equipment, links, peers: for instance, in 186 mobile networks, operators shall be able to identify in real time 187 bottleneck links responsible for customer experience degradation and 188 take necessary actions to avoid further snowball effect. 190 Charging of data usage is another key feature for Mobile Network 191 Operators where per flow accurate information is collected. It is 192 important to reconcile values between amount of charged data and 193 amount of data seen by the network (cf discussion on goodput in 194 section 6)). This could allow detecting fraud attempts or 195 dysfunctions within the network. In case of significant gap, 196 operators must be able to react quickly to isolate this traffic. 197 Additionally, charging may require the differentiation of the goodput 198 from the throughput. 200 Continuous network performance monitoring requires packet losses and 201 delay measurements to allow operators to manage properly their 202 networks and to provides them with a reference of performance of 203 their network for interdomain troubleshooting. 205 5. Explicit measurement signals 207 An alternative to exposing raw transport protocol data to the path is 208 to have explicitly designed signals with the purpose of facilitating 209 on-path measurements. To facilitate troubleshooting, such a signal 210 should enable passive measurement of RTT, packet-loss and other 211 congestion indicators such as ECN. An example of how a transport 212 protocol could expose measurement information to the path would be 213 three flag bits available in a public header. The first bit would be 214 used for passive RTT measurement, bit 2 would indicate whether the 215 packet contains retransmitted data and bit 3 would indicate whether 216 the packet contains an ECN echo. An explicit signal is unambiguous 217 and simpler for a middlebox to interpret, than parsed transport 218 headers. Furthermore, it could be invariant between revisions of the 219 transport protocol that exposes it, which minimizes the risk of 220 network ossification. 222 6. QUIC Fallback 224 Fallback is necessary to address cases where a QUIC connection 225 establishment fails [QUICAPP] (A device of the path blocks UDP, the 226 stack blocks 0-RTT...). 228 Fallback may occur additionally when an active QUIC connection drops 229 and tries to reconnect. As an example, the steps of the fallback 230 could be: 232 o The QUIC connection drops accidently; 234 o The UA fallback and connects in TCP/TLS to the origin server; 236 o The UA receives from the origin server an indication for an 237 alternate service [ALTSVC] supporting QUIC; 239 o The UA ends gracefully the TCP connection; 241 o The UA tries to establish a QUIC connection to the server and port 242 described in the alternate service indication; 244 o The QUIC 1-RTT connection is established; 246 6.1. Flapping 248 A fallback may suddenly occur when one or more elements (links, 249 nodes, reverse proxy, switch, server ...) of the path fail or are 250 reconfigured. 252 There are cases where the fallback loops and triggers flapping 253 between the origin server and the alternate server. As an example, 254 this might happen when an alternate service indication is outdated 255 and points to a server which does not support QUIC anymore. 257 This becomes critical for UX when numerous fallbacks occur suddenly 258 on the same path between a set of customers of an ISP and another 259 entity which provides the application data. The time to troubleshoot 260 can be very long. The origin server and the alternate servers can be 261 hosted by different entities. 263 7. Versioning and Implementations 265 Versioning is an important part of the QUIC protocol framework 266 [QUICCORE]. Multiple versions of the protocol are expected to be 267 deployed and used concurrently. In order to encourage networks to 268 rapidly support the QUIC protocol and to support any versions of QUIC 269 in the future, the exposure of the fields for on-path network 270 performance measurement must not depend on the version. 272 There might be numerous implementations of the QUIC protocol in the 273 future. An important part of them will implement the congestion 274 control at application level. There will be unfair behaviors like 275 abnormal retransmission rate which will impact the fairness of the 276 repartition of the bandwidth amongst the customers of the network. By 277 consequence the network needs to be able to detect connections which 278 have abnormal throughput/goodput. 280 8. Security Considerations 282 The integrity of the transport parameters exposed for measuring on- 283 path delay and losses can be end-to-end protected to increase the 284 security of the connection. Additionally it helps end-points and on- 285 path points of presence to compute metrics based on the same raw 286 values. 288 Flapping from QUIC to a fallback protocol might overload on-path 289 devices and end-points and by consequence affect the stability of the 290 connections and introduces weaknesses. 292 The fallback from encrypted headers to clear headers transport 293 protocols might open the door to new types of active attacks. 295 It is not clear yet whether a network can distinguish numerous QUIC 296 fallback flappings from an active attack: 298 o What is the expected behavior from the network? 300 o Will networks detect QUIC flapping as an active attack? 302 9. IANA Considerations 304 This draft does not request any IANA actions. 306 10. Discussions 308 10.1. Fallback 310 Troubleshooting QUIC traffic and its fallbacks requires measuring 311 similar metrics. One suggestion is to use the integrity mechanism of 312 the TCPinc WG [TCPINC] to protect and keep visible the fields used 313 for on-path measurement. 315 Fallback must be precisely specified in the core specification of 316 QUIC [QUICCORE]. 318 To avoid unnecessary flapping [QUICCORE] might clarify the usage of 319 the advertisement of QUIC support in HTTP protocols [ALTSVC]. 321 [QUICMAN] should propose guidance for the management of QUIC fallback 322 flapping situations. 324 QUIC packets numbering should be continuous to allow packet loss 325 monitoring. Flows control fields (ACK, SACK) are available in QUIC 326 but encrypted. 328 10.2. On-path Measurement 330 QUIC is designed to carry other traffic than HTTP such as DNS and 331 Web. End-to-end encryption of the transport headers prevents the use 332 of models [E-MODEL] and heuristics to estimate UX on a path segment. 333 To maintain a high level of UX, QUIC capabilities should support the 334 measurement of the delay and the losses of a segment of a source to 335 destination path. 337 On-path measurement techniques are currently ad hoc. Adding the 338 exposure of the fields for on-path packet delay and losses in the 339 core specification of the QUIC protocol creates a stable network 340 performance measurement framework. It will be a real incentive for 341 networks to support QUIC rapidly and to support the numerous QUIC 342 versions in the future. This will reduce network impacts on the 343 ossification of the IP stack in the future. 345 11. References 347 11.1. Normative References 349 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 350 Requirement Levels", BCP 14, RFC 2119, March 1997. 352 [ALTSVC] https://tools.ietf.org/html/rfc7838 354 [QUICCORE] https://tools.ietf.org/html/draft-ietf-quic-transport 356 [QUICAPP] https://tools.ietf.org/wg/quic/draft-ietf-quic- 357 applicability/ 359 [QUICMAN] https://tools.ietf.org/wg/quic/draft-ietf-quic- 360 manageability/ 362 [TCPINC] https://tools.ietf.org/wg/tcpinc/ 364 [Issue166] https://github.com/quicwg/base-drafts/issues/166 366 [IABSEC] https://www.iab.org/2014/11/14/iab-statement-on-internet- 367 confidentiality/ 369 11.2. Informative References 371 [E-MODEL] https://www.itu.int/rec/T-REC-G.107-199812-S/en 373 [SPATIAL] https://tools.ietf.org/html/rfc5644 375 [COMPO] https://tools.ietf.org/html/rfc6049 377 12. Acknowledgments 379 This document was prepared using 2-Word-v2.0.template.dot. 381 Authors' Addresses 383 Emile Stephan 384 Orange 385 2, avenue Pierre Marzin 386 Lannion 22300 387 France 389 Email: emile.stephan@orange.com 391 Mathilde Cayla 392 Orange 393 6, avenue Albert Durand 394 Blagnac 31700 395 France 397 Email: mathilde.cayla@orange.com 399 Arnaud Braud 400 Orange 401 2, avenue Pierre Marzin 402 Lannion 22300 403 France 405 Email: arnaud.braud@orange.com 407 Fred Fieau 408 Orange 409 40-48, avenue de la Republique 410 Chatillon 92320 411 France 413 Email: frederic.fieau@orange.com 415 Alex Ferrieux 416 Orange 417 2, avenue Pierre Marzin 418 Lannion 22300 419 France 421 Email: alexandre.ferrieux@orange.com 423 Marcus Ihlar 424 Ericsson 426 Email: marcus.ihlar@ericsson.com